sqlaload is a small library that I use to handle databases in Python data processing. In many projects, your process starts with very messy data (something you’ve scraped or loaded from a hand-prepared Excel sheet). In subsequent stages, you gradually add cleaned values in new columns or new tables. Managing a full SQL schema for such operations can be a hassle, you really want something close to MongoDB: a NoSQL data store you can throw fairly random data at and get it back later.
With sqlaload, the idea is to combine some of the schema flexibility, while still keeping things in a structured database in the background:
I first saw this type of SQL schema generation implemented in ScraperWiki: they have a couple of high-level SQLite wrappers that expand your database as you feed them data. We later adopted that concept for the joint CKAN/ScraperWiki webstore, which neither project ended up using.
Still, webstore had become an essential part of many of my data projects as an operational data store. Eventually, I decided to kick out the networking aspect: data access via HTTP is terribly slow and I wanted to have my data in Postgres, not SQLite. The webstore code went into sqlaload, and became a thin wrapper on top of SQLAlchemy core (the non-ORM database abstraction part of SQLAlchemy).
Running on top of SQLAlchemy also means that all of its functionality - for example the query expression language - are available and can be used to call up more advanced functionality.
If you want to try it out, sqlaload is now on PyPI and the README has a lot of detailed documentation on the library.
Comments