Cleaning up Greater London Authority Spending (for OpenSpending)

I’ve been working to get Greater London Authority spending data cleaned up and into OpenSpending. Primary motivation comes from this question:

Which companies got paid the most (and for doing what)? (see this issue for more)

I wanted to share where I’m up to and some of the experience so far as I think these can inform our wider efforts - and illustrate the challenges just getting and cleaning up data. I note that the code and README for this ongoing work is in a repo on github:

Data Quality Issues

There are 61 CSV files as of March 2013 (a list can be found in scrape.json).

Unfortunately the “format” varies substantially across files (even though they are all CSV!) which makes using this data real pain. Some examples:

  • no of fields and there names vary across files (e.g. SAP Document no vs Document no)
  • number of blank columns or blank lines (some files have no blank lines (good!), many have blank lines plus some metadata etc etc)
  • There is also at least one “bad” file which looks to be an excel file saved as CSV
  • Amounts are frequently formatted with “,” making them appear as strings to computers.
  • Dates vary substantially in format e.g. “16 Mar 2011”, “21.01.2011” etc
  • No unique transaction number (possibly document number)

They also switched from monthly reporting to period reporting (where there are 13 periods of approx 28d each).

Progress so far

I do have one month loaded (Jan 2013) with a nice breakdown by “Expenditure Account”:

Interestingly after some fairly standard grants to other bodies, “Claim Settlements” comes in as the biggest item at £2.3m