Quick and Dirty Analysis on Large CSVs

Posted on 11 April 2013 by Rufus Pollock

I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be found in this issue on OpenSpending’s things-to-do repo).

The dataset I’m working with is the consolidated spending (over £25k) by all UK goverment departments. Thanks to the efforts of of OpenSpending folks (and specifically Friedrich Lindenberg) this data is already nicely ETL’d from thousands of individual CSV (and xls) files into one big 3.7 Gb file (see below for links and details).

My question is what is the best way to do quick and dirty analysis on this?

Examples of the kinds of options I was considering were: