The Data Wrangling Blog

  • 05 June 2014 Neil Ashton

    Labs newsletter: 5 June, 2014

    Welcome back to the OKFN Labs! Members of the Labs have been building tools, visualizations, and even new data protocols—as well as setting up conferences and events. Read on to learn more. If you’d like to suggest a piece of news for next month’s newsletter,...
  • At the first International Sports Hackdays in Basel, Sierre and Milan, over 120 developers and designers, journalists and scientists, professionals and amateurs came together to prototype new approaches to make creative use of sports data. They built new types of hardware, new interfaces for fitness...
  • Football is the world’s most popular sport and the World Cup in Brazil - kicking off next month in São Paulo on June 12th (in 38 days 3 hours 15 minutes and counting) - is the world’s biggest (sport) event with 32 national teams from...
  • 05 May 2014 Rufus Pollock

    CSV Conf 2014 - for Data Makers Everywhere

    Announcing CSV,Conf - the conference for data makers everywhere which takes place on 15 July 2014 in Berlin. This one day conference will focus on practical, real-world stories, examples and techniques of how to scrape, wrangle, analyze, and visualize data. Whether your data is big...
  • In an ideal world… In an ideal world we would go in search of a piece of data by using our favorite search engine and we would land on a page with a big download button. It would give you a few options for formats....
  • 20 March 2014 Neil Ashton

    Labs newsletter: 20 March, 2014

    We’re back with a bumper crop of updates in this new edition of the now-monthly Labs newsletter! Textus Viewer refactoring The TEXTUS Viewer is an HTML + JS application for viewing texts in the format of TEXTUS, Labs’s open source platform for collaborating around collections...
  • 04 March 2014 Rufus Pollock

    The SEC EDGAR Database

    This post looks at the Securities and Exchange Commission (SEC) EDGAR database. EDGAR is a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports: All companies, foreign and domestic, are required to file registration statements, periodic reports,...
  • 20 February 2014 Neil Ashton

    Labs newsletter: 20 February, 2014

    The past few weeks have seen major improvements to the Labs website, another Open Data Maker Night in London, updates to the TimeMapper project, and more. Labs Hangout: today The next Labs online hangout is taking place today in just a few hours—now’s your chance...
  • 30 January 2014 Neil Ashton

    Labs newsletter: 30 January, 2014

    From now on, the Labs newsletter will arrive through a special announce-only mailing list, newsletter@okfnlabs.org, more details on which can be found below. Keep reading for other new developments including the fifth Labs Hangout, the launch of SayIt, and new developments in the vision of...
  • 20 January 2014 Stefan Urbanek

    OLAP Cubes and Logical Models

    Last time we talked about OLAP in general – what it is and why it is useful. Today we are going to look at the data – how they are structured and why? What are cubes? What does it mean “multi-dimensional”? Data Cubes and Logical...
  • 16 January 2014 Neil Ashton

    Labs newsletter: 16 January, 2014

    Welcome back from the holidays! A new year of Labs activities is well underway, with long-discussed improvements to the Labs projects page, many new PyBossa developments, a forthcoming community hangout, and more. Labs projects page Getting the Labs project page organized better has been high...
  • 10 January 2014 Stefan Urbanek

    Introduction to OLAP

    What is OLAP? “Online Analytical Processing – OLAP is an approach to answering multi-dimensional analytical queries swiftly” says Wikipedia. What does that mean? What are multi-dimensional analytical queries? Why this approach? We will learn all this in a short blog series. The term OLAP is...
  • 25 December 2013 Thomas Levine

    How I parse PDF files

    Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. I sort of follow this decision process. Do we need to read...
  • Data Converters is a command line tool and Python library making routine data conversion tasks easier. It helps data wranglers with everyday tasks like moving between tabular data formats—for example, converting an Excel spreadsheet to a CSV or a CSV to a JSON object. The...
  • 12 December 2013 Neil Ashton

    Labs newsletter: 12 December, 2013

    We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the Nomenklatura, Annotator, and Data Protocols projects and some new posts on...
  • 06 December 2013 Michael Bauer

    Introducing Reconcile-CSV

    Recently I spent a week in Tanzania working on education data with the ministry of education (blog post here). One of the problems we faced there were spreadsheets, we liked to merge, without having any unique IDs. I quickly realized we can do this through...
  • This post introduces one of the handiest features of Data Pipes: fast (pre) viewing of CSV files in your browser (and you can share the result by just copying a URL). The Raw CSV CSV files are frequently used for storing tabular data and are...
  • 28 November 2013 Neil Ashton

    Labs newsletter: 28 November, 2013

    Another busy week at the Labs! We’ve had lots of discussion around the idea of “bad data”, a blog post about Mark’s aid tracker, new PyBossa developments, and a call for help with a couple of projects. Next week we can look forward to another...
  • 25 November 2013 Mark Brough

    Looking at aid in the Philippines

    See also: “A closer look at aid in the Philippines” Since Typhoon Yolanda/Haiyan struck the Philippines on 8th November there has been some discussion around the availability of information to help coordinate activities effectively in the disaster response phase. To see what data was already...
  • 21 November 2013 Neil Ashton

    Labs newsletter: 21 November, 2013

    This week, Labs members gathered in an online hangout to discuss what they’ve been up to and what’s next for Labs. This special edition of the newsletter recaps that hangout for those who weren’t there (or who want a reminder). Data Pipes update Last week...
  • We’ve just started a mini-project called Bad Data. Bad Data provides real-world examples of how not to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly. This isn’t about being critical but about educating—providing examples of how not to do...
  • 14 November 2013 Neil Ashton

    Labs newsletter: 14 November, 2013

    Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post. Data Pipes: lots of improvements Data Pipes is a Labs project that provides a web API for a...
  • 11 November 2013 Tarek Amr

    Natural Language Processing using Python

    This weekend the Google Developer Group in Cairo arranged 2-days workshops followed by a hackathon. During this event, I organized a workshop about NLTK and the use of Python in Natural Language Processing (NLP). The session’s slides can be found here. The beauty of NLP...
  • 07 November 2013 Neil Ashton

    Labs newsletter: 7 November, 2013

    There was lots of interesting activity around Labs this week, with two launched projects, a new initiative in the works, and an Open Data Maker Night in London. Webshot: online screenshot service webshot.okfnlabs.org, an online service for taking screenshots of websites, is now live, thanks...
  • 06 November 2013 Rufus Pollock

    Tracking Issues with Data the Simple Way

    Data Issues is a prototype initiative to track “issues” with data using a simple bug tracker—in this case, GitHub Issues. We’ve all come across “issues” with data, whether it’s “data” that turns out to be provided as a PDF, the many ways to badly format...
  • 17 October 2013 Anastasios Ventouris

    A Python guide for open data file formats

    If you are an open data researcher you will need to handle a lot of different file formats from datasets. Sadly, most of the time, you don’t have the opportunity to choose which file format is the best for your project, but you have to...
  • TimeMapper lets you create elegant and embeddable timemaps quickly and easily from a simple spreadsheet. A timemap is an interactive timeline whose items connect to a geomap. Creating a timemap with TimeMapper is as easy as filling in a spreadsheet template and copying its URL....
  • Datapackages are a neat idea along the “using data like we use code” way. While Tryggvi has created a nice python module to handle datapackages - there is a problem using datapackages in javascript. In an ideal world I’d just call something like d3.csv() on...
  • 07 October 2013 Rufus Pollock

    PublicBodies.org - Update no. 2

    Herewith is a report on recent improvements to PublicBodies.org, our project in Open Knowledge Foundation Labs project to provide “a URL (and information) on every “public body” - that’s every government funded agency, department or organization. New data New data contributed over the last couple...
  • 04 October 2013 Rufus Pollock

    Data as Code Deja-Vu

    Someone just pointed me at this post from Ben Balter about Data as Code in which he emphasizes the analogies between data and code (and especially open data and open-source – e.g. “data is where code was 2 decades ago” …). I was delighted to...
  • 24 September 2013 Michael Bauer

    Full stack datavis - scraperwiki, d3 and github.

    The city of Vienna started releasing waiting times for some of its service offices recently. I followed my usual hunch and just wrote a small script on scraperwiki that stows away the JSON released by the city not knowing yet what to do with it....
  • 16 September 2013 Michael Bauer

    Using d3 as user input

    Recently, I was at Chicas Poderosas in Bogota - the three day event featured talks on two days and a hackday on the last. During the event I was approached by Natalia an industrial designer who introduced a project of hers: Electrocardiogr_ama. She wanted to...
  • 11 September 2013 Rufus Pollock

    Data Pipes - streaming online data transformations

    Data Pipes provides an online service built in NodeJS to do simple data transformations – deleting rows and columns, find and replace, filtering, viewing as HTML – and, furthermore, to connect these transformations together Unix pipes style to make more complex transformations. Because Data Pipes...
  • I’m pleased to announce the Miga Data Viewer, or Miga, an open source tool I created that lets you create a web/mobile app nearly automatically from a set of CSV data. There are already various applications/frameworks that provide a JavaScript-enabled front-end for structured data -...
  • 22 August 2013 Pierre-Yves Vandenbussche

    How a bad experience made an OKFN labs project

    From theory to experimentation Back in November 2010, I faced a problem while teaching my students about the Semantic Web. I wanted to convey the idea that Semantic Web technologies can break down the barriers between dataset silos on the Web and simplify the publication...
  • Tonight a couple of us were having a discussion on the OpenSpending IRC channel on how we can promote and better document the usage of the API. Tony had already begun to work on OpenSpending using R. I had previously done so as well. This...
  • 08 August 2013 Paul Fitzpatrick

    Diffing and patching tabular data

    A few years ago at the Eastern Conference for Workplace Democracy in New Hampshire, a bunch of friends chatting on a grassy knoll realized they were all working on overlapping directories of their communities, and decided to pool their efforts. They tracked down some techies...
  • 06 August 2013 Daniel Lombraña González

    Mapping Antimatter tracks with CrowdCrafting.org

    This last weekend, CERN hosted a very special event: the 2nd CERN Summer Student Webfest organized by the Citizen Cyberscience Centre. The Webfest invites CERN summer students to participate in a 48 hours marathon hacking new applications, tools, games, etc. about physics. This year, I...
  • 06 August 2013 Neil Ashton

    data.okfn.org - update no. 2

    data.okfn.org is the Labs’ repository of high-quality, easy-to-use open data. This update summarizes some of the improvements to data.okfn.org that have taken place over the past two months. New tools Several tools which make it easier to use the Data Package standard are now operational....
  • 31 July 2013 Daniel Lombraña González

    Analyzing Icelandic conviction rates with CrowdCrafting.org

    CrowdCrafting.org hosts a wide variety of applications that range from science to humanities. Since the official launch of CrowdCrafting.org, lots of applications have been created , but one of them has done a really impressive job: Héraðsdómar - sýknað eða sakfellt. Héraðsdómar - sýknað eða...
  • For a while I’ve been thinking about how to make Open Data more tangible. Even with great visualizations, it tends to remain stuck in computers and smartphones. Recently, I had the idea to start taking geodata, released by cities, and start making it into physical...
  • Having developed the Greek DBpedia, the first Internationalized DBpedia, OKFN Greece is now involved in the OKFN Labs by introducing three applications using DBpedia. 1. DBpedia Spotlight DBpedia Spotlight is an application that automatically spots and disambiguates words or phrases of text documents that might...
  • Spanish society has been bombarded recently with a flurry of news stories about possible cases of corruption in the major political parties like the Partido Socialista Obrero Español and the Partido Popular. In January of 2013 the party that rules the country, Partido Popular (PP),...
  • 09 July 2013 Neil Ashton

    PublicBodies.org progress

    There have been many new developments with PublicBodies.org, the Labs project which aims to provide “a URL for every part of government”, since the last update on the Labs blog. The news includes: a new and improved backend; a push for integration with Nomenklatura; discussion...
  • The next Open Data Maker Night London will be on Tuesday 16th July 6-9pm (you can drop in any time during the evening). Like the last two it is kindly hosted by the wonderful Centre for Creative Collaboration, 16 Acton Street, London. When: Tuesday 16th...
  • Back in April, I wrote on the Open Knowledge Foundation main blog to launch the first component of our Aid Transparency Tracker, a tool to analyse aid donors’ commitments to publish more open data about their aid activities. At the end of that post, I...
  • ElasticSearch is a great open-source search tool that’s built on Lucene (like SOLR) but is natively JSON + RESTful. Its been used quite a bit at the Open Knowledge Foundation over the last few years. Plus, as its easy to setup locally its an attractive...
  • 28 June 2013 Neil Ashton

    Basic data cleaning with Data Explorer

    Data Explorer is a client-side web application for data processing and visualization. With Data Explorer, you can import data, transform it with JavaScript code, and visualize it on a graph or a map – all fully within the browser and with your data and code...
  • 28 May 2013 Rufus Pollock

    data.okfn.org - update no. 1

    This is the first of regular updates on Labs project http://data.okfn.org/ and summarizes some of the changes and improvements over the last few weeks. 1. Refactor of site layout and focus. We’ve done a refactor of the site to have stronger focus on the data....
  • Our next Open Humanities Hangout will take place next Tuesday, 28th May. This is the latest in the series of regular hangouts we’ve been organizing over the past few months with people interested in tapping in to the growing amount of open cultural data and...
  • Nomenklatura is a simple service that makes it easy to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names against that canonical list – for example, matching Acme Widgets, Acme Widgets Inc...
  • This is an update on PublicBodies.org - a Labs project whose aim is to provide a “URL for every part of Government”: http://publicbodies.org/ PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have...
  • 11 April 2013 Rufus Pollock

    Quick and Dirty Analysis on Large CSVs

    I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be found in this issue...
  • I’ve been working to get Greater London Authority spending data cleaned up and into OpenSpending. Primary motivation comes from this question: Which companies got paid the most (and for doing what)? (see this issue for more) I wanted to share where I’m up to and...
  • 30 March 2013 Friedrich Lindenberg

    sqlaload, an ETL wrapper for SQLAlchemy

    sqlaload is a small library that I use to handle databases in Python data processing. In many projects, your process starts with very messy data (something you’ve scraped or loaded from a hand-prepared Excel sheet). In subsequent stages, you gradually add cleaned values in new...
  • 27 March 2013 Sam Leon

    Next Steps for Textus

    At the Culture Labs hangout yesterday we wrote up the plans for the next steps for Textus we have been discussing over the last few months. The result is this slide deck overview. It both introduces Textus and outlines next steps (slide 12 onwards). Key...
  • 18 March 2013 Rufus Pollock

    Progress on the Data Explorer

    This is an update on progress with the Data Explorer (aka Data Transformer). Progress is best seen from this demo which takes you on a tour of house prices and the difference between real and nominal values. More information on recent developments can be found...
  • 26 February 2013 Rufus Pollock

    Recline JS - Componentization and a Smaller Core

    Over time Recline JS has grown. In particular, since the first public announce of Recline last summer we’ve had several people producing new backends and views (e.g. backends for Couch, a view for d3, a map view based on Ordnance Survey’s tiles etc etc). As...
  • 20 February 2013 Daniel Lombraña González

    Exporting PyBossa data to CSV or JSON with one click

    I’m really happy to announce that today we have finally added a feature that will allow to export your data into a CSV format with just one click (we also support the same for JSON). For this purpose, all the applications in PyBossa now feature...
  • 29 January 2013 Daniel Lombraña González

    Mozilla FirefoxOS App Days & Crowdcrafting.org

    Last Saturday, the 26th of January, Mozilla held in parallel in 25 cities all over the world a hack day, the #FirefoxOSAppDay, about creating new web applications for their new FirefoxOS mobile OS and the desktop web browser (this stills in beta and alpha mode!)....
  • 28 January 2013 Daniel Lombraña González

    PyBossa.JS or how you can easily create new PyBossa applications

    In the last weeks we have been working hard in order to make easier to develop new PyBossa applications. For this reason, we are happy to announce a new version of PyBossa.JS. This new version introduces several improvements: Creating an app is much easier! You...
  • 25 January 2013 Friedrich Lindenberg

    Journoid, data notifications

    At the Open Interests hackday in November, a discussion with Martin Stabe from the FT’s interactive desk led a prototype of Journoid. The idea is to monitor changing on-line datasets for remarkable information, like earthquakes, procurement in a particular industry or a close parliamentary vote....
  • I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node given that it is pure JS and therefore could be a more natural fit when getting info out of web pages. In particular, when my first steps when looking to...
  • 08 January 2013 Rufus Pollock

    Archiving Twitter the Hacky Way

    There are many circumstances where you want to archive a tweets - maybe just from your own account or perhaps for a hashtag for an event or topic. Unfortunately Twitter search queries do not give data more than 7 days old and for a given...
  • 13 December 2012 Stefan Wehrmeyer

    Bundes-Git – German Laws on GitHub

    If you compare software code and legislation you can find many similarities: both are big bodies of text spread over multiple units (laws/files). The total amount of text inevitably grows bigger over time with many small changes to existing parts while most of the corpus...
  • 12 December 2012 Gregor Aisch

    Speeding Up Your PyBossa App

    Thanks to the free crowd-crafting tool PyBossa, nowadays the biggest challenge for successful crowd-sourcing is engaging users for participating in tasks, and to keep that motivation at a high level over time. Therefor the user experience of crowd-sourcing apps plays a crucial role. After participating...
  • 04 December 2012 Rufus Pollock

    Javascript Timeline Libaries - A Review

    This post is a rough and ready overview of various javascript timeline libraries that arose from research in creating a timeline view for Recline JS. Note this material hung around on my hard disk for a few months so some of it may already be...
  • Making sense of massive datasets that document the processes of lobbying and public procurement at European Union level is not an easy task. Yet a group of 25 journalists, developers, graphic designers and activists worked together at the Open Interests Europe hackathon last weekend to...
  • 13 November 2012 Vitor Baptista

    Scraping Data Behind a CAPTCHA

    How much does the highest paid person in the Brazilian Federal Senate earns? That’s the question I asked myself a few weeks ago, and one that should be easy to answer. In Brazil, every public body must publish its employees’ salaries online, but some do...
  • 01 November 2012 Rufus Pollock

    Recline JS Search Demo

    We’ve recently finished a demo for ReclineJS showing how it can be used to build JS-based (ajax-style) search interfaces in minutes (or even seconds!): http://reclinejs.com/demos/search/ Because of Recline’s pluggable backends you get out of the box support for data sources such as SOLR, Google Spreadsheet,...
  • 23 October 2012 Nigel Babu

    Labs Show and Tell - 26th October!

    We’re having the next Show and Tell on Friday, 26 October at 2:30 pm BST via Google Hangout on Air. As usual, the URL will be posted on OKFN Labs’ G+ Page. If you’d like to present, add your name to the list. Remember, #okfn...
  • 22 October 2012 Friedrich Lindenberg

    Wrangling dirty data with messytables.

    One of the largest data collection projects we have done so far has been the consolidation of the UK’s departmental expenditure. Over 370 different government entities have published a total of more than 7000 spreadsheets. Many of those have obviously been hand-crafted or at least...
  • 15 October 2012 Velichka Dimitrova

    Open Interests Hackathon in London, 24-25 November

    The European Journalism Centre and the Open Knowledge Foundation, sponsored by Knight-Mozilla OpenNews, invite you to the Open Interests Hackathon to track the the interests and money flows which shape European policy. When: 24-25 November Where: Google Campus Cafe, 4-5 Bonhill Street, EC2A 4BX London...
  • 10 October 2012 Nigel Babu

    Labs Show and Tell - All Welcome!

    Built an app or tool you want to show people? Played around with some interesting data? Know of a new development people should know about? Want to find out what others are doing? Come to the Show and Tell this Friday and share what you...
  • 25 September 2012 Friedrich Lindenberg

    Data Catalogues are People!

    Last week, Matej Kurian published a message on the okfn-labs mailing list, describing the various sources he had discovered for machine-readable excerpts of the EU’s joint procurement system, TED. What struck me about this message was that, apparently, this polite and brilliant policy wonk had...
  • WikipediaJS is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc. The library is the work of Labs member Rufus Pollock. In essence, it is a small wrapper around the data and APIs of the DBPedia project and...
  • 08 August 2012 Rufus Pollock

    Timeliner - Make Nice Timelines Fast

    As part of the Recline launch I put together quickly some very simple demo apps one of which was called Timeliner: http://timeliner.reclinejs.com/ This uses the Recline timeline component (which itself is a relatively thin wrapper around the excellent Verite timeline) plus the Recline Google docs...
  • This a brief post to announce an alpha prototype version of the Data Transformer, an app to let you clean up data in the browser using javascript: http://transformer.datahub.io/ 2m overview video:   What does this app do? You load a CSV file from github (fixed...
  • 14 July 2012 Daniel Lombraña Gonzalez

    Displaying PyBossa Urban Parks Data on a 3D Globe

    Labs member Daniel Lombraña González has built a 3-d globe showing the locatoins of urban parks around the world as located by volunteers using the Pybossa Urban Park geocoding app: http://teleyinex.github.com/pybossa-urbanpark-globe/ — (Source code) Background The Urban Parks geo-coding application is a micro-tasking app running...
  • On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. While we submitted proposals for Grano and DataProtocols, we decided to hold back...
  • On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. The first idea I want to repost here is a proposal for Grano,...