The Data Wrangling Blog

01 August 2019 Daniel Fireman

Introduction to Statistics With Data Packages and Gonum

After 6 years at Google, Daniel Fireman is currently a Ph.D. student, professor and activist for government transparency and accountability in the Northeast of Brazil. He was one of the 2017’s Frictionless Data Tool Fund grantees and implemented the core Frictionless Data specification in the...

●●●●●
18 October 2018 Adam Kariv

Announcing datapackage-pipelines version 2.0

Today we’re releasing a major version for datapackage-pipelines, version 2.0.0. This new version marks a big step forward in realizing the Data Factory concept and framework. We integrated datapackage-pipelines with its younger sister dataflows, and created a set of common building blocks you can now...

●●●●●
30 August 2018 Adam Kariv

Data Factory & DataFlows - Tutorial

Data Factory is an open framework for building and running lightweight data processing workflows quickly and easily. We recommend reading this introductory blogpost to gain a better understanding of underlying Data Factory concepts before diving into the tutorial below. Learn how to write your own...

●●●●●
29 August 2018 Adam Kariv

Data Factory & DataFlows - An Introduction

Today I’d like to introduce a new library we’ve been working on - dataflows. DataFlows is a part of a larger conceptual framework for data processing. We call it ‘Data Factory’ - an open framework for building and running lightweight data processing workflows quickly and...

●●●●●
07 May 2018 Matt Thompson

Processing Tabular Data Packages in Clojure

Matt Thompson was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data data package and table schema libraries in Clojure programming language. You can read more about this in his grantee profile. In this post, Thompson will show...

●●●●●
28 April 2018 Georges Labrèche

Processing Tabular Data Packages in Java

Georges Labrèche was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Java programming language. You can read more about this in his grantee profile. In this post, Labrèche will show you how to install and...

●●●●●
08 March 2018 Serah Rono

Collecting, Analysing and Sharing Twitter Data

On March 3, communities around the world marked Open Data Day in over 400 events. Here’s the dataset for all Open Data Day 2018 events. In this post, we will harvest Open Data Day affiliated content from Twitter and analyze it using R before packaging...

●●●●●
16 February 2018 Daniel Fireman

Processing Tabular Data Packages in Go

Daniel Fireman was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Go programming language. You can read more about this in his grantee profile. In this post, Fireman will show you how to install and...

●●●●●
15 February 2018 Rufus Pollock

Frictionless Data Lib - A Design Pattern for Accessing Files and Datasets

This document outlines a simple design pattern for a “core” data library "data". The pattern is focused on access and use of: individual files (streams) collections of files (“datasets”) Its primary operation is open: file = open('path/to/file.csv') dataset = open('path/to/files/') It defines a standardized “stream-plus-metadata”...

●●●●●
14 February 2018 Open Knowledge Greece

Creating and Using Data Packages in R

Open Knowledge Greece was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in R programming language. You can read more about this in their grantee profile. In this post, Kleanthis Koupidis, a Data Scientist and Statistician...

●●●●●
05 February 2018 Serah Rono

Working with Data Package Creator

The Data Package Creator, create.frictionlessdata.io, is a revamp of the Data Packagist app that lets you create and edit and validate your data packages with ease. Read on and find out how. Frictionless Data aims to make it effortless to transport high quality data among...

●●●●●
10 January 2018 Adam Kariv

Interactive Data wrangling using Data Package Pipelines new UI

datapackage-pipelines is a framework for defining data processing steps to generate self-describing Data Packages, built on the concepts and tooling of the Frictionless Data project. You can read more about datapackage-pipelines in this introductory post. Data wrangling can be quite a tedious task - We...

●●●●●
21 December 2017 Paul Walsh

Bootstrapping data standards with Frictionless Data

When it comes to tabular data, the Frictionless Data specifications provide users with strong conventions for declaring both the shape of data (via schemas) and information about the data (as metadata on package and resource descriptors). Within the Frictionless Data world, we purposefully refer to...

●●●●●
29 November 2017 Vitor Baptista

Validating scraped data using goodtables

We have to deal with many challenges when scraping a page. What’s the page’s layout? How do I extract the bits of data I want? How do I know when their layout changes and break my code? How can I be sure that my code...

●●●●●
03 November 2017 DataHub Team

Core Data on DataHub.io

This blog post was originally published on datahub.io by Rufus Pollock, Meiran Zhiyenbayev & Anuar Ustayev. The “Core Data” project provides essential data for the data wranglers and data science community. Its online home is on the DataHub: https://datahub.io/core https://datahub.io/docs/core-data This post introduces you to...

●●●●●
11 October 2017 Meiran Zhiyenbayev

Data Package v1 Specifications. What has Changed and how to Upgrade

This post walks you through the major changes in the Data Package v1 specs compared to pre-v1. It covers changes in the full suite of Data Package specifications including Data Resources and Table Schema. It is particularly valuable if: you were using Data Packages pre...

●●●●●
05 October 2017 Serah Rono

Frictionless Data Specs v1 Updates

The Frictionless Data team released v1 specifications in the first week of September 2017 and Paul Walsh, Chief Product Officer at Open Knowledge International, wrote a detailed blogpost about it. With this milestone, in addition to modifications on pre-existing specifications like Table Schema1 and CSV...

●●●●●
13 July 2017 Dan Fowler

Measure for Measure

In his Open Knowledge International Tech Talk, Developer Brook Elgie describes how we are using Data Package Pipelines and Redash to gain insight into our organization in a declarative, reproducible, and easy to modify way. This post briefly introduces a newly launched internal project at...

●●●●●
10 July 2017 Andy Lulham

DAC and CRS code lists – Now available as Frictionless Data!

This blog was originally posted on the Publish What You Fund website. Maintained, machine readable versions of the DAC and CRS code lists are now available as CSV and JSON! Here’s how Publish What You Fund and Open Knowledge made it happen… The OECD’s Development...

●●●●●
22 May 2017 Serah Rono

Introducing the new goodtables library and goodtables.io

Information is everywhere. This means that there is so much we need to know at any given time, but such limited capacity and time to internalize it all. True art, therefore, lies in the ability to draw summaries adequate enough to save time and impart...

●●●●●
27 February 2017 Adam Kariv

Data Package Pipelines

datapackage-pipelines is the newest part of the Frictionless Data toolchain. Originally developed through work on OpenSpending, it is a framework for defining data processing steps to generate self-describing Data Packages. OpenSpending is an open database for uploading fiscal data for countries or municipalities to better...

●●●●●
30 November 2016 Dan Fowler

Case Studies for Frictionless Data

For our Frictionless Data project, we were curious to learn about some of the common issues users face when working with data. To that end, we started a Case Study series to highlight projects and organizations working with the Frictionless Data specifications and tooling in...

●●●●●
17 October 2016 Dan Fowler

Frictionless Data Specs Working Group

Last month, we had the first call of the Frictionless Data Specifications Working Group, starting a new chapter in the project. The call covered the status of the specifications to date, current adoption, upcoming technical pilots and partnerships, and how work will be organized going...

●●●●●
13 October 2016 Matt Fullerton

Building 2030-watch.de: measuring progress towards the sustainable development goals (SDGs)

For the last 15 months the Open Knowledge Foundation Germany has been working on a prototype to monitor progress towards the sustainable development goals (SDGs) from an independent, civil society-led perspective. There’s a detailed blog post on why such independent monitoring is necessary at our...

●●●●●
04 August 2016 Dan Fowler

Embulk at csv,conf,v2

Having co-organized csv,conf,v2 this past May, a few of us from Open Knowledge International had the awesome opportunity to travel to Berlin and sit in on a range of fascinating talks on the current state-of-the-art on wrangling messy data. Previously, I posted about Comma Chameleon...

●●●●●
01 August 2016 Dan Fowler

Using Data Packages with Pandas

Frictionless Data is about making it effortless to transport high quality data among different tools and platforms for further analysis. We obviously ♥ data science, and pandas is one of the most popular Python libraries for advanced data analysis and modeling. This post highlights our...

●●●●●
25 July 2016 Dan Fowler

Publish Data Packages to DataHub (CKAN)

Back in March, I wrote about a CKAN extension for publishing and exporting Data Packages1. This extension, datapackager, has been updated and is now live on our very own CKAN instance, DataHub. DataHub users can now import and export Data Packages via the CKAN UI...

●●●●●
18 July 2016 Dan Fowler

Comma Chameleon at csv,conf,v2

Having co-organized csv,conf,v2 this past May, a few of us from Open Knowledge International had the awesome opportunity to travel to Berlin and sit in on a range of fascinating talks on the current state-of-the-art on wrangling messy data. One such talk was given by...

●●●●●
14 July 2016 Dan Fowler

Using Data Packages with R

R is a popular open-source programming language and platform for data analysis. Frictionless Data is an Open Knowledge International project aimed at making it easy to publish and load high-quality data into tools like R through the creation of a standard wrapper format called the...

●●●●●
13 July 2016 Alexandre Bonnasseau

'Continuous Processing' with Data Packages

When storing your data in Data Packages, it is considered good practice to store scripts for updating, processing, or analyzing your data in a directory called scripts/ placed at the root of your Data Package. I’ve written a tutorial to show how to achieve continuous...

●●●●●
17 May 2016 Dan Fowler

Automated Data Validation with Data Packages

Much of the open data on the web is published in CSV or Excel format. Unfortunately, it is often messy and can require significant manipulation to actually be usable. In this post, I walk through a workflow for automating data validation on every update to...

●●●●●
19 April 2016 Rufus Pollock

Tools for Extracting Data and Text from PDFs - A Review

Extracting data from PDFs remains, unfortunately, a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options. The tools we can consider fall into three categories: Extracting text from PDF Extracting...

●●●●●
25 March 2016 Alexandre Bonnasseau

Tools for Data Packages: Make vs. Tuttle

When crafting data from some other data, like packaging public data, using the good tools can really ease development process and reliability of the data. The venerable make which have already been used for decades to build software, is a very good option as advocated...

●●●●●
11 March 2016 Dan Fowler

Frictionless Data Transport in Python

Tool and platform integrations for “Data Packages” are key elements of our Frictionless Data Initiative at Open Knowledge International. We recently posted on the main blog about some integration work funded by our friends at Google. We’ve built useful Python libraries for working with Tabular...

●●●●●
18 February 2016 Josh Wieder

Submit your Newsletter ideas today!

The first quarter of 2016 is almost through, which means that the OKFN Labs Newsletter is on its way! But we have a problem. We know that you have spent the last 3 months writing awesome code, founding disruptive new projects and basically changing the...

●●●●●
05 December 2015 Josh Wieder

Labs newsletter: Q4 2015

Hey there hackers & hackettes! Welcome to the 4th quarter 2015 Open Knowledge Labs Newsletter: A Very Special Holiday Edition of the Open Knowledge Labs Newsletter. We hope that all of our readers, volunteers, team members & contributors have a great holiday season. Labs is...

●●●●●
28 September 2015 Daniel Fowler

Labs newsletter: Q2/Q3 2015

Welcome to the second Labs Newsletter of 2015! There has been excellent progress on various open data tools and initiatives across the Open Knowledge network since the last newsletter. Let’s take a look: Labs Still <3 Discourse Open Knowledge is in the process of centralizing...

●●●●●
04 September 2015 Osahon Okungbowa

Open Data Companion (ODC) – Bringing Open Data to the Mobile Platform

As software developers, we are always looking for data to solve a problem or address a shortcoming. It’s just how we’re wired. So, you heard of open data [1], and now you’re excited to go exploring and get the open data needed for the project....

●●●●●
28 August 2015 Matt Fullerton

Document to text conversion web service gets a nice name, a nice URL and a web interface

Give Me Text! In a previous post, I detailed a web service where you can throw documents of many kinds at it, and get text in return. We’ve now given this service a name, “Give Me Text!”, and a nice URL at http://givemetext.okfnlabs.org/ for both...

●●●●●
14 August 2015 Chris Hutchins

Improving the openness of health and social care data

The Health and Social Care Information Centre (HSCIC) is responsible for publishing a large proportion of the official statistics related to health and care in England. Each year we release about 250 statistical publications, ranging from high-level summary data on hospital admissions, through to detail...

●●●●●
11 July 2015 Dan Fowler

Featured Core Datasets: Comprehensive Country Codes and Country List

Are you in need of a clean, well maintained list of all countries and their associated international codes in CSV and JSON? If so, you might consider the country-codes and country-list data packages available at data.okfn.org. Country Codes, using source data from ISO, the CIA...

●●●●●
11 May 2015 Paul Walsh

Labs newsletter: Q1 2015

Welcome to the first Labs Newsletter of 2015! There has been some great activity around open data and tech in the Open Knowledge network over the first quarter of 2015. Let’s dive straight in! Labs <3 Discourse In case you don’t know, Discourse is an...

●●●●●
26 April 2015 Gerald Bauer

Introducing datapak - Work with Tabular Data Packages using Ruby and ActiveRecord

Tabular data packages are a pragmatic way of both publishing your own data and consuming the data that others share with the world. The newly published datapak is a Ruby library that lets you work with tabular data packages using ActiveRecord and, thus, your SQL...

●●●●●
06 March 2015 Paul Walsh

The Good Tables web service

Introducing the Good Tables web service Good Tables is a free online service that helps you find out if your tabular data is actually good to use - it can check for structural problems (blank rows and columns) as well as ensure that data fits...

●●●●●
21 February 2015 Matt Fullerton

A public web service for document to text conversion including OCR

Getting text out of documents Last year I was working on beta.offenedaten.de, a catalog of data catalogs in Germany using the CKAN platform as the basis. Although the topic of how to enable full-text search of documents in CKAN data catalogs is somewhat open, I...

●●●●●
20 February 2015 Paul Walsh

Introducing Good Tables

What is it? Good Tables is a Python package for validating tabular data through a processing pipeline. It is built by Open Knowledge, with funding from the Open Data User Group. Good Tables is currently an alpha release. Applications range from simple validation checks on...

●●●●●
03 January 2015 Rufus Pollock

Wanted - Data Curators to Maintain Key Datasets in High-Quality, Easy-to-Use and Open Form

Wanted: volunteers to join a team of “Data Curators” maintaining “core” datasets (like GDP or ISO-codes) in high-quality, easy-to-use and open form. What is the project about: Collecting and maintaining important and commonly-used (“core”) datasets in high-quality, standardized and easy-to-use form - in particular, as...

●●●●●
11 September 2014 Rufus Pollock

A Data API for Data Packages in Seconds Using CKAN and its DataStore

dpm the command-line ‘data package manager’ now supports pushing (Tabular) Data Packages straight into a CKAN instance (including pushing all the data into the CKAN DataStore): dpm ckan {ckan-instance-url} This allows you, in seconds, to get a fully-featured web data API – including JSON and...

●●●●●
01 September 2014 Stefan Urbanek

Bubbles: Python ETL Framework (prototype)

Introduction and ETL The abbreviation ETL stands for extract, transform and load. What is it good for? For everything between data sources and fancy visualisations. In the data warehouse the data will spend most of the time going through some kind of ETL, before they...

●●●●●
19 August 2014 Ricardo Lafuente

Data Central: a static frontend for data package collections

This post explains our issues at the Portuguese open data front when it comes to providing bulk datasets in standard and easy-to-parse ways. It also introduces Data Central, our tentative solution to those issues: a Python tool to generate static web frontends for your data...

●●●●●
05 June 2014 Neil Ashton

Labs newsletter: 5 June, 2014

Welcome back to the OKFN Labs! Members of the Labs have been building tools, visualizations, and even new data protocols—as well as setting up conferences and events. Read on to learn more. If you’d like to suggest a piece of news for next month’s newsletter,...

●●●●●
04 June 2014 Hannes Gassert

First International Sport Hackdays Kick Off New OK Working Group

At the first International Sports Hackdays in Basel, Sierre and Milan, over 120 developers and designers, journalists and scientists, professionals and amateurs came together to prototype new approaches to make creative use of sports data. They built new types of hardware, new interfaces for fitness...

●●●●●
06 May 2014 Gerald Bauer

Using open football data - Get ready for the World Cup in Brazil 2014

Football is the world’s most popular sport and the World Cup in Brazil - kicking off next month in São Paulo on June 12th (in 38 days 3 hours 15 minutes and counting) - is the world’s biggest (sport) event with 32 national teams from...

●●●●●
05 May 2014 Rufus Pollock

CSV Conf 2014 - for Data Makers Everywhere

Announcing CSV,Conf - the conference for data makers everywhere which takes place on 15 July 2014 in Berlin. This one day conference will focus on practical, real-world stories, examples and techniques of how to scrape, wrangle, analyze, and visualize data. Whether your data is big...

●●●●●
22 March 2014 Matthew Landauer

Morph, a scraper platform for hackers and would be hackers

In an ideal world… In an ideal world we would go in search of a piece of data by using our favorite search engine and we would land on a page with a big download button. It would give you a few options for formats....

●●●●●
20 March 2014 Neil Ashton

Labs newsletter: 20 March, 2014

We’re back with a bumper crop of updates in this new edition of the now-monthly Labs newsletter! Textus Viewer refactoring The TEXTUS Viewer is an HTML + JS application for viewing texts in the format of TEXTUS, Labs’s open source platform for collaborating around collections...

●●●●●
04 March 2014 Rufus Pollock

The SEC EDGAR Database

This post looks at the Securities and Exchange Commission (SEC) EDGAR database. EDGAR is a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports: All companies, foreign and domestic, are required to file registration statements, periodic reports,...

●●●●●
20 February 2014 Neil Ashton

Labs newsletter: 20 February, 2014

The past few weeks have seen major improvements to the Labs website, another Open Data Maker Night in London, updates to the TimeMapper project, and more. Labs Hangout: today The next Labs online hangout is taking place today in just a few hours—now’s your chance...

●●●●●
30 January 2014 Neil Ashton

Labs newsletter: 30 January, 2014

From now on, the Labs newsletter will arrive through a special announce-only mailing list, [email protected], more details on which can be found below. Keep reading for other new developments including the fifth Labs Hangout, the launch of SayIt, and new developments in the vision of...

●●●●●
20 January 2014 Stefan Urbanek

OLAP Cubes and Logical Models

Last time we talked about OLAP in general – what it is and why it is useful. Today we are going to look at the data – how they are structured and why? What are cubes? What does it mean “multi-dimensional”? Data Cubes and Logical...

●●●●●
16 January 2014 Neil Ashton

Labs newsletter: 16 January, 2014

Welcome back from the holidays! A new year of Labs activities is well underway, with long-discussed improvements to the Labs projects page, many new PyBossa developments, a forthcoming community hangout, and more. Labs projects page Getting the Labs project page organized better has been high...

●●●●●
10 January 2014 Stefan Urbanek

Introduction to OLAP

What is OLAP? “Online Analytical Processing – OLAP is an approach to answering multi-dimensional analytical queries swiftly” says Wikipedia. What does that mean? What are multi-dimensional analytical queries? Why this approach? We will learn all this in a short blog series. The term OLAP is...

●●●●●
17 December 2013 Neil Ashton

Convert data between formats with Data Converters

Data Converters is a command line tool and Python library making routine data conversion tasks easier. It helps data wranglers with everyday tasks like moving between tabular data formats—for example, converting an Excel spreadsheet to a CSV or a CSV to a JSON object. The...

●●●●●
12 December 2013 Neil Ashton

Labs newsletter: 12 December, 2013

We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the Nomenklatura, Annotator, and Data Protocols projects and some new posts on...

●●●●●
06 December 2013 Michael Bauer

Introducing Reconcile-CSV

Recently I spent a week in Tanzania working on education data with the ministry of education (blog post here). One of the problems we faced there were spreadsheets, we liked to merge, without having any unique IDs. I quickly realized we can do this through...

●●●●●
05 December 2013 Neil Ashton

View a CSV (Comma Separated Values) in Your Browser

This post introduces one of the handiest features of Data Pipes: fast (pre) viewing of CSV files in your browser (and you can share the result by just copying a URL). The Raw CSV CSV files are frequently used for storing tabular data and are...

●●●●●
28 November 2013 Neil Ashton

Labs newsletter: 28 November, 2013

Another busy week at the Labs! We’ve had lots of discussion around the idea of “bad data”, a blog post about Mark’s aid tracker, new PyBossa developments, and a call for help with a couple of projects. Next week we can look forward to another...

●●●●●
25 November 2013 Mark Brough

Looking at aid in the Philippines

See also: “A closer look at aid in the Philippines” Since Typhoon Yolanda/Haiyan struck the Philippines on 8th November there has been some discussion around the availability of information to help coordinate activities effectively in the disaster response phase. To see what data was already...

●●●●●
21 November 2013 Neil Ashton

Labs newsletter: 21 November, 2013

This week, Labs members gathered in an online hangout to discuss what they’ve been up to and what’s next for Labs. This special edition of the newsletter recaps that hangout for those who weren’t there (or who want a reminder). Data Pipes update Last week...

●●●●●
19 November 2013 Rufus Pollock

Bad Data: real-world examples of how not to do data

We’ve just started a mini-project called Bad Data. Bad Data provides real-world examples of how not to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly. This isn’t about being critical but about educating—providing examples of how not to do...

●●●●●
14 November 2013 Neil Ashton

Labs newsletter: 14 November, 2013

Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post. Data Pipes: lots of improvements Data Pipes is a Labs project that provides a web API for a...

●●●●●
11 November 2013 Tarek Amr

Natural Language Processing using Python

This weekend the Google Developer Group in Cairo arranged 2-days workshops followed by a hackathon. During this event, I organized a workshop about NLTK and the use of Python in Natural Language Processing (NLP). The session’s slides can be found here. The beauty of NLP...

●●●●●
07 November 2013 Neil Ashton

Labs newsletter: 7 November, 2013

There was lots of interesting activity around Labs this week, with two launched projects, a new initiative in the works, and an Open Data Maker Night in London. Webshot: online screenshot service webshot.okfnlabs.org, an online service for taking screenshots of websites, is now live, thanks...

●●●●●
06 November 2013 Rufus Pollock

Tracking Issues with Data the Simple Way

Data Issues is a prototype initiative to track “issues” with data using a simple bug tracker—in this case, GitHub Issues. We’ve all come across “issues” with data, whether it’s “data” that turns out to be provided as a PDF, the many ways to badly format...

●●●●●
17 October 2013 Anastasios Ventouris

A Python guide for open data file formats

If you are an open data researcher you will need to handle a lot of different file formats from datasets. Sadly, most of the time, you don’t have the opportunity to choose which file format is the best for your project, but you have to...

●●●●●
11 October 2013 Neil Ashton

Introducing TimeMapper - Create Elegant TimeMaps in Seconds

TimeMapper lets you create elegant and embeddable timemaps quickly and easily from a simple spreadsheet. A timemap is an interactive timeline whose items connect to a geomap. Creating a timemap with TimeMapper is as easy as filling in a spreadsheet template and copying its URL....

●●●●●
11 October 2013 Michael Bauer

Datapackageproxy - work with datapackages in your browser

Datapackages are a neat idea along the “using data like we use code” way. While Tryggvi has created a nice python module to handle datapackages - there is a problem using datapackages in javascript. In an ideal world I’d just call something like d3.csv() on...

●●●●●
07 October 2013 Rufus Pollock

PublicBodies.org - Update no. 2

Herewith is a report on recent improvements to PublicBodies.org, our project in Open Knowledge Foundation Labs project to provide “a URL (and information) on every “public body” - that’s every government funded agency, department or organization. New data New data contributed over the last couple...

●●●●●
04 October 2013 Rufus Pollock

Data as Code Deja-Vu

Someone just pointed me at this post from Ben Balter about Data as Code in which he emphasizes the analogies between data and code (and especially open data and open-source – e.g. “data is where code was 2 decades ago” …). I was delighted to...

●●●●●
24 September 2013 Michael Bauer

Full stack datavis - scraperwiki, d3 and github.

The city of Vienna started releasing waiting times for some of its service offices recently. I followed my usual hunch and just wrote a small script on scraperwiki that stows away the JSON released by the city not knowing yet what to do with it....

●●●●●
16 September 2013 Michael Bauer

Using d3 as user input

Recently, I was at Chicas Poderosas in Bogota - the three day event featured talks on two days and a hackday on the last. During the event I was approached by Natalia an industrial designer who introduced a project of hers: Electrocardiogr_ama. She wanted to...

●●●●●
11 September 2013 Rufus Pollock

Data Pipes - streaming online data transformations

Data Pipes provides an online service built in NodeJS to do simple data transformations – deleting rows and columns, find and replace, filtering, viewing as HTML – and, furthermore, to connect these transformations together Unix pipes style to make more complex transformations. Because Data Pipes...

●●●●●
27 August 2013 Yaron Koren

Miga, a new app generator for structured data

I’m pleased to announce the Miga Data Viewer, or Miga, an open source tool I created that lets you create a web/mobile app nearly automatically from a set of CSV data. There are already various applications/frameworks that provide a JavaScript-enabled front-end for structured data -...

●●●●●
22 August 2013 Pierre-Yves Vandenbussche

How a bad experience made an OKFN labs project

From theory to experimentation Back in November 2010, I faced a problem while teaching my students about the Semantic Web. I wanted to convey the idea that Semantic Web technologies can break down the barriers between dataset silos on the Web and simplify the publication...

●●●●●
14 August 2013 Michael Bauer

ropenspending - accessing the OpenSpending API through R

Tonight a couple of us were having a discussion on the OpenSpending IRC channel on how we can promote and better document the usage of the API. Tony had already begun to work on OpenSpending using R. I had previously done so as well. This...

●●●●●
08 August 2013 Paul Fitzpatrick

Diffing and patching tabular data

A few years ago at the Eastern Conference for Workplace Democracy in New Hampshire, a bunch of friends chatting on a grassy knoll realized they were all working on overlapping directories of their communities, and decided to pool their efforts. They tracked down some techies...

●●●●●
06 August 2013 Daniel Lombraña González

Mapping Antimatter tracks with CrowdCrafting.org

This last weekend, CERN hosted a very special event: the 2nd CERN Summer Student Webfest organized by the Citizen Cyberscience Centre. The Webfest invites CERN summer students to participate in a 48 hours marathon hacking new applications, tools, games, etc. about physics. This year, I...

●●●●●
06 August 2013 Neil Ashton

data.okfn.org - update no. 2

data.okfn.org is the Labs’ repository of high-quality, easy-to-use open data. This update summarizes some of the improvements to data.okfn.org that have taken place over the past two months. New tools Several tools which make it easier to use the Data Package standard are now operational....

●●●●●
31 July 2013 Daniel Lombraña González

Analyzing Icelandic conviction rates with CrowdCrafting.org

CrowdCrafting.org hosts a wide variety of applications that range from science to humanities. Since the official launch of CrowdCrafting.org, lots of applications have been created , but one of them has done a really impressive job: Héraðsdómar - sýknað eða sakfellt. Héraðsdómar - sýknað eða...

●●●●●
28 July 2013 Michael Bauer

Making puzzles out of Shapefiles - bringing Open Data to the physical world

For a while I’ve been thinking about how to make Open Data more tangible. Even with great visualizations, it tends to remain stuck in computers and smartphones. Recently, I had the idea to start taking geodata, released by cities, and start making it into physical...

●●●●●
23 July 2013 Charalampos Bratsas

Apps using DBpedia Wikipedia from Open Knowledge Foundation Greece

Having developed the Greek DBpedia, the first Internationalized DBpedia, OKFN Greece is now involved in the OKFN Labs by introducing three applications using DBpedia. 1. DBpedia Spotlight DBpedia Spotlight is an application that automatically spots and disambiguates words or phrases of text documents that might...

●●●●●
11 July 2013 Daniel Lombraña González

Spanish Party Financing Scandals - CrowdSourcing Data Extraction with CrowdCrafting

Spanish society has been bombarded recently with a flurry of news stories about possible cases of corruption in the major political parties like the Partido Socialista Obrero Español and the Partido Popular. In January of 2013 the party that rules the country, Partido Popular (PP),...

●●●●●
09 July 2013 Neil Ashton

PublicBodies.org progress

There have been many new developments with PublicBodies.org, the Labs project which aims to provide “a URL for every part of government”, since the last update on the Labs blog. The news includes: a new and improved backend; a push for integration with Nomenklatura; discussion...

●●●●●
08 July 2013 Rufus Pollock

Open Data Maker Night London No 3 - Tuesday 16th July

The next Open Data Maker Night London will be on Tuesday 16th July 6-9pm (you can drop in any time during the evening). Like the last two it is kindly hosted by the wonderful Centre for Creative Collaboration, 16 Acton Street, London. When: Tuesday 16th...

●●●●●
08 July 2013 Mark Brough

Open Data QA - the Aid Transparency Tracker

Back in April, I wrote on the Open Knowledge Foundation main blog to launch the first component of our Aid Transparency Tracker, a tool to analyse aid donors’ commitments to publish more open data about their aid activities. At the end of that post, I...

●●●●●
01 July 2013 Rufus Pollock

Querying ElasticSearch - A Tutorial and Guide

ElasticSearch is a great open-source search tool that’s built on Lucene (like SOLR) but is natively JSON + RESTful. Its been used quite a bit at the Open Knowledge Foundation over the last few years. Plus, as its easy to setup locally its an attractive...

●●●●●
28 June 2013 Neil Ashton

Basic data cleaning with Data Explorer

Data Explorer is a client-side web application for data processing and visualization. With Data Explorer, you can import data, transform it with JavaScript code, and visualize it on a graph or a map – all fully within the browser and with your data and code...

●●●●●
28 May 2013 Rufus Pollock

data.okfn.org - update no. 1

This is the first of regular updates on Labs project http://data.okfn.org/ and summarizes some of the changes and improvements over the last few weeks. 1. Refactor of site layout and focus. We’ve done a refactor of the site to have stronger focus on the data....

●●●●●
21 May 2013 Sam Leon

Open Humanities Hangout - Open Correspondence and the Letter Net

Our next Open Humanities Hangout will take place next Tuesday, 28th May. This is the latest in the series of regular hangouts we’ve been organizing over the past few months with people interested in tapping in to the growing amount of open cultural data and...

●●●●●
16 May 2013 Friedrich Lindenberg

Nomenklatura - Data Matching and Reconciliation Made Easy

Nomenklatura is a simple service that makes it easy to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names against that canonical list – for example, matching Acme Widgets, Acme Widgets Inc...

●●●●●
01 May 2013 Rufus Pollock

Update on PublicBodies.org - a URL for every part of Government

This is an update on PublicBodies.org - a Labs project whose aim is to provide a “URL for every part of Government”: http://publicbodies.org/ PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have...

●●●●●
11 April 2013 Rufus Pollock

Quick and Dirty Analysis on Large CSVs

I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be found in this issue...

●●●●●
03 April 2013 Rufus Pollock

Cleaning up Greater London Authority Spending (for OpenSpending)

I’ve been working to get Greater London Authority spending data cleaned up and into OpenSpending. Primary motivation comes from this question: Which companies got paid the most (and for doing what)? (see this issue for more) I wanted to share where I’m up to and...

●●●●●
30 March 2013 Friedrich Lindenberg

sqlaload, an ETL wrapper for SQLAlchemy

sqlaload is a small library that I use to handle databases in Python data processing. In many projects, your process starts with very messy data (something you’ve scraped or loaded from a hand-prepared Excel sheet). In subsequent stages, you gradually add cleaned values in new...

●●●●●
27 March 2013 Sam Leon

Next Steps for Textus

At the Culture Labs hangout yesterday we wrote up the plans for the next steps for Textus we have been discussing over the last few months. The result is this slide deck overview. It both introduces Textus and outlines next steps (slide 12 onwards). Key...

●●●●●
18 March 2013 Rufus Pollock

Progress on the Data Explorer

This is an update on progress with the Data Explorer (aka Data Transformer). Progress is best seen from this demo which takes you on a tour of house prices and the difference between real and nominal values. More information on recent developments can be found...

●●●●●
26 February 2013 Rufus Pollock

Recline JS - Componentization and a Smaller Core

Over time Recline JS has grown. In particular, since the first public announce of Recline last summer we’ve had several people producing new backends and views (e.g. backends for Couch, a view for d3, a map view based on Ordnance Survey’s tiles etc etc). As...

●●●●●
20 February 2013 Daniel Lombraña González

Exporting PyBossa data to CSV or JSON with one click

I’m really happy to announce that today we have finally added a feature that will allow to export your data into a CSV format with just one click (we also support the same for JSON). For this purpose, all the applications in PyBossa now feature...

●●●●●
29 January 2013 Daniel Lombraña González

Mozilla FirefoxOS App Days & Crowdcrafting.org

Last Saturday, the 26th of January, Mozilla held in parallel in 25 cities all over the world a hack day, the #FirefoxOSAppDay, about creating new web applications for their new FirefoxOS mobile OS and the desktop web browser (this stills in beta and alpha mode!)....

●●●●●
28 January 2013 Daniel Lombraña González

PyBossa.JS or how you can easily create new PyBossa applications

In the last weeks we have been working hard in order to make easier to develop new PyBossa applications. For this reason, we are happy to announce a new version of PyBossa.JS. This new version introduces several improvements: Creating an app is much easier! You...

●●●●●
25 January 2013 Friedrich Lindenberg

Journoid, data notifications

At the Open Interests hackday in November, a discussion with Martin Stabe from the FT’s interactive desk led a prototype of Journoid. The idea is to monitor changing on-line datasets for remarkable information, like earthquakes, procurement in a particular industry or a close parliamentary vote....

●●●●●
15 January 2013 Rufus Pollock

Web Scraping with CSS Selectors in Node using JSDOM or Cheerio

I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node given that it is pure JS and therefore could be a more natural fit when getting info out of web pages. In particular, when my first steps when looking to...

●●●●●
08 January 2013 Rufus Pollock

Archiving Twitter the Hacky Way

There are many circumstances where you want to archive a tweets - maybe just from your own account or perhaps for a hashtag for an event or topic. Unfortunately Twitter search queries do not give data more than 7 days old and for a given...

●●●●●
13 December 2012 Stefan Wehrmeyer

Bundes-Git – German Laws on GitHub

If you compare software code and legislation you can find many similarities: both are big bodies of text spread over multiple units (laws/files). The total amount of text inevitably grows bigger over time with many small changes to existing parts while most of the corpus...

●●●●●
12 December 2012 Gregor Aisch

Speeding Up Your PyBossa App

Thanks to the free crowd-crafting tool PyBossa, nowadays the biggest challenge for successful crowd-sourcing is engaging users for participating in tasks, and to keep that motivation at a high level over time. Therefor the user experience of crowd-sourcing apps plays a crucial role. After participating...

●●●●●
04 December 2012 Rufus Pollock

Javascript Timeline Libaries - A Review

This post is a rough and ready overview of various javascript timeline libraries that arose from research in creating a timeline view for Recline JS. Note this material hung around on my hard disk for a few months so some of it may already be...

●●●●●
29 November 2012 Liliana Bounegru

Following Money and Influence in the EU - the Open Interests Hackathon

Making sense of massive datasets that document the processes of lobbying and public procurement at European Union level is not an easy task. Yet a group of 25 journalists, developers, graphic designers and activists worked together at the Open Interests Europe hackathon last weekend to...

●●●●●
13 November 2012 Vitor Baptista

Scraping Data Behind a CAPTCHA

How much does the highest paid person in the Brazilian Federal Senate earns? That’s the question I asked myself a few weeks ago, and one that should be easy to answer. In Brazil, every public body must publish its employees’ salaries online, but some do...

●●●●●
01 November 2012 Rufus Pollock

Recline JS Search Demo

We’ve recently finished a demo for ReclineJS showing how it can be used to build JS-based (ajax-style) search interfaces in minutes (or even seconds!): http://reclinejs.com/demos/search/ Because of Recline’s pluggable backends you get out of the box support for data sources such as SOLR, Google Spreadsheet,...

●●●●●
23 October 2012 Nigel Babu

Labs Show and Tell - 26th October!

We’re having the next Show and Tell on Friday, 26 October at 2:30 pm BST via Google Hangout on Air. As usual, the URL will be posted on OKFN Labs’ G+ Page. If you’d like to present, add your name to the list. Remember, #okfn...

●●●●●
22 October 2012 Friedrich Lindenberg

Wrangling dirty data with messytables.

One of the largest data collection projects we have done so far has been the consolidation of the UK’s departmental expenditure. Over 370 different government entities have published a total of more than 7000 spreadsheets. Many of those have obviously been hand-crafted or at least...

●●●●●
15 October 2012 Velichka Dimitrova

Open Interests Hackathon in London, 24-25 November

The European Journalism Centre and the Open Knowledge Foundation, sponsored by Knight-Mozilla OpenNews, invite you to the Open Interests Hackathon to track the the interests and money flows which shape European policy. When: 24-25 November Where: Google Campus Cafe, 4-5 Bonhill Street, EC2A 4BX London...

●●●●●
10 October 2012 Nigel Babu

Labs Show and Tell - All Welcome!

Built an app or tool you want to show people? Played around with some interesting data? Know of a new development people should know about? Want to find out what others are doing? Come to the Show and Tell this Friday and share what you...

●●●●●
25 September 2012 Friedrich Lindenberg

Data Catalogues are People!

Last week, Matej Kurian published a message on the okfn-labs mailing list, describing the various sources he had discovered for machine-readable excerpts of the EU’s joint procurement system, TED. What struck me about this message was that, apparently, this polite and brilliant policy wonk had...

●●●●●
10 September 2012 Rufus Pollock

WikipediaJS - accessing Wikipedia article data through Javascript

WikipediaJS is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc. The library is the work of Labs member Rufus Pollock. In essence, it is a small wrapper around the data and APIs of the DBPedia project and...

●●●●●
08 August 2012 Rufus Pollock

Timeliner - Make Nice Timelines Fast

As part of the Recline launch I put together quickly some very simple demo apps one of which was called Timeliner: http://timeliner.reclinejs.com/ This uses the Recline timeline component (which itself is a relatively thin wrapper around the excellent Verite timeline) plus the Recline Google docs...

●●●●●
31 July 2012 Rufus Pollock

The Data Transformer - Cleaning Up Data in the Browser

This a brief post to announce an alpha prototype version of the Data Transformer, an app to let you clean up data in the browser using javascript: http://transformer.datahub.io/ 2m overview video: What does this app do? You load a CSV file from github (fixed...

●●●●●
14 July 2012 Daniel Lombraña Gonzalez

Displaying PyBossa Urban Parks Data on a 3D Globe

Labs member Daniel Lombraña González has built a 3-d globe showing the locatoins of urban parks around the world as located by volunteers using the Pybossa Urban Park geocoding app: http://teleyinex.github.com/pybossa-urbanpark-globe/ — (Source code) Background The Urban Parks geo-coding application is a micro-tasking app running...

●●●●●
10 July 2012 Friedrich Lindenberg

dataissues.org - public issue tracking for data defects

On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. While we submitted proposals for Grano and DataProtocols, we decided to hold back...

●●●●●
09 July 2012 Friedrich Lindenberg

Grano - social network analysis for advocates and journalists.

On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. The first idea I want to repost here is a proposal for Grano,...

●●●●●

Have your say!

Do you have a topic that you'd like to write about? We love guest posts. Here's how to submit one »

Have your say!

Blogroll