Open Knowledge Labs: The Data Wrangling Blog

Introduction to Statistics With Data Packages and Gonum

2019-08-01T00:00:00+00:00

After 6 years at Google, Daniel Fireman is currently a Ph.D. student, professor and activist for government transparency and accountability in the Northeast of Brazil. He was one of the 2017’s Frictionless Data Tool Fund grantees and implemented the core Frictionless Data specification in the Go programming language: datapackage and tableschema, which he still maintains. You can read more about this in his grantee profile.

Since its first release in 2017, we’ve been improving datapackage and tableschema packages. Besides fixing bugs, we tried to make it easier to use data packages together with statistical/plotting libraries like Gonum. This post shows an example of such usage and was inspired in this post, from Sebastian Binet.

Our goal in this tutorial is to load a data package from the web and use Gonum to calculate some basic statistics.

Go, Data Packages & Gonum

datapackage is “a package for working with Data Packages“. A Data Package consists of:

Metadata that describes the structure and contents of the package
Resources such as data files that form the contents of the package

Gonum is “a set of packages designed to make writing numeric and scientific algorithms productive, performant and scalable.”

Before being able to use datapackage and Gonum, we need to install Go. We can download and install the Go toolchain for a variety of platforms and operating systems from golang.org/dl. This post assumes the installation of version 11 or newer.

After installing Go, the runtime will download Gonum, datapackage and all its dependencies as part of running the go script.

Reading Datapackage

In this post, we are using a Tabular Data Package containing the periodic table. The package descriptor (datapackage.json) and contents (data.csv) are stored on GitHub. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Let’s start by taking a quick look at the header and the first rows.

// file: stats.go

package main

import (
    "fmt"

    "github.com/frictionlessdata/datapackage-go/datapackage"
)

func main() {
    pkg, err := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    if err != nil {
        panic(err)
    }
    res := pkg.GetResource("data")
    table, err := res.ReadAll()
    if err != nil {
        panic(err)
    }
    for i := 0; i < 4; i++ {
        fmt.Println(table[i])
    }
}

Gonum and statistics

Gonum provides many statistical functions. Let’s use it to calculate the mean, median, standard deviation and variance of the atomic masses.

// file: stats.go

package main

import (
        "fmt"
        "math"
        "sort"

        "github.com/frictionlessdata/datapackage-go/datapackage"
        "github.com/frictionlessdata/tableschema-go/csv"
        "gonum.org/v1/gonum/stat"
)

func main() {
        pkg, err := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
        if err != nil {
                panic(err)
        }
        var masses []float64
        res := pkg.GetResource("data")
        if err := res.CastColumn("atomic mass", &masses, csv.LoadHeaders()); err != nil {
                panic(err)
        }
        fmt.Printf("data: %v\n", masses)

        sort.Float64s(masses)
        fmt.Printf("data: %v (sorted)\n", masses)

        // computes the weighted mean of the dataset.
        // we don't have any weights (ie, all weights are 1)
        // so we just pass a nil slice.
        mean := stat.Mean(masses, nil)

        // computes the median of the dataset.
        // here as well, we pass a nil slice as weights.
        median := stat.Quantile(0.5, stat.Empirical, masses, nil)

        variance := stat.Variance(masses, nil)
        stddev := math.Sqrt(variance)

        fmt.Printf("mean=     %v\n", mean)
        fmt.Printf("median=   %v\n", median)
        fmt.Printf("variance= %v\n", variance)
        fmt.Printf("std-dev=  %v\n", stddev)
}

The program above performs some basic statistical operations on our dataset:

$> go run stats.go
... dependency download logs ...
data: [1.00794 4.002602 6.941 9.012182 10.811 12.0107 14.0067 15.9994 18.9984032 20.1797 22.98976928 24.305 26.9815386 28.0855 30.973762 32.065 35.453 39.948 39.0983 40.078 44.955912 47.867 50.9415 51.9961 54.938045 55.845 58.933195 58.6934 63.546 65.38 69.723 72.64 74.9216 78.96 79.904 83.798 85.4678 87.62 88.90585 91.224 92.90638 95.96 98 101.07 102.9055 106.42 107.8682 112.411 114.818 118.71 121.76 127.6 126.90447 131.293 132.9054519 137.327 138.90547 140.116 140.90765 144.242 145 150.36 151.964 157.25 158.92535 162.5 164.93032 167.259 168.93421 173.054 174.9668 178.49 180.94788 183.84 186.207 190.23 192.217 195.084 196.966569 200.59 204.3833 207.2 208.9804 209 210 222 223 226 227 232.03806 231.03588 238.02891 237 244 243 247 247 251 252 257 258 259 262 267 268 271 272 270 276 281 280 285 284 289 288 293 294 294]
data: [1.00794 4.002602 6.941 9.012182 10.811 12.0107 14.0067 15.9994 18.9984032 20.1797 22.98976928 24.305 26.9815386 28.0855 30.973762 32.065 35.453 39.0983 39.948 40.078 44.955912 47.867 50.9415 51.9961 54.938045 55.845 58.6934 58.933195 63.546 65.38 69.723 72.64 74.9216 78.96 79.904 83.798 85.4678 87.62 88.90585 91.224 92.90638 95.96 98 101.07 102.9055 106.42 107.8682 112.411 114.818 118.71 121.76 126.90447 127.6 131.293 132.9054519 137.327 138.90547 140.116 140.90765 144.242 145 150.36 151.964 157.25 158.92535 162.5 164.93032 167.259 168.93421 173.054 174.9668 178.49 180.94788 183.84 186.207 190.23 192.217 195.084 196.966569 200.59 204.3833 207.2 208.9804 209 210 222 223 226 227 231.03588 232.03806 237 238.02891 243 244 247 247 251 252 257 258 259 262 267 268 270 271 272 276 280 281 284 285 288 289 293 294 294] (sorted)
mean=     146.43746355915252
median=   140.90765
variance= 8026.634755570227
std-dev=  89.59148818704948

Thanks for reading!

We welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-go repository.

Announcing datapackage-pipelines version 2.0

2018-10-18T00:00:00+00:00

Today we’re releasing a major version for datapackage-pipelines, version 2.0.0.

This new version marks a big step forward in realizing the Data Factory concept and framework. We integrated datapackage-pipelines with its younger sister dataflows, and created a set of common building blocks you can now use interchangeably between the two frameworks.

figure 1: diagram showing the relationship between dataflows and datapackage-pipelines

It’s now possible to bootstrap and develop flows using dataflows, and then run these flows as-is on a datapackage-pipelines server - or effortlessly convert them to the declarative yaml syntax.

Install datapackage-pipelines using pip:

pip install datapackage-pipelines

What Changed?

New Low-level API and stdout Redirect

One big change (and a long time request) is that processors are now allowed to print from inside their processing code, without interfering with the correct operation of the pipeline. All prints are automatically converted to logging.info(…) calls.This behaviour is enabled when using the new low-level API. The main change we’ve introduced is that ingest() is now a context manager. This means that you now should run:

# New style for ingest and spew
with ingest() as ctx:
 # Do stuff with datapackage and resource_iterator
 spew(ctx.datapackage,
 ctx.resource_iterator,
 ctx.stats)

Backward compatibility is maintained for the old way of using ingest(), so you don’t have to update all your code immediately.

# This still works, but won’t handle print()s
parameters, datapackage, resource_iterator = ingest()
spew(datapackage, resource_iterator)

Dataflows integration

There’s a new integration with dataflows which allows running Flows directly from the pipeline-spec.yaml file. You can integrate dataflows within pipeline specs using the flow attribute instead of run. For example, given the following flow file, saved under my-flow.py:

from dataflows import Flow, dump_to_path, load, update_package
​
def flow(parameters, datapackage, resources, stats):
  stats[‘multiplied_fields’] = 0
 ​
  def multiply(field, n):
    def step(row):
      row[field] = row[field] * n
      stats[‘multiplied_fields’] += 1
      return step
​
    return Flow(update_package(name=’my-datapackage’),
                load((datapackage, resources),
                multiply(‘my-field’, 2))

And a pipeline-spec.yaml in the same directory:

my-flow:
 pipeline:
   — run: load_resource
 parameters:
   url: http://example.com/my-datapackage/datapackage.json
   resource: my-resource
     — flow: my-flow
     — run: dump.to_path

You can run the pipeline using dpp run my-flow.

If you want to wrap a flow inside a processor, you can use the spew_flow helper function:

from dataflows import Flow
from datapackage_pipelines.wrapper import ingest
from datapackage_pipelines.utilities.flow_utils import spew_flow
​
def flow(parameters):
 return Flow(
 # Flow processing comes here
 )
​
​
if __name__ == ‘__main__’:
 with ingest() as ctx:
 spew_flow(flow(ctx.parameters), ctx)

Standard Processor Refactoring

We refactored all standard processors to use their counterparts from dataflows, thus removing code duplication and allowing us to move forward quicker. As a result, we’re also introducing a couple of new processors:

load - Loads and streams a new resource (or resources) into the data package. It’s based on the dataflows processor with the same name, so it supports loading from local files, remote URL, data packages, locations in environment variables etc. For more information, consult the dataflows documentation.
printer - Smart printing processor for displaying the contents of the stream - comes in handy for development or monitoring a pipeline.It will not print all rows, but an logarithmically sparse sample - in other words, it will print rows 1-20, 100-110, 1000-1010 etc. It also prints the last 10 rows of the dataset.

Deprecations

We are deprecating a few processors — you can still use them as usual but they will be removed in the next major version (3.0):

add_metadata - was renamed to update_package for consistency
add_resource and stream_remote_resources - are being replaced by the load
dump.to_path, dump.to_zip, dump.to_sql - are being deprecated - you should use dump_to_path, dump_to_zip and dump_to_sql instead. Note that dump_to_path and dump_to_zip lack some features that exist in the current processors — for example, custom file formatters and non-tabular file support. We might introduce some of that functionality into the new processors as well in the next versions - in the meantime, please let us know what you think about these features and how badly you need them.

The Road Ahead

In the next versions we’re planning to further the integration of dataflows and datapackage-pipelines. We’re going to work on streamlining development and deployment as well as taking care of naming and documentation to harmonize all aspects of the dataflows ecosystem. We’re also working on de-composing datapackage-pipelines into smaller, self contained components. In this version we took apart the standard processor code and some supporting libraries (e.g. kvstore) and delegated it to external libraries.

Links and References

Read more on datapackage-pipelines here: https://github.com/frictionlessdata/datapackage-pipelines
Read more on dataflows here: https://github.com/datahq/dataflows
Read more on Data Factory here: http://okfnlabs.org/blog/2018/08/29/data-factory-data-flows-introduction.html

Contributors

Thanks to Ori Hoch for contributing code and other invaluable assistance with this release.

Data Factory & DataFlows - Tutorial

2018-08-30T00:00:00+00:00

Data Factory is an open framework for building and running lightweight data processing workflows quickly and easily. We recommend reading this introductory blogpost to gain a better understanding of underlying Data Factory concepts before diving into the tutorial below.

Learn how to write your own processing flows

Let’s start with the traditional ‘hello, world’ example:

from dataflows import Flow

data = [
  {'data': 'Hello'},
  {'data': 'World'}
]

def lowerData(row):
	row['data'] = row['data'].lower()

f = Flow(
      data,
      lowerData
)
data, *_ = f.results()

print(data)

# -->
# [
#   [
#     {'data': 'hello'},
#     {'data': 'world'}
#   ]
# ]

This very simple flow takes a list of dicts and applies a row processing function on each one of them.

We can load data from a file instead:

from dataflows import Flow, load

# beatles.csv:
# name,instrument
# john,guitar
# paul,bass
# george,guitar
# ringo,drums

def titleName(row):
    row['name'] = row['name'].title()

f = Flow(
      load('beatles.csv'),
      titleName
)
data, *_ = f.results()

print(data)

# -->
# [
#   [
#     {'name': 'John', 'instrument': 'guitar'},
#     {'name': 'Paul', 'instrument': 'bass'},
#     {'name': 'George', 'instrument': 'guitar'},
#     {'name': 'Ringo', 'instrument': 'drums'}
#   ]
# ]

The source file can be a CSV file, an Excel file or a JSON file. You can use a local file name or a URL for a file hosted somewhere on the web.

Data sources can be generators and not just lists or files. Let’s take as an example a very simple scraper:

from dataflows import Flow

from xml.etree import ElementTree
from urllib.request import urlopen

# Get from Wikipedia the population count for each country
def country_population():
    # Read the Wikipedia page and parse it using etree
    page = urlopen('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population').read()
    tree = ElementTree.fromstring(page)
    # Iterate on all tables, rows and cells
    for table in tree.findall('.//table'):
        if 'wikitable' in table.attrib.get('class', ''):
            for row in table.findall('tr'):
                cells = row.findall('td')
                if len(cells) > 3:
                    # If a matching row is found...
                    name = cells[1].find('.//a').attrib.get('title')
                    population = cells[2].text
                    # ... yield a row with the information
                    yield dict(
                        name=name,
                        population=population
                    )

f = Flow(
      country_population(),
)
data, *_ = f.results()

print(data)
# --->
# [
#   [
#     {'name': 'China', 'population': '1,391,090,000'},
#     {'name': 'India', 'population': '1,332,140,000'},
#     {'name': 'United States', 'population': '327,187,000'},
#     {'name': 'Indonesia', 'population': '261,890,900'},
#     ...
#   ]
# ]

This is nice, but we do prefer the numbers to be actual numbers and not strings.

In order to do that, let’s simply define their type to be numeric:

from dataflows import Flow, set_type

def country_population():
    # same as before
	...

f = Flow(
	country_population(),
    set_type('population', type='number', groupChar=',')
)
data, *_ = f.results()

print(data)
# -->
# [
#   [
#     {'name': 'China', 'population': Decimal('1391090000')},
#     {'name': 'India', 'population': Decimal('1332140000')},
#     {'name': 'United States', 'population': Decimal('327187000')},
#     {'name': 'Indonesia', 'population': Decimal('261890900')},
#     ...
#   ]
# ]

Data is automatically converted to the correct native Python type.

Apart from data-types, it’s also possible to set other constraints to the data. If the data fails validation (or does not fit the assigned data-type) an exception will be thrown - making this method highly effective for validating data and ensuring data quality.

What about large data files? In the above examples, the results are loaded into memory, which is not always preferable or acceptable. In many cases, we’d like to store the results directly onto a hard drive - without having the machine’s RAM limit in any way the amount of data we can process.

We do it by using dump processors:

from dataflows import Flow, set_type, dump_to_path

def country_population():
    # same as before
	...

f = Flow(
	country_population(),
    set_type('population', type='number', groupChar=','),
    dump_to_path('country_population')
)
*_ = f.process()

Running this code will create a local directory called county_population, containing two files:

├── country_population
│   ├── datapackage.json
│   └── res_1.csv

The CSV file - res_1.csv - is where the data is stored. The datapackage.json file is a metadata file, holding information about the data, including its schema.

We can now open the CSV file with any spreadsheet program or code library supporting the CSV format - or using one of the data package libraries out there, like so:

from datapackage import Package
pkg = Package('country_population/res_1.csv')
it = pkg.resources[0].iter(keyed=True)
print(next(it))
# prints:
# {'name': 'China', 'population': Decimal('1391110000')}

Note how using the data package meta-data, data-types are restored and there’s no need to ‘re-parse’ the data. This also works with other types too, such as dates, booleans and even lists and dicts.

So far we’ve seen how to load data, process it row by row, and then inspect the results or store them in a data package.

Let’s see how we can do more complex processing by manipulating the entire data stream:

from dataflows import Flow, set_type, dump_to_path

# Generate all triplets (a,b,c) so that 1 <= a <= b < c <= 20
def all_triplets():
    for a in range(1, 20):
        for b in range(a, 20):
            for c in range(b+1, 21):
                yield dict(a=a, b=b, c=c)

# Yield row only if a^2 + b^2 == c^1
def filter_pythagorean_triplets(rows):
    for row in rows:
        if row['a']**2 + row['b']**2 == row['c']**2:
            yield row

f = Flow(
    all_triplets(),
    set_type('a', type='integer'),
    set_type('b', type='integer'),
    set_type('c', type='integer'),
    filter_pythagorean_triplets,
    dump_to_path('pythagorean_triplets')
)
_ = f.process()

# -->
# pythagorean_triplets/res_1.csv contains:
# a,b,c
# 3,4,5
# 5,12,13
# 6,8,10
# 8,15,17
# 9,12,15
# 12,16,20

The filter_pythagorean_triplets function takes an iterator of rows, and yields only the ones that pass its condition.

The flow framework knows whether a function is meant to handle a single row or a row iterator based on its parameters:

if it accepts a single row parameter, then it’s a row processor.
if it accepts a single rows parameter, then it’s a rows processor.
if it accepts a single package parameter, then it’s a package processor.

Let’s see a few examples of what we can do with a package processors.

First, let’s add a field to the data:

from dataflows import Flow, load, dump_to_path


def add_is_guitarist_column_to_schema(package):
	# Add a new field to the first resource
    package.pkg.resources[0]
               .descriptor['schema']['fields']
               .append(dict(
            name='is_guitarist',
            type='boolean'
    ))
    # Must yield the modified datapackage
    yield package.pkg
    # And its resources
    yield from package

def add_is_guitarist_column(row):
	row['is_guitarist'] = row['instrument'] == 'guitar'
    return row

f = Flow(
    # Same one as above
    load('beatles.csv'),
    add_is_guitarist_column_to_schema,
    add_is_guitarist_column,
    dump_to_path('beatles_guitarists')
)
_ = f.process()

In this example we create two steps - one for adding the new field (is_guitarist) to the schema and another step to modify the actual data.

We can combine the two into one step:

from dataflows import Flow, load, dump_to_path


def add_is_guitarist_column(package):

    # Add a new field to the first resource
    package.pkg.resources[0].descriptor['schema']['fields'].append(dict(
        name='is_guitarist',
        type='boolean'
    ))
    # Must yield the modified datapackage
    yield package.pkg

    # Now iterate on all resources
    resources = iter(package)
    # Take the first resource
    beatles = next(resources)

    # And yield it with with the modification
    def f(row):
        row['is_guitarist'] = row['instrument'] == 'guitar'
        return row

    yield map(f, beatles)

f = Flow(
    # Same one as above
    load('beatles.csv'),
    add_is_guitarist_column,
    dump_to_path('beatles_guitarists')
)
_ = f.process()

The contract for the package processing function is simple:

First modify package.pkg (which is a Package instance) and yield it.

Then, yield any resources that should exist on the output, with or without modifications.

In the next example we’re removing an entire resource in a package processor - this next one filters the list of Academy Award nominees to those who won both the Oscar and an Emmy award:

    from dataflows import Flow, load, dump_to_path

    def find_double_winners(package):

        # Remove the emmies resource -
        #    we're going to consume it now
        package.pkg.remove_resource('emmies')
        # Must yield the modified datapackage
        yield package.pkg

        # Now iterate on all resources
        resources = iter(package)

        # Emmies is the first -
        # read all its data and create a set of winner names
        emmy = next(resources)
        emmy_winners = set(
            map(lambda x: x['nominee'],
                filter(lambda x: x['winner'],
                       emmy))
        )

        # Oscars are next -
        # filter rows based on the emmy winner set
        academy = next(resources)
        yield filter(lambda row: (row['Winner'] and
                                  row['Name'] in emmy_winners),
                     academy)

    f = Flow(
        # Emmy award nominees and winners
        load('emmy.csv', name='emmies'),
        # Academy award nominees and winners
        load('academy.csv', encoding='utf8', name='oscars'),
        find_double_winners,
        dump_to_path('double_winners')
    )
    _ = f.process()

# -->
# double_winners/academy.csv contains:
# 1931/1932,5,Actress,1,Helen Hayes,The Sin of Madelon Claudet
# 1932/1933,6,Actress,1,Katharine Hepburn,Morning Glory
# 1935,8,Actress,1,Bette Davis,Dangerous
# 1938,11,Actress,1,Bette Davis,Jezebel
# ...

Builtin Processors

DataFlows comes with a few built-in processors which do most of the heavy lifting in many common scenarios, leaving you to implement only the minimum code that is specific to your specific problem.

A complete list, which also includes an API reference for each one of them, can be found in the DataFlows Built-in Processors page.

Data Factory & DataFlows - An Introduction

2018-08-29T00:00:00+00:00

Today I’d like to introduce a new library we’ve been working on - dataflows. DataFlows is a part of a larger conceptual framework for data processing.

We call it ‘Data Factory’ - an open framework for building and running lightweight data processing workflows quickly and easily. LAMP for data wrangling!

Most of you already know what Data Packages are. In short, it is a portable format for packaging different resources (tabular or otherwise) in a standard way that takes care of most interoperability problems (e.g. “what’s the character encoding of the file?” or “what is the data type for this column?” or “which date format are they using?”). It also provides rich and flexible metadata, which users can then use to understand what the data is about (take a look at frictionlessdata.io to learn more!).

Data Factory complements the Data Package concepts by adding dynamics to the mix.

While Data Packages are a great solution for describing data sets, these data sets are always static - located in one place. Data Factory is all about transforming Data Packages - modifying their data or meta-data and transmitting them from one location to another.

Data Factory defines standard interfaces for building processors - software modules for mutating a Data Package - and protocols for streaming the contents of a Data Package for efficient processing.

Philosophy and Goals

Data Factory is more pattern/convention than library.

An analogy is with web frameworks. Web frameworks were more about a core pattern plus a set of ready to use components than a library themselves. For example, python frameworks were built around WSGI e.g. Pylons, Flask etc. Or consider ExpressJS for Node.

In this sense these frameworks were about convention over configuration. They attempted to decrease the number of decisions that a developer using the framework is required to make without necessarily losing flexibility.

Like web frameworks, Data Factory uses convention over configuration with the aim of decreasing the number of decisions that a data developer is required to make without necessarily losing flexibility.

By following a standard scheme, developers are able to use a large and growing library of existing, reusable processors. This also increases readability and maintainability of data processing code.

Our focus is on:

Small to medium sized data (KBs to GBs)
Desktop wrangling - people who start on their desktop
Easy transition from desktop to “cloud”
Heterogeneous data sources
Process using basic building blocks that are extensible
Less technical audience
Limited resources - limit on memory, CPU, etc.

What are we not?

Big data processing and machine learning. e.g. if you want to wrangle TBs of data in a distributed setup or want to train a machine learning model with GBs of data, you probably don’t want this.
Processing real-time event data.
Technical know-how is needed: we aren’t a fancy ETL UI – you probably need a bit of technical sophistication

Architecture

This new framework is built on the foundations of the Frictionless Data project - both conceptually as well as technically. This project provided us the definition of Data Packages and the software to read and write these packages.

On top of this Frictionless Data basis, we’re introducing a few new concepts:

the Data Stream - essentially a Data Package in transit;

the Data Processor, which manipulates a Data Package, receiving one Data Stream as its input and producing a new Data Stream as its output.

A chain of Data Processors is what we call a Data Flow.

We will be providing a library of such processors: some for loading data from various sources, some for storing data in different locations, services or databases, and some for doing common manipulation and transformation on the processed data.

On top of all that we’re building a few integrated services:

dataflows-server (formerly known as datapackage-pipelines) - a server side multi-processor runner for Data Flows.
dataflows-cli - a client library for building and running Data Flows locally
dataflows-blueprints - ready made flow generators for common scenarios (e.g. ‘I want to regularly pull all my analytics from these X services and dump them in a database’)
and more to come.

On Data Wrangling

In our experience, data processing starts simple - downloading and inspecting a CSV, deleting a column or a row. We wanted something that was as fast as the command line to get started but would also provide a solid basis as your pipeline grows. We also wanted something that provided some standardization and conventions over completely bespoke code.

With integration in mind, DataFlows comes with very little environmental requirements, and can be embedded in your existing data processing setup.

In short, DataFlows provides a simple, quick and easy-to-setup, and extensible way to build lightweight data processing pipelines.

Introducing dataflows

The first piece of software we’re introducing today is dataflows and its standard library of processors.

dataflows introduces the concept of a Flow - a chain of data processors, reading, transforming and modifying a stream of data and writing it to any location (or loading it to memory for further analysis).

dataflows also comes with a rich set of built-in data processors, ready to do most of the heavy-lifting you’ll need to reduce boilerplate code and increase your productivity.

A demo is worth a thousand words

Most data processing starts simple: getting a file and having a look.

With dataflows you can do this in a few seconds and you’ll have a solid basis for whatever you want to do next.

Bootstrapping a data processing script

$ pip install dataflows
$ dataflows init https://rawgit.com/datahq/demo/_/first.csv
Writing processing code into first_csv.py
Running first_csv.py
first:
  #  Name        Composed    DOB
     (string)    (string)    (date)
---  ----------  ----------  ----------
  1  George      22          1943-02-25
  2  John        90          1940-10-09
  3  Richard     2           1940-07-07
  4  Paul        88          1942-06-18
  5  Brian       n/a         1934-09-19

Done!

dataflows init actually does 3 things:

Analyzes the source file
Creates a processing script for reading it
Runs that script for you

In our case, a script named first_csv.py was created - here’s what it contains:

# ...

def first_csv():
    flow = Flow(
        # Load inputs
        load('https://rawgit.com/datahq/demo/_/first.csv',
             format='csv', ),
        # Process them (if necessary)
        # Save the results
        add_metadata(name='first_csv', title='first.csv'),
        printer(),
    )
    flow.process()

if __name__ == '__main__':
    first_csv()

The flow variable contains the chain of processing steps (i.e. the processors). In this simple flow, load loads the source data, add_metadata modifies the file’s metadata and printer outputs the contents to the standard output.

You can run this script again at any time, and it will re-run the processing flow:

$ python first_csv.py
first:
  #  Name        Composed    DOB
     (string)    (string)    (date)
---  ----------  ----------  ----------
  1  George      22          1943-02-25
...

This is all very nice, but now it’s time for some real data wrangling. By editing the processing script it’s possible to add more functionality to the flow - dataflows provides a simple, solid basis for building up your pipeline quickly, reliably and repeatedly.

Fixing some bad data

Let’s start by getting rid of that annoying n/a in the last line of the data.

We edit first_csv.py and add to the flow two more steps:

def removeNA(row):
    row['Composed'] = row['Composed'].replace('n/a', '')

f = Flow(
        load('https://rawgit.com/datahq/demo/_/first.csv'),
		# added here custom processing:
	    removeNa,
	    # now parse column as Integer:
        set_type('Composed', type='integer'),
        printer()
    )

removeNa is a simple function which modifies each row it sees, replacing n/as with the empty string. After it we call set_type, which declares the Composedcolumn should be an integer - and verifies it’s indeed an integer while processing data.

Writing the cleaned data

Finally, let’s write the output to a file using the dump_to_path processor:

def removeNA(row):
    row['Composed'] = row['Composed'].replace('n/a', '')

f = Flow(
		load('https://rawgit.com/datahq/demo/_/first.csv'),                	
       add_metadata(
            name='beatles_infoz',
            title='Beatle Member Information',
        ),
 	    removeNa,
        set_type('Composed', type='integer'),
        dump_to_path('first_csv/')
    )

Now, we re-run our modified processing script…

$ python first_csv.py
...

we get a valid Data Package which we can use…

$ tree
├── first_csv
│   ├── datapackage.json
│   └── first.csv

which contains a normalized and cleaned-up CSV file…

$ head out/out.csv
Name,Composed,DOB
George,22,1943-02-25
John,90,1940-10-09
Richard,2,1940-07-07
Paul,88,1942-06-18
Brian,,1934-09-19

datapackage.json, a JSON file containing the package’s metadata…

$ cat first_csv/datapackage.json    # Edited for brevity
{
  "count_of_rows": 5,
  "name": "beatles_infoz",
  "title": "Beatle Member Information",
  "resources": [
    {
      "name": "first",
      "path": "first.csv",
      "schema": {
        "fields": [
          {"name": "Name",     "type": "string"},
          {"name": "Composed", "type": "integer"},
          {"name": "DOB",      "type": "date"}
        ]
      }
    }
  ]
}

and is very simple to use in Python (or JS, Ruby, PHP and many other programming languages) -

$ python
>>> from datapackage import Package
>>> p = Package('first_csv/datapackage.json')
>>> list(p.resources[0].iter())
[['George', 22, datetime.date(1943, 2, 25)],
 ['John', 90, datetime.date(1940, 10, 9)],
 ['Richard', 2, datetime.date(1940, 7, 7)],
 ['Paul', 88, datetime.date(1942, 6, 18)],
 ['Brian', None, datetime.date(1934, 9, 19)]]
>>>

More ….

Lots, lots more - there is a whole suite of processors built in plus you can quickly add your own with a few lines of python code.

Dig in at the project’s GitHub Page or continue reading the in-depth tutorial here.

Processing Tabular Data Packages in Clojure

2018-05-07T00:00:00+00:00

Matt Thompson was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data data package and table schema libraries in Clojure programming language. You can read more about this in his grantee profile. In this post, Thompson will show you how to set up and use the Clojure libraries for working with Tabular Data Packages.

This tutorial uses a worked example of downloading a data package from a remote location on the web, and using the Frictionless Data tools to read its contents and metadata into Clojure data structures.

Setup

First, we need to set up the project structure using the Leiningen tool. If you don’t have Leiningen set up on your system, follow the link to download and install it. Once it is set up, run the following command from the command line to create the folders and files for a basic Clojure project:

$ lein new periodic-table

This will create the periodic-table folder. Inside the periodic-table/src/periodic-table folder should be a file named core.clj. This is the file you need to edit during this tutorial.

The Data

For this tutorial, we will use a pre-created data package, the Periodic Table Data Package hosted by the Frictionless Data project. A Data Package is a simple container format used to describe and package a collection of data. It consists of two parts:

Metadata that describes the structure and contents of the package
Resources such as data files that form the contents of the package

Our Clojure code will download the data package and process it using the metadata information contained in the package. The data package can be found here on GitHub.

The data package contains data about elements in the periodic table, including each element’s name, atomic number, symbol and atomic weight. The table below shows a sample taken from the first three rows of the CSV file:

atomic number	symbol	name	atomic mass	metal or nonmetal?
1	H	Hydrogen	1.00794	nonmetal
2	He	Helium	4.002602	noble gas
3	Li	Lithium	6.941	alkali metal

Loading the Data Package

The first step is to load the data package into a Clojure data structure (a map). The initial step is to require the data package library in our code (which we will give the alias dp). Then we can use the load function to load our data package into our project. Enter the following code into the core.clj file:

(ns periodic-table.core
  (:require [frictionlessdata.datapackage :as dp]
            [frictionlessdata.tableschema :as ts]
            [clojure.spec.alpha :as s]))

(def pkg
  (dp/load "https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"))

This pulls the data in from the remote GitHub location and converts the metadata into a Clojure map. We can access this metadata by using the descriptor function along with keys such as :name and :title to get the relevant information:

(println (str "Package name:" (dp/descriptor pkg :name)))
(println (str "Package title:" (dp/descriptor pkg :title)))

The package descriptor contains metadata that describes the contents of the data package. What about accessing the data itself? We can get to it using the get-resources function:

(def table (dp/get-resources pkg :data))

(doseq [row table]
  (println row))

The above code locates the data in the data package, then goes through it line by line and prints the contents.

Casting Types with core.spec

We can use Clojure’s spec library to define a schema for our data, which can then be used to cast the types of the data in the CSV file.

Below is a spec description of a periodic element type, consisting of an atomic number, atomic symbol, the element’s name, its mass, and whether or not the element is a metal or non-metal:

(s/def ::number int?)
(s/def ::symbol string?)
(s/def ::name string?)
(s/def ::mass float?)
(s/def ::metal string?)

(s/def ::element (s/keys :req [::number ::symbol ::name ::mass ::metal]))

The above spec can be used to cast values in our tabular data so that they match the specified schema. The example below shows our tabular data values being cast to fit the spec description. Then the -main function loops through the elements, printing only those with an atomic mass of over 10.

(ns periodic-table.core
  (:require [frictionlessdata.datapackage :as dp]
            [frictionlessdata.tableschema :as ts]
            [clojure.spec.alpha :as s]))

(s/def ::number int?)
(s/def ::symbol string?)
(s/def ::name string?)
(s/def ::mass float?)
(s/def ::metal string?)

(s/def ::element (s/keys :req [::number ::symbol ::name ::mass ::metal]))

(def pkg
  (dp/load "https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"))

(def resources (dp/get-resources pkg :data))

(def elements (dp/cast resources element))

(defn -main []
  (doseq [e elements]
    (if (< (:mass e) 10)
      (println e))))

When run, the program produces the following output:

$ lein run
{::number 1 ::symbol "H" ::name "Hydrogen" ::mass 1.00794 ::metal "nonmetal"}
{::number 2 ::symbol "He" ::name "Helium" ::mass 4.002602 ::metal "noble gas"}
{::number 3 ::symbol "Li" ::name "Lithium" ::mass 6.941 ::metal "alkali gas"}
{::number 4 ::symbol "Be" ::name "Beryllium" ::mass 9.012182 ::metal "alkaline earth metal"}

This concludes our simple tutorial for using the Clojure libraries for Frictionless Data.

We welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-clj repository.

Processing Tabular Data Packages in Java

2018-04-28T00:00:00+00:00

Georges Labrèche was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Java programming language. You can read more about this in his grantee profile.

In this post, Labrèche will show you how to install and use the Java libraries for working with Tabular Data Packages.

Our goal in this tutorial is to load tabular data from a CSV file, infer data types and the table’s schema.

Setup

First things first, you’ll want to grab datapackage-java and the tableschema-java libraries.

The Data

For our example, we will use a Tabular Data Package containing the periodic table. You can find the data package descriptor and the data on GitHub.

Packaging

Let’s start by fetching and packaging the data:

// fetch the data
URL url = new URL("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json");

// package the data
Package dp = new Package(url);

That’s it, you’re all set to start playing with the packaged data. There are parameters you can set such as loading a schema or imposing strict validation so be sure to go through the project’s README for more detail.

Iterating

Now that you have a Data Package instance, let’s see what the data looks like. A data package can contain more than one resource so you have to use the Package.getResource() method to specify which resource you’d like to access.

Let’s iterate over the data:

// Get a resource named data from the data package
Resource resource = pkg.getResource("data");

// Get the Iterator
Iterator<String[]> iter = resource.iter();

// Iterate
while(iter.hasNext()){
	String[] row = iter.next();
   	String atomicNumber = row[0];
   	String symbol = row[1];
   	String name = row[2];
  	String atomicMass = row[3];
   	String metalOrNonMetal = row[4];
}

Notice how we’re fetching all values as String. This may not be what you want, particularly for the atomic number and mass. Alternatively, you can trigger data type inference and casting like this:

// Get Iterator.
// Third boolean is the cast flag.
Iterator<Object[]> iter = resource.iter(false, false, true));

// Iterator
while(iter.hasNext()){
	String[] row = iter.next();
   	int atomicNumber = row[0];
   	String symbol = row[1];
   	String name = row[2];
  	float atomicMass = row[3];
   	String metalOrNonMetal = row[4];
}

And that’s it, your data is now associated with the appropriate data types!

Inferring the Schema

We wouldn’t have had to infer the data types if we had included a Table Schema when creating an instance of our Data Package. If a Table Schema is not available, then it’s something that can also be inferred and created with tableschema-java:

URL url = new URL("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/data.csv");
Table table = new Table(url);
Schema schema = table.inferSchema();
schema.write("/path/to/write/schema.json");

The type inference algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned.

The inference algorithm traverses all of the table’s rows and attempts to cast every single value of the table. When dealing with large tables, you might want to limit the number of rows that the inference algorithm processes:

// Only process the first 25 rows for type inference.
Schema schema = table.inferSchema(25);

Be sure to go through tableschema-java’s README as well to learn more about how to operate with Table Schema.

Contributing

In case you discovered an issue that you’d like to contribute a fix for, or if you would like to extend functionality:

# install jabba and maven2
$ cd tableschema-java
$ jabba install 1.8
$ jabba use 1.8
$ mvn install -DskipTests=true -Dmaven.javadoc.skip=true -B -V
$ mvn test -B

Make sure that all tests pass, and submit a PR with your contributions once you’re ready.

We also welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-java repository.

Collecting, Analysing and Sharing Twitter Data

2018-03-08T00:00:00+00:00

On March 3, communities around the world marked Open Data Day in over 400 events. Here’s the dataset for all Open Data Day 2018 events.

In this post, we will harvest Open Data Day affiliated content from Twitter and analyze it using R before packaging and publishing the data and associated resources publicly on GitHub.

Collecting the Data

With over 300million monthly users [source, January 2018], Twitter is a popular social network that I particularly like for its abbreviated messages, known as Tweets. Twitter’s Standard Search API allows users to mine tweets from as far back as a week for free.

R is a popular programming language for data analysis and has an active community of contributors that add to its capabilities by writing custom packages for interacting with different tools and platforms and achieving different tasks. In this post, we will employ two such packages:

twitteR allows us to interact with the Twitter API. We will install this from CRAN, the official packages repository for R.
Frictionless Data’s datapackage.r library will allow us to collate our open data day data and associated resources, such as the R script in one place before we publish it. We will install this from GitHub.

To get started, create a new application on apps.twitter.com and take note of the API and access tokens. We will need to specify these in our R script.

# install and load the twitteR library

install.packages("twitteR")
library(twitteR)


# specify Twitter API and Access Tokens

api_key <- "YOUR_API_KEY"
api_secret <- "YOUR_API_SECRET"
access_token <- "YOUR_ACCESS_TOKEN"
access_secret <- "YOUR_ACCESS_SECRET"

setup_twitter_oauth(api_key, api_secret, access_token, access_secret)

We are now ready to read tweets from the two official Open Data Day hashtags: #opendataday and #odd18. With a maximum number of 100 tweets per request, Twitter’s Search API allows for 180 Requests every 15 minutes. Since we are interested in as many tweets as we can get, we will specify the upper limit as 18,000, which tells the twitteR library the maximum number of tweets to retrieve for us.

# read tweets from the two official hashtags, #opendataday and #odd18

tweets_opendataday <- searchTwitteR("#opendataday", n = 18000)
tweets_odd18 <- searchTwitteR("#odd18", n = 18000)

# view lists of mined tweets from both hashtags

tweets_opendataday
tweets_odd18

Note: run each searchTwitteR() function separately and 15 minutes apart to avoid surpassing the limit.

In the R script snippet above, we assigned results of our search to the variables tweets_opendataday and tweets_odd18 and called the two variables to view the entire list of tweets obtained. Lucky for us, the total number of tweets on either hashtag are within Twitter’s 15 minute request limit. Here’s the feedback we receive:

# tweets mined on March 7, 2018

#opendataday
 18000 tweets were requested but the API can only return 11458

#odd18
 18000 tweets were requested but the API can only return 3497

Here’s a snippet of the list obtained from the #opendataday hashtag:

[[1]]
[1] "OpenDataAnd: Con motivo del pasado #OpenDataDay, @ODIHQ nos recuerda qu<U+00E9> es y para qu<U+00E9> sirven los #DatosAbiertos<U+2026> https://t.co/Fib4rSukbs"

[[2]]
[1] "johnfaig: RT @ODIHQ: Here's our list of seven weird and wonderful open datasets (nominated by you) https://t.co/H42bV5oIhw\n\n#opendataday #opendataday<U+2026>"

[[3]]
[1] "SurianoRodrigo: RT @CETGAPue: Desde el auditorio Ing. Antonio Osorio Garc<U+00ED>a de la @fi_buap, se lleva a cabo el BootCamp #OpenDataDay, al que asisten acad<U+00E9>m<U+2026>"

[[4]]
[1] "Carolrozu: RT @CETGAPue: Desde el auditorio Ing. Antonio Osorio Garc<U+00ED>a de la @fi_buap, se lleva a cabo el BootCamp #OpenDataDay, al que asisten acad<U+00E9>m<U+2026>"

[[5]]
[1] "Josefina_Buxade: RT @CETGAPue: Desde el auditorio Ing. Antonio Osorio Garc<U+00ED>a de la @fi_buap, se lleva a cabo el BootCamp #OpenDataDay, al que asisten acad<U+00E9>m<U+2026>"

Since the entire lists are long (~11,500 tweets on the #opendataday hashtag) and hard to comprehend, our best bet is to convert the lists to data frames. In R, data frames allow us to store data in tables and manipulate and analyse them easily. twitteR’s twListToDF function allows us to convert lists to data frames. After scraping data, it is always a good idea to save the original raw data as it provides a good base for any analysis work. We will write our data to a CSV file, so we can publish and it widely. The CSV format is machine-readable and easy to import into any spreadsheet application or advanced tools for analysis.

# convert the list of mined tweets from each hashtag to a dataframe

tweets_opendataday_df <- twListToDF(tweets_opendataday)
tweets_odd18_df <- twListToDF(tweets_odd18)

# save scraped data in CSV files

write.csv(tweets_opendataday_df, file="data/opendataday_raw.csv")
write.csv(tweets_odd18_df, file="data/odd18_raw.csv")

Here’s what the first five rows of our data frame look like:

text	favorited	favoriteCount	replyToSN	created	truncated	replyToSID	id	replyToUID	statusSource	screenName	retweetCount	isRetweet	retweeted	longitude	latitude
1	Participan como panelistas de la mesa <U+201C>Datos ma<U+00F1>aneros, qu<U+00E9> son y para qu<U+00E9> sirven los #DatosAbiertos<U+201D>, Karla Ramos<U+2026> https://t.co/wFBYaUP68n	FALSE	3	NA	05/03/18 16:29	TRUE	NA	9.70698E+17	NA	Twitter for iPhone	CETGAPue	2	FALSE	FALSE	NA	NA
2	RT @Transparen_Xal: A unos minutos de empezar el Open Data Day Xalapa#ODD18 #Xalapa https://t.co/VH3m0QGeOJ	FALSE	0	NA	05/03/18 16:28	FALSE	NA	9.70698E+17	NA	Hootsuite	AytoXalapa	1	TRUE	FALSE	NA	NA
3	Nos encontramos ya en @ImacXalapa con @AytoXalapa para sumar esfuerzos a favor de la cultura de participaci<U+00F3>n ciuda<U+2026> https://t.co/VdIcF16Ub4	FALSE	0	NA	05/03/18 16:22	TRUE	NA	9.70696E+17	NA	Twitter for Android	VERIVAI	1	FALSE	FALSE	NA	NA
4	A unos minutos de empezar el Open Data Day Xalapa#ODD18 #Xalapa https://t.co/VH3m0QGeOJ	FALSE	0	NA	05/03/18 16:20	FALSE	NA	9.70696E+17	NA	Twitter for iPhone	Transparen_Xal	1	FALSE	FALSE	NA	NA
5	El gobierno de @TonyGali promueve el uso de los #DatosAbiertos. Entra al portal https://t.co/Jz23xpJLAS y consult<U+2026> https://t.co/UoWP43R8Km	FALSE	5	NA	05/03/18 16:09	TRUE	NA	9.70693E+17	NA	Twitter for iPhone	CETGAPue	4	FALSE	FALSE	NA	NA

For ease of analysis, and because the two data frames have the same columns, let’s merge the two datasets.

# combine dataframes from the two hashtags

alltweets_df <- rbind(tweets_opendataday_df, tweets_odd18_df)
write.csv(alltweets_df, file="data/allopendatadaytweets.csv")

Analysing the Data

Data analysis in R is quite a joy. We will use R’s dplyr package to analyse our data and answer a few questions:

how many open data day attendees tweeted from android phones?

We can answer this using dplyr’s select() function, which as the name suggests, allows us to see only data we are interested in, in this case, tweets sent from the Twitter for Android app.

# install and load dplyr

install.packages(dplyr)
library(dplyr)

# find out number of open data day tweets from android phones

android_tweets <- filter(alltweets_df, grepl("Twitter for Android", statusSource))
tally(android_tweets)

# the result
   n
1 5180

5,180 of the 14,955 (34.6%) #opendataday and #odd18 tweets were sent from android phones.

Naturally, Open Data Day events cut across many topics and disciplines, and some events included hands-on workshop sessions or hackathons. Let’s find out which open data day tweets point to open source projects and resources that are available on GitHub.

 # open data day tweets that mention resources on GitHub

github_resources <- filter(alltweets_df, grepl("github.com", statusSource))

# the result
 n
1 32

Only 32 #opendataday and #odd18 tweets contain GitHub links.

Not all open data day tweets are geotagged, but from the few that are, we can create a very basic map to show where people tweeted from. To do this, we will use the Leaflet library for R.

# install and load leaflet

install.packages(leaflet)
library(leaflet)

# create basic map
map <- leaflet() %>%
  addTiles() %>%
  addCircles(data = alltweets_df, lat = ~ latitude, lng = ~ longitude)

# view map
map

figure 1: map showing where geotagged #opendataday and #odd18 tweets originated from

Due to Twitter’s terms of use, we can only share a stripped-down version of the raw data. Our final dataset contains tweet IDs and retweet count, and will be packaged alongside this R script, so you could download the tweets yourself.

# filter out retweets and leave original tweets

  notretweets_df <- dplyr::filter(alltweets_df, grepl("FALSE", isRetweet))

# strip down tweets data to comply with Twitter's terms of use.

  subsetoftweets <- select(notretweets_df, id, retweetCount)
  write.csv(subsetoftweets, file="data/subsetofopendatadaytweets.csv")

Packaging the Data and associated resources

Providing context when sharing data is important, and Frictionless Data’s Data Package format makes it possible. Using datapackage.r, we can infer a schema for the all tweets CSV file and publish it alongside the other resources.

# specify filepath and infer schema
    filepath = '/data/subsetofopendatadaytweets.csv'

    schema = tableschema.r::infer(filepath)

Read more about the datapackage-r package in this post by Open Knowledge Greece.

Alternatively, we can use the Data Package Creator to package our data and associated resources.

figure 2: creating the data package on Data Package Creator

Read more about the data package creator in this post.

Publishing on Github

Once our data package is ready, we can simply publish it to GitHub. Find the open data day tweets data package here.

Conclusion

Data Packages are a great format for sharing data collections with contextual information i.e. we added metadata and a schema to our final dataset. Read more about Data Packages in Frictionless Data and reach out in our community chat on Gitter.

Processing Tabular Data Packages in Go

2018-02-16T00:00:00+00:00

Daniel Fireman was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Go programming language. You can read more about this in his grantee profile.

In this post, Fireman will show you how to install and use the Go libraries for working with Tabular Data Packages.

Our goal in this tutorial is to load a data package from the web and read its metadata and contents.

Setup

For this tutorial, we will need the datapackage-go and tableschema-go packages, which provide all the functionality to deal with a Data Package’s metadata and its contents.

We are going to use the dep tool to manage the dependencies of our new project:

$ cd $GOPATH/src/newdataproj
$ dep init

The Periodic Table Data Package

A Data Package is a simple container format used to describe and package a collection of data. It consists of two parts:

Metadata that describes the structure and contents of the package
Resources such as data files that form the contents of the package

In this tutorial, we are using a Tabular Data Package containing the periodic table. The package descriptor (datapackage.json) and contents (data.csv) are stored on GitHub. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the header and the first three rows:

atomic number	symbol	name	atomic mass	metal or nonmetal?
1	H	Hydrogen	1.00794	nonmetal
2	He	Helium	4.002602	noble gas
3	Li	Lithium	6.941	alkali metal

Inspecting Package Metadata

Let’s start off by creating the main.go, which loads the data package and inspects some of its metadata.

package main

import (
    "fmt"

    "github.com/frictionlessdata/datapackage-go/datapackage"
)

func main() {
    pkg, err := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    if err != nil {
        panic(err)
    }
    fmt.Println("Package loaded successfully.")
}

Before running the code, you need to tell the dep tool to update our project dependencies. Don’t worry; you won’t need to do it again in this tutorial.

$ dep ensure
$ go run main.go
Package loaded successfully.

Now that you have loaded the periodic table Data Package, you have access to its title and name fields through the Package.Descriptor() function. To do so, let’s change our main function to (omitting error handling for the sake of brevity, but we know it is very important):

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    fmt.Println("Name:", pkg.Descriptor()["name"])
    fmt.Println("Title:", pkg.Descriptor()["title"])
}

And rerun the program:

$ go run main.go
Name: period-table
Title: Periodic Table

And as you can see, the printed fields match the package descriptor. For more information about the Data Package structure, please take a look at the specification.

Quick Look At the Data

Now that you have loaded your Data Package, it is time to process its contents. The package content consists of one or more resources. You can access Resources via the Package.GetResource() method. Let’s print the periodic table data resource contents.

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    res := pkg.GetResource("data")
    table, _ := res.ReadAll()
    for _, row := range table {
        fmt.Println(row)
    }
}

$ go run main.go
[atomic number symbol name atomic mass metal or nonmetal?]
[1 H Hydrogen 1.00794 nonmetal]
[2 He Helium 4.002602 noble gas]
[3 Li Lithium 6.941 alkali metal]
[4 Be Beryllium 9.012182 alkaline earth metal]
...

The Resource.ReadAll() method loads the whole table in memory as raw strings and returns it as a Go [][]string. This can be quick useful to take a quick look or perform a visual sanity check at the data.

Processing the Data Package’s Content

Even though the string representation can be useful for a quick sanity check, you probably want to use actual language types to process the data. Don’t worry, you won’t need to fight the casting battle yourself. Data Package Go libraries provide a rich set of methods to deal with data loading in a very idiomatic way (very similar to encoding/json).

As an example, let’s change our main function to use actual types to store the periodic table and print the elements with atomic mass smaller than 10.

package main

import (
    "fmt"

    "github.com/frictionlessdata/datapackage-go/datapackage"
    "github.com/frictionlessdata/tableschema-go/csv"
)

type element struct {
    Number int     `tableheader:"atomic number"`
    Symbol string  `tableheader:"symbol"`
    Name   string  `tableheader:"name"`
    Mass   float64 `tableheader:"atomic mass"`
    Metal  string  `tableheader:"metal or nonmetal?"`
}

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    resource := pkg.GetResource("data")

    var elements []element
    resource.Cast(&elements, csv.LoadHeaders())
    for _, e := range elements {
        if e.Mass < 10 {
            fmt.Printf("%+v\n", e)
        }
    }
}

$ go run main.go
{Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal}
{Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas}
{Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal}
{Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal}

In the example above, all rows in the table are loaded into memory. Then every row is parsed into an element object and appended to the slice. The resource.Cast call returns an error if the whole table cannot be successfully parsed.

If you don’t want to load all data in memory at once, you can lazily access each row using Resource.Iter and use Schema.CastRow to cast each row into an element object. That would change our main function to:

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    resource := pkg.GetResource("data")

    iter, _ := resource.Iter(csv.LoadHeaders())
    sch, _ := resource.GetSchema()
    var e element
    for iter.Next() {
        sch.CastRow(iter.Row(), &e)
        if e.Mass < 10 {
            fmt.Printf("%+v\n", e)
        }
    }
}

$ go run main.go
{Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal}
{Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas}
{Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal}
{Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal}

And our code is ready to deal with the growth of the periodic table in a very memory-efficient way :-)

We welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-go repository.

Frictionless Data Lib - A Design Pattern for Accessing Files and Datasets

2018-02-15T00:00:00+00:00

This document outlines a simple design pattern for a “core” data library "data".

The pattern is focused on access and use of:

individual files (streams)
collections of files (“datasets”)

Its primary operation is open:

file = open('path/to/file.csv')
dataset = open('path/to/files/')

It defines a standardized “stream-plus-metadata” interface for file and dataset objects, along with methods for creating these from file or dataset pointers such as file paths or urls.

file = open('path/to/file.csv')
file.stream()
file.rows()
file.descriptor
file.descriptor.path
...

This pattern derives from many years experience working on data tools and projects like Frictionless Data. Specifically:

Data “plus”: when you work with data you always find yourself needing the data itself plus a little bit more – things like where the data came from on disk (or is going to), or how large it is. This pattern gives you that information in a standardized way.
Streams (and strings): streams are the standard way to access data (though strings are useful too) and you should get the same interface whether you’ve loaded data from online, on disk or inline; and, finally, we want both raw byte streams and (for tabular data) object/row streams aka iterators.
Building blocks: most data wrangling, even in simple cases, involves building data processing pipelines. Pipelines need a standard stream-plus-metadata interface to pass data between steps in the pipeline. For example, suppose you want to load a csv file and convert to JSON and write to stdout: that’s already three steps (load, convert, dump). Then suppose you want to delete the first 3 rows, delete the 2nd column. Now you have a more complex processing pipeline.

Fig 1: data pipelines and the stream-plus-metadata pattern

The pattern leverages the Frictionless Data specs including Data Resource, Table Schema and Data Package. But it keeps metadata implicit rather than explicit and focuses on giving users the simplest most direct interface possible (put most crudely: open then stream). You can find more about the connection with the Frictionless Data tooling in the appendix.

Finally, we already have one working implementation of the pattern in javascript:

https://github.com/datahq/data.js

Work on a python implementation is underway (most of the code is already there in the python Data Package libraries).

Table of Contents

Overview of the Pattern
The Pattern in Detail
Conclusion
Appendix: Why we need a pattern like this
Appendix: Design Principles
Appendix: Internal Library Structure Suggestions
Appendix: API with Data Package terminology
Appendix: Connection with Frictionless Data

Overview of the Pattern

The pattern is based on the following principles:

Data wrangler focused: focus on the core data wrangler workflow: open a file and do something with it
Zen-like: Simplicity and power. As simple as possible: does just what it needs and no more.
Use Streams: stream focused library, including object streams (aka iterators).

A minimal viable interface for the file case:

// this example uses javascript but the example is generic

// data.js is just an illustrative name for the library
const data = require('data.js')

// path can be local or remote
// file is now a data.File object
const file = data.open(pathOrUrl)

// a byte stream
file.stream()

// if this file is tabular this will give me a row stream (iterator)
file.rows()

// descriptor for this file including info like size (if available)
// the descriptor follows the Data Resource specification
// (and if Tabular the Tabular Data Resource spec)
file.descriptor

For datasets:

// path or url to a directory (or datapackage.json)
// dataset is a data.Dataset object
// note: may rename to openDataset if need to disambiguate from open(file)
const dataset = open(pathOrUrl)

// list of files
dataset.files

// readme (if README.md existed)
dataset.readme

// any metadata (either inferred or from datapackage.json)
// this follows the Data Package spec
dataset.descriptor

These interfaces can then form the standard basis for lots of additional functionality e.g.

infer(file) => inferred tableschema (and types) for the column

writer(file) => stream (for saving to disk)

validate(file) => validate a file (assumes it has a tableschema)

NOTE: here we have used file and dataset terminology. If you are more familiar with the package and resource of the Frictionless Data specs please mentally substitute file => resource and dataset => package.

The Pattern in Detail

Note: Support for Datasets is optional. Supporting datasets is an added layer of complexity and some implementors MAY choose to support files only. If so, they MUST indicate this clearly.

`open` method

The library MUST provide a method open which takes a locator to a file and returns a File object:

open(path/to/file.csv, [options]) => File object

options is a dictionary of keyword argument list of options. The library MUST support an option basePath. basePath is for cases where you want to create a File with a path that is relative to a base directory / path e.g.

file = open('data.csv', {basePath: '/my/base/path'})

Will open the file: /my/base/path/data.csv

This functionality is mainly useful when using Files as part of Datasets where it can be convenient for a File to have a path relative to the directory of the Dataset. (See also Data Package and Data Resource in the Frictionless Data specs).

File locators

Locators can be:

A file path
A URL
Raw data in JSON format
A Data Resource (in native language structure)

Implementors MUST support file paths, SHOULD support URLs and MAY support the last two.

file = open('/path/to/file.csv')

file = open('https://example.com/data.xls')

// loading raw data
file = open({
  name: 'mydata',
  data: { // can be any javascript - an object, an array or a string or ...
    a: 1,
    b: 2
  }
})

// Loading with a descriptor - this allows more fine-grained configuration
// The descriptor should follow the Frictionless Data Resource model
// http://specs.frictionlessdata.io/data-resource/
file = open({
  // file or url path
  path: 'https://example.com/data.csv',
  // a Table Schema - https://specs.frictionlessdata.io/table-schema/
  schema: {
    fields: [
      ...
    ]
  }
  // CSV dialect - https://specs.frictionlessdata.io/csv-dialect/
  dialect: {
    // this is tab separated CSV/DSV
    delimiter: '\\t'
  }
})

File

The File instance MUST have the following properties and methods

Metadata: `descriptor`

Main metadata is available via the descriptor:

file.descriptor

The descriptor follows the Frictionless Data Data Resource spec.

The descriptor metadata is a combination of the metadata passed in at File creation (if you created the File with a descriptor object) and auto-inferred information from the File path. This is the info that SHOULD be auto-inferred:

path: path this was instantiated with - may not be same as file.path (depending on basePath)
pathType: remote | local
name:   file name (without extension)
format: the extension
mediatype: mimetype based on file name and extension

In addition to this metadata there are certain properties which MAY be computed on demand and SHOULD be available as getters on the file object:

// the full path to the file (using basepath)
const path = file.path

const size = file.size

// md5 hash of the file
const hash = file.hash

// file encoding
const encoding = file.encoding

Note: size, hash are not available for remote Files (those created from urls).

Accessing data

Accessing data in the file:

// byte stream
file.stream()

// if file is tabular
// crude rows - no type casting etc
file.rows(cast=False, keyed=False, ...)

// entire file as a buffer/string (be careful with large files!)
file.buffer()

// (optional)
// if tabular return entire set of rows as an array
file.array()

// EXPERIMENTAL
// file object packed into a stream
// metadata is first line (\n separated)
// motivation: way to send object over single stdin/stdout pipe
file.packed()

`stream`

A raw byte stream:

stream()

`rows`

Get the rows for this file as an object stream / iterator.

file.rows(cast=False, keyed=False, ...) =>
  iterator with items [val1, val2, val3, ...]

keyed: if false (default) returns rows as arrays i.e. [val1, val2, val3]. If true returns rows as objects i.e.. {col1: val1, col2: val2, ...}.
cast: if false (default) returns values uncast. If true attempts to cast values either using best-effort or TableSchema if available
addRowNumber: default false. Add first value or column _id to resulting rows with row number. [OPTIONAL for implementors]

Note: this method assumes underlying data is tabular. Library should raise appropriate error if called on a non-tabular file. It is also up to implementors what tabular formats they support (there are many). At a minimum the library MUST support CSV. It SHOULD support JSON and it MAY (it is desirable) support Excel.

Support for TableSchema and CSV Dialect

The library SHOULD support Table Schema and CSV Dialect in the rows method using metadata provided when the file was opened:

// load a CSV with a non-standard dialect e.g. tab separated or semi-colon separated
file = open({
  path: 'mydata.tsv'
  // Full support for http://specs.frictionlessdata.io/csv-dialect/
  dialect: {
    delimiter: '\\t' // for tabs or ';' for semi-colons etc
  }
})
file.rows() // use the dialect info in parsing the csv

// open a CSV with a Table Schema
file = open({
  path: 'mydata.csv'
  // Full support for Table Schema https://specs.frictionlessdata.io/table-schema/
  schema: {
    fields: [
      {
        name: 'Column 1',
        type: 'integer'
      },
      ...
    ]
  }
})

Dataset

A collection of data files with optional metadata.

Under the hood it heavily uses Data Package formats and it natively supports Data Package formats including loading from datapackage.json files. However, it does not require knowledge or use of Data Packages.

A Dataset has two key properties:

// metadata
dataset.descriptor

// files in the dataset
dataset.files

`open` for datasets

The library MUST provide a method openDataset that takes a locator to a dataset and returns a Dataset object:

openDataset(path/to/dataset/) => Dataset object

The library MAY overload the open method to support datasets as well as files:

open(path/to/dataset/) => Dataset object

Note: overloading can be tricky as disambiguating locators for files from locators for datasets is not always trivial.

Dataset Locators

path/to/dataset - can be one of:

local path to Dataset
remote url to Dataset
descriptor object (i.e. datapackage.json)

`descriptor`

A Dataset MUST have a descriptor which holds the Dataset metadata. The descriptor MUST follow the Data Package spec.

The Dataset SHOULD have the convenience attribute path which is the path (remote or local) to this dataset.

`identifier` (optional)

A Dataset MAY have a identifier property that encapsulates the location (or origin) of this Dataset. The locator property MUST have the following structure:

{
  name: <name>,   // computed from path
  owner: <owner>, // may be null
  path: <path>,   // computed path
  type: <type>,   // e.g. local, url, github, datahub, ... 
  original: <path>, // path (file or url) as originally supplied
  version: <version> // version as computed
}

Note: the identifier is parsed from the locator passed into the open method. See the Data Package identifier spec https://frictionlessdata.io/specs/data-package-identifier/ and implementation in data.js library https://github.com/datahq/data.js#parsedatasetidentifier

README

The Dataset object MAY support a readme property which returns a string corresponding to the README for this Dataset (if it exists).

The readme content is taken from the README.md file located in the Dataset root directory, or, if that does not exist from the readme property on the descriptor. If neither of those exist the readme will be undefined or null.

`files`

A Dataset MUST have files property which returns an array of the Files contained in this Dataset:

dataset.files => Array(<File>)

addFile

The library SHOULD implement an addFile method to add a File to a Dataset:

dataset.addFile(file)

file: an already instantiated File object or a File descriptor

Operators

Finally, we discuss some operators. These SHOULD not be in the core library but it is useful to be aware of them:

infer(file) => TableSchema: infer the Table Schema for a CSV file or other tabular file
- inferStructure(file): infer the structure i.e. CSV Dialect of a CSV or other tabular file. In addition to CSV dialect properties this may include things like skipRows i.e. number of rows to skip
validate(file/dataset, metadataOnly=False): validate the data in a file e.g. against its schema
- metadataOnly: only validate the metadata e.g. against the Data Package or Data Resource schemas.
write(file/dataset): write a File or Dataset to disk

Conclusion

In this document we’ve outlined a “Frictionless Data Library” pattern that standardizes the design of a “core” data library API focused on accessing files and datasets.

Almost all data wrangling work involves opening data streams and passing them between processes. Standardizing the API would have major benefits for tool creators and users, making it quicker and easier to develop tooling as well as making tooling more “plug and play”.

Appendix: Why we need a pattern like this

All data wrangling tools need to load and then pass around “file-like objects” as they process data

All data tools need to access files/streams:

data = open(path/to/file.csv)

And so every programming language and every tool have a method for opening a file path and returning a byte-stream.

But …

A file is more than a byte stream: the stream may be structured and there is usually the need for associated metadata

Often we need more than just a byte stream, for example:

We may want the stream to be structured: if it is a CSV file we’re opening we’d like to get a stream of row objects not just a stream of bytes
We may want file metadata to be available (where did the file come from, how big is the file, when was it last modified)
We may want schema information: not just the CSV file but type information on its columns (this would allow us to reliably cast the CSV data to proper types when reading)
And we may even want to add metadata ourselves (perhaps automatedly), for example guessing the types of the columns in a CSV

A file is more than a byte stream: the stream may be structured and there is usually the need for associated metadata, at a minimum the name and size of the file but also extending to things like a file schema.

Tool authors find themselves inventing their own “stream-plus-metadata” objects … but they are all different

Tool authors find themselves inventing their own file-like “stream-plus-metadata” objects to describe the files they open.

Note: Many languages have “file-like” object that usually consists of a stream plus some metadata (e.g. python file object, Node Streams etc). But this is not standardized and is often inadequate so tool makers end up wrapping or replacing it.

This is not just about opening files but about passing streams around with because most tools, even very simple ones, start to contain implicit mini data pipelines:

These stream-plus-metadata objects contain implicit mini-metadata standards for describing files and collections of files (“datasets”). These mini-metadata standards look like Data Resource, Table Schema, Data Package etc.

But these stream-plus-metadata objects and their mini-metadata are all a little different across the various languages and tools.

Plus, many tools also need to access collection of files, i.e. datasets

Many tools want to access collections of files e.g. datasets:

dataset = open(/path/to/dataset)

Datasets already require some structure to list their collection of files and usually require some additional metadata ranging from where the dataset was loaded from to items such as its license.

You can even have datasets without multiple files when the file you are using is implicitly a dataset. For example, an Excel file is really a dataset if you think of each sheets as a separate file stream or think of an sqlite database.

Having a common API pattern for files (stream-plus-metadata) and datasets would reduce duplication and support plug and play with tooling

Standardizing the structure of these stream-plus-metadata file objects (and dataset objects), and building standard libraries to create them from file/dataset pointers would:

Reduce repetition / allow for reuse across tools: at present, data wrangling tool write this themselves. They now have a common pattern and may even be able to use a common underlying library.
Support plug and play: new wrangling can operate on these standard file and dataset objects. For example, an inference library that given a file object returns an inferred schema, or a converter that converts xls => csv.

Appendix: Design Principles

The pattern is based on the following principles:

Data wrangler focused: focus on the core data wrangler workflow: open a file and do something with it
Zen-like: Simplicity and power. As simple as possible: does just what it needs and no more.
Use Streams: stream focused library, including object streams.

Orient to the data wrangler workflow

See motivation section above

Open => Read / Stream
[optional] Inspect
Check
Operate on
Write

Zen - maximum viable simplicity

As simple as possible. Does just what it needs and no more. Simple and powerful.

Zen =>

“thin” (vs fat) objects: all complex operators such as infer or dump operate on objects rather than becoming part of them
a single open method to get data (file or dataset)
hide metadata by default (data package, data resource etc are in the background)

Core objects should be kept as simple as possible (and no simpler)

=> Inversion of control where possible so that we don’t end up with “fat” core classes e.g.

A. save data to disk should be separate objects that operate on the main objects rather than built into them e.g.

const writer = CSVWriter()
writer.writer(dataLibFileObjectInstance, filePath, [options])

rather than e.g.

dataLibFileObjectInstance.saveToCsv(filePath)

If there is a simple way to invert dependency (i.e. not have all different dumper in main lib) but have a simple save method that would be fine.

B. Similarly for parsers (though reading is so essential that read needs to be part of of the class)

C. infer, validate etc should operate on Files rather than be part of it …

const tableschema = infer(fileObj)

Rather than

fileObj.inferSchema()

Use Streams

Streams are the natural way to handle data and it scales to large datasets.

The library should be stream focused library, including object streams.

Appendix: Internal Library Structure Suggestions

These are some suggestions for how implementors could structure their library internally. They are entirely optional.

Library Components

In top level library just have Dataset and File (+ TabularFile)

graph TD

Dataset[Dataset/Package] --> File[File/Resource]
File --> TabularFile
TabularFile -.-> TableSchema
TableSchema -.-> Field

parsers((Parsers))
dumpers((Writers))
tools((Tools))

tools --> infer
infer --> validate

classDef medium fill:lightblue,stroke:#333,stroke-width:4px;

graph TD

parsers((Parsers))
dumpers((Writers))

subgraph Parsers - Tabular
  csv["CSV parse(resource) -> row stream"]
  xls["XLS ..."]
end

subgraph Writers - Tabular
  ascii
  csvdump[CSV]
  xlsdump[XLS]
  markdown
end

parsers --> csv
parsers --> xls

dumpers --> ascii
dumpers --> markdown
dumpers --> xlsdump
dumpers --> csvdump

Streams

graph LR

in1[File,URL,Stream] -- stream--> stream[Byte Stream + Meta]
stream --parse--> objstream[Obj Stream+Meta]
objstream --unparse--> stream2[Byte Stream + Meta]
stream --write--> out
stream2 --writer--> out[file/stream]

open => yields descriptor and file stream parse => yields file rows (internally uses parsers) writer => writers

// aka write
writer(File) => readable stream

parser(File) => object stream

Loaders/Parsers and Writers

Loaders/Parsers and Writers should be be an extensible list.

Inversion of control is important: the core library does not depend directly on parsers (that way we can hot swap and/or extend the list at runtime).

Parsers:

// file is a data.File object
parse(file) => row stream

Writers are similar:

// e.g. csv.js

// dump to CSV file
write(file, path) => null

Note we may want a writer for datasets as well e.g. a writer to datapackage.json or to sql or …

write(dataset, destination ...)

Appendix: API with Data Package terminology

In progress

// data.js is just an illustrative name for the library
var data = require('data.js')

// path can be local or remote
const resource = data.open(pathOrUrl)

// a byte stream
resource.stream()

// if this file is tabular this will give me a row stream (iterator)
resource.rows()

For packages

// path or url to a directory (or datapackage.json)
const package = data.open(pathOrUrl)

// list of files
package.resources

// readme (if README.md exists or there is a description in the metadata)
package.readme

// any metadata (either inferred or from datapackage.json)
package.descriptor

Appendix: Connection with Frictionless Data

I’ve distilled this pattern out of the work of myself and others who have worked on Frictionless Data specs and tooling.

It is motivated by the following observations about the Data Package suite of libraries and their Table Schema, Data Resource and Data Package interfaces:

These libraries contain functions and metadata that standardize operations that are common to almost all data wrangling tools because almost all data wrangling tools need to handle files/streams and datasets and the core metadata is designed around describing files and datasets – or inferring and validating that.
BUT: by presenting the underlying metadata such as Data Resource, Data Package front and centre and hiding the common operations (e.g. open this file) they make a rather unnatural interface for data wranglers.
Most data wranglers start from an immediate need: display this csv on the command line, convert this excel file to csv etc. At the simplest, most data wrangling tools need some function like open(file) => file-like object where the file-like object can be be used for other tasks

Metaphorically: the current data package libraries put the skeleton (the metadata) “on the outside” and the “flesh” (the actual methods wranglers want to use) on the “inside” (they are implicit or hidden with the overall library)

What follows from this insight is that we should invert this:

“Put the flesh on the outside”: Create a simple interface that addresses the common needs of data wranglers and data wrangler tooling e.g. open(file)
Put the bones on the inside: leverage the Frictionless Data metadata structures but put them on the inside, out of sight but still available if needed.

Note: it may be appropriate to continue to have a dedicated Data Package or Table Schema library but keep it really simple

Here’s how I put this in the original issue https://github.com/frictionlessdata/tableschema-js/issues/78:

People don’t care about Data Packages / Resources, they care about opening a data file and doing something with it

Data Packages / Resources come up because they are a nicely agreed metadata structure for all the stuff that comes up in the background when you do that.

Put crudely: Most people are doing stuff with a file (or dataset), and they want to grab it and read it preferably in a structured way e.g. as a row iterator – sometimes inferring or specifying stuff a long the way e.g. encoding, formatting, field types.

=> Our job is to help users to open that file (or dataset) and stream it as quickly as possible.

Recommendations for Frictionless Data community

Suggestions:

Users want to do stuff with data fast. This implies that a library like tabulator is more immediately appropriate to end users than data-package or table-schema
The current set of FD libraries is bewildering and confusing, especially for new users. There are several complementary libraries plus some of the API is pretty confusing (see appendix for more on this)

Recommendations:

Have a primary “gateway” library oriented around reading and writing data and datasets.
This can be based around a simplified Package and Resource interface and library
- Move auxiliary functionality to libraries e.g. infer,
- Make parsers / loaders (and writers) to a plugin model so this can be extended easily
Consider renaming Package and Resource to Dataset and File in the simple library as these are more accessible and common terms

Why do it?

Massively grow the potential audience: Create an interface non-DP fanatics can use and want to use (and DP ones too)
Ease of use: easier for us and others to use
Elegance: do it right - this is the elegant, functional, beautiful way to do this library

Relation to Data Packages

We use Data Package and Table Schema as the metadata model for data files and datasets
Data Package libraries already implement APIs a bit like this and support many features we want (e.g. infer)

Creating and Using Data Packages in R

2018-02-14T00:00:00+00:00

Open Knowledge Greece was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in R programming language. You can read more about this in their grantee profile.

In this post, Kleanthis Koupidis, a Data Scientist and Statistician at Open Knowledge Greece, explains how to create and use Data Packages in R.

Creating Data Packages in R

This section of the tutorial will show you how to install the R library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.

Setup

For this tutorial, we will need the Data Package R library (datapackage.r).

devtools library is required to install the datapackage.r library from github.

    # Install devtools package if not already
    install.packages("devtools")

And then install the development version of datapackage.r from github.

    devtools::install_github("frictionlessdata/datapackage-r")

Load

You can start using the library by loading datapackage.r.

    library(datapackage.r)

You can add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required name key as well as a human-readable title key. For the keys supported, please consult the full Data Package spec. Note, we will be creating the required resources key further down below.

    dataPackage = Package.load()
    dataPackage$descriptor['name'] = 'period-table'
    dataPackage$descriptor['title'] = 'Periodic Table'
    # commit the changes to Package class
    dataPackage$commit()

    ## [1] TRUE

Infer a CSV Schema

We will use periodic-table data from remote path

atomic.number	symbol	name	atomic.mass	metal.or.nonmetal.
1	H	Hydrogen	1.00794	nonmetal
2	He	Helium	4.002602	noble gas
3	Li	Lithium	6.941	alkali metal
4	Be	Beryllium	9.012182	alkaline earth metal
5	B	Boron	10.811	metalloid
6	C	Carbon	12.0107	nonmetal
7	N	Nitrogen	14.0067	nonmetal
8	O	Oxygen	15.9994	nonmetal
9	F	Fluorine	18.9984032	halogen
10	Ne	Neon	20.1797	noble gas

We can guess our CSV’s schema by using infer from the Table Schema library. We pass directly the remote link to the infer function, the result of which is an inferred schema. For example, if the processor detects only integers in a given column, it will assign integer as a column type.

    filepath = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'

    schema = tableschema.r::infer(filepath)

Once we have a schema, we are now ready to add a resource key to the Data Package which points to the resource path and its newly created schema. Below we define resources with three ways, using json text format with usual assignment operator in R list objects and directly using addResource function of Package class:

    # define resources using json text
    resources = helpers.from.json.to.list(
      '[{
        "name": "data",
        "path": "filepath",
        "schema": "schema"
      }]'
    )
    resources[[1]]$schema = schema
    resources[[1]]$path = filepath

    # or define resources using list object
    resources = list(list(
      name = "data",
      path = filepath,
      schema = schema
      ))

And now, add resources to the Data Package:

    dataPackage$descriptor[['resources']] = resources
    dataPackage$commit()

    ## [1] TRUE

Or you can directly add resources using addResources function of Package class:

    resources = list(list(
      name = "data",
      path = filepath,
      schema = schema
      ))      

    dataPackage$addResource(resources)

Now we are ready to write our datapackage.json file to the current working directory.

    dataPackage$save('example_data')

The datapackage.json (download) is inlined below. Note that atomic number has been correctly inferred as an integer and atomic mass as a number (float) while every other column is a string.

    jsonlite::prettify(helpers.from.list.to.json(dataPackage$descriptor))

    ## {
    ##     "profile": "data-package",
    ##     "name": "period-table",
    ##     "title": "Periodic Table",
    ##     "resources": [
    ##         {
    ##             "name": "data",
    ##             "path": "https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv",
    ##             "schema": {
    ##                 "fields": [
    ##                     {
    ##                         "name": "atomic number",
    ##                         "type": "integer",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "symbol",
    ##                         "type": "string",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "name",
    ##                         "type": "string",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "atomic mass",
    ##                         "type": "number",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "metal or nonmetal?",
    ##                         "type": "string",
    ##                         "format": "default"
    ##                     }
    ##                 ],
    ##                 "missingValues": [
    ##                     ""
    ##                 ]
    ##             },
    ##             "profile": "data-resource",
    ##             "encoding": "utf-8"
    ##         }
    ##     ]
    ## }
    ##

Publishing

Now that you have created your Data Package, you might want to publish your data online so that you can share it with others.

Using Data Packages in R

This section of the tutorial will show you how to install the R libraries for working with Tabular Data Packages and demonstrate a very simple example of loading a Tabular Data Package from the web and pushing it directly into a local SQL database and send query to retrieve results.

Setup

For this tutorial, we will need the Data Package R library (datapackage.r). Devtools library is also required to install the datapackage.r library from github.

# Install devtools package if not already
install.packages("devtools")

And then install the development version of datapackage.r from github.

devtools::install_github("frictionlessdata/datapackage.r")

Load

You can start using the library by loading datapackage.r.

    library(datapackage.r)

Reading Basic Metadata

In this case, we are using an example Tabular Data Package containing the periodic table stored on GitHub (datapackage.json, data.csv). This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the first five rows:

    url = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'
    pt_data = read.csv2(url, sep = ',')
    knitr::kable(head(pt_data, 5), align = 'c')

atomic.number	symbol	name	atomic.mass	metal.or.nonmetal.
1	H	Hydrogen	1.00794	nonmetal
2	He	Helium	4.002602	noble gas
3	Li	Lithium	6.941	alkali metal
4	Be	Beryllium	9.012182	alkaline earth metal
5	B	Boron	10.811	metalloid

Data Packages can be loaded either from a local path or directly from the web.

    url = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/package.json'
    datapackage = Package.load(url)
    datapackage$resources[[1]]$descriptor$profile = 'tabular-data-resource' # tabular resource descriptor profile
    datapackage$resources[[1]]$commit() # commit changes

    ## [1] TRUE

At the most basic level, Data Packages provide a standardized format for general metadata (for example, the dataset title, source, author, and/or description) about your dataset. Now that you have loaded this Data Package, you have access to this metadata using the metadata dict attribute. Note that these fields are optional and may not be specified for all Data Packages. For more information on which fields are supported, see the full Data Package standard.

    datapackage$descriptor$title

    ## [1] "Periodic Table"

Reading Data

Now that you have loaded your Data Package, you can read its data. A Data Package can contain multiple files which are accessible via the resources attribute. The resources attribute is an array of objects containing information (e.g. path, schema, description) about each file in the package.

You can access the data in a given resource in the resources array by reading the data attribute.

    table = datapackage$resources[[1]]$table
    periodic_table_data = table$read()

You can further manipulate list objects in R by using

 [purrr](https://cran.r-project.org/package=purrr), [rlist](https://cran.r-project.org/package=rlist) packages.

Loading into an SQL database

Tabular Data Packages contains schema information about its data using Table Schema. This means you can easily import your Data Package into the SQL backend of your choice. In this case, we are creating an SQLite database.

To create a new SQLite database and load the data into SQL we will need DBI package and RSQLite package, which contains SQLite (no external software is needed).

You can install and load them by using:

    install.packages(c("DBI","RSQLite"))

    library(DBI)
    library(RSQLite)

To create a new SQLite database, you simply supply the filename to dbConnect():

    dp.database = dbConnect(RSQLite::SQLite(), "") # temporary database

We will use data.table package to convert the list object with the data to a data frame object to copy them to database table.

    # install data.table package if not already
    # install.packages("data.table")

    periodic_table_sql = data.table::rbindlist(periodic_table_data)
    periodic_table_sql = setNames(periodic_table_sql,unlist(datapackage$resources[[1]]$headers))

You can easily copy an R data frame into a SQLite database with dbWriteTable():

    dbWriteTable(dp.database, "periodic_table_sql", periodic_table_sql)
    # show remote tables accessible through this connection
    dbListTables(dp.database)

    ## [1] "periodic_table_sql"

The data are already to the database.

We can further issue queries to hte database and return first 5 elements:

    dbGetQuery(dp.database, 'SELECT * FROM periodic_table_sql LIMIT 5')

    ##   atomic number symbol      name atomic mass   metal or nonmetal?
    ## 1             1      H  Hydrogen    1.007940             nonmetal
    ## 2             2     He    Helium    4.002602            noble gas
    ## 3             3     Li   Lithium    6.941000         alkali metal
    ## 4             4     Be Beryllium    9.012182 alkaline earth metal
    ## 5             5      B     Boron   10.811000            metalloid

Or return all elements with an atomic number of less than 10:

    dbGetQuery(dp.database, 'SELECT * FROM periodic_table_sql WHERE "atomic number" < 10')

    ##   atomic number symbol      name atomic mass   metal or nonmetal?
    ## 1             1      H  Hydrogen    1.007940             nonmetal
    ## 2             2     He    Helium    4.002602            noble gas
    ## 3             3     Li   Lithium    6.941000         alkali metal
    ## 4             4     Be Beryllium    9.012182 alkaline earth metal
    ## 5             5      B     Boron   10.811000            metalloid
    ## 6             6      C    Carbon   12.010700             nonmetal
    ## 7             7      N  Nitrogen   14.006700             nonmetal
    ## 8             8      O    Oxygen   15.999400             nonmetal
    ## 9             9      F  Fluorine   18.998403              halogen

More about using databases, SQLite in R you can find in vignettes of DBI and RSQLite packages.

We welcome your feedback and questions via our Frictionless Data Gitter chat or via Github issues on the datapackage-r repository.

Working with Data Package Creator

2018-02-05T00:00:00+00:00

The Data Package Creator, create.frictionlessdata.io, is a revamp of the Data Packagist app that lets you create and edit and validate your data packages with ease. Read on and find out how.

Frictionless Data aims to make it effortless to transport high quality data among different tools and platforms for further analysis. At the heart of this work is the Data Package, a simple format that makes it possible to package a collection of data and attach contextual information to it before sharing it. Where tabular data is involved, the ensuing Tabular Data Package contains the dataset, its schema and descriptive metadata associated with the dataset collated in a JSON file.

The basic building block of a Data Package is its datapackage.json file. The Frictionless Data team and community have developed libraries and continue to actively support users who wish to create and work with Data Packages in Javascript, Python, Ruby, R, PHP, Java, Go, Clojure and Julia. Up until now, the Data Packagist app, which was developed as an Open Knowledge Labs initiative, has also been a helpful resource to help people create Data Packages quickly and with relative ease.

At Open Knowledge International and as part of the Frictionless Data project, we are constantly thinking about streamlining processes and making it easier for users to adopt the software we develop for use in their data work. New improvements to the Data Package specification as part of the September 2017 update have also led our team to carry out subsequent iterations on the original Data Packagist app. The outcome of this work is the Data Package Creator, which boasts a revamped user interface and additional functionality to streamline the data package creation process.

Data Package Creator is an online service that lets users generate tabular data packages from their datasets (more on the Tabular Data Package specification). Let’s see how it works.

As mentioned earlier, a data package contains a collection of data. Each unique data file is referred to as a data resource.

You can add as many resources as your data collection contains either by linking to them, uploading them from your local machine or creating them from scratch and specifying their fields. You can also edit each resource (rename, add and remove fields, et al) within the Data Package Creator.

For our example, I am looking to package data resources that contain information on three cities I am interested in: Paris, Rome and London.

The first resource is location.csv which contains city names and their coordinates. I will load this file from my local machine. Here’s what the data in the location.csv file looks like.

city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A

The second resource is data.csv which contains population information on the three cities and I will load this tabular data resource from a Github repository.

city,population
london,8787892
paris,2244000
rome,2877215

The third resource is one that doesn’t exist yet and which I will create and add fields to in the Data Package Creator. I’ll call it rome.csv. Once I download the data package, I will add this resource to the data package before sharing it elsewhere.

city,location
rome,"41.89,12.51"

The datapackage.json file is updated every time a resource is added, edited or removed. This JSON file can be viewed on the right hand side of the Data Package Creator by clicking on the {···} symbol to expand the section.

screen grab of the new Data Package Creator

Metadata attached to any Data Package is also stored in the datapackage.json file. However, editing JSON files directly can be a laborious and error-prone task. The MetaData section on the left side makes it easy to write and edit descriptive metadata that will be included in your Data Package alongside your data.

The Profile Section allows you to specify what kind of Data Package you are going for. There are 3 options:

Data Package: This can contain a collection of any type of data resource and a JSON file.
Tabular Data Package: This collection must contain tabular data. It is possible to load data in any machine readable format - csv, tsv, xls, etc and a JSON file
[Fiscal Data Package][fdp]: This is a subset of the tabular data package, specifically designed for use with budget and fiscal data.

The keyword section also allows you to add up to 3 tags to your Data Package, so they are more discoverable.

Before downloading your data package, click on the Validate button at the bottom of the side navigation to check whether the generated schema is valid. The Validate button prompts Data Package creator to check whether the selected profile is befitting for the resources that constitute your data package. Should you see a warning, such as the one below, it is likely that the wrong Profile is specified in the MetaData section.

Error message that ensues on choosing the wrong data package profile. My data package is comprised of tabular data resources and the Fiscal Data Package profile is ill-suited for it. Tabular Data Package profile is most ideal.

Aim for the eureka message below, and in case you feel stuck, reach out and we’ll work with you to resolve the issue.

Finally, click on the download button which gives you a local copy of the generated datapackage.json file, complete with your data schema and metadata attached to it. Score 1 for data provenance! Finally, create a folder and place your downloaded datapackage.json file in it. Create a new folder within it, call it data and add all the data resources in your data package to it. You are now ready to share your data package. Here’s what my final data package folder looks like:

Three-Cities-Data-Package
        |-- datapackage.json      
        |-- data
            |-- population.csv
            |-- rome.csv

Please note, there are cases where all you would need to share is the datapackage.json file i.e. if all your resources are online and publicly accessible. For this reason, as the population data resource is publicly available and already linked in the resulting JSON file, I need not include the csv file in my final data package.

This is the code repository for Data Package Creator. We welcome your feedback and questions via our Frictionless Data Gitter chat or via Github issues.

Happy days!

Interactive Data wrangling using Data Package Pipelines new UI

2018-01-10T00:00:00+00:00

datapackage-pipelines is a framework for defining data processing steps to generate self-describing Data Packages, built on the concepts and tooling of the Frictionless Data project. You can read more about datapackage-pipelines in this introductory post.

Data wrangling can be quite a tedious task -

We download a few files from a data portal or some other source.
We use Excel or other applications to view the data.
We select columns from all these files and copy-paste data to construct the data-set that we need.
We filter the data so that it contains only the rows that are required.
We use formulas to compute data for new columns, to un-pivot the data or to verify that the data ‘makes sense’.

Finally, we have the wrangled data, ready to be analysed, used in our application or published in an article.

The big problem with this process is that it’s not repeatable or verifiable. In many cases, the ability to show the various transformations and processes that the data underwent is crucial to establish the data’s authenticity and correctness.

The common solution for this is to ditch the spreadsheet programs and bring out the power tools - the programming languages. By writing a data processing program, we are able to repeatedly run the same processing sequence on the source data and consistently receive the same results. This processing code can also be presented and reviewed as a proof of the validity of the resulting data.

However, as anyone that tinkered with programming knows - writing code is hard. Things that are simple to do using spreadsheet programs often require a complex mental effort to accomplish using custom code. Making sure that the code you’ve written is correct and actually produces the intended result is another major obstacle (not to mention the time it takes to learn how to program in the first place). Furthermore, even with the most readable code, it’s still hard for a 3rd party reviewer to verify the validity of the process - unless they’re familiar with the exact toolset and method you’ve used (which is often not the case).

At Open Knowledge International, we’ve tried to tackle this problem by finding a middle way. The datapackage-pipelines framework allows users to build processing pipelines - essentially, sequences of processing steps. Each of these steps is a reusable and flexible building-block which performs a single action. For example, you might use a ‘Load data from source’ block, a ‘Select columns’ block or ‘Sort data’ block. By combining these blocks together in a chain, one could construct powerful and simple to understand data processors.

While providing a good solution to issues like repeatability and difficult-to-understand processes, pipelines were still difficult to develop and test. Most users building the pipelines were already proficient programmers and getting the right result turned to be a tricky business.

So, we’ve decided to tackle this problem by creating an interactive user interface for building pipelines.

Our approach is based on a few principles:

Modularity: each step in the pipeline should be as small and simple as possible. In case of a failure, users can tell exactly which step caused the problem. Since each step is very simple, debugging also becomes a non-issue.
Interactivity: each decision or change the user makes is immediately reflected in the UI. If there was an error, the user can change their mind or try something else. The effects of the change are instantly visible and no long build/run/test cycles are needed.
Server side processing: by leveraging smart caching heuristics, the server can optimize on the required processing and further improve speed and snappiness of the user interface.

In our proof-of-concept implementation, users are prompted to select a source file - either a URL for a datafile hosted anywhere, or choose a dataset from datahub.io. Once selected, users can choose to remove columns or rows (to position the data table and remove filler columns), add a schema (to validate data) or filter some of the rows with specific constraints.

As this is a demonstration only, the list of building blocks is still lacking - however we’re planning on adding more so that this product becomes more powerful and useful.

Check out our proof-of-concept here at http://dppui.openknowledge.io!

Feel free to ask any questions / start a discussion about this in our Frictionless Data Gitter chat. You can also find the code repositories for this work here and here.

Bootstrapping data standards with Frictionless Data

2017-12-21T00:00:00+00:00

When it comes to tabular data, the Frictionless Data specifications provide users with strong conventions for declaring both the shape of data (via schemas) and information about the data (as metadata on package and resource descriptors).

Within the Frictionless Data world, we purposefully refer to specification work as specifications, and not standards. The specifications therein provide clear conventions for working with data, and declare fundamental interfaces on which a modular software system that works with these specifications can be built. It is very meta. However, the specifications and software foundation do make the Frictionless Data ecosystem a powerful and compelling technical foundation on which to build data standards.

Some reasons for why:

data serialised in a format that can be read by software developers use to build tools such as APIs, and also by many consumer programs that are used by consumers of data with little to no technical know how.
built in progressive enhancement, where metadata, as well as structural and schematic information about the data, can be incorporated over time without modifying the original data source.
A large and growing collection of tools, in many programming languages, for working with the Frictionless Data specifications.
The specifications and the software are platform agnostic. A major example of this is being web-friendly without being dependent on the web (as with many linked data approaches). Linkable data, not Linked Data.

We’ll demonstrate this with some examples below, which are a proof of concept for the idea of using Frictionless Data as a technical foundation for data standards. This is an ongoing work that we intend to iterate on in response to feedback to this initial take.

Of course, we do not in any way think that the technical implementation of a data standard is what “data standards” is about. Data standards are about communities of practice, stakeholder engagement, and increasingly, a vehicle of change at the level of policy and governance. Technical implementation, in this wider context, is but a small, yet crucial, component. Indeed, this is a critical part of the promise we are pointing to here - that by building on a common foundation, communities building data standards can focus a little less on the technical implementation details and a little more on the change they want to see by creating them.

Grant funding

360Giving is an organization that helps funders to be transparent about the grants they award. It provides a standard for publishing grants data in a common format, and a Registry to host the data. Publishers can upload a spreadsheet that contains various fields describing the different activities they funded. We will demonstrate how a custom Data Package profile could describe one of these spreadsheets, ensuring that the required metadata fields are present and that the contents of the file conform to the schema.

We will use this sample dataset, taken directly from the Registry without any changes:

Identifier	Title	Description	Currency	Amount Awarded	Amount Disbursed	Award Date
360G-blagravetrust-00658000009YZRq	Achieving Further	Work with 22 FE colleges to improve attainment; attendance and participation	GBP	300000	300000	2014-07-08
360G-blagravetrust-00658000007A1UQ	Training on feedback for Portsmouth VCS	Improving feedback skills for Portsmouth VCS - Feedback Fund 2016	GBP	3933	3933	2016-08-09
360G-blagravetrust-00658000008vdAl	Creative learning programme	Portsmouth young people leaving care	GBP	75000	25000	2016-11-08
360G-blagravetrust-00658000007lweS	Feedback Fund	Feedback Fund 2016	GBP	2094	2094	2016-08-09

Our first step was to create a Table Schema describing the expected contents of the fields, which was then embedded in the Data Package descriptor. This was easy as like we mentioned before, there already is a well defined schema for how the fields should be. For the purposes of this example we just focused on a subset of all available fields. Here are some example fields:

Type	Constraints
Identifier	string
Title	string	maxLength: 140
Description	string
Currency	string	enum: [‘AED’, ‘AFN’, ‘ALL’, ‘AMD’, …]
Amount Awarded	number
Amount Disbursed	number
Award Date	date
URL	string
…	…	…
Funding Org:Name	string
Funding Org:Department	string
Grant Programme:Code	string
Grant Programme:Title	string
Grant Programme:URL	string
From an open call?	string
Related Activity	string
Last modified	datetime
Data Source	string

Our custom Grants Data Package extends the Data Package specification by adding the following fields:

Name	Description	Type
funder	A JSON object describing the funding organization. It can include the following properties: `id`, `name`, `email`, `url`	object
year	The year that the grants data in this file covers	integer
modified	The timestap of when this dataset was last modifed	datetime

This follows closely the JSON specification that 360Giving has, with the rest of the fields covered by the standard Data Package specification.

Once we have our data packaged in this way, we can leverage all the ecosystem of tools built around Data Packages to work with it. For instance, using the datapackage library we can iterate over the contents of the file:

import datapackage

datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'
dp = datapackage.Package(datapackage_url)

for row in dp.resources[0].iter(keyed=True):
    print(row)
    # {'Funding Org:Identifier': 'GB-CHC-1164021', 'Beneficiary Location:Geographic Code Type': 'UA', 'From an open call?': None, 'Beneficiary Location:Name': 'Reading', 'Grant Programme:Code': None, 'Beneficiary Location:Geographic Code': 'E06000038', 'Amount Disbursed': Decimal('300000'), 'Recipient Org:City': 'Newbury', 'Award Date': datetime.datetime(2014, 7, 8, 0, 0), 'Beneficiary Location:Longitude': Decimal('-0.95543100000000003024780426130746491253376007080078125'), 'Recipient Org:Web Address': 'http://www.afaeducation.org', 'Recipient Org:Charity Number': '1142154', 'Grant Programme:Title': None, 'Related Activity': None, 'Grant Programme:URL': None, 'Recipient Org:Country': 'UK', 'Funding Org:Name': 'The Blagrave Trust', 'Title': 'Achieving Further', 'Planned Dates:End Date': datetime.datetime(2017, 6, 30, 0, 0), 'Recipient Org:Postal Code': 'RG14 1JQ', 'Identifier': '360G-blagravetrust-00658000009YZRq', 'Data Source': None, 'Planned Dates:Start Date': None, 'Currency': 'GBP', 'Description': 'Work with 22 FE colleges to improve attainment; attendance and participation', 'Recipient Org:Identifier': 'GB-CHC-1142154', 'Recipient Org:Description': 'Charity working with nurseries schools and colleges to raise attainment and achivement of children particularly those with barriers to learning', 'Funding Org:Department': None, 'Beneficiary Location:Country Code': None, 'Last modified': None, 'URL': None, 'Amount Awarded': Decimal('300000'), 'Beneficiary Location:Latitude': Decimal('51.4541449999999969122654874809086322784423828125'), 'Recipient Org:County': 'Berkshire', 'Recipient Org:Name': 'Achievement for All', 'Recipient Org:Street Address': 'Oxford House, Oxford Street', 'Planned Dates:Duration (months)': None, 'Recipient Org:Company Number': None}
 

Also, as we define the Table Schema, we can use goodtables to perform data validation and get a report of issues found:

from goodtables import validate

datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'
validate(datapackage_url)
	'''
	{'error-count': 0,
	 'preset': 'datapackage',
	 'table-count': 1,
	 'tables': [{'datapackage': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/datapackage.json',
	   'encoding': None,
	   'error-count': 0,
	   'errors': [],
	   'format': 'inline',
	   'headers': ['Identifier',
		'Title',
		'Description',
		'Currency',
		'Amount Awarded',
		'Amount Disbursed',
		'Award Date',
		'URL',
		'Planned Dates:Start Date',

		...		

		'Grant Programme:URL',
		'From an open call?',
		'Related Activity',
		'Last modified',
		'Data Source'],
	   'row-count': 70,
	   'schema': 'table-schema',
	   'scheme': None,
	   'source': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/360G-blagravetrust-2016.xlsx',
	   'time': 0.53,
	   'valid': True}],
	 'time': 1.386,
	 'valid': True,
	 'warnings': []}
	'''

IATI Registry

The IATI Standard is a technical framework to publish aid, development, and humanitarian data in a standard way. Data published in the IATI standard is indexed on the IATI Registry. Here we will demonstrate the creation of a custom Data Package profile to package data meant to be published in the registry, ensuring that it has the required metadata.

Here are the fields available when publishing a new IATI file on the registry:

Name	Data Package field	Description	Type
`registry-file-id`	`name`	A unique identifier for the activity record	string
`registry-publisher-id`	-	Publisher identificator on the IATI Registry	string
`title`	`title`	The title of the dataset	string
`description`	`description`	Some useful notes about the data	string
`source-url`	`resources[0]['path']`	URL to a publicly accessible IATI file	string
`contact-email`	-	Contact email for publisher	string
`file-type`	-	Must be either ‘Activity’ or ‘Organization’	string
`recipient-country`	-	Recipient country	string
`last-updated-datetime`	-	Timestamp of the last modification	date-time
`activity-count`	-	Number of activities described in the data	integer
`default-language`	-	Language of the data	string
`secondary-publisher`	-	The publisher this dataset is published on behalf of	string

To create the new profile, we will add those fields that do not map directly to the Data Package specification to a standard Data Package descriptor and create a custom JSON Schema to validate it. Here is the resulting Data Package descriptor.

Trees

The Open Council Data defined the standard Trees 1.3 for describing the trees in a geographical region (e.g. a council). This standard includes information about the location, type, and other characteristics of individual trees, which is useful for planning future growth, maintenance of canopy cover, managing risk of falling branches, etc.

We are using the from Colac Otway Shire Trees as an example.

lat	lon	genus	species	dbh	height	common	location	ref	maturity	planted	address
-38.344595	143.592171	Melaleuca	Stypheliodes	1	5	Prickly Paperback	street	10001	mature	1975-01-01	106 Queen ST COLAC VIC 3250
-38.346198	143.591812	Melaleuca	Stypheliodes	1	4	Prickly Paperback	street	10004	mature	1975-01-01	122 Queen ST COLAC VIC 3250
-38.342097	143.588944	Fraxinus	Excelsior	1.2	12	Golden Ash	street	10007	mature	1980-01-01	40 Rae ST COLAC VIC 3250
-38.341927	143.588715	Agonis	Flexuosa	0.4	5	Weeping Willow Myrtle	street	10018	semi-mature	1980-01-01	47 Rae ST COLAC VIC 3250//next to coles coaches
-38.342044	143.591182	Eucalyptus	Nichollii	0.3	6	Willow Peppermint	street	10021	mature	1980-01-01	56 Rae ST COLAC VIC 3250//Between Queen St & CCDA

This data was modified from the source to conform to the Trees 1.3 specification. All the data is available here.

The Trees Data Package extends the Data Package specification by adding the following fields:

Name	Description	Type
countryCode	A single or an array of 2-letter ISO country code defining the country(ies) present in the data	string
geospatialCoverage	Geospatial area contained in the dataset	geojson

Name	Title	Type	Constraints
lat	Latitude in decimal degrees (EPSG:4326)	number	required: True
lon	Longitude in decimal degrees (EPSG:4326)	number	required: True
genus	Botanical genus, in title case (e.g. Eucalyptus)	string
species	Botanical species, in title case (e.g. Regnans)	string
dbh	Diameter at breast height (130cm above ground), in centimeters. If this information is available only as a range, this contains the middle of the range.	number	minimum: 0
dbh_min	Minimum diameter at breast height (130cm above ground)	number	minimum: 0
dbh_max	Maximum diameter at breast height (130cm above ground)	number	minimum: 0
year_min	Lower bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2018).	year
year_max	Upper bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2023).	year
crown	Width in metres of the tree’s foliage (also known as crown spread). If this information is available only as a range, this contains the middle of the range.	number	minimum: 0
crown_min	Minimum width in meters of the tree’s foliage	number	minimum: 0
crown_max	Maximum width in meters of the tree’s foliage	number	minimum: 0
height	Height in meters. If this information is available only as a range, this contains the middle of the range.	number	minimum: 0
height_min	Minimum height in meters	number	minimum: 0
height_max	Maximum height in meters	number	minimum: 0
common	Common name for species (non-standardised)	string
location	Where the tree is located	string	enum: [‘park’, ‘street’, ‘council’]
ref	Council-specific identifier, enabling joining to other datasets	number
maintenance	How often the tree is inspected (in months)	number	minimum: 0
maturity		string	enum: [‘young’, ‘semi-mature’, ‘mature’, ‘over-mature’]
planted	Date of planting	date
updated	Date of addition to database or most recent revision	date
health	Health of tree growth	string	enum: [‘stump’, ‘dead’, ‘poor’, ‘fair’, ‘good’]
variety	Any part of the scientific name below species level, including subspecies or variety	string
description	Other information about the tree that is not in its scientific name or species	string
family	Botanical family	string
ule_min	Lower bound on useful life expectancy when surveyed	number	minimum: 0
ule_max	Upper bound on useful life expectancy when surveyed	number	minimum: 0
address	Street address	string

import datapackage

datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/trees/datapackage.json'
dp = datapackage.Package(datapackage_url)

for row in dp.resources[0].iter(keyed=True):
    print(row)
    # {'lat': Decimal('-38.347497'), 'lon': Decimal('143.595686'), 'genus': 'Melaleuca', 'species': 'Nesophila', 'dbh': Decimal('0.25'), 'dbh_min': None, 'dbh_max': None, 'year_min': None, 'year_max': None, 'crown': None, 'crown_min': None, 'crown_max': None, 'height': Decimal('2'), 'height_min': None, 'height_max': None, 'common': 'Snowy Honey Myrtle', 'location': 'street', 'ref': Decimal('10379'), 'maintenance': None, 'maturity': 'semi-mature', 'planted': datetime.date(1980, 1, 1), 'updated': None, 'health': None, 'variety': None, 'description': None, 'family': None, 'ule_min': None, 'ule_max': None, 'address': '18 Thomas ST COLAC VIC 3250'}

Conculsion

This has been a high-level exploration of using Tabular Data Package and Table Schema as a “specification framework”, allowing one to bootstrap a proof of concept data standard. Taking this approach, one gains access to a collection of modular software libraries that provide powerful APIs for working with this data according to the rules and condition of the standard that is declared. Data validation, processing, transport, and consumption do not require custom tool chains once the data standard is declared as a Tabular Data Package Profile.

The approach described here is a first step in the direction of domain-specific tabular data profiles. A future iteration would likely integrate work we are currently undertaking in the Fiscal Data Package which enables the simple declaration of domain concepts via columnType annotations on Table Schemas. This enables data standard authors to work at a level of abstraction of domain concepts, rather than the “primitive types” we work with here via Table Schema. We plan to revisit this work once the columnType work from Fiscal Data Package is stable for general use.

For now, all the schemas above work as described, and open up all the software in the Frictionless Data ecosystem to those following this approach.

You can check the source code for all the examples listed in the following GitHub repository:

https://github.com/frictionlessdata/profiles

Validating scraped data using goodtables

2017-11-29T00:00:00+00:00

We have to deal with many challenges when scraping a page. What’s the page’s layout? How do I extract the bits of data I want? How do I know when their layout changes and break my code? How can I be sure that my code isn’t introducing errors to the data? There are many tools to test that the code works, but not so many to test the actual data. This is especially important when you don’t control the source of the data, which is almost always the case when you’re scraping (otherwise, you wouldn’t be scraping). In this post, I’ll show you how I used goodtables to validate scraped data.

Goodtables is an open source data validator for tabular data (think spreadsheets and CSVs). It can check both the structure of the file (do all rows have the same number of columns?), and its contents (is this a valid date?). Goodtables gives you a safety net that guarantees that your data files are valid.

We’ll work step by step. First, I’ll show you what the data looks like, then we’ll then check what goodtables can find out of the box, without any information about the data contents. Finally, we’ll define the types and constraints of each column, so goodtables can validate that the rows contain what we expect.

By the end of this post, you’ll have a better idea on how goodtables can help you be more confident about your data’s quality.

The data

We’ll use the remuneration of the civil servants working for São Paulo’s City Council as an example. This data was scraped from their website. The first few rows look like:

name	role	function	remuneration	department	year	month
MILTON LEITE DA SILVA	VEREADOR	VEREADOR	11534.82	PRESIDÊNCIA	2017	9
PAULO CESAR TAGLIAVINI	CHEFE DE GABINETE DA PRESIDÊNCIA	CHEFE DE GABINETE DA PRESIDÊNCIA	14124.71	GABINETE DA PRESIDÊNCIA	2017	9
CECILIA DE ARRUDA	CHEFE DE CERIMONIAL	CHEFE DE CERIMONIAL	22455.9	GABINETE DA PRESIDÊNCIA	2017	9
ANTONIO JAIR DA ROSA	ASSISTENTE LEGISLATIVO III		7383.64	GABINETE DA PRESIDÊNCIA	2017	9
BRASILINO SILVA BRANDAO	ASSISTENTE LEGISLATIVO III		8135.51	GABINETE DA PRESIDÊNCIA	2017	9

Some of the columns are strings (name, role, function, and department), one is numeric (remuneration), and two are date parts (year and month). We’ll think about the types and constraints on each of these columns in a minute, but first let’s see what goodtables tells us out of the box.

Initial validations

Goodtables is written in Python, and can be used both as a command-line tool or imported in your Python code. We’ll use it in the command-line. Considering that our data lives in data/remunerations.csv, we validate it by running goodtables data/remunerations.csv. This is the output:

DATASET
=======
{'error-count': 0,
 'preset': 'table',
 'table-count': 1,
 'time': 0.025,
 'valid': True}
---------
Warning: Table "data/remunerations.csv" inspection has reached 1000 row(s) limit

TABLE [1]
=========
{'encoding': 'utf-8',
 'error-count': 0,
 'format': 'csv',
 'headers': ['name', 'role', 'function', 'remuneration', 'department', 'year', 'month'],
 'row-count': 1000,
 'schema': None,
 'scheme': 'file',
 'source': 'data/remunerations.csv',
 'time': 0.024,
 'valid': True}

It hasn’t found any errors, good! However, there’s a warning: it just analyzed the first 1,000 rows. Maybe there’s an error in the other rows? As our data is very small, with a bit over 2,000 rows, analyzing everything should be quick. Let’s try again with a high row limit, using goodtables --row-limit 999999 data/remunerations.csv:

DATASET
=======
{'error-count': 1,
 'preset': 'table',
 'table-count': 1,
 'time': 0.046,
 'valid': False}

TABLE [1]
=========
{'encoding': 'utf-8',
 'error-count': 1,
 'format': 'csv',
 'headers': ['name', 'role', 'function', 'remuneration', 'department', 'year', 'month'],
 'row-count': 2043,
 'schema': None,
 'scheme': 'file',
 'source': 'data/remunerations.csv',
 'time': 0.045,
 'valid': False}
---------
[1859,-] [duplicate-row] Row 1859 is duplicated to row(s) 1858

A-ha! Now it found an error: duplicate rows. Depending on the data, this might or might not be an issue. Goodtables is helpful enough to tell us the row numbers, let’s take a look at them:

name	role	function	remuneration	department	year	month
				CTI-4 - EQUIPE DE TELECOMUNICAÇÕES E INFRAESTRUTURA	2017	9
				CTI-4 - EQUIPE DE TELECOMUNICAÇÕES E INFRAESTRUTURA	2017	9

This does look like a valid error (no names?). After investigating for a while, I found the culprit: the source website was modified. There are now a few cases where the civil servant’s name was removed by judicial order, and it broke my code. The joys of scraping, right?

After fixing it and running goodtables again, this is what I got:

DATASET
=======
{'error-count': 0,
 'preset': 'table',
 'table-count': 1,
 'time': 0.083,
 'valid': True}

TABLE [1]
=========
{'encoding': 'utf-8',
 'error-count': 0,
 'format': 'csv',
 'headers': ['name', 'role', 'function', 'remuneration', 'department', 'year', 'month'],
 'row-count': 4083,
 'schema': None,
 'scheme': 'file',
 'source': 'data/remunerations.csv',
 'time': 0.081,
 'valid': True}

Great, no more errors!

Without giving any information about my data, goodtables found out there was a duplicate row. This led me to find out that the website I’m scraping was modified and broke my code. Even if we stopped now, this has already been useful. We won’t. There still are some useful tricks up goodtable’s sleeve.

Improving the validations

Although goodtables provides valuable information for an arbitrary CSV, its real power comes when we tell it the data schema. It’ll validate that the data is what we expect it to be (numbers are numbers, dates are valid, etc.). The easiest way to define this schema is by creating a Data Package.

The first thing we need is to create a JSON file named datapackage.json:

{
  "name": "remunerations-cmsp",
  "title": "Remuneration of the civil servants from the Sao Paulo's City Council",
  "resources": [
    {
      "name": "remunerations",
      "path": "data/remunerations.csv"
    }
  ]
}

This is the simplest data package we can create for this data. It just defines a name and title for the dataset, and a single resource, our CSV file. Goodtables support data packages out of the box, so we can run goodtables datapackage.json and it’ll give us the same result as running goodtables data/remunerations.csv directly. With this in place, we can start writing the schema.

Think of the schema as a data dictionary. It defines what each column means, what data it contains, which format, their description, and so on. Looking at the data, these are the data types of each of our columns:

String
- Name
- Role
- Function
- Department
Currency
- Remuneration
Dates
- Year
- Month

Schemas in data packages follow the Table Schema specification which defines how to write the schema, a few basic types, and how to add constraints (e.g. uniqueness, required, valid ranges). It sounds more complicated than it actually is. For instance, this is how we would write the column’s types I defined above using the Table Schema:

{
  "name": "remunerations-cmsp",
  "title": "Remuneration of the civil servants from the Sao Paulo's City Council",
  "resources": [
    {
      "name": "remunerations",
      "path": "data/remunerations.csv",
      "schema": {
        "fields": [
          {
            "name": "name",
            "type": "string"
          },
          {
            "name": "role",
            "type": "string"
          },
          {
            "name": "function",
            "type": "string"
          },
          {
            "name": "remuneration",
            "type": "number"
          },
          {
            "name": "department",
            "type": "string"
          },
          {
            "name": "year",
            "type": "year"
          },
          {
            "name": "month",
            "type": "integer"
          }
        ]
      }
    }
  ]
}

The only thing we changed was adding the schema attribute to our resource, everything else is the same. When we run goodtables again, it still is successful, but now it’s not only running the basic validations, but also checking the cells’ types.

Can we improve it further? Of course!

Take a look at the month column. As Table Schema doesn’t have a “month” data type, we had to use the closest to it: integer. A month is an integer, however it’s not any integer. It can’t be zero, or -1, or 42, it must be from 1 to 12. The Table Schema allows us to define these constraints in our schema, but before I show you how, what about the other columns? Are there other similar constraints, not only about valid ranges, but also if they are required or must be unique.

I went through all columns, looking at the data and understand which constraints they have, and this is what I defined:

Department
- Required
Remuneration
- Required
Year
- Required
- Greater than 2017 (there’s no historical data)
Month
- Required
- Between 1 and 12

There are no constraints for name, role and function other than the type. On the datapackage.json, these fields will look like:

{
  "name": "department",
  "type": "string",
  "constraints": {
      "required": "true"
  }
},
{
  "name": "remuneration",
  "type": "number",
  "constraints": {
      "required": true
  }
},
{
  "name": "year",
  "type": "number",
  "constraints": {
    "required": true,
    "minimum": 2017
  }
},
{
  "name": "month",
  "type": "number",
  "constraints": {
    "required": true,
    "minimum": 1,
    "maximum": 12
  }
}

Goodtables now raises a few errors on remuneration. There are some rows where it’s empty. Looking back at the original website, I confirm that I was wrong, there really are some rows without remuneration (apparently the councillors’ remunerations are somewhere else). After removing this constraint, everything runs successfully.

The final datapackage.json looks like:

{
  "name": "remunerations-cmsp",
  "title": "Remuneration of the civil servants from the Sao Paulo's City Council",
  "resources": [
    {
      "name": "remunerations",
      "path": "data/remunerations.csv",
      "schema": {
        "fields": [
          {
            "name": "name",
            "type": "string"
          },
          {
            "name": "role",
            "type": "string"
          },
          {
            "name": "function",
            "type": "string"
          },
          {
            "name": "remuneration",
            "type": "number"
          },
          {
            "name": "department",
            "type": "string",
            "constraints": {
              "required": "true"
            }
          },
          {
            "name": "year",
            "type": "year",
            "constraints": {
              "required": true,
              "minimum": 2017
            }
          },
          {
            "name": "month",
            "type": "integer",
            "constraints": {
              "required": true,
              "minimum": 1,
              "maximum": 12
            }
          }
        ]
      }
    }
  ]
}

I could’ve added constraints in the role, function, and department fields, as they can only have a set of values (i.e. there’s no department “Foobar”). I decided it wasn’t worth the trouble now, as I don’t have a list of possible values at hand. If I want to add these or other constraints in the future, the structure is already in place, so it’s straightforward.

Conclusion

My intent with this post was to show you the value of adding even a little bit of data validation to your toolbox, and how easy it is to do so with goodtables. We started by running it without giving any information about our data. It found duplicate rows that led me to discover that the website I’m scraping has changed, so my scraper was out of date. After I updated the code and ran it again, goodtables was successful.

We then told goodtables more about our data by writing a schema using the Data Package and Table Schema specifications. This led me to know the data better, as my initial assumptions that all rows must have a remuneration was wrong.

With all this in place, goodtables is now able to check not only the structure of the data, but that its contents are valid. The next step is how to make sure it stays this way. In a future post, I’ll show you how to run goodtables automatically as part of your test suite when your data is on GitHub.

I hope you found this interesting. If you’re curious about how this all fit together, check it out on https://github.com/vitorbaptista/remuneracao_cmsp.

If you have any questions, feedback, or would just like to chat, join our Frictionless Data Gitter chat. We’d love to hear from you, so we can make these tools as useful as they can be.

Core Data on DataHub.io

2017-11-03T00:00:00+00:00

This blog post was originally published on datahub.io by Rufus Pollock, Meiran Zhiyenbayev & Anuar Ustayev.

The “Core Data” project provides essential data for the data wranglers and data science community. Its online home is on the DataHub:

https://datahub.io/core

https://datahub.io/docs/core-data

This post introduces you to the Core Data, presents a couple of examples and shows you how you can access and use core data easily from your own tools and systems including R, Python, Pandas and more.

Why Core Data
Examples
Use Core Data from your favorite language or tool
Conclusion

Why Core Data

If you build data driven applications or data driven insight you regularly find yourself wanting common “core” data, things like lists of countries, populations, geographic boundaries and more.

However, finding good quality data has always been challenging. Professionals can spend lots of time finding and preparing data before they get to do any real work analysing or presenting it.

To address this, a few years ago we started the “core data” project as part of the Frictionless Data initiative. Its purpose was to curate important, commonly used datasets including reference data like country codes, indicators like population and GDP, and geodata like country boundaries. It provides them in a high-quality, easy-to-use, and standard form.

Recently the Core Data project has got even better with a new home on the newly upgraded DataHub and has expanded thanks to new partners like Datopian and John Snow Labs (more on this in a future post!).

Examples

There are dozens of core datasets already available and many more being worked on, including a list of countries and their 2 digit codes, and a more extensive version.

List of Countries

Ever needed to build a drop-down list of countries in a web application? Or ever needed to add country name labels for a graph and only had country codes?

Then these datasets are for you!

First up is the very simple “country-list” dataset:

https://datahub.io/core/country-list

You can see a preview table for the dataset on the showcase page:

You can download it in either CSV or JSON formats:

Country Codes

Maybe the simple list of countries is not enough for you. Perhaps you need phone codes for each country, or want to know their currencies?

We’ve got you covered with the more extensive country codes dataset:

https://datahub.io/core/country-codes

All the countries from Country List including number of associated codes - ISO 3166 codes, ITU dialing codes, ISO 4217 currency codes, and many others. This dataset includes 26 different codes and associated information.

You can also preview the data and download in different formats just like it is described for Country List dataset above:

Population

This is another useful dataset for people: you regularly need population in order to do normalisations and calculate per capita figures as part of a statistical analysis.

This dataset includes population figures for countries, regions (e.g. Asia) and the world. Data comes originally from World Bank and has been converted into standard tabular data package with CSV data and a table schema:

https://datahub.io/core/population

Preview the data on the showcase page:

Get the data in CSV or JSON formats just like for any other Core Datasets:

Use Core Data from your favorite language or tool

We have made Core Data easy-to-use from various programming languages and tools. We will walk through using our Country List example. But you can apply these instructions to any other Core Data in the DataHub.

CSV and JSON

If you just need to get data, you have a direct link usable from any tool or app e.g. for the country list:

For more read our "Getting Data" tutorial:

https://datahub.io/docs/getting-started/getting-data

cURL

Following commands help you to get the data using “cURL” tool. Use -L flag so “cURL” follows redirects:

    # Get the data:
    curl -L https://datahub.io/core/country-list/r/data.csv

    # datapackage.json provides metadata and a list of all data files
    curl -L https://datahub.io/core/country-list/datapackage.json

    # See just the available data files (resources):
    curl -L https://datahub.io/core/country-list/datapackage.json | jq ".resources"

R

If you are using R here’s how to get the data you want quickly loaded:

    install.packages("jsonlite")
    library("jsonlite")

    json_file <- "https://datahub.io/core/country-list/datapackage.json"
    json_data <- fromJSON(paste(readLines(json_file), collapse=""))

    # access csv file by the index starting from 1
    path_to_file = json_data$resources[[1]]$path
    data <- read.csv(url(path_to_file))
    print(data)

Python

Here we take a look at how to get Country List in Python programming language:

For Python, first install the datapackage library (all the datasets on DataHub are Data Packages):

    pip install datapackage

Again, we’ll use the country-list dataset:

    from datapackage import Package

    package = Package('https://datahub.io/core/country-list/datapackage.json')

    # get list of resources:
    resources = package.descriptor['resources']
    resourceList = [resources[x]['name'] for x in range(0, len(resources))]
    print(resourceList)

    data = package.resources[0].read()
    print(data)

Pandas

In order to work with Data Packages in Pandas you need to install the Frictionless Data data package library and the pandas extension:

    pip install datapackage
    pip install jsontableschema-pandas

To get the data run following code:

    import datapackage

    data_url = "https://datahub.io/core/country-list/datapackage.json"

    # to load Data Package into storage
    storage = datapackage.push_datapackage(data_url, 'pandas')

    # data frames available (corresponding to data files in original dataset)
    storage.buckets

    # you can access datasets inside storage, e.g. the first one:
    storage[storage.buckets[0]]

Ruby, JavaScript and many more

We also have support for JavaScript, SQL, Ruby and PHP. See our “Getting Data” tutorial for more:

https://datahub.io/docs/getting-started/getting-data

Conclusion

This post has shown how you can import datasets in a high quality, standard form quickly and easily.

There are many more datasets to explore than the three we showed you here. You can find a full list here:

https://datahub.io/core

Finally, we would love collaborators to help us curate even more core datasets. If you’re interested you can find out more about the Core Data Curator program here:

https://datahub.io/docs/core-data/curators

If you have questions, comments or feedback join DataHub’s chat channel or open an issue on DataHub’s tracker.

Data Package v1 Specifications. What has Changed and how to Upgrade

2017-10-11T00:00:00+00:00

This post walks you through the major changes in the Data Package v1 specs compared to pre-v1. It covers changes in the full suite of Data Package specifications including Data Resources and Table Schema. It is particularly valuable if:

you were using Data Packages pre v1 and want to know how to upgrade your datasets
if you are implementing Data Package related tooling and want to know how to upgrade your tools or want to support or auto-upgrade pre-v1 Data Packages for backwards compatibility

It also includes a script we have created (in JavaScript) that we’ve been using ourselves to automate upgrades of the Core Data.

The Changes

Two major changes in v1 were presentational:

Creating Data Resource as a separate spec from Data Package. This did not change anything substantive in terms of how data packages worked but is important presentationally. In parallel, we also split out a Tabular Data Resource from the Tabular Data Package.
Renaming JSON Table Schema to just Table Schema

In addition, there were a fair number of substantive changes. We summarize these in the sections below. For more detailed info see the current specifications and the old site containing the pre spec v1 specifications.

Table Schema

Link to spec: https://specs.frictionlessdata.io/table-schema/

Property	Pre v1	v1 Spec	Notes	Issue
id/name	id	name	Renamed id to name to be consistent across specs
type/number	format: currency	format: currency - removed format: bareNumber format: decimalChar and groupChar		#509 #246
type/integer	No additional properties	Additional properties: bareNumber		#509
type/boolean	true: [yes, y, true, t, 1],false: [no, n, false, f, 0]	true: [ true, True, TRUE, 1],false: [false, False, FALSE, 0]		#415
type/year + yearmonth		year and yearmonth NB: these were temporarily gyear and gyearmonth		#346
type/duration		duration		#210
type/rdfType		rdfType	Support rich “semantic web” types for fields	#217
type/null		removed (see missingValue)		#262
missingValues		missingValues	Missing values support did not exist pre v1.	#97

Data Resource

Link to spec: https://specs.frictionlessdata.io/data-resource/

Note: Data Resource did not exist as a separate spec pre-v1 so strictly we are comparing the Data Resource section of the old Data Package spec with the new Data Resource spec.

Property	Pre v1	v1 Spec	Notes	Issue
path	path and url	path only	url merged into path and path can now be a url or local path	#250
path	string	string or array	path can be an array to support a single resource split across multiple files	#228
name	recommended	required	Made name required to enable access to resources by name consistently across tools
profile		recommended	See profiles discussion
sources, licenses …			Inherited metadata from Data Package like sources or licenses upgraded inline with changes in Data Package

Tabular Data Resource

Link to spec: https://specs.frictionlessdata.io/data-resource/

Just as Data Resource split out from Data Package so Tabular Data Resource split out from the old Tabular Data Package spec.

There were no significant changes here beyond those in Data Resource.

Data Package

Link to spec: https://specs.frictionlessdata.io/data-package/

Property	Pre v1	v1 Spec	Notes	Issue
name	required	recommended	Unique names are not essential to any part of the present tooling so we have have moved to recommended.
id		id property-globally unique	Globally unique id property	#228
licenses	license - object or string. The object structure must contain a type property and a url property linking to the actual text	licenses - is an array. Each item in the array is a License. Each must be an object. The object must contain a name property and/or a path property. It may contain a title property.
author	author	author is removed in favour of contributors
contributor	name, email, web properties with name required	title property required with roles, role property values must be one of - author, publisher, maintainer, wrangler, and contributor. Defaults to contributor.
sources	name, web and email and none required	title, path and email and title is required
resources		resources array is required		#434
dataDependencies	dataDependencies		Moved to a pattern until we have greater clarity on need.	#341

Tabular Data Package

Link to spec: https://specs.frictionlessdata.io/tabular-data-package/

Tabular Data Package is unchanged.

Profiles

Profiles arrived in v1:

http://specs.frictionlessdata.io/profiles/

Profiles are the first step on supporting a rich ecosystem of “micro-schemas” for data. They provide a very simple way to quickly state that your data follows a specific structure and/or schema. From the docs:

Different kinds of data need different formats for their data and metadata. To support these different data and metadata formats we need to extend and specialise the generic Data Package. These specialized types of Data Package (or Data Resource) are termed profiles.

For example, there is a Tabular Data Package profile that specializes Data Packages specifically for tabular data. And there is a “Fiscal” Data Package profile designed for government financial data that includes requirements that certain columns are present in the data e.g. Amount or Date and that they contain data of certain types.

We think profiles are an easy, lightweight way to starting adding more structure to your data.

Profiles can be specified on both resources and packages.

Automate upgrading your descriptor according to the spec v1

We have created a data package normalization script that you can use to automate the process of upgrading a datapackage.json or Table Schema from pre-v1 to v1.

The script enables you to automate updating your datapackage.json for the following properties: path, contributors, resources, sources and licenses.

This is a simple script that you can download directly from here:

https://raw.githubusercontent.com/datahq/datapackage-normalize-js/master/normalize.js

e.g. using wget:

wget https://raw.githubusercontent.com/datahq/datapackage-normalize-js/master/normalize.js

# path (optional) is the path to datapackage.json
# if not provided looks in current directory
normalize.js [path]

# prints out updated datapackage.json

You can also use as a library:

# install it from npm
npm install datapackage-normalize

so you can use it in your javascript:

const normalize = require('datapackage-normalize')

const path = 'path/to/datapackage.json'
normalize(path)

Conclusion

The above summarizes the main changes for v1 of Data Package suite of specs and instructions on how to upgrade.

If you want to see specification for more details, please visit Data Package specifications. You can also visit the Frictionless Data initiative for more information about Data Packages.

This blog post was originally published on datahub.io by Meiran Zhiyenbayev. Meiran works for Datopian who have been developing datahub.io as part of the Frictionless Data initative.

Frictionless Data Specs v1 Updates

2017-10-05T00:00:00+00:00

The Frictionless Data team released v1 specifications in the first week of September 2017 and Paul Walsh, Chief Product Officer at Open Knowledge International, wrote a detailed blogpost about it. With this milestone, in addition to modifications on pre-existing specifications like Table Schema¹ and CSV Dialect² in line with our design philosophy³, the team created two new specifications- Data Resource⁴ and Tabular Data Resource⁵- which employ explicit pattern rules to help describe data resources unambiguously.

Following the September release, the team has now updated our range of frictionless data implementations to work with v1 specs - from tableschema and datapackage libraries to tableschema plugins and the goodtables.io service.

Some of the highlights from this update include:

SQL/BigQuery/Pandas plugins now work with all 15 Table Schema types⁶ with no data loss,
use Frictionless Data tools ⁷ to infer, package and use data from different online sources
Create datapackages from a select few tables in your database

Table Schema Plugins update

The Pandas, SQL and BigQuery plugins have now been updated to work with v1 specifications.

Here’s how you can infer arbitrary CSV files from an online source, create a data package with the data and analyze it in a widely used data analysis tool like Pandas or an SQL database:

#pip install datapackage tableschema tableschema_sql tableschema_pandas
from pprint import pprint
from tableschema import Storage
from datapackage import Package
from sqlalchemy import create_engine

# Infer data package from some CSVs in the internet
package = Package()
package.add_resource({'name': 'teams', 'path': 'https://raw.githubusercontent.com/danielfrg/espn-nba-scrapy/master/data/teams.csv'})
package.add_resource({'name': 'games', 'path': 'https://raw.githubusercontent.com/danielfrg/espn-nba-scrapy/master/data/games.csv'})
package.infer()
pprint(package.descriptor)

# Check data package integrity
package.descriptor['resources'][1]['schema']['foreignKeys'] = [
    {'fields': 'home_team', 'reference': {'resource': 'teams', 'fields': 'name'}},
    {'fields': 'visit_team', 'reference': {'resource': 'teams', 'fields': 'name'}},
]
package.commit()
package.get_resource('games').check_relations()
pprint('Integrity is checked')

# Analyze data package in SQL
engine = create_engine('sqlite:///')
package.save(storage='sql', engine=engine)
pprint(list(engine.execute("""
    SELECT home_team, round(avg(home_team_score), 1) as score
    FROM games GROUP BY home_team ORDER BY score DESC
""")))

# Analyze data package in Pandas
storage = Storage.connect('pandas')
package.save(storage=storage)
pprint(storage['games'].loc[storage['games']['home_team_score'].idxmax()])

Data Package Storage API update

We are working to make the Data Package specification⁸ the go-to metadata format to move datasets from one persistent storage to another. The Storage API (example below) now allows you to move data between Pandas, SQL and Big Query.

# pip install datapackage tableschema tableschema_sql tableschema_pandas tableschema_bigquery
import io
import os
import json
from pprint import pprint
from tableschema import Storage
from datapackage import Package
from sqlalchemy import create_engine
from apiclient.discovery import build
from oauth2client.client import GoogleCredentials
engine = create_engine('sqlite:///') # use your persistent database

# From BigQuery to SQL
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '.credentials.json'
credentials = GoogleCredentials.get_application_default()
service = build('bigquery', 'v2', credentials=credentials)
package = Package(storage='bigquery', service=service, project='bigquery-public-data', dataset='usa_names')
package.save(storage='sql', engine=engine)

# From SQL to Pandas
storage = Storage.connect('pandas')
package = Package(storage='sql', engine=engine)
package.save(storage=storage)
pprint(storage['usa_1910_current'].head())

For more examples and ideas on how to use Storage API in your data wrangling and publishing workflow, take a look at datapackage-py documentation⁹. We welcome community contributions to allow for more integrations. Interested in contributing? Start here.

Use Table Schema’s Data Types with no data loss

With the new update, it is now possible to store and retain all your data even where your storage backend provides a limited subset of all supported data types. For example, SQLite doesn’t support a JSON data type:

from datapackage import Package
from tableschema import Table, Storage
from sqlalchemy import create_engine
engine=create_engine('sqlite:///')

# Resource
data = [[{'key': 'value'}]]
schema = {'fields': [{'name': 'object', 'type': 'object'}]}

# Save
storage = Storage.connect('sql', engine=engine)
table = Table(data, schema=schema)
table.save('objects', storage=storage)

# Load
table = Table('objects', schema=schema, storage=storage)
table.read()
# [[{'key': 'value'}]] - objects inside as we'd like

Create datapackages from a few tables in your DB

You can now create new data packages with a select few SQL, BigQuery or Panda tables from your database when loading it as a data package. Example:

package = Package({'resources': [{'path': 'table1'}, {'path': 'table3'}]}, storage='sql', engine=engine)
package.resource_names # ['table1', 'table3']
package.infer()
print(package.descriptor)

Goodtables web service works with v1 specs

Our Goodtables web service is now updated to work with v1 specifications¹⁰. This tool allows you to setup your continous data validation workflow to ensure that published data is always valid. try.goodtables.io allows a one-time validation for arbitrary tabular files against structure and schema checks and is perfect for demo or trial purposes.

Next steps

We are looking to write more in-depth documentation and guides for the Frictionless Data specs and tools as we update our codebase¹¹.
We are looking forward to extend the number of our storage API implementations. In addition to the SQL/BigQuery/Pandas implementations, we are working on SPSS¹² and Elasticsearch¹³ plugins. Contributors play a very important role in this work. Feel free to write your own tableschema plugin - it’s fun and a relatively simple task!

We welcome community contributions to our codebase, and are keen to interact with you on Frictionless Data Gitter chat.

Table Schema: http://specs.frictionlessdata.io/table-schema/ ↩
CSV Dialect: http://specs.frictionlessdata.io/csv-dialect/ ↩
Frictionless Data Design Philosophy: http://specs.frictionlessdata.io/#design-philosophy ↩
Data Resource: http://specs.frictionlessdata.io/data-resource/ ↩
Tabular Data Resource: http://specs.frictionlessdata.io/tabular-data-resource/ ↩
Table Schema Types: http://specs.frictionlessdata.io/table-schema/#types-and-formats ↩
Frictionless Data Tools http://frictionlessdata.io/software/ ↩
Data Package http://specs.frictionlessdata.io/data-package/ ↩
Data Package Python Library: https://github.com/frictionlessdata/datapackage-py ↩
Frictionless Data Specifications: http://specs.frictionlessdata.io/ ↩
Frictionless Data on Github: http://github.com/frictionlessdata ↩
Frictionless Data SPSS Plugin: https://github.com/frictionlessdata/tableschema-spss-py ↩
Frictionless Data ElasticSearch Plugin: https://github.com/frictionlessdata/tableschema-elasticsearch-py ↩

Measure for Measure

2017-07-13T00:00:00+00:00

In his Open Knowledge International Tech Talk, Developer Brook Elgie describes how we are using Data Package Pipelines and Redash to gain insight into our organization in a declarative, reproducible, and easy to modify way.

This post briefly introduces a newly launched internal project at Open Knowledge International called Measure, its history, motivation, and the tech that drives it. To learn more, watch the embedded video demonstration by developer Brook Elgie and check out the code.

What is Measure?

Measure is a system that allows us to collect and analyze metrics from various internal sources and external platforms through a combination of easy-to-write YAML docs and a user-friendly interface. These include the number of views on our main website, downloads of our libraries from PyPI, retweets on Twitter, and form-based records of project outputs (e.g. recent talks we’ve given). Like many organizations, we rely heavily on hosted platforms to execute on our mission, each of which has its own interface to useful data. This can make it harder to correlate events (e.g. how many downloads did this software package have after this blog post?) and yield insight across platforms. It’s critical to harmonize access to this data not only for us to learn how to be more effective, but also to demonstrate to external funders the impact of our work advancing the cause of openness. It’s also important for this data to be accessible to everyone at the organization, regardless of their technical skill.

Brook Elgie describes Measure in an Open Knowledge International Tech Talk

How Does it Work?

Measure relies on several technologies we are developing here at Open Knowledge International around our Frictionless Data project. Each of our projects has a source specification file defined in YAML and split into themes. For example, social-media is a theme for data sources such as Twitter and Facebook, while code-packaging is a theme for PyPI and other software repositories we upload to. Each theme has a pipeline which is composed of processors which do the actual work of fetching data and transforming the Data Package (a collection of data and descriptive metadata) and its resources. Data is moved through the thematic pipeline using Data Package Pipelines and a handful of other tools in the Frictionless Data project. The final processor writes the processed resources to the Measure database, which is used as the data source for our visualisation tool, Redash. Each pipeline is configured to run once a day. You can read more about Data Package Pipelines and how it enables this process in its introductory blog post.

By consolidating our metrics into a single database and surfacing through Redash, it’s easy to create and share visualisations across one or more data sources, create dashboards of project and organization health, and make truly data-driven decisions with minimal friction.

Tech Talks

If you enjoyed this, you can see similar content on our Open Knowledge International Tech Talks YouTube Playlist.

DAC and CRS code lists – Now available as Frictionless Data!

2017-07-10T00:00:00+00:00

This blog was originally posted on the Publish What You Fund website.

Maintained, machine readable versions of the DAC and CRS code lists are now available as CSV and JSON! Here’s how Publish What You Fund and Open Knowledge made it happen…

The OECD’s Development Assistance Committee (DAC) maintains a set of code lists used by donors to report on their aid flows. These are used as part of donors’ DAC reporting, but also in their IATI publications. Not only that, but since some of the codes e.g. for aid classification, are so widely used, they are also useful to recipient country governments to map aid activities to their own budgets. So they’re super important!

Keeping in sync

Now, these code lists are available on the OECD website as a non-machine-readable XLS file. There’s also an XML version, but it was last updated 18-months ago, and as such it differs significantly from the standard, canonical XLS version on the OECD website.

Because of this lack of a machine-readable version, IATI maintains its own replicated versions of these code lists. These replicated versions are used by d-portal, the IATI Dashboard and others. However, due to the overheads involved in maintaining them, these too have fallen out of sync with the source file.

There has been a-rumbling (and some grumbling!) within the IATI community about getting the DAC to produce a machine-readable version of these code lists. This idea has long been in the offing, and we at Publish What You Fund would very much welcome such a development.

In the meantime, though, we have taken matters into our own hands. Together with Open Knowledge, we’ve published a frictionless data package of the DAC code lists – with data available in machine-readable CSV and JSON formats. This is published as an Open Knowledge Core Dataset – a group of important and commonly-used datasets in high quality, easy-to-use and open form.

But how does it work? The science bit!

The data is stored on github, and maintained by a scraper that runs nightly on morph.io (created by the wonderful Open Australia Foundation). When a change to the data is detected, a pull request is sent by DAC CRS Bot, and reviewed by a (human) maintainer. Via github, we maintain a version history of changes to the data, so it’s possible to tell what changed and when.

The next logical step would be for IATI to use this data to maintain their replicated lists as a routine maintenance task. We’ve already tested this as a proof of concept one-off task, to bring all the relevant replicated IATI code lists up-to-date, including adding all French translations. De rien!

Introducing the new goodtables library and goodtables.io

2017-05-22T00:00:00+00:00

Information is everywhere. This means that there is so much we need to know at any given time, but such limited capacity and time to internalize it all. True art, therefore, lies in the ability to draw summaries adequate enough to save time and impart knowledge. From the 1880s, tabulation has been our go-to method for compacting information, not only to preserve it, but also to make analyses and draw meaningful conclusions out of it.

Tables, comprised of rows and columns of related data, are not always as easy to analyze, and especially not when there are thousands of rows of data. Mixed data types, missing data, or ill-suited data in tables are but a few reasons why tabular data is often a nightmare to work with in its raw state, often referred to as “dirty” data.

Enter goodtables.

The goodtables library

goodtables is a Python library that allows users to inspect tabular data, checking it for both structural and schematic errors, and giving pointers on plausible error fixes, before users draw analyses on the data using other tools. At its most basic level, goodtables highlights general errors in tabular files that would otherwise prevent loading or parsing.

Since the release of goodtables v0.7 in early 2015, the codebase has evolved, allowing for additional use cases while working with tabular data. Without cutting back on functionality, goodtables v1 has been simplified, and the focus is now on extensible data validation.

Using goodtables

goodtables is still in alpha, so we need to pass the pre-release flag (--pre) to pip to install. With that, installation of goodtables v1 is as easy as pip install goodtables --pre.

The goodtables v1 CLI supports two presets by default: table and datapackage. The table preset allows you to inspect a single tabular file.

Example:

goodtables --json table valid.csv returns a JSON report for the file specifying the error count, source, and validity of the data file among other things.

The datapackage preset allows you to run checks on datasets aggregated in one container. Data Packages are a format for coalescing data in one ‘container’ before shipping it for use by different people and with different tools.

Example:

goodtables datapackage datapackage.json allows a user to check a data package’s schema, table by table, and gives a detailed report on errors, row count, headers, and validity or lack thereof of a data package.

You can try out these commands on your own data or you can use datasets from this folder.

Customization

In addition to general structure and schema checks on tabular files available in v0.7, the goodtables library now allows users to define custom (data source) presets and run custom checks on tabular files. So what is the difference?

While basic schema checks inspect data against the data quality spec, custom_check gives developers leeway to specify acceptable values against data fields so that any values outside of the defined rules are flagged as errors.

custom_preset allows users to define custom interfaces to your data storage platform of choice. They allow you to tell goodtables where your dataset is held, whether it is hosted on CKAN, Dropbox, or Google Drive.

Any presets outside of the built-in ones above are made possible and registered through a provisional API.

Examples:

CKAN custom preset: CKAN is the world’s leading open data platform developed by Open Knowledge Foundation to help streamline the publishing, sharing, finding and using of data. Here’s a custom preset that, for example, could help the user run an inspection on datasets from Surrey’s Data Portal which utilizes CKAN.
Dropbox custom preset: Dropbox is one of the most popular file storage and collaboration cloud service in use. It ships with an API that makes it possible for third party apps to read files stored on Dropbox as long as a user’s access token is specified. Here’s our goodtables custom preset for Dropbox. Remember to generate an access token by first creating a Dropbox app with full permissions.
Google Sheets custom preset: The Google Sheets parser to enable custom preset definition is currently in development. At present, for any data file stored in Google Drive and published on the web, the command goodtables table google_drive_file_url inspects your dataset and checks for validity, or lack thereof.

Validating multiple tables

goodtables also allows users to carry out parallel validation for multi-table datasets. The datapackage preset make this possible.

Example:

Frictionless Data is a core Open Knowledge Foundation project and all goodtables work falls under its umbrella. One of the pilots working with Frictionless Data is DM4T, with an aim to understand the extent to which Data Package concepts can be applied in the energy sector. DM4T pilot’s issue tracker lives here and its Data Package comprises of 20 CSV files and is approximately 6.7 GB in size.

To inspect DM4T’s energy consumption data collected from 20 households in the UK, run:

goodtables --table-limit 20 datapackage https://s3-eu-west-1.amazonaws.com/frictionlessdata.io/pilots/pilot-dm4t/datapackage.json

In the command above, the --table-limit option allows you to check all 20 tables, since by default goodtables only runs checks on the first ten tables by default. You can find plenty of sample Data Packages for use with goodtables in this repository.

So why use GitHub for storage of data files? At Open Knowledge Foundation, we highly recommend and work with others to use GitHub repositories for dataset storage.

PRO TIP: In working with datasets hosted on GitHub, say the countries codes Data Package, users should use the raw file URL with goodtables, since support for GitHub URL resolution is still in development.

Standards and other enhancements

goodtables v1 also works with our proposed data quality specification standard, which defines an extensive list of standard tabular data errors. Other enhancements from goodtables v0.7 include:

Breaking out tabulator into its own library. As part of the Frictionless Data framework, tabulator is a Python library that has been developed to provide a consistent interface for stream reading and writing tabular data that is in whatever format, be it CSV, XML, etc. The library is installable via pip: pip install tabulator.
Close to 100% support for Table Schema due to lots of work on the underlying Python library. The Table Schema Python library allows users to validate dataset schema and, given headers and data, infer a schema as a python dictionary based on its initial values.
Better CSV parsing, better HTML detection, and less false positives.

goodtables.io

Moving forward, at Open Knowledge Foundation we want to streamline the process of data validation and ensure seamless integration is possible in different publishing workflows. To do so, we are launching a continuous data validation hosted service that builds on top of this suite of Frictionless Data libraries. goodtables.io will provide support for different backends. At this time, users can use it to check any datasets hosted on GitHub and Amazon S3 buckets, automatically running validation against data files every time they are updated, and providing a user friendly report of any issues found.

Try it here: goodtables.io

This kind of continuous feedback allows data publishers to release better, higher quality data and helps ensure that this quality is maintained over time, even if different people publish the data.

Using this dataset on Github, here’s sample output from data validation run on goodtables.io:

Updates on the files in the dataset will trigger a validation check on goodtables.io. As with other projects at Open Knowledge International, goodtables.io code is open source and contributions are welcome. We hope to build functionality to support additional data storage platforms in the coming months, please let us know which ones to consider in our Gitter chat or on the Frictionless Data forum.

Data Package Pipelines

2017-02-27T00:00:00+00:00

datapackage-pipelines is the newest part of the Frictionless Data toolchain. Originally developed through work on OpenSpending, it is a framework for defining data processing steps to generate self-describing Data Packages.

OpenSpending is an open database for uploading fiscal data for countries or municipalities to better understand how governments spend public money. In this project, we’re often presented with requests to upload large amounts of potentially budget messy data—often a CSV or Excel files—to the platform. We looked for existing ETL (extract, transform, load) solutions for extracting data from these different sources, transforming them into a format that OpenSpending supports (the Open Fiscal Data Package) and loading them into the platform. A few existing and powerful solutions exist, but none suited our needs. Most were optimised for a use case in which you have a few different data sources, on which a large dependency graph can be built out of complex processing nodes. The OpenSpending use case is radically different. Not only do we have many data sources, but our processing flows are independent (i.e. not an intricate dependency graph) and mostly quite similar (i.e. built from the same building blocks).

We also found that typical ETL solutions were intended to be used by data scientists and developers with processing pipelines defined in code. While this is very convenient for coders, it is less so for the kind of non-techies (e.g. government officials) we want to use the platform. Writing processing nodes in code gives developers a lot of flexibility but also provides very few assurances about the computational resources the code will use. This creates problems when having to make decisions regarding deployment or concurrency.

Pipelines for Data Packages

Based on these observations, we implemented a new ETL library, datapackage-pipelines, with a different set of assumptions and use cases.

Pipelines for something other than data - Stefan Schmidt - CC BY-NC 2.0

datapackage-pipelines assumptions and use cases:

Processing flows (or ‘pipelines’) are defined in a configuration file and not code.

This allows non-techies to write pipeline definitions, and enables other possibilities, such as strict validation of definition files.

Writing custom processing code is possible, but the framework encourages small, simple processing nodes and not processing behemoths. This creates better design and easier-to-understand pipelines.
Input and output works through streaming data.

While this means processing nodes have limited flexibility, it also means they must adhere to strict use of computing resources. This constraint allows us to deploy processing flows more easily, without having to worry about a processing node taking too much memory or disk space.
We are based on the Data Package, like OpenSpending.

All pipelines process and produce valid Data Packages. This means that metadata (both descriptive and structural) and data validation are built into the framework. The resulting files can then be seamlessly used with any compliant tool or library, which makes the produced data extremely portable and machine-processable.

Quick Start

To start using datapackage-pipelines, you must first create a pipeline-spec.yaml file in your current directory. Here’s an example:

worldbank-co2-emissions:
  pipeline:
    -
      run: add_metadata
      parameters:
        name: 'co2-emissions'
        title: 'CO2 emissions (metric tons per capita)'
        homepage: 'http://worldbank.org/'
    -
      run: add_resource
      parameters:
        name: 'global-data'
        url: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel"
        format: xls
        headers: 4
    -
      run: stream_remote_resources
      cache: True
    -
      run: set_types
      parameters:
         resources: global-data
         types:
           "[12][0-9]{3}":
              type: number
    -
      run: dump.to_path
      parameters:
          out-path: co2-emissions

Running a pipeline from the command line is done using the dpp tool. Install the latest version of datapackage-pipelines from PyPI (Requirements: Python 3.5 or higher):

$ pip install datapackage-pipelines

At this point, running dpp will show the list of available pipelines by scanning the current directory and its subdirectories, searching for pipeline-spec.yaml files. (You can ignore the “:Skipping redis connection, host:None, port:6379” warning for now.)

$ dpp
Available Pipelines:
- ./worldbank-co2-emissions (*)

Each pipeline has an identifier, composed of the path to the pipeline-spec.yaml file and the name of the pipeline, as defined within that description file. In this case,the identifier is ./worldbank-co2-emissions.

In order to run a pipeline, you use dpp run <pipeline-id>. You can also use dpp run all for running all pipelines and dpp run dirty to run the just the dirty pipelines (more on that in the README.

$ dpp run ./worldbank-co2-emissions
INFO :Main:RUNNING ./worldbank-co2-emissions
INFO :Main:- lib/add_metadata.py
INFO :Main:- lib/add_resource.py
INFO :Main:- lib/stream_remote_resources.py
INFO :Main:- lib/dump/to_zip.py
INFO :Main:DONE lib/add_metadata.py
INFO :Main:DONE lib/add_resource.py
INFO :Main:stream_remote_resources: OPENING http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
INFO :Main:stream_remote_resources: TOTAL 264 rows
INFO :Main:stream_remote_resources: Processed 264 rows
INFO :Main:DONE lib/stream_remote_resources.py
INFO :Main:dump.to_zip: INFO :Main:Processed 264 rows
INFO :Main:DONE lib/dump/to_zip.py
INFO :Main:RESULTS:
INFO :Main:SUCCESS: ./worldbank-co2-emissions
                    {'dataset-name': 'co2-emissions', 'total_row_count': 264}

At the end of this, you should have a new directory co2-emisonss-wb with a /data directory and a datapackage.json file. This is a Data Package.

$ tree
.
├── co2-emissions
│   ├── data
│   │   └── EN.ATM.CO2E.csv
│   └── datapackage.json
└── pipeline-spec.yaml

2 directories, 3 files

So what exactly happened? Let’s explore what a pipeline actually is, and what it does.

The Pipeline

The basic concept in this framework is the pipeline. A pipeline has a list of processing steps, and it generates a single Data Package as its output. Each step is executed in a processor and consists of the following stages:

Modify the Data Package descriptor file (datapackage.json) - For example: add metadata, add or remove resources, change resources’ data schema etc. For valid elements, see the spec.
Process resources - Each row of each resource is processed sequentially. The processor can drop rows, add new ones, or modify their contents.
Return stats - If necessary, the processor can report a dictionary of data which will be returned to the user when the pipeline execution terminates. This can be used, for example, for calculating quality measures for the processed data.

Not every processor needs to do all of these. In fact, you would often find each processing step doing only one of these.

pipeline-spec.yaml file

Pipelines are defined in a declarative way, and not in code. One or more pipelines can be defined in a pipeline-spec.yaml file. This file specifies the list of processors (referenced by name) and the execution parameters for each of the processors.

In the above example we see one pipeline called worldbank-co2-emissions. Its pipeline consists of 4 steps:

metadata: This processor (see the repo for more), which modifies the Data Package’s descriptor (in our case: the initial, empty descriptor) - adding name, title, and other properties to the datapackage.json.
add_resource: This processor adds a single resource to the Data Package. This resource has a name and a url, pointing to the remote location of the data.
stream_remote_resources: This processor converts remote resources (like the one we defined in the 1st step) to local resources, streaming the data to processors further down the pipeline (see “Mechanics” below).
set_types: This processor assigns data types to fields in the data. In this example, field headers looking like years will be assigned the number type.
dump.to_path: Create a validated Data Package in the provided path co2-emissions-wb

Mechanics

An important aspect of how the pipelines are run is the fact that data is passed in streams from one processor to another. Each processor is run in its own dedicated process, where the Data Package is read from its STDIN and output to its STDOUT. No processor holds the entire data set at any point.

Dirty tasks and keeping state

As you modify and re-run your pipeline, you can also avoid unnecessarily repeating steps. By setting the cached: property on a specific pipeline step to True, this step’s output will be stored on disk (in the .cache directory, in the same location as the pipeline-spec.yaml file). Re-running the pipeline will make use of that cache, thus avoiding the execution of the cached step and its precursors.

The cache hash is also used for seeing if a pipeline is “dirty”. When a pipeline completes executing successfully, dpp stores the cache hash along with the pipeline id. If the stored hash is different than the currently calculated hash, it means that either the code or the execution parameters were modified, and that the pipeline needs to be re-run.

Validating

The Data Package metadata is always validated before being passed to a processor, so there’s no possibility for a processor to modify a Data Package in a way that renders it invalid.
The data itself is not validated against its respective Table Schema, unless explicitly requested by setting the validate flag to True in the step’s properties. This is done for two main reasons:
- Performance: validating the data in every step is very CPU-intensive
- In some cases, you modify the schema in one step and the data in another, so you would only like to validate the data once all the changes were made
In any case, all the dump.to_* (dump.to_path, dump.to_sql, dump.to_zip) standard processors validate their input data regardless of the validate flag - so in case you’re using them, your data validity is covered 👍🏽.

Try it out

This all adds up to highly modular, configurable, and resource-considerate framework for processing and packaging tabular data. Once you have created a Data Package, you can publish it anywhere on the web, comfortable in the knowledge that its embedded metadata will make it much easier to document and use. Developers can process Data Packages using our Python and JavaScript libraries. Data analysts can use the R library for Data Packages or our Python Pandas library to load the data.

For more information about datapackage-pipelines, included running pipelines on a schedule, using the dashboard, configuring the standard processors, and information on how to write your own processor, visit the GitHub repo.

Case Studies for Frictionless Data

2016-11-30T00:00:00+00:00

For our Frictionless Data project, we were curious to learn about some of the common issues users face when working with data. To that end, we started a Case Study series to highlight projects and organizations working with the Frictionless Data specifications and tooling in interesting and innovative ways.

Through several interviews over the past several months, we now have three case studies published on a range of topics: data science in the browser, the provision of clean power system data for energy researchers, and re-usable components for cloud-based data-intensive workflows.

Dataship

http://frictionlessdata.io/case-studies/dataship/

We interviewed Waylon Finn of Dataship to learn more about how he uses the Data Package specifications. Dataship is a way to share data and analysis, from simple charts to complex machine learning, with anyone in the world easily and for free. The Data Package acts as the base format for Dataship notebooks.

Open Power System Data

http://frictionlessdata.io/case-studies/open-power-system-data/

We spoke to Lion Hirth and Ingmar Schlecht about Open Power System Data, a free-of-charge and open platform providing clean, high quality data needed for power system analysis and modeling. The project is aimed at resolving some of the most common, persistent issues energy researchers face when working with data.

Tesera

http://frictionlessdata.io/case-studies/tesera/

Spencer Cox of Tesera Systems, Inc. shared how his team is using the Frictionless Data specifications across a range of purpose-built tools to power data-driven applications in the cloud.

Reach out to us

If you are using any of the Frictionless Data specifications—JSON Table Schema, Data Packages—for your project, big or small, reach out to us. We can work together on developing a case study to share your project with the world!

Frictionless Data Specs Working Group

2016-10-17T00:00:00+00:00

Last month, we had the first call of the Frictionless Data Specifications Working Group, starting a new chapter in the project. The call covered the status of the specifications to date, current adoption, upcoming technical pilots and partnerships, and how work will be organized going forward. In this post, I will lay out the purpose for this initiative, who is participating, and how you can get involved.

Overview

Frictionless Data is a project encompassing a set of tooling and specifications to ease the transport and reuse of data. The specifications have grown out of a long engagement with issues around data interoperability, publication workflows, and analysis. For most of the history of this project, the specifications were curated by Rufus Pollock as one of several “Data Protocols” with input and assistance from individuals from Open Knowledge International and other organizations. As a result, the specifications have steadily gained traction across various projects and software developed by, among others, the Open Data Institute (ODI), Tesera Systems, Inc., Dataship, and Open Power System Data.

This adoption validates the approach we’ve taken: creating a minimum viable set of specifications to significantly improve transport of data. In reaching out to new users, we would like to make sure that we have resolved some of the outstanding edge cases to ensure that Data Packages can serve a solid foundation for many more types of data-intensive applications. This work is all the more important as “core” libraries in Python, Javascript, and Ruby are currently being refined, and newer libraries, like R, are being developed. With that in mind, we have organized a working group with a specific goal: to deliver a first, complete version of the specifications by end of this year.

Working Group

Members of the working group currently include:

Paul Walsh (Open Knowledge International)
Rufus Pollock (Open Knowledge International)
Dan Fowler (Open Knowledge International)
Dominik Moritz (University of Washington)
Steven De Costa (Link Digital)
James McKinney (Open North)
Karissa McKelvey (Dat Data
Spencer Cox (Tesera Systems, Inc.)

Work will happen continue to happen asynchronously, in the open, without excessive rules around voting. Rather, we will listen to feedback and act in favor of consensus (without requiring it). Rufus Pollock, having led this work for many years with a strong focus on keeping it simple, will remain the curator; decisions of what stays or goes from the specs will rest with him. Having more eyes the specs with a variety of different perspectives will allow us to solidify and remove ambiguous statements, eliminate unnecessary repetition or logical errors, and, hopefully, achieve a minimal 1.0 by end of 2016. Beyond the core Data Package specifications, open topics might include defining further custom “profiles” (e.g. Fiscal Data Package), as well as potential extensions, including specifications for visualizations, statistics, and quality metrics for data.

Feedback Needed

Are you currently using or considering using the Frictionless Data specifications for your data or application? If so, please let us know!

Work is managed via an issue tracker on GitHub, which is the best way to raise specific questions. If you would like to specifically flag an issue for the Working Group, mention @frictionlessdata/specs-working-group in the comment. For general commentary on any aspect of Frictionless Data, you can leave a comment on the forum.

Current Specifications: http://specs.frictionlessdata.io/
- JSON Schema (for validation): https://github.com/frictionlessdata/schemas
Specs Issue Tracker: https://github.com/frictionlessdata/specs/issues
- Current Milestone: https://github.com/frictionlessdata/specs/milestone/1
Forum: https://discuss.okfn.org/c/frictionless-data

Thanks to Paul Walsh who provided the motivating text that served as the basis for this post and Jo Barratt who did much of organizing necessary to make it happen.

Building 2030-watch.de: measuring progress towards the sustainable development goals (SDGs)

2016-10-13T00:00:00+00:00

For the last 15 months the Open Knowledge Foundation Germany has been working on a prototype to monitor progress towards the sustainable development goals (SDGs) from an independent, civil society-led perspective. There’s a detailed blog post on why such independent monitoring is necessary at our blog. To give a quick example, the UN Commission agreed to measure the tax revenue generated by low-income countries but doesn’t propose an indicator to measure the financial secrecy of European countries, for example encouraging tax evasion. At 2030-watch we have the Tax Justice Network as our data partner providing an indicator on this topic. The Tax Justice Network also collaborate with Open Knowledge on tax justice.

Due to “cherry picking” of indicators, there is a high risk that the ambition of the 2030 Agenda is watered down at the monitoring stage. This is why we have created 2030-Watch: a tool that focuses on high-income countries and highlights through a visualisation tool which countries are doing well at achieving which goals using over 60 indicators using data from official sources like Eurostat and the OECD and also from civil society organisations. The results might surprise you, so head on over and take a look!

It’s been my pleasure to lead development work on the project the last few months: creating a workflow for uploading indicator data, reworking the site to showcase indicator “sponsors” and allowing multilingual texts. Last week we were very proud to launch the English version: 2030-watch.de/en/. The site is generated using the static site generator Jekyll. Recent Jekyll versions have a wonderful ability to ingest JSON data and make it available to templates. We’ve used this facility to make a JSON database of all indicators available to various small visualisation web applications written in AngularJS. We have recently moved from direct editing of JSON files (one per indicator) via GitHub to reading in data from standardized Google Sheets and automatically outputting JSON files as updates to the GitHub repository. This change was made to make indicator sponsorship easier for external parties. It has the side benefit of allowing conversion to CSV, ODS and XLSX formats using the Google Drive API. Source code for the website is of course open and you can take a look at batch processing Google Drive sheets at our data processing repository.

The country comparison tool shows multiple countries’ performance across all indicators in a simple color-coded fashion as well as allowing comparisons to be made between the countries for each indicator

2030 Watch is still a prototype which has been developed with scarce resources and a lot of voluntary work. We still have a lot to do and are looking for help in many areas, technical and non-technical. For example on the technical side we would love to see ideas of how the site can become even more user friendly and more informative and how the visualisations could work with up to 90 indicators, or how we could solve the problem of adapting the data tool for mobile use. For further details on how to get involved contact us at info@2030-watch.de. We are also raising financial contributions at betterplace.org. Despite the remaining challenges we feel that 2030 Watch already demonstrates that civil society monitoring is possible.

Disclaimer and acknowledgements: This post has reused some of Claudia’s longer post at 2030-watch.de. I am responsible for the development and data preparation effort carried out since September 2016. Special thanks go to fellow labs member Mark Brough who has reworked the visuals wonderfully in the last months, to Katja Dittrich who created the data visualisations and to Christian Pape who developed the first version of the site in 2015.

Embulk at csv,conf,v2

2016-08-04T00:00:00+00:00

Having co-organized csv,conf,v2 this past May, a few of us from Open Knowledge International had the awesome opportunity to travel to Berlin and sit in on a range of fascinating talks on the current state-of-the-art on wrangling messy data. Previously, I posted about Comma Chameleon by Stuart Harrison¹. Another such talk was given by Sadayuki Furuhashi of Treasure Data² who presented on tool he is developing called Embulk. Embulk is an open-source tool for moving messy data.

Friction in Data Transport

In his talk, Sadayuki talked about the friction commonly experienced in moving large amounts of data from one system to another. He gives a relatively simple example of trying to push a 10GB CSV file to PostgreSQL and encountering a series of issues—broken or missing records, unsupported time formats—that can typically be dealt with only through trial and error. Multiply these kinds of issues across the various and growing number of backends and file formats, and it quickly becomes clear that there’s not enough time in the day for data wranglers to write and optimize their scripts to move data flexibly and efficiently. Enter Embulk.

Embulk

Embulk is open-source tool for transporting massive, messy datasets—in parallel—from one system to another. In this context, “system” can refer to any number of endpoints including Amazon S3, an SQL database, or even a CSV file on your local computer. Embulk attempts to solve the issues above by creating a plugin-based framework that supports various data transport tasks, including file type and format guessing, processing, filtering, and encryption.

Specialized connectors for supporting different storage engines—RDBMSs, cloud services, etc.—as well as various file types —CSV, XML, JSON, HDF5, etc.—can be created by the community as plugins. The core of Embulk is actually quite small and gains most of its power from this plugin architecture.

Frictionless Data

While watching his presentation, it was clear that there is a lot of opportunity for collaboration between the work Embulk is doing and the ecosystem we’re trying to build through our Frictionless Data project. In our project, we’re looking to support easy and efficient transport of data primarily through the development and promotion of the Frictionless Data specifications and the development of various libraries, tools, and integrations. For instance, our Python library for reading and working with JSON Table Schema also supports a plugin-architecture for reading and storing data in a variety of backends. Currently, we have support for Pandas, BigQuery, and SQL (visit our User Stories page to vote for and comment on what you’d like to see next).

Embulk Guess and Data Packages

As an example of the potential overlap, we can demonstrate the schema and dialect guessing that Embulk employs to load data. To support loading CSV data into a variety of backends, Embulk needs a good idea of the types of records (schema) and also the rules by which these values are separated in the file (dialect). Embulk’s guess function (embulk guess) makes guesses about the file structure and outputs something similar to the datapackage.json (see our specifications for more details). To demonstrate, we can use Embulk’s convenient example function which creates an example CSV. It looks like this:

id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

The guess function reads the data and generates a configuration file used by Embulk that looks like the following. Of particular interest is the parser section.

in:
  type: file
  path_prefix: /Users/dan/Desktop/demo/csv/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}

As you can see, embulk guess is powerful enough to go a bit further than other similar type guessing functions by also guessing, for example, that not only is a column a date, but also the expected date format structure.

Many of the fields in the parser section can be represented in a Data Package following the JSON Table Schema and CSV Dialect Description Format specifications, which are part of the Data Package specifications. Here’s what the equivalent datapackage.json file would look like:

{
  "name": "sample_01",
  "resources": [
    {
      "name": "sample-01",
      "path": "sample_01.csv",
      "format": "csv",
      "encoding": "UTF-8",
      "dialect": {
        "lineTerminator": "\r\n",
        "delimiter": ",",
        "quoteChar": "\"",
        "escapeChar": "\"",
        "header": true
      }
      "schema": {
        "fields": [
          { "name": "id", "type": "integer" },
          { "name": "account", "type": "integer" },
          { "name": "time", "type": "datetime",
              "format": "fmt:%Y-%m-%d %H:%M:%S" },
          { "name": "purchase", "type": "date",
              "format": "fmt:%Y%m%d" },
          { "name": "comment", "type": "string" }
        ]
      }
    }
  ]
}

The Potential Future

Embulk looks like a really powerful tool that can definitely be part of the Frictionless Data ecosystem we envision. I would love to an input plugin to read a datapackage.json to give Embulk’s CSV parser what it needs. It would also be great to see an output plugin that can produce a valid datapackage.json file with your data. Once you’ve generated a schema using Embulk’s powerful guessing functionality, publishing the schema with your data in a standard format like the Data Package is an excellent step towards making your data more findable and reusable. The Frictionless Data project is, at its heart, about highlighting the benefits of adopting just such a standardized containerization approach to data. Of course, even without this, Embulk is a really powerful tool for solving some of the problems of data transport today. Give it a try!

Download Embulk: https://github.com/embulk/embulk
Follow Treasure Data on Twitter: https://twitter.com/TreasureData
Follow Sadayuki Furuhashi on Twitter: https://twitter.com/frsyuki/
Follow Open Knowledge Labs on Twitter: https://twitter.com/okfnlabs
See the full range of speakers from csv,conf,v2: http://csvconf.com

See Sadayuki’s full talk:

Comma Chameleon: /blog/2016/07/18/comma-chameleon.html ↩
Treasure Data: https://www.treasuredata.com/ ↩

Using Data Packages with Pandas

2016-08-01T00:00:00+00:00

Frictionless Data is about making it effortless to transport high quality data among different tools and platforms for further analysis. We obviously ♥ data science, and pandas is one of the most popular Python libraries for advanced data analysis and modeling. This post highlights our most recent community contribution¹—pandas integration for Data Packages—what it means, and how you can contribute.

Pandas

From the pandas documentation:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

One of the primary data structures in pandas is the DataFrame. The DataFrame, similar to R’s data frame, stores the kind of 2-dimensional, tabular data common across various data analysis use cases. While pandas has extremely powerful tools for importing, exporting, and manipulating data, the process of loading data from, say, a single CSV file, often requires some trial and error to do optimally. For instance, one might need to manually specify CSV dialect parameters, index columns, datetime fields, etc. Pandas has automatic type and encoding guessing, but guessing often fails, requiring manual intervention to accurately describe and load your data. (See my recent post on R for an example of this.)

A Tabular Data Package consists of one or more CSV resources, each containing a schema (indicating type, constraints, and other metadata useful for validation and analysis) and, optionally, a dialect (specifying characters for separating or quoting values). See our JSON Table Schema guide and the CSVDDF specification for more information. Given that a single Tabular Data Package can consist of multiple tables, pandas integration means loading multiple DataFrames—with appropriately set types, encodings, indexes and dialects—at once. And once you have Tabular Data Packages in a pandas DataFrame, you now get all the power provided by Pandas to reshape, explore and visualise data as well as access to Pandas’ variety of export formats.

jsontableschema-pandas

The newly developed Pandas plugin allows users to generate and load Pandas DataFrames based on JSON Table Schema descriptors. In order to use it, you first need to install the datapackage and jsontableschema-pandas libraries.

pip install datapackage
pip install jsontableschema-pandas

You can load a Data Package into your environment by using the datapackage.push_datapackage function. We pass a path to the descriptor file (datapackage.json), and we are choosing pandas for our backend:

import datapackage
import pandas

url = 'https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/cpi/datapackage.json'
storage = datapackage.push_datapackage(descriptor=url,backend='pandas')

Once loaded into memory, the tables method returns a list of DataFrames stored in the Data Package.

storage.tables

['data__cpi']

In this case, we have a single table, data__cpi which we can take a peek at using the Pandas head() method.

storage['data__cpi'].head()

	Country Name	Country Code	Year	CPI
0	Afghanistan	AFG	2004-01-01	63.131893
1	Afghanistan	AFG	2005-01-01	71.140974
2	Afghanistan	AFG	2006-01-01	76.302178
3	Afghanistan	AFG	2007-01-01	82.774807
4	Afghanistan	AFG	2008-01-01	108.066600

At this point, you can treat storage['data__cpi'] as you would any other DataFrame in Pandas. For more detail on how to interact with the library and where to go from here, please visit below:

Package on PyPI: https://pypi.python.org/pypi/jsontableschema-pandas
Source on GitHub: https://github.com/frictionlessdata/jsontableschema-pandas-py

Contributing

The Python library jsontableschema-py provides the core set of utilities for working with Tabular Data Package tables, and it implements a plugin-based system for adding different storage backends. In a recent post, I highlighted the first two of these storage integrations: SQL and BigQuery. These libraries, and the Pandas library, were written as drivers implementing the jsontableschema.storage.Storage interface. If you have another storage backend you’d like to use with Data Packages in Python, consider writing a plugin.

We’re also looking to support other integrations beyond Python. You can find user stories we’re looking to support on the User Stories section of the Frictionless Data site. Do you have a library, tool, or platform that you’d like to see support importing and exporting Data Packages? Let us know by voting and commenting on what you’d like to see! If you have any questions about how to contribute, jump into the Frictionless Data chat or post in the forum.

To see the code used in this post, visit its Jupyter Notebook.

Thanks @sirex for the contribution! http://sirex.lt ↩

Publish Data Packages to DataHub (CKAN)

2016-07-25T00:00:00+00:00

Back in March, I wrote about a CKAN extension for publishing and exporting Data Packages¹. This extension, datapackager, has been updated and is now live on our very own CKAN instance, DataHub. DataHub users can now import and export Data Packages via the CKAN UI and API. This post will show you how.

DataHub and Data Packages

DataHub is a free, powerful data management platform hosted by Open Knowledge International. It is powered by CKAN, the leading open-source data management system used by governments and civic organizations around the world—including Data.gov and data.gov.uk. In this post, I describe how to load “Data Packages” onto DataHub to take advantage of CKAN’s powerful visualization and analytics features.

A Data Package is a coherent collection of data, metadata, and other assets. Open Knowledge International is currently working on Frictionless Data, a project aimed at creating an ecosystem for frictionless data transport by defining the Data Package standard and designing the tools and integrations that support them. Given its ubiquity as a data publishing platform, CKAN support is an important part of this strategy.

Importing a Data Package into DataHub.io

Register on DataHub: If you’re not already a DataHub user, you will need to register for an account. Once registered, you will also need to request an “organization” via our forum. New datasets can only be loaded on DataHub if they are associated with an organization.
Create Your Data Package: If you don’t have your data in a Data Package already, you can visit this online Data Package creator or create a Data Package programmatically in Python. If you are just interested in trying out this demo, you should be able to visit the datasets organization on GitHub and download any of the repos as a zip file.
Zip your Data Package: If you created your Data Package in the previous step,e create a new zip file from the Data Package folder with the datapackage.json at the root. If you are on a Unix-type machine, you can usually run zip -r my-datapackage-to-import.zip <data package directory>. Note: make sure your packaged data, unzipped, is less than the 100MB, as this is current size limit on DataHub.
Import your Data Package: While signed in, click on “Import Data Package” on the page of the organization you created in Step 1, and upload the zipped Data Package you created in the previous step.

Once your Data Package has been successfully imported, you should be to use the dataset as you would any dataset on the DataHub. This includes adding or editing any of your dataset’s metadata, or, accessing the dataset using the CKAN API.

Screencast

This screencast walks through the import steps outlined above.

Exporting a Data Package from DataHub.io

Exporting a Data Package from DataHub is even easier. Just navigate to the dataset you’d like to export, click on “Download Data Package”, and a datapackage.json file will be downloaded to your computer. The JSON file will contain the Data Package representation of the metadata stored on DataHub as well as links to the resources stored on DataHub.

CKAN Data Packager and Other Extensions

For information on importing and exporting data via the CKAN API or, if you are interested in adding datapackager to your own CKAN instance, you can read more on the extension repository.

Of course, CKAN is not the only data repository software we are looking to support. A major aim of Frictionless Data is to create integrations with the many different types of tools and platforms people already use for working with data. Visit our User Stories page to learn about the kinds of use cases and data workflows we’re looking to support. Let us know how you store your data and what you would like to see next!

Frictionless Data Transport in Python: 11 March 2016 ↩

Comma Chameleon at csv,conf,v2

2016-07-18T00:00:00+00:00

Having co-organized csv,conf,v2 this past May, a few of us from Open Knowledge International had the awesome opportunity to travel to Berlin and sit in on a range of fascinating talks on the current state-of-the-art on wrangling messy data. One such talk was given by Stuart Harrison of the Open Data Institute (ODI) who presented on tool he is developing called Comma Chameleon. Comma Chameleon is a desktop CSV editor with validation magic 🌟.

Comma Chameleon

CSV is a great, simple format that is easy to publish and use by technical and non-technical users alike. But CSV can also be abused (see Bad Data for examples), leading Stuart and his team at ODI Labs—the R&D team at ODI—to develop Comma Chameleon. Comma Chameleon is a desktop tool that uses CSVLint under the hood to validate CSVs for their structural integrity as well as their adherence to a schema specified in JSON Table Schema (CSV on the Web support in progress).

The point of Comma Chameleon is to give non-technical users the ability to create and edit CSV files in a more appropriate tool than Excel, software designed for manipulating spreadsheets first and foremost. The app allows users to fix errors in their data in place before publishing using the handy validation functions described above. Comma Chameleon also allows users to add useful metadata—for instance, a title, description, and a license—and export it all as a zipped Data Package.

Frictionless Data

Comma Chameleon—built with Electron—is an excellent example of the kind of tool that can provide the foundation for real advances in data quality thanks to adherence to a few simple, open standards. At Open Knowledge International, we are currently working hard on Frictionless Data, an initiative to define and promote just such tools and standards. We are delighted to be partnering with the ODI in the coming months on this and other initiatives around Frictionless Data.

Download Comma Chameleon: https://github.com/theodi/comma-chameleon
Follow ODILabs on Twitter: https://twitter.com/odilabs
Follow Stuart Harrison on Twitter: https://twitter.com/pezholio
Follow Open Knowledge Labs on Twitter: https://twitter.com/okfnlabs
See the full range of speakers from csv,conf,v2: http://csvconf.com

See Stuart’s full talk:

Using Data Packages with R

2016-07-14T00:00:00+00:00

R is a popular open-source programming language and platform for data analysis. Frictionless Data is an Open Knowledge International project aimed at making it easy to publish and load high-quality data into tools like R through the creation of a standard wrapper format called the Data Package.

In this post, I will demonstrate an in-progress version of datapkg, an R package that makes it easy to load Data Packages into your R environment by automating otherwise manual import steps using information provided in the Data Package descriptor file datapackage.json. datapkg was developed through a collaboration between Open Knowledge International and rOpenSci, an organization that specializes in creating open-source tools using R for advancing open science.

Loading Tabular Data in R

R’s core strengths as a data analysis framework lie in its support for a wide array of statistical tests, its straightforward, powerful options for static visualization, and the ease with which its functionality can be extended. For these reasons, R enjoys a vibrant online community who contribute daily to thousands of packages on CRAN. For this post, we will avoid going deep into what makes R so powerful, and instead focus on the typical first step in any data analysis project: loading source data. This post assumes you have a fairly basic understanding of R and a working R environment on your machine.

When loading tabular data from a file into an R environment, it is common to use the functions read.csv or read.delim. These are wrappers for the more generic read.table function that provide sane defaults for reading from commonly formatted CSV and tab-delimited files, respectively. These commands read data into what’s called a “data frame”, R’s basic data structure for storing data tables. In this structure, each column (“vector”) in the original tabular data file may be assigned a different type (e.g. string, integer, date).

As a simple example, let’s load a CSV file containing the CBOE Volatility Index using read.csv(). This dataset can be found on our example Data Packages repo in the subdirectory “finance-vix”. Once downloaded, we can set R’s working directory to where the data is stored and take a peek at the files within its data subdirectory:

setwd('/Users/dan/Downloads/example-data-packages-master/finance-vix')
list.files("data")

'vix-daily.csv'

We can read this single CSV, vix-daily, using R’s read.csv() function and assign its output to a data frame called volatility_raw. Afterwards, we can get a sample of the data by viewing the first few rows of the file using the head() function.

volatility_raw <- read.csv("data/vix-daily.csv")
head(volatility_raw)

	Date	VIXOpen	VIXHigh	VIXLow	VIXClose
1	1/2/2004	17.96	18.68	17.54	18.22
2	1/5/2004	18.45	18.49	17.44	17.49
3	1/6/2004	17.66	17.67	16.19	16.73
4	1/7/2004	16.72	16.75	15.5	15.5
5	1/8/2004	15.42	15.68	15.32	15.61
6	1/9/2004	16.15	16.88	15.57	16.75

In the process of loading this data into a data frame, R made an educated guess as to the types of data found in each column. We can display those types by looking at the the “structure” of an R object using the str command.

str(volatility_raw)

'data.frame':	3122 obs. of  5 variables:
 $ Date    : Factor w/ 3122 levels "01/02/2014","01/02/2015",..: 543 644 652 659 666 672 493 501 508 515 ...
 $ VIXOpen : num  18 18.4 17.7 16.7 15.4 ...
 $ VIXHigh : num  18.7 18.5 17.7 16.8 15.7 ...
 $ VIXLow  : num  17.5 17.4 16.2 15.5 15.3 ...
 $ VIXClose: num  18.2 17.5 16.7 15.5 15.6 ...

We can see that while R has correctly guessed the types of “VIXOpen”, “VIXHigh”, “VIXLow”, and “VIXClose” to be num, it has incorrectly guessed the type of the “Date” to be Factor when R has a much more appropriate type for the kind of data in this column called, predictably, Date. This is a problem easily demonstrable by attempting to plot the data.

plot(volatility_raw$Date, volatility_raw$VIXOpen, type='l')

What should be the steadily increasing Date on the X axis is, instead, out of order because the Date column is has not been assigned its correct type. In this very simple case, there is a straightforward fix which is to manually re-assign the Date column (in our data frame represented as volatility_raw$Date) to a type Date passing the special format %m/%d/%Y which we found out by previewing the data. After this, we can revisit its structure using the str() command.

volatility_raw$Date <- as.Date(volatility_raw$Date, "%m/%d/%Y")
str(volatility_raw)

'data.frame':	3122 obs. of  5 variables:
 $ Date    : Date, format: "2004-01-02" "2004-01-05" ...
 $ VIXOpen : num  18 18.4 17.7 16.7 15.4 ...
 $ VIXHigh : num  18.7 18.5 17.7 16.8 15.7 ...
 $ VIXLow  : num  17.5 17.4 16.2 15.5 15.3 ...
 $ VIXClose: num  18.2 17.5 16.7 15.5 15.6 ...

We have successfully given the Date column a Date type, and we should be able to run the same plot() function above and get a better result. While this is a good solution for this single dataset with a single incorrectly guessed column, it doesn’t scale well to multiple incorrectly guessed columns across multiple datasets. In addition, it only represents one type of manual task to be performed on a new set of data. We have designed the Data Package format to obviate this, and other kinds of tedious “data wrangling” tasks. In the next section, we will perform the same task above using the datapkg library.

Loading Tabular Data Packages in R

A Data Package is a specification for creating a “container” for transporting data by saving useful metadata in a specially formatted file. This file is called datapackage.json, and it is stored in the root of a directory containing a given dataset. When loading a Data Package, datapkg—the new R Data Package library developed by rOpenSci—reads this extra metadata in order to conveniently load high quality, well formatted data into your R environment.

Installing datapkg

Note: the Data Package library for R is still in testing and subject to change. For this reason, it is not yet on CRAN and must be installed from its GitHub repository using the devtools package.

To install, start your R environment and run the following commands:

install.packages("devtools")
library(devtools)
install_github("hadley/readr")
install_github("ropenscilabs/jsonvalidate")
install_github("frictionlessdata/datapackage-r")

Reading Data

Revisiting our data directory, we can examine the files in the root using the list.files() function:

'data' 'datapackage.json'

The presence of the datapackage.json file indicates our current R working directory is points to a Data Package, so we can load the datapkg library and use the datapkg_read() function to read our Data Package (note: we can also pass a path or URL to this function).

library(datapkg)
volatility <- datapkg_read()

The datapkg_read() function reads not only the data in the dataset, but also the metadata stored with it. This metadata includes high level information like the author, source, and license of the dataset. We can inspect this information by reading various variables stored on this object. For instance, to get a fuller, human-readable title, we can access volatility$title or, if the Data Package has a “homepage” variable set, we can access it using volatility$homepage.

'VIX - CBOE Volatility Index'
'http://www.cboe.com/micro/VIX/'

datapkg_read() also uses schema information stored in the datapackage.json to facilitate the loading of data. As shown above, one misstep we encountered when loading a new dataset into R was neglecting to correct an incorrectly guessed column type. What the Data Package format provides is a simple, standard way to store that information with a dataset to automate this and other steps. The following snippet shows how the datapackage.json descibes this information:

      "schema": {
        "fields": [
          {
            "name": "Date",
            "type": "date",
            "format": "fmt:%m/%d/%Y"
          },
          {
            "name": "VIXOpen",
            "type": "number"
          },
          {
            "name": "VIXHigh",
            "type": "number"
          },
          {
            "name": "VIXLow",
            "type": "number"
          },
          {
            "name": "VIXClose",
            "type": "number"
          }
        ]
      }

As above, we can verify that datapkg_read() used this information to construct its data frame by calling the str() function. The data variable on the volatility object created by datapkg_read() points to a list of files (“resources”) on the dataset; vix-daily is the name of the resource—expressed as a data frame—we want.

str(volatility$data$`vix-daily`)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3122 obs. of  5 variables:
 $ Date    : Date, format: "2004-01-02" "2004-01-05" ...
 $ VIXOpen : num  18 18.4 17.7 16.7 15.4 ...
 $ VIXHigh : num  18.7 18.5 17.7 16.8 15.7 ...
 $ VIXLow  : num  17.5 17.4 16.2 15.5 15.3 ...
 $ VIXClose: num  18.2 17.5 16.7 15.5 15.6 ...

The output shows that the Date column has been set with the correct type. Because the type on the Date column has been set correctly, we can immediately plot the data.

vix.daily <- volatility$data$`vix-daily`
plot(vix.daily$Date, vix.daily$VIXOpen, type='l')

Going Forward

This has been a very small example of the basic functionality of the R library. This software is still in testing, so if you are an R user and would like to use Data Packages to help manage your data in R, please let us know. You can leave a comment here on the forum.

To see the code used in this post, visit its Jupyter Notebook.

'Continuous Processing' with Data Packages

2016-07-13T00:00:00+00:00

When storing your data in Data Packages, it is considered good practice to store scripts for updating, processing, or analyzing your data in a directory called scripts/ placed at the root of your Data Package. I’ve written a tutorial to show how to achieve continuous processing: that is, the delivery of updated data every time something changes, either in the source data or the processing code.

Depending on the timeliness of your dataset, you’ll want to periodically run update scripts stored in your scripts/ directory, but what if you don’t want to run the update script of your Data Package by yourself? Instead, why not let Travis CI do it for you?

If your Data Package already…

has scripts that download the source data, cleans it or reformats it into a nice interoperable format
relies on make to run the scripts
has tests to validate the data

…then you’re ready to go to the next level of automation! Here’s a tutorial to enable regular updates of the data with Travis CI.

It’s very well suited for small data (less then 300 MB) and when the processing step is short (i.e. less than 10 minutes). This makes this workflow perfect for Data Packages!

Read the tutorial to find out more!

Automated Data Validation with Data Packages

2016-05-17T00:00:00+00:00

Much of the open data on the web is published in CSV or Excel format. Unfortunately, it is often messy and can require significant manipulation to actually be usable. In this post, I walk through a workflow for automating data validation on every update to a shared repository inspired by existing practices in software development and enabled by Frictionless Data standards and tooling.

Software projects have long benefited from Continuous Integration services like Travis CI and others for ensuring and maintaining code quality. Continuous integration is a process where all tests are automatically run and a report is generated on every update (“commit”) to a project’s shared repository. This allows developers to find and resolve errors quickly and reliably. In addition, by displaying the “build status” others outside of the project can clearly see the status of the project test compliance.

As with software, datasets are often collaboratively created, edited, and updated over time, sometimes introducing subtle (or not so subtle) structural and schematic errors (see Bad Data for examples). Much of the “friction” in using the data comes from the time and effort needed to identify and address these errors before analyzing in a given tool. Automatically flagging data quality issues at upload time in a repository can go a long way in making data more useful and have significant follow-on effects in the data ecosystem, both open and closed.

Continuous Data Integration

As the Frictionless Data tooling and standards ecosystem continues to grow, we now have the elements necessary to provide data managers with the same type of service for tabular data (e.g. Excel and CSV). In less than one hour, a few of us at Open Knowledge booted a small demo to show what continuous data integration could look like. On each commit to our example repository, a set of validation tests are run on the data, raising an exception if the data is invalid. If a user adds “bad” data, the “build” fails and issues a report indicating what went wrong.

As an example, the following CSV has a few issues with its values. In the schema we define in the datapackage.json file below (i.e. the object that the schema points to), we defined the “Number” column type to number and the Date column type to date. However, the CSV contains invalid values for those types: “x23.5” and “2015-02” respectively.

CSV

Date,Country,Number
2015-01-01,3,20.3
2015-02-01,United States,23.5
2015-02,United States,x23.5

datapackage.json

{
  "name": "fd-continuous-data-integration",
  "title": "",
  "resources": [
    {
      "name": "data",
      "path": "data/data.csv",
      "format": "csv",
      "mediatype": "text/csv",
      "schema": {
        "fields": [
          {
            "name": "Date",
            "type": "date",
            "description": ""
          },
          {
            "name": "Country",
            "type": "string",
            "description": ""
          },
          {
            "name": "Number",
            "type": "number",
            "description": ""
          }
        ]
      }
    }
  ]
}

When we try to add this invalid data to the repository, the following report is generated:

+----------------+------------------------------------------------------------+
| result_name    | result_message                                             |
+================+============================================================+
| Incorrect Type | The value "2015-02" in column "Date" is not a valid Date.  |
+----------------+------------------------------------------------------------+
| Incorrect Type | The value "x23.5" in column "Number" is not a valid Number.|
+----------------+------------------------------------------------------------+

How It Works

The Data Package descriptor file, datapackage.json, provides both high-level metadata as well as a schema for tabular data. We use the Python library datapackage-py to create a high-level model of the Data Package that allows us to inspect and work with the data inside. The real work is accomplished using GoodTables.

We previously blogged about using Good Tables to validate our tabular data. On every update, two small test functions read the datapackage.json to read and validate the tabular data contained therein according to its structure and adherence to a schema. Here’s the first:

def test_schema(self):
    # We heart CSV :)

    data_format = 'csv'

    # Load our Data Package path and schema

    data = dp.metadata['resources'][0]['path']
    schema = dp.metadata['resources'][0]['schema']

    # We use the "schema" processor to test the data against its
    # expected schema.  There is also a "structure" processor.

    processor = processors.SchemaProcessor(schema=schema,
        format=data_format,
        row_limit=row_limit,
        report_limit=report_limit)
    valid, report, data = processor.run(data)

    # Various formatting options for our report follow.  

    output_format = 'txt'
    exclude = ['result_context', 'processor', 'row_name',
               'result_category', 'column_name', 'result_id',
               'result_level']

    # And here's our report!

    out = report.generate(output_format, exclude=exclude)

    self.assertTrue(valid, out)

For more information, read the guide on frictionlessdata.io about validating data. Behind the scenes, this is just a normal Travis CI configuration (see the .travis.yml).

Try It Yourself

Our example relies on GitHub as a data storage mechanism and Travis CI as a host for the actual validation. However, this approach is broadly applicable to any storage and processing backend with some extra tweaking (e.g. using AWS Lambda and S3).

Check out the ex-continuous-data-integration repository on our frictionlessdata organization on GitHub to see how you can try this out with your own data! Let us know how it works on our chat channel.

Tools for Extracting Data and Text from PDFs - A Review

2016-04-19T00:00:00+00:00

Extracting data from PDFs remains, unfortunately, a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.

The tools we can consider fall into three categories:

Extracting text from PDF
Extracting tables from PDF
Extracting data (text or otherwise) from PDFs where the content is not text but is images (for example, scans)

The last case is really a situation for OCR (optical character recognition) so we’re going to ignore it here. We may do a follow up post on this.

The Paris Climate Agreement text was published as PDF. Some of the tools described here – plus the usual blood, sweat and tears – were used turn them back into usable HTML for our Paris COP21 Climate Treaty Texts site

A classic example of an important government report published as PDF only

Generic (PDF to text)

PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Pure python
- In our trials PDFMiner has performed excellently and we rate as one of the best tools out there.
pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf. One of the better for tables but have found PDFMiner somewhat better for a while. Command-line Linux
pdftoxml - command line utility to convert PDF to XML built on poppler.
docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…)
pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler’s pdftoxml, which didn’t properly decode CID Type2 fonts in PDFs.
pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction as it generates css-heavy HTML that replicates the exact look of a PDF document.
pdf.js - you probably want a fork like pdf2json or node-pdfreader that integrates this better with node. Not tried this on tables though …
- Max Ogden has this list of Node libraries and tools for working with PDFs: https://gist.github.com/maxogden/5842859
- Here’s a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247
Apache Tika - Java library for extracting metadata and content from all types of document types including PDF.
Apache PDFBox - Java library specifically for creating, manipulating and getting content from PDFs.

Tables from PDF

Tabula - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.
https://github.com/okfn/pdftables - open-source. Created by Scraperwiki but now closed-source and powering PDFTables so here is a fork.
pdftohtml - one of the better for tables but have not used for a while
https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515
Using scraperwiki + pdftoxml - see this recent tutorial Get Started With Scraping – Extracting Simple Tables from PDF Documents

Existing open services

http://givemetext.okfnlabs.org/ - Give me Text is a free, easy to use open source web service that extracts text from PDFs and other documents using Apache Tika (and built by Labs member Matt Fullerton)
http://pdfx.cs.man.ac.uk/ - has a nice command line interface
- Is this open? Says at bottom of usage that it is powered by http://www.utopiadocs.com/
- Note that as of 2016 this seems more focused on conversion to structured XML for scientific articles but may still be useful
~~Scraperwiki - https://views.scraperwiki.com/run/pdf-to-html-preview-1/ and this tutorial~~ - no longer working as of 2016

Existing proprietary free or paid-for services

There are many online – just do a search – so we do not propose a comprehensive list. Two we have tried and seem promising are:

http://www.newocr.com/ - free, with an API, very bare bones site but quite good results based on our limiting testing
https://pdftables.com/ - pay-per-page service focused on tabular data extraction from the folks at ScraperWiki

We also note that Google app engine used to do this but unfortunately it seems discontinued.

Other good intros

Tools for Data Packages: Make vs. Tuttle

2016-03-25T00:00:00+00:00

When crafting data from some other data, like packaging public data, using the good tools can really ease development process and reliability of the data.

The venerable make which have already been used for decades to build software, is a very good option as advocated by Mike Bostock’s in his blog.

A state-of-the-art Makefile

Let’s take an example with crafting geo-countries datapackage. We need to download data from NaturalEarth, extract the zip, convert it to json with ogr (the ‘‘swiss-army-knife’’ of maps), and rename a column. Following Mike Bostok’s instructions, here’s an appropriate Makefile (that should lie the scripts folder of the project):

all: ../data/countries.geojson

ne_10m_admin_0_countries.zip:
	wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip

ne_10m_admin_0_countries.README.html ne_10m_admin_0_countries.VERSION.txt ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx: ne_10m_admin_0_countries.zip
	unzip ne_10m_admin_0_countries.zip

ne_10m_admin_0_countries.geojson: ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx
	ogr2ogr -select admin,iso_a3  -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp
    
../data:
	mkdir ../data

../data/countries.geojson: ne_10m_admin_0_countries.geojson ../data
# Change the name of the fields after conversion
	cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g'  > ../data/countries.geojson

If you’re not familiar with Makefiles, the last section reads : “When both files ne_10m_admin_0_countries.geojson and ../data are available, you can run command cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson and it will produce file ../data/countries.geojson”. Make deduces the commands to be run, starting with the ones where everything is available, until it produces target all.

We achieve two very important goals with this Makefile :

it covers the whole process even the download part. It’s so easy to forget wether we have downloaded ne_10m_admin_0_countries.zip or ne_110m_admin_0_countries.zip when it is done by hand. But now every thing is written down so we can keep track of it in our source repository (like git), even if we change our mind.
Running make checks the date consistency of the files. That means that if Scottland has gone independent in 2015 it would have created a new country, that Natural Earth would have added. Now you can download the updated version of ne_10m_admin_0_countries.zip. When running make again, it would notice that the unziped files like ne_10m_admin_0_countries.dbf and so on are older than their source, so the unzip command has to be run again ! And so on because ne_10m_admin_0_countries.geojson would not be up to date, until every depending file is updated.

Even if this is a great improvement over running-all-the-commands-manually-and-don’t-remember-them as much custom-script-that-must-start-from-scratch-every-time, it is not enough to have a fluid and reliable development experience.

Improve collaboration with `tuttle`

Before we see in detail two major improvements, let’s see the same workflow written in a tuttlefile (still in folder scripts) :

file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
    wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip

file://ne_10m_admin_0_countries.README.html, file://ne_10m_admin_0_countries.VERSION.txt, file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx <- file://ne_10m_admin_0_countries.zip
    unzip ne_10m_admin_0_countries.zip

file://ne_10m_admin_0_countries.geojson <- file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx
    ogr2ogr -select admin,iso_a3  -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp
    
file://../data <-
    cd ..
    mkdir data
    
file://../data/countries.geojson <- file://ne_10m_admin_0_countries.geojson, file://../data
# Change the name of the fields after conversion
    cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g'  > ../data/countries.geojson

Looks familiar ?

It is very close to Makefile, except for urls everywhere. Because tuttle aims at giving a url to every bit of data, in order link them together.

You can see the first section of the tuttlefile clearly states the dependency of file ne_10m_admin_0_countries.zip to url http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip. This means that when the online list of countries change, no unusual action is required. You just have to execute tuttle run as if you where building the data for the first time. It will notice the source url has changed and will reprocess dependencies accordingly.

The other difference with make is not in the syntax, it’s in how it deals with changes in the tuttlefile. If you ever worked with the ogr2ogr command line tool, you know it’s impossible to make it right the first time. But if you change the command in a Makefile, unfortunately running make again won’t update the data because the date of the file ne_10m_admin_0_countries.geojson seem coherent.

To improve this, tuttle reacts to changes in every command. When you run it, it will first roll back as the previous command as if had never run by deleting whatever data has been produced. Then it will run the updated ogr2ogr command. That’s very handy when prototyping because you want focus on your code without side effects caused by remaining data.

This feature also proves really useful when working in a team. With make, if you change the makefile, you need to send an mail to all your team with instructions of how to clean the workspace (ie : “Please remove file ../data/countries.geojson because I have changed the ogr2ogr command”), and hope nobody misses it because it would lead to undebuggable behaviour. On the other hand tuttle guarantees the data corresponds exactly the tuttlefile, so you can safely share or merge changes with your fellow contributors.

Conclusion

If you put both improvements over make together (remote dependencies and reliably reprocess what have changed), we can set up a system that automatically updates datapackages when either the original data changes or when someone modifies the source code. Pretty cool, huh ?

I hope I’ve convinced you of the advantages of tuttle for collectively crafting data. If you’re interested, the best way to learn more about inline languages, url to databases or online resources, is to read the main tutorial.

And one more thing about the sugar syntax you can expect… You could simplify the first section of the tuttlefile in only one line :

file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip ! download

Frictionless Data Transport in Python

2016-03-11T00:00:00+00:00

Tool and platform integrations for “Data Packages” are key elements of our Frictionless Data Initiative at Open Knowledge International. We recently posted on the main blog about some integration work funded by our friends at Google. We’ve built useful Python libraries for working with Tabular Data Packages in some of the most popular tools in use today by data wranglers and developers. These integrations allow for easily getting data into and out of your tool of choice for further manipulation while reducing the tedious wrangling sometimes needed. In this post, I will give some more details of the work done on adding support for these open standards within CKAN, Google’s BigQuery, and common SQL database software. But first, here is an introduction to the format for those who are unfamiliar.

Tabular Data Package

Tabular Data Package is a simple structure for publishing and sharing tabular data in CSV format. You can find more information about the standards here, but here are the key features:

Your dataset is stored as a collection of flat files.
Useful information about this dataset is stored in a specially formatted JSON file, datapackage.json stored with your data. For tabular data, this information is a combination of general metadata and schema information.
- General metadata (e.g. name, title, sources) are stored as top-level attributes of the file
- The exact schema (e.g. type, constraint information per column, and relations between resources) for the tabular data is stored in a resources attribute. For each resource, a schema is specified using the JSON Table Schema standard.

As an example, for the following data.csv file…

date	price
2014-01-01	1243.068
2014-02-01	1298.713
2014-03-01	1336.560
2014-04-01	1299.175

…we can define the associated datapackage.json file describing it:

{
  "name": "gold-prices",
  "title": "Gold Prices (Monthly in USD)",
  "resources": [
    {
      "path": "data.csv",
      "format": "csv",
      "schema": {
        "fields": [
          {
            "name":"date",
            "type":"date"
          },
          {
            "name":"price",
            "type":"number",
            "constraints": {
              "minimum": 0.0
            }
          }
        ]
      }
    }
  ]
}

By providing a simple, easy-to-use standard for packaging data and building a suite of integrations to easily and losslessly import and export packaged data using existing software, we foresee a radical improvement in the quality and speed of data-driven analysis.

So, without further ado, let’s look at some of the actual tooling built :).

CKAN, originally developed by Open Knowledge, is the leading open-source data management system used by governments and civic organizations around the world. It allows organizations and ordinary users to streamline the publishing, sharing, and use of open data. In the US and the UK, data.gov and data.gov.uk run on CKAN, and there are many more instances around the world.

Given its ubiquity, CKAN was a natural target for supporting Data Packages, so we built a CKAN extension for importing and exporting Data Packages both via the UI and the API: ckanext-datapackager. This work replaces a previous implementation for an earlier version of CKAN.

CKAN Data Packager Extension

Source and usage information: https://github.com/ckan/ckanext-datapackager
Screencast (UI): https://youtu.be/qEaAJB_GYmQ
Screencast (API): https://asciinema.org/a/8jrpft2etpubte8jupfko8ci5

BigQuery and SQL Integration

BigQuery is Google’s web service for querying massive datasets. By providing a Python library, we can allow data wranglers to easily import and export “big” Data Packages for analysis in the cloud. Likewise, by supporting general SQL import and export for Data Packages, a wide variety of software that depend on typical SQL databases can support Data Packages natively. The library powering both implementations is jsontableschema-py, which provides a high level interface for importing and exporting tabular data to and from Tabular Storage objects based on JSON Table Schema descriptors.

BigQuery

Source: jsontableschema-bigquery-py.
Screencast: https://www.youtube.com/watch?v=i_YHSwl-7VU
Walkthrough: https://gist.github.com/vitorbaptista/998aed29097945aaccff

SQL

Source and usage information: jsontableschema-sql-py.
Screencast: https://asciinema.org/a/cyzd0lz0kqvcqmg4zneifohov
Walkthrough: https://gist.github.com/vitorbaptista/19d476d99595584e9ad5

Beyond

This modular approach allows us to easily build support across many more tools and databases. We already have plans to support MongoDB and DAT. Of course, we need feedback from you to pick the next libraries to focus on. What tool do you think could benefit from Data Package integration? Tell us in the forum.

For more information on Data Packages and our Frictionless Data approach, please visit http://frictionlessdata.io/.

Submit your Newsletter ideas today!

2016-02-18T00:00:00+00:00

The first quarter of 2016 is almost through, which means that the OKFN Labs Newsletter is on its way! But we have a problem. We know that you have spent the last 3 months writing awesome code, founding disruptive new projects and basically changing the world. But we haven’t received your newsletter submissions so we can help spread the word about what you’ve been up to.

Submitting your ideas or work for inclusion in the newsletter is easy, and only taks about a minute. Start by clicking the link for Open Knowledge Labs issue tracker here.

Here’s what your submission should include:

A title that is succinct, and describes the item well
A link (or links) to the project/initiative you are submitting
A single paragraph that presents an introduction to the project/initiative

And that’s it! Its fast, its easy and it helps get the word out about your latest project - so submit your newsletter ideas today.

Labs newsletter: Q4 2015

2015-12-05T00:00:00+00:00

Hey there hackers & hackettes! Welcome to the 4th quarter 2015 Open Knowledge Labs Newsletter: A Very Special Holiday Edition of the Open Knowledge Labs Newsletter. We hope that all of our readers, volunteers, team members & contributors have a great holiday season. Labs is doing our part to keep things festive:

Despite the hustle and bustle of the season, we are happy to report that Labs has made some serious progress with our existing projects and that we also have a few very cool tools to assist with your year-end data analysis.

Tuttle - language, platform & version-control agnostic tool for collaborating on complex coding projects

Our very own @lexman (Alexandre Bonnasseau of mappy.com) was kind enough to provide Labs with a tool called tuttle that should come in handy when submitting code for large projects.

@lexman does an excellent job describing the purpose of tuttle in a recent post to the Labs discussion site:

“When we write scripts to create data, we don’t make it right on the first time. How many times did you have to comment the beginning of a script, so that executions jumps directly to a bug fix? With tuttle, you won’t have to. First, it computes only what is necessary : for example if a file has already been downloaded, it won’t do it again. But also, when you change a line of code, tuttle knows exactly what data must be removed and what part of the code must be run instead.”

Tuttle can be used to generate reports that generate workflows based on submission history and also highlight errors, as illustrated below (or in more detail here):

@lexman provides a detailed (and incredibly helpful) tutorial that helps acquaint new users with tuttle. We highly recommend giving the tutorial a try and using tuttle for complex development projects.

Mira turns CSV files into an HTTP API

Mira is a new tool that comes to Labs from @davbre and is built using Ruby on Rails & relies on Postgres. Mira allows users to generate an API using data packages, a way to describe csv files using JSON - greatly simplifying what can often be a lengthy, tedious process. Here is how @davbre describes his utility:

“This is a small application developed using Ruby-on-Rails. You upload a datapackage.json file to it along with the corresponding CSV files and it gives you a read-only HTTP API. It’s pretty simple - it uses the metadata in the datapackage.json file to import each CSV file into its own database table. Once imported, various API endpoints become available for metadata and data. You can perform simple queries on the data, controlling the ordering, paging and variable selection. It also talks to the DataTables jQuery plug-in.”

Python data analysis library Agate has reached version 1.1

A new Python library has begun to come of age. Agate was built by @onyxfish (NPR data journalist Christopher Groskopf) as an alternative to numpy and pandas. Whereas numpy and pandas were designed for scientists, Agate is designed with the needs of journalists in mind. Agate places a premium on ease of use and flexibility, even at the expense of performance optimazations present in other libraries. As @onyxfish puts it in a post announcing the new version of Agate:

“In greater depth, agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.”

Agate’s leap from version 0.11.0 to version 1.0.0 on October 22nd of this year marked the first major release for the up-and-coming library (version 1.1 was released November 4th). While Agate was fully functional at v0.11.0, the changes since then have been substantial. Among some of the more impressive additions:

Agate can now be used as a drop-in replacement for Python’s csv module
Migrated csvkit‘s unicode CSV reading/writing support into agate
100% test coverage reached
Added support for Python 3.5
Massive performance increases for joins.
Dozens of other resolved issues …

Agate has an impressive array of documentation for developers. Take a look at the manual, the standard tutorials, a tutorial for using Agate with Jupyter notebook, the Agate Cookbook and the Agate API documentation.

Agate does indeed look promising, and there is an immense need for tools like it. Journalism is changing rapidly. With a flood of new information from Open Data advocates (like Open Knowledge) making their way to the newsroom, organizations that can effectively interpret that data will maintain a significant advantage over their competitors. Meanwhile, the public can only benefit from more accurate analysis of larger sets of information that impact their lives. Clearly, the need for analytics that were once only required at university now extends beyond the Ivory Tower.

Webshot improvements

Webshot is a free, automated utility that allows for the generation of live screenshots. Screenshots serve an important role in demonstrating accountability (when content is removed, defaced or censored from the internet), but just as frequently are critical for troubleshooting & diagnostic services. There are many scenarios in which manually creating screenshots would not be feasible - because of routing issues, or because a screenshot needs to be generated at an exact time (Webshot can be called via an API).

The Github page for Webshot includes not just the Webshot source code, but also a node-based web server with a default Heroku configuration. This enables users to spin up a fully functional Webshot instance using Heroku in just a few minutes, starting from scratch. Install the node package manager, create your heroku instance and push the configuration and you are all set!

When calling screenshots that have been generated by Webshot, the URLs reference the source website of the screenshot and allow for resizing of the image. Here are some examples from the Webshot documentation:

    http://localhost:5000/api/generate?url=google.com&width=500
    http://localhost:5000/api/generate?url=google.com&height=300
    http://localhost:5000/api/generate?url=google.com&width=200&height=400

Check out Webshot - and don’t forget to contribute to the project on Github!

New tasks need your help within the Core Datasets Issue Registry

Open Knowledge’s Core Datasets are a selection of commonly-used datasets on a variety of topics that can be put to use for a variety of different research topics. All of the tools are free and available on Github - being able to find so many diverse, reliable and useful datasets in one place can save those of us who rely on open data a lot of time and hassle. Because so many people rely on these tools, submitting code to these tools allows your submissions to make a real and significant difference to projects all over the world. Here are some examples of the types of packages that are included:

geoip2, a free IP geolocation database based on data from the Geolite2 MaxMind databases
imf-weo, a copy of the International Monetary Fund World Economic Outlook database
clinical-trials-us, javascript-based tool listing official US clinical trial outcomes from the FDA, relies on data from clinicaltrials.gov
crime-uk, UK-specific crime data from multiple sources, including http://police.uk/data
browser-stats, a Python based tool that collects browser usage statistics trends, primarily gathered from W3Schools log files
and much more …

Thanks in large part to @pdehaye, the new Core Datasets Managing Curator, there has been a flurry of new activity and project additions on the Core Datasets Issue Registry. Take a look at the Issue Registry for issues that you think you could help to resolve and start to tackle it! For example, a thread has been created for the Big Mac Index Dataset. Don’t know what a Big Mac Index is? Not a problem! You don’t need to be an economist to write a script that will poll the correct datasets(note: XLS file). If you want to help out, but aren’t sure how to start or you’re having trouble, browse the easier issues and leave a comment on the relevant thread! Also, be sure to let us know in a thread that you are working on a specific project to avoid duplicating effort. Now that you know all this, go get coding!

Labs establishes organizational structure, open positions still available

Labs continues to expand and attract interest from talented developers and all manner of smarty-pantses. With more people and more projects there is more responsibility and more to get done. To that end, Labs has begun to develop an organizational structure so that all of our team members can focus on what we are best at, to prevent duplication of efforts and to make communication easier and more effective. So far, the assigned positions are:

@danfowler
- Team Lead
@loleg
- Team Lead
@mattfullerton
- Team Lead
@pdehaye
- Core Datasets Managing Curator
@davbre
- Advisory Group Member
@davidmiller
- Advisory Group Member
@jgkim
- Advisory Group Member
@jwieder
- Advisory Group Member

For more detailed information about each position, be sure to check out this thread. There are still positions available and a significant need for assistance from those with all sorts of different skills - leave a comment on the thread to let us know that you want to help step up to keep Labs growing!

It’s time to get involved

The New Year is a time for reflection of the year gone by and an opportunity to resolve to engage in good deeds for the year ahead. This year, OFKN Labs urges you to forget about silly New Years resolutions like more exercise or less carbs. Do something important with 2016 and write more code! The first thing to do is to make sure that you are a part of the Labs team by signing up. Once you have joined the Labs community, check out our Ideas page or our current Projects and find something that you would be interested in collaborating on. Do you have a plan for something we haven’t though of yet? Tell us about it on Twitter or better yet jump on the mailing list.

For all of you already contributing to Labs: keep up the great work! Open Data is important, and your efforts continue to provide transparency for critical information. With your help, Labs will continue its success into 2016. See you then!

Labs newsletter: Q2/Q3 2015

2015-09-28T00:00:00+00:00

Welcome to the second Labs Newsletter of 2015! There has been excellent progress on various open data tools and initiatives across the Open Knowledge network since the last newsletter. Let’s take a look:

Labs Still <3 Discourse

Open Knowledge is in the process of centralizing community discussions on our Discourse forums. In order to do this, we’ve been enabling many new features to support mailing-list-style communication such as starting and replying to new topics via email. There’s already a lot of discussion there, so, check out the Open Knowledge Labs category, sign up, and tell us about your favorite tools!

JTS-SQL

Friedrich Lindenberg, AKA pudo, has booted up JTS-SQL, a Python library that removes some of the friction in dealing with data by automatically generating database table models based on JSON Table Schema field descriptors.

Give Me Text!

The Labs web service booted by Matt Fullerton for converting documents (e.g. PDF) to text using OCR has been given a new name “Give Me Text!” and a nice URL. Read the original announcement here.

Core Data Curators

We’re always looking for curators to help support our Core Datasets project which is aimed at collecting and maintaining (curating) important and commonly-used (“core”) datasets (e.g. GDP, ISO-codes) in high-quality, standardized and easy-to-use form. Interested in joining? Have questions? Visit us here.

OpenSpending Next

Work on the next iteration of OpenSpending is picking up steam. We recently hosted an OpenSpending Tech Hangout to demo some of our recent work. We can’t do it without you, so jump into the OpenSpending category and tell us how you’d like your government’s finances aggregated, stored, and visualized :).

Building a 100% Open-Source-based Open Data platform

Alex Corbi shares his experience participating in the development of the Open Development Mekong project, a platform built using 100% open-source technologies whose aim is to increase transparency in the South Asian countries of the Mekong basin region. The article focuses on its main component; a knowledge base platform built using CKAN. Additionally, Alex gives an overview on how the development was coordinated on GitHub and the set of additional tools developed around the project, including a plugin to connect Wordpress and CKAN, which can be used and re-purposed for future projects the attendees can find themselves in the future.

Open Data Companion

Osahon Okungbowa let us know about a mobile app he has created, Open Data Companion (ODC), which provides a unified access point to over 120 CKAN open data portals and thousands of datasets from around the world, right from your mobile device. Crafted with mobile-optimised features and design, this is an easy and convenient way to find, access and share open data.

Mexico’s new Open Data Portal

Juan Ortiz Freuler has pointed us to this report, written in non-technical language, which provides the reader with contextual data to understand the challenges the government of a developing country is facing in the implementation of an Open Data Portal. The report includes analysis of availability and quality of key datasets, critical analysis of the existing normative framework, an analysis of web traffic towards the portal, as well as insights from interviews with over two dozen professional Mexican data users.

A Model for Frictionless Science?

Steven De Costa has been doing work with open government data for the last few years, mostly around the platform capability of CKAN, and has been considering what frictionless science might look like:

I’ve been thinking about how to publish the full set of research artifacts needed to replicate and review work undertaken by labs, or to swap out data and reconstitute the research in a new context. That thinking, done only with little access to end users, has revealed the following short list of what might be published as a ‘dataset’ listing of ‘resources’…

Check out the forums to contribute to the discussion!

Get Involved

Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the ideas page to see what’s cooking in the Labs, and the newsletter page if you have items to submit to the next newsletter.

Open Data Companion (ODC) – Bringing Open Data to the Mobile Platform

2015-09-04T00:00:00+00:00

As software developers, we are always looking for data to solve a problem or address a shortcoming. It’s just how we’re wired. So, you heard of open data [1], and now you’re excited to go exploring and get the open data needed for the project. That’s when you face the first major obstacle – open data (with the right use license) is so difficult to locate on the Internet, you spend hours and hours searching before you find data with the right ‘open’ license you can use. Luckily, in recent times, OKFN’s frictionless data project [2] and ThinkData Works [3] are taking actions to help solve this problem.

As a data journalist, researcher or open data enthusiast, have you ever wished you could just whip-out your mobile device and quickly get instant access to a specific open data? Or maybe you want to always be aware when new datasets are available from a portal you’re interested in without having to manually poll or visit the portal for regular checks?

Everyone now has a mobile device. Popular consensus [4] and reports [5] show that mobile device usage and time spent on mobile devices are rapidly increasing. This means that mobile devices are now one of the fastest and easiest means of accessing data and information. Yet, as of now, open data lacks a strong mobile presence.

[Image Source – Motorola Smartphone Relationship Survey]

To tackle the highlighted challenges, Open Data Companion was developed to bridge the gap between open data portals and the mobile platform.

What is Open Data Companion (ODC)?

ODC is a FREE Android productivity app which provides a unified access point and common repository to over 120 CKAN [6] open data portals and thousands of datasets from around the world; right from your mobile device. Crafted with mobile-optimised features and design, this is an easy and convenient way to find, access and share open data.

Open Data Companion provides a framework for all Private Sector, State, Regional, National and Worldwide CKAN open data portals to deliver open data to all mobile users.

Download the ODC Android App here

Key Features

ODC is built on a distributed system of CKAN portals. This means the app (or its server) does not cache or store harvested datasets. Rather the app uses the powerful CKAN API [7] to retrieve live/current datasets from any active CKAN portal on the Internet. Users are free to setup access to as many data portals as they want.
Browse datasets from any accessed portal using categorisation / classification (i.e. organisations and groups) as provided by the portal. This category based navigation along with mobile gestures allow users to more quickly locate datasets.
Receive push notifications on your mobile device whenever new datasets become available from any portal of your choosing. This helps to keep track of new uploads on any portal without having to poll it manually for changes.
Preview datasets and create in-app data visualisations. The app uses the installed data viewer extensions on CKAN portals to enable mobile users produce data visualisations. This is a great feature for data journalist and data visualizers.
Other key features - download datasets to your mobile device; bookmark datasets for later viewing; share links to datasets on social media.

ODC complements and supplements the services of CKAN portals; it allows data producers and portal administrators to reach mobile users at no added costs or without additional configurations. Find out more about Open Data Companion (ODC) [8].

Use ODC and give feedback. More features are actively being developed.

References

[1] Open Data Handbook http://opendatahandbook.org/

[2] OKFN’s Frictionless Data Project http://data.okfn.org/vision

[3] ThinkData Work’s Project Namara https://namara.io/#/

[4] We Now Spend More Time Staring at Phones Than TVs http://www.bloomberg.com/bw/articles/2014-11-19/we-now-spend-more-time-staring-at-phones-than-tvs

[5] Study: Americans spend 162 minutes on their mobile device per day, mostly with apps http://www.geekwire.com/2014/flurry-report-mobile-phones-162-minutes/

[6] CKAN http://ckan.org/

[7] CKAN API http://docs.ckan.org/

[8] Open Data Companion (ODC) http://odc.utopiasoftwareonline.com/

Document to text conversion web service gets a nice name, a nice URL and a web interface

2015-08-28T00:00:00+00:00

Give Me Text!

In a previous post, I detailed a web service where you can throw documents of many kinds at it, and get text in return. We’ve now given this service a name, “Give Me Text!”, and a nice URL at http://givemetext.okfnlabs.org/ for both the API, which is at a subpath /tika and a web interface for uploading documents to the service. The web service is based on some nice work by Tyler Palsulich who got in touch via GitHub. Thanks Tyler!

Improving the openness of health and social care data

2015-08-14T00:00:00+00:00

The Health and Social Care Information Centre (HSCIC) is responsible for publishing a large proportion of the official statistics related to health and care in England. Each year we release about 250 statistical publications, ranging from high-level summary data on hospital admissions, through to detail on prescriptions, and results from surveys on lifestyles and smoking, drinking and drug use habits. We publish a vast array of aggregated non-identifiable data, all under the Open Government Licence, and are working with an open data mind-set to ensure that these data can be used to maximum effect.

Most of our statistical data is presented in formatted spreadsheets, providing context and detail in accordance with the Statistics Code of Conduct, but we are also making the data available for re-use, in machine-readable, comma-separated-variable format. We hope that this encourages our non-identifiable data to be consumed by a greater array of users, for more purposes. An example of this is the annual Hospital Episode Statistics (HES) publication for admitted patient care – the publication contains a raft of Excel tables of various statistics, but we have also now created a set of CSV files, which use a consistent structure.

To improve the discoverability of our non-identifiable data, as well as being published on our own site, our datasets are also available through data.gov.uk (DGU). We hope that organising datasets in this way makes it easier for users to find exactly the data they need. An area that we’ve recently worked on is our Clinical Indicators. The HSCIC is responsible for assuring the quality of health and care indicators, and publishes over 1700 indicators on our Indicator Portal.

Until recently, the only way to search or access the data was from within the portal. Now, the aggregated datasets that support over 100 indicators on health outcomes can be accessed using DGU, thanks to its harvesting tool. Our portal makes the metadata available for each indicator using the DDI xml standard – so we have converted this into a data.json equivalent, which will be maintained in line with ongoing release of indicators.

This means that these indicators, and more in future, can be found without the user needing to be within our own portal. Users can also benefit from DGU’s additional CKAN functionality (for example, the ‘Preview’ function) and of course, being a CKAN implementation, the aggregated datasets have their own API, allowing other portals to re-harvest the data. All of which will hopefully increase the different ways in which our data is used.

To begin with, the indicators that can be found on data.gov.uk are:

NHS Outcomes Framework (‘NHSOF’; 50 indicators) – which sets out the outcomes and corresponding indicators used by the Secretary of State to hold NHS England to account for improvements in health outcomes
Clinical Commissioning Group Outcomes Indicator Set (‘CCGOIS’; 53 indicators) – which is an integral part of NHS England’s systematic approach to quality improvement.

If you’re already using our open data to generate benefits, we’d love to learn more – it will help us to prioritise our efforts. Tweet us at @HSCICOpenData.

Chris Hutchins is Open Data Lead, Health and Social Care Information Centre (HSCIC)

Featured Core Datasets: Comprehensive Country Codes and Country List

2015-07-11T00:00:00+00:00

Are you in need of a clean, well maintained list of all countries and their associated international codes in CSV and JSON? If so, you might consider the country-codes and country-list data packages available at data.okfn.org.

Country Codes, using source data from ISO, the CIA World Factbook, and others, provides comprehensive information for countries in the world, including their respective ISO 3166 codes, ITU dialing codes, ISO 4217 currency codes, and many others.
Country List provides a subset of the information found in country-codes, for use when you only need a country’s two-character ISO 3166 code and its name in English.

Often, different databases that provide country-level information may use different unique identifiers for each country, adding significant friction to the process of using the data they provide. By linking all standardized identifiers in one place, Country Codes can provide a useful table on which to join distinct datasets and much more.

Core Datasets

Due to the value they provide, these datasets have been designated “core” datasets as part of the part of the Frictionless Data initiative. The Core Datasets project is about collecting and maintaining (curating) important and commonly-used datasets in a high-quality, standardized, and easy-to-use form: in particular, as up-to-date, well-structured Data Packages. These datasets are available now for inspection on the site, and are downloadable in either CSV or JSON format for use in your next application, website, or spreadsheet.

We need help suggesting, preparing and maintaining a set of “core” datasets as Data Packages. To get involved, see our previous call for data curators. Also check out the Core Datasets category on our new discussion forum.

Labs newsletter: Q1 2015

2015-05-11T00:00:00+00:00

Welcome to the first Labs Newsletter of 2015! There has been some great activity around open data and tech in the Open Knowledge network over the first quarter of 2015. Let’s dive straight in!

Labs <3 Discourse

In case you don’t know, Discourse is an open source forum/mailing list hybrid for communities. Open Knowledge runs a Discourse server, and of course, there is a home there for the Open Knowledge Labs community. We hope to move community discussion there going forward, so, check out the Open Knowledge Labs category, signup, and set your digest preferences.

Labs hangouts

The first Open Knowledge Labs hangout for 2015 was held on April 16th to a full house, and the next one is currently scheduled for May 14. Checkout the previous agenda, and planning for the next one, here at okfnpad.

Core datasets

Core datasets is a project for collecting and maintaining important and commonly-used (“core”) datasets in high-quality, standardized and easy-to-use form. There has been quite some activity here, with a call for data curators (jump in if you are interested!). Currently, 35+ volunteers are contributing, with leadership from super contributor @sxren.

Most action takes place here with datasets then appearing on the frictionless data site.

Some notable recent contributions include:

Media Types (@bluechi and @sxren)
Membership to Copyright Treaties (@bluechi)
Transparency International - Corruptions Perceptions Index
Top Level Domain Names
NUTS administrative boundaries

Data Package libraries

Data Packages are a simple set of specifications for packaging data. Some great libraries have recently been released (and updated) for working with the Data Package format and related specs such as JSON Table Schema.

dpmr: Data Package management in R

dpmr is for working with Data Packages in R. Check it out here.

DataPak: Data Package management in Ruby

DataPak is for working with Data Packages in Ruby, and provides some really nice extras like managing your packages locally, SQL integration and more. Read the announcement on the Labs blog [here][datapack-announce], and check out the code here.

Data Package: Data Package management in Python

Data Package, and Budget Data Package, are Python packages for working with Data Packages. These libraries have been around for a while, but recently were updated to add Python 3 support. Check out Data Package here, and Budget Data Package here.

JTSKit: Working with JSON Table Schema in Python

JSON Table Schema is a specification for declaring schemas for data, and is used within Data Packages. JTSKit is a Python library for working with JSON Table Schema, providing interfaces for validating schemas, inferring schema from data, and a schema model class for easy use in Python code. Check it out here.

OCR PDF to Text

A new web service is available via Labs for converting documents (eg: PDF) to text using OCR. Read the announcement here, and check out the code here.

GoodTables

GoodTables is a web services (and Python Library/CLI) for validating tabular data. Read more about it in the announcement here, check out the web service here, and the library here.

Databaker

ScraperWiki have released a new library for getting data out of spreadsheets. Read the announcement here, and check out the code here.

Council data visualisations and standards

Steve Bennett of Open Knowledge Australia has been doing some awesome work standardising and visualising council data in Victoria, Australia. He’s hoping to gain wider adoption of the standards that are emerging, in Australia and beyond. The standardisation work is happening here, on the OKFNAU repository on GitHub. See some of the data visualised on the Open Bin Map and Open Trees.

New data portal for Washington DC

Washington, DC’s data catalog has a new home. It operates on the ArcGIS Open Data platform and houses data relevant to city services in a variety of formats and with built-in APIs. The service is run out of the DC Office of the Chief Technology Officer, who have been quite responsive to issues and requests. You can give them a shout on Twitter as @opendatadc. Old datasets are still accessible here as they transition to the new site.

Remote data access wrapper for the Nomis API

Here’s an interesting blog post detailing work in Python/Pandas over the Nomis API, coming out of work Tony Hirst is doing teaching data wrangling for the UK Cabinet Office.

Get involved

Introducing datapak - Work with Tabular Data Packages using Ruby and ActiveRecord

2015-04-26T00:00:00+00:00

Tabular data packages are a pragmatic way of both publishing your own data and consuming the data that others share with the world. The newly published datapak is a Ruby library that lets you work with tabular data packages using ActiveRecord and, thus, your SQL database of choice (by default the library uses an in-memory SQLite database).

Using datapak

Let’s try using the datapak gem in a simple example that pulls a list of S&P 500 companies from the Frictionless Data dataset registry.

require 'datapak'

Datapak.import(
  's-and-p-500-companies'
)

Using Datapak.import will:

1) download all data packages to the ./pak folder

2) (auto-)add all tables to an in-memory SQLite database using SQL create_table commands via ActiveRecord migrations e.g.

create_table :constituents_financials do |t|
  t.string :symbol             # Symbol         (string)
  t.string :name               # Name           (string)
  t.string :sector             # Sector         (string)
  t.float  :price              # Price          (number)
  t.float  :dividend_yield     # Dividend Yield (number)
  t.float  :price_earnings     # Price/Earnings (number)
  t.float  :earnings_share     # Earnings/Share (number)
  t.float  :book_value         # Book Value     (number)
  t.float  :_52_week_low       # 52 week low    (number)
  t.float  :_52_week_high      # 52 week high   (number)
  t.float  :market_cap         # Market Cap     (number)
  t.float  :ebitda             # EBITDA         (number)
  t.float  :price_sales        # Price/Sales    (number)
  t.float  :price_book         # Price/Book     (number)
  t.string :sec_filings        # SEC Filings    (string)
end

3) (auto-)import all records using SQL inserts e.g.

INSERT INTO constituents_financials
  (symbol,
   name,
   sector,
   price,
   dividend_yield,
   price_earnings,
   earnings_share,
   book_value,
   _52_week_low,
   _52_week_high,
   market_cap,
   ebitda,
   price_sales,
   price_book,
   sec_filings)
VALUES
  ('MMM',
   '3M Co',
   'Industrials',
   162.27,
   2.11,
   22.28,
   7.284,
   25.238,
   123.61,
   162.92,
   104.0,
   8.467,
   3.28,
   6.43,
   'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')

4) (auto-)add ActiveRecord models for all tables.

Now you can use all the “magic” of ActiveRecord to work with the datasets. Example:

class Constituent < ActiveRecord::Base
end


pp Constituent.count

# SELECT COUNT(*) FROM "constituents"
# => 496


pp Constituent.first

# SELECT  "constituents".* FROM "constituents" ORDER BY "constituents"."id" ASC LIMIT 1
# => #<Constituent:0x9f8cb78
         id:     1,
         symbol: "MMM",
         name:   "3M Co",
         sector: "Industrials">


pp Constituent.find_by!( symbol: 'MMM' )

# SELECT  "constituents".*
         FROM "constituents"
         WHERE "constituents"."symbol" = "MMM"
         LIMIT 1
# => #<Constituent:0x9f8cb78
         id:     1,
         symbol: "MMM",
         name:   "3M Co",
         sector: "Industrials">


pp Constituent.find_by!( name: '3M Co' )

# SELECT  "constituents".*
          FROM "constituents"
          WHERE "constituents"."name" = "3M Co"
          LIMIT 1
# => #<Constituent:0x9f8cb78
         id:     1,
         symbol: "MMM",
         name:   "3M Co",
         sector: "Industrials">


pp Constituent.where( sector: 'Industrials' ).count

# SELECT COUNT(*) FROM "constituents"
         WHERE "constituents"."sector" = "Industrials"
# => 63


pp Constituent.where( sector: 'Industrials' ).all

# SELECT "constituents".*
         FROM "constituents"
         WHERE "constituents"."sector" = "Industrials"
# => [#<Constituent:0x9f8cb78
          id:     1,
          symbol: "MMM",
          name:   "3M Co",
          sector: "Industrials">,
      #<Constituent:0xa2a4180
          id:     8,
          symbol: "ADT",
          name:   "ADT Corp (The)",
          sector: "Industrials">,...]

How to manually download a data package

Use the Datapak::Downloader class to download a data package to your disk (by default data packages get stored in ./pak).

dl = Datapak::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode`)

Will result in:

-- pak
   |-- language-codes
   |   |-- data
   |   |   |-- language-codes-3b2.csv
   |   |   |-- language-codes.csv
   |   |   `-- language-codes-full.csv
   |   `-- datapackage.json
   |-- s-and-p-500-companies
   |   |-- data
   |   |   |-- constituents.csv
   |   |   `-- constituents-financials.csv
   |   `-- datapackage.json
   `-- un-locode
       |-- data
       |   |-- code-list.csv
       |   |-- country-codes.csv
       |   |-- function-classifiers.csv
       |   |-- status-indicators.csv
       |   `-- subdivision-codes.csv
       `-- datapackage.json

How to manually add and import a data package

Use the Datapak::Pak class to read a data package and import it into an SQL database.

pak = Datapak::Pak.new( './pak/un-locode/datapackage.json' )
pak.tables.each do |table|
  table.up!      # (auto-) add table  using SQL create_table via ActiveRecord migration
  table.import!  # import all records using SQL inserts
end

That’s it.

Bonus: How to connect to a different SQL database

You can connect to any database supported by ActiveRecord. If you do NOT establish a connection in your script - the standard (default fallback) uses an in-memory SQLite3 database.

SQLite

For example, to create an SQLite3 database on disk, lets say datapak.db, use in your script (before the Datapak.import statement):

ActiveRecord::Base.establish_connection( adapter:  'sqlite3',
                                         database: './datapak.db' )

PostgreSQL

For example, to connect to a PostgreSQL database, use in your script (before the Datapak.import statement):

require 'pg'       ##  pull in PostgreSQL (pg) machinery

ActiveRecord::Base.establish_connection( adapter:  'postgresql'
                                         username: 'ruby',
                                         password: 'topsecret',
                                         database: 'database' )

Find Out More

datapak

Tabular Data Package

spec :: dataprotocols.org/tabular-data-package
datasets :: data.okfn.org/data

The Good Tables web service

2015-03-06T00:00:00+00:00

Introducing the Good Tables web service

Good Tables is a free online service that helps you find out if your tabular data is actually good to use - it can check for structural problems (blank rows and columns) as well as ensure that data fits a specific schema.

Tabular data in CSV and Excel formats is one the most common forms of data available on the web - especially if looking at open data. Unfortunately, much of that data is messy with blank and incorrect rows, and unexpected values for some fields. (For example, date columns that do not feature well-formed dates. See here for more examples of “bad data”.)

That’s where Good Tables comes in: it checks your data for you, giving you quick and simple feedback on where your tabular data may not yet be quite perfect.

Good Tables uses the previously announced Good Tables Python library, and is developed by Open Knowledge with funding from the Open Data User Group.

Good Tables is currently an alpha release; we invite the community to start using and contributing to it to help us move towards v1.0.

API

The documentation for the API can be found here.

Using the API is easy: POST or GET your data, and get back a JSON object containing the report.

For example:


# make a request
curl http://goodtables.okfnlabs.org/api/run --data "data=https://raw.githubusercontent.com/okfn/goodtables/master/examples/row_limit_structure.csv&schema=https://raw.githubusercontent.com/okfn/goodtables/master/examples/test_schema.json"

# the response will be like
{
    "report": {
        "summary": {
            "bad_row_count": 1,
            "total_row_count": 10,
            ...
        },
        "results": [
            {
            "result_id": "structure_001", # the ID of this result type
            "result_level": "error", # the severity of this result type (info/warning/error)
            "result_message": "Row 1 is defective: there are more cells than headers", # a message that describes the result
            "result_name": "Defective Row", # a human-readable title for this result
            "result_context": ['38', 'John', '', ''], # the row values from which this result triggered
            "row_index": 1, # the idnex of the row
            "row_name": "", # If the row has an id field, this is displayed, otherwise empty
            "column_index": 4, # the index of the column
            "column_name": "" # the name of the column (the header), if applicable
            },
            ...
        ]
    }
}

For more details on the report response, see the report section of the Good Tables documentation.

UI

The web service also features a form for manual validation of data via a UI.

Let’s see it in action:

Contributions

We invite all contributions. Feel free to open an issue if you encounter any problems, or just start hacking and send a pull request.

A public web service for document to text conversion including OCR

2015-02-21T00:00:00+00:00

Getting text out of documents

Last year I was working on beta.offenedaten.de, a catalog of data catalogs in Germany using the CKAN platform as the basis. Although the topic of how to enable full-text search of documents in CKAN data catalogs is somewhat open, I wanted to be able to collect the full text of open data resources for searching. We can’t assume that PDFs are always nice PDFs full of text: they can just as easily be scans of paper documents without any optical character recognition (OCR) having taken place. So when we extract text from documents, it would be nice to have an option to do OCR too. This is a need common to other projects we have at OKF Germany, and, after discussion on the Labs list, apparently something people would like to have.

Lend me your files, I send you back text

In short, there is now a web service available for converting a multitude of document types to simple text. It lives at:

http://beta.offenedaten.de:9998/tika

To test it, just throw some images with text in them at it. For example, on a terminal on Mac or Linux:

curl -T tiff_example.tif http://beta.offenedaten.de:9998/tika

How it was built

My involvement in the code for this project was zero. I just took the web server part of the developer version of the Apache Tika Project and put it on a server. OCR support using Tesseract has recently been added to Tika.

Roll your own

For intensive use of the service or to include it in your own infrastructure, you can use this Docker image, built on this GitHub Repository. In case you don’t know what Docker is, don’t ask me, as I won’t do a great job of explaining it to you. I’m sure there’s a few Docker experts out there who could improve the Dockerfile setup: pull requests on GitHub are welcome!

Improvements

The big missing feature from this, and from Tika generally, is the ability to perform OCR on a PDF when little or no text comes back. There is a trick to get the OCR on a PDF, but your application will need to decide when to employ it, for example based on the non-OCR results.

Get involved

A quick look at the discussion on GitHub shows how many ideas there are floating around to improve open document processing tooling on the web. This is just one tiny piece of that puzzle. More concretely, it would be great to get some Open Knowledge involvement in the Tika project going to support them, particularly with the “no text found in PDF” conundrum above. Just get in touch with them directly or with me via the GitHub issue or old-fashioned email.

Avoiding the OCR problem in the first place

I thought it might be worth mentioning to anyone involved in putting open data and open documents on the web that there is a procedure for adding the text to a scan-based PDF, using Adobe Acrobat. If anyone knows of an open source solution for this (i.e. embedding and attaching the OCR text in the images in the PDF), I would love to hear from you.

Introducing Good Tables

2015-02-20T00:00:00+00:00

What is it?

Good Tables is a Python package for validating tabular data through a processing pipeline.

It is built by Open Knowledge, with funding from the Open Data User Group. Good Tables is currently an alpha release.

Applications range from simple validation checks on CSV files, to integration with a larger ETL pipeline.

The codebase currently ships with two validators that can be used in a pipeline:

The StructureProcessor checks for common structural errors
The SchemaProcessor checks for conformance to a JSON Table Schema

There is a hook to add custom processors, and there are plans to include more processorss in the core library.

Good Tables ships with some documentation, but it is not yet complete. You are welcome to check out the code, run the tests (or check them on Travis), open an issue, or make a pull request to help us iterate to a version one release (here is the backlog).

We are also released some packages that are used in Good Tables: Good Tables Web, JTSKit, and TellMe. You can read more about each of these below.

Why?

The development of Good Tables has been driven by a real-world pain point: monitoring and validating government spending data in the United Kingdom (the dashboard for this project is under development here). A brief overview of this use case can demonstrate the value proposition of Good Tables.

The Problem

In the UK, various government departments publish spend data. This data is required to be accessible: that is, machine-readable and publicly available. Additionally, the data must conform to schema.

Monitoring the publication of such data, and validating its well-formedness, is a difficult task. The data is produced in a variety of circumstances (e.g.: available resources), and the producers of this data have no tools at hand to confirm that their work is correct.

Considering that spend data is produced at regular periodic intervals, and departments are expected to publish in a timely manner, the problem of producing well-formed data is compounded.

The Solution

Good Tables provides part of the solution with tooling to ensure data is machine readable and well formed. All spend data across the various government departments is collected and run through a Goodl Tables pipeline at regular intervals.

The validation pipeline for this data is something like as follows:

Is the file readable as CSV?
Are there headers from the first line of the file?
Are there any empty headers, empty rows, or ragged rows?
Do all the values in the file conform with the expected schema (columns of numbers, dates, etc.)?

Any errors detected while the pipeline is running are written in to a report. When the pipeline finishes running, a user-facing report is generated, providing actionable data on what exactly is wrong with the file (so the data producers can take steps to fix such errors).

How can I use it now?

If you are running Python 2.7, 3.3 or 3.4, you can start using Tabular Validator today.

As mentioned above, this is an alpha release. Still, we have decent test coverage, and we are hoping to uncover bugs and weirdness through wider usage.

Here’s how you can use Good Tables right now:

In existing code bases

See some examples in the test suite to get a working idea of the API and how you could integrate a Good Tables pipeline, or stand-alone processor, into your existing workflow with tabular data.

As a CLI

If you are doing data wrangling in the terminal, Good Tables comes with a CLI called “goodtables”. See here for the CLI interface,. This is still very much a work in progress, and currently exposes a subset of the Good Tables pipeline interface.

Via the web

The Good Tables Web package provides both a Web API and a simple form UI over Good Tables. Read about the current API here.

Extra goodies

Good Tables has been developed as part of a larger project, and we are pulling out functionality out into standalone packages as possible/practical.

Like the Good Tables package, these are all alpha releases, but each has a passing test suite on Python 2.7, 3.3, and 3.4.

TellMe

TellMe is a Python package for creating user-facing reports from things happening in code. It is a simple library that provides a logger-like interface to build reports, and then generate them in several output formats.

JTSKit

JTSKit is a Python package providing a set of utilities for working with JSON Table Schema.

Good Tables Web

Good Tables Web is a flask application that provides a Web API over Good Tables, as well as a simple form UI.

Wanted - Data Curators to Maintain Key Datasets in High-Quality, Easy-to-Use and Open Form

2015-01-03T00:00:00+00:00

Wanted: volunteers to join a team of “Data Curators” maintaining “core” datasets (like GDP or ISO-codes) in high-quality, easy-to-use and open form.

What is the project about: Collecting and maintaining important and commonly-used (“core”) datasets in high-quality, standardized and easy-to-use form - in particular, as up-to-date, well-structured Data Packages.
The “Core Datasets” effort is part of the broader Frictionless Data initiative.
What would you be doing: identifying and locating core (public) datasets, cleaning and standardizing the data and making sure the results are kept up to date and easy to use
Who can participate: anyone can contribute. Details on the skills needed are below.
Get involved: read more below or jump straight to the sign-up section.

What is the Core Datasets effort?

Summary: Collect and maintain important and commonly-used (“core”) datasets in high-quality, reliable and easy-to-use form (as Data Packages).

Core = important and commonly-used datasets e.g. reference data (country codes) and indicators (inflation, GDP)

Curate = take existing data and provide it in high-quality, reliable, and easy-to-use form (standardized, structured, open)

Full details: including slide-deck at http://data.okfn.org/roadmap/core-datasets.
Live examples: You can find already packaged core datasets at http://data.okfn.org/data/ and in “raw” form on Github at https://github.com/datasets/

What Roles and Skills are Needed

We need a variety of roles from identifying new “core” datasets to packaging the data to performing quality control (checking metadata etc).

Core Skills - at least one of these skills will be needed:

Data Wrangling Experience. Many of our source datasets are not complex (just an Excel file or similar) and can be “wrangled” in a Spreadsheet program. What we therefore recommend is at least one of:
- Experience with a Spreadsheet application such as Excel or (preferably) Google Docs including use of formulas and (desirably) macros (you should at least know how you could quickly convert a cell containing ‘2014’ to ‘2014-01-01’ across 1000 rows)
- Coding for data processing (especially scraping) in one or more of python, javascript, bash
Data sleuthing - the ability to dig up data on the web (specific desirable skills: you know how to search by filetype in google, you know where the developer tools are in chrome or firefox, you know how to find the URL a form posts to)

Desirable Skills (the more the better!):

Data vs Metadata: know difference between data and metadata
Familiarity with Git (and Github)
Familiarity with a command line (preferably bash)
Know what JSON is
Mac or Unix is your default operating system (will make access to relevant tools that much easier)
Knowledge of Web APIs and/or HTML
Use of curl or similar command line tool for accessing Web APIs or web pages
Scraping using a command line tool or (even better) by coding yourself
Know what a Data Package and a Tabular Data Package are
Know what a text editor is (e.g. notepad, textmate, vim, emacs, …) and know how to use it (useful for both working with data and for editing Data Package metadata)

We are looking for volunteer contributors to form a “curation team”.

Time commitment: Members of the team commit to at least 8-16h per month (though this will be an average - if you are especially busy with other things one month and do less that is fine)
Schedule: There is no schedule so you can contribute at any time that is good for you - evenings, weekeneds, lunch-times etc
Location: all activity will be carried out online so you can be based anywhere in the world
Skills: see above

To register your interest fill in the following form. Any questions, please get in touch directly.

Want to Dive Straight In?

Can’t wait to get started as a Data Curator? You can dive straight in and start packaging the already-selected (but not packaged) core datasets. Full instructions here:

http://data.okfn.org/roadmap/core-datasets#contribute

A Data API for Data Packages in Seconds Using CKAN and its DataStore

2014-09-11T00:00:00+00:00

dpm the command-line ‘data package manager’ now supports pushing (Tabular) Data Packages straight into a CKAN instance (including pushing all the data into the CKAN DataStore):

dpm ckan {ckan-instance-url}

This allows you, in seconds, to get a fully-featured web data API – including JSON and SQL-based query APIs:

View fullsize

Once you have a nice web data API like this we can very easily create data-driven applications and visualizations. As a simple demonstration, there’s the CKAN Data Explorer (example with IMF data - see below).

Where Can I Find a CKAN instance to Upload to?

If you’re looking for a CKAN site to upload your Data Packages to we recommend the DataHub which is community-run and free. To upload to the DataHub you’ll want to.

Configure the DataHub CKAN instance in your .dpmrc

[ckan.datahub]
url = http://datahub.io/
apikey = your-api-key

Upload your Data Package
```
dpm ckan datahub --owner_org=your-organization
```
You have to set the owner organization as all datasts on the DataHub need an owner organization.

One I Did Earlier

Here’s a live example of one “I did earlier”:

Here’s the source Data Package: IMF World Economic Outlook in data.okfn.org registry (Data Package on github (source))
Get this on your local machine (dpm install or just clone the github repo)
Then I uploaded it: dpm ckan http://datahub.io/ --owner_org=rufuspollock
Now it’s live on the DataHub: http://datahub.io/dataset/imf-weo
- Indicators: http://datahub.io/dataset/imf-weo/resource/ea3926e3-43a8-46d0-832a-e53efd61ebb0
- Values: http://datahub.io/dataset/imf-weo/resource/24cd8ebe-fa3f-4353-9ad9-d53bd88751a6
- Note this is a normalized dataset in which there are 2 tables (the DataStore supports JOINS if we want to put them back together)
Here’s a sample API query to get all indicators related to GDP: http://datahub.io/api/action/datastore_search?resource_id=ea3926e3-43a8-46d0-832a-e53efd61ebb0&limit=5&q=GDP
Now the data has a nice web Data API you can easily build data-driven apps or visualizations. For example, the CKAN Explorer is a simple JS + HTML app which allows you to explore CKAN DataStore data. Here’s the app pre-loaded with the DataStore indicator data

Context: a big motivation (personally) for doing this is that I’d like to see a nice web data API available for the “Core” Data Packages we’re creating as part of the Frictionless Data effort. If you’re interested in helping, get in touch.

Bubbles: Python ETL Framework (prototype)

2014-09-01T00:00:00+00:00

Introduction and ETL

The abbreviation ETL stands for extract, transform and load. What is it good for? For everything between data sources and fancy visualisations. In the data warehouse the data will spend most of the time going through some kind of ETL, before they reach their final state. ETL is mostly automated, reproducible and should be designed in a way that it is not difficult to track how the data move around the data processing pipes.

Data warehouse stands and falls on ETLs.

Bubbles

Bubbles is, or rather is meant to be, a framework for ETL written in Python, but not necessarily meant to be used from Python only. Bubbles is meant to be based rather on metadata describing the data processing pipeline (ETL) instead of script based description. The principles of the framework can be summarized as:

ETL is described as a data processing pipeline which is an directed graph
Processing operations are nodes in the graph, such as aggregation, filtering, dataset comparison (diff), conversion, …
Nodes might have multiple different inputs and a single output (there might be multiple outgoing connections, but all of them are the same) – the inputs are considered operands to the operation and the output is the operation result.
Data do not flow, if it is not necessary

The pipeline is described in a such way, that it is technology agnostic – the ETL developer, the person who wants data to be processed, does not have to care about how to access and work with data in particular data store, he can just focus on his task – deliver the data in the form that he needs to be delivered.

Data Objects and Data Store

The core of Bubbles are data objects – abstract concept of datasets which might have multiple internal representations. What actually flows between the nodes are not data itself, but those virtual representations of data and their compositions. Data are fetched only if it is really necessary – if there is no other option how to compose the data, such as join between a database table and a CSV file.

Here are few objects with different representations:

The objects are:

object which originates from a CSV file, can be processed mainly using the python iterators, however retains its text CSV nature, just in case some of the nodes might know how to work with it more efficiently, for example row filtering without actually parsing the CSV into row objects
SQL object representing a table – it can be composed into other SQL statements or can be used directly as a Python iterable
MongoDB collection – similar to the previous SQL table, can be iterated as raw stream of documents
SQL statement which might be a result of previous operations or our custom complex query. It can be used as such statement and composed with further operations, or the data can be fetched and iterated over in Python. Since this SQL object comes from a known database (PostgreSQL in this case) which implements a COPY command that generates CSV output, we can treat that object as such and provide the option to use CSV representation as well
Twitter API object – an exampple of a data object that actually does not exists for us as a physical table, we do not even know from how many original tables the Twitter is feeding us the data and we do not have to care at all. We are just fine that we can have an impression of iterable dataset.

To be more concrete, take a simple filtering for example. Say we have sample of Tweets stored in a SQL database, MongoDB and obviously on Twitter. We want to get all tweets by OKFN. In SQL we use a SQL driver, connect to the database and do:

SELECT * FROM  WHERE screen_name = 'okfn'

in Mongo we use a mongodb driver, connect to the database and do:

db.tweets.find(
    { },
    { screen_name: 'okfn'}
)

and in Twitter we just issue the following HTTP request:

https://api.twitter.com/1/statuses/user_timeline.json?screen_name=okfn

We asked for the same data object – a tweet in three different data stores. We had to use three different approaches. It does not look bad for us, “the tech people”. What if we can just write:

p = Pipeline(...)
p.source("data", "tweets")
p.filter_value("screen_name", "okfn")
p.pretty_print()

p.run()

The "data" is the data store. We as ETL designers do not have to worry what kind of data store it is, how to talk to it, how to get data from it.

Now we would like to count the tweets, so let us add aggregate() operation, which by default yields only record count:

p = Pipeline(...)
p.source("data", "tweets")
p.filter_value("screen_name", "okfn")
p.aggregate()
p.pretty_print()

p.run()

What happens here? For example, in the SQL case the COUNT() aggregation function will be used. For twitter, because our backend does not know better, the tweets will have to be pulled all from the Twitter API and counted one-by-one. Which is sad, but good for our example. The objective was to deliver the desired result, which happened.

Context

One thing is missing in my examples above: Pipeline(...) – the pipeline works in a context. We need to provide the description of data stores. For example:

stores = { "data": {"type": "sql", "url": "postgresl://localhost/twitter" }}

Stores have an interface for getting datasets by name or creating new datasets. Dataset might be:

table in a SQL store
collection in a MongoDB store
CSV file in a store represented by a directory of CSV files
JSON newlline delimited file in a store represented by a directory of JSOND files
resource collection over an API, such as the Twitter example above
dataset from a datapackage

ETL designer should not care about the underlying implementation, he should care only about having “a set of data that look like a table”. Object dataset responds to methods such as object_names() or get_object(name).

Operations

The ETL operations work on data objects provided as operands. An operation returns another data object. As mentioned above, the flow of data is just virtual. That means that when we are filtering the data, the framework might be actually composing a SQL WHERE statement instead of just pulling the data out of the database and filtering them row-by-rown in Python.

Similar with fields in the dataset – if we want to keep just certain columns, why to pass them around all in the first place? Why not to ask only for those that we actually need at the end? That is what Bubbles should do. Therefore the keep_fields() operation just selects certain columns when used in the SQL context.

There might be multiple implementations of the same operation. Which implementation (function) is used is determined at the time of pipeline execution. aggregate() might be in-python row-by-row aggregation using a dictionary or it might be SUM() or AVG() with GROUP BY statement in SQL, depending on which kind of object is passed to the operation.

In the following image you might see how the most appropriate operation is chosen for you depending on the data source. You can also see, that for certain representations the operations are combined together to produce just single data query for the source system:

Examples

Here is an example of Bubbles framework in action: “list customer details of customers who ordered something between year 2011 and 2013”. You might see that the source is a directory of CSV files. For comparison on the SQL example we create() a table, so the rest of the pipeline will hapen as SQL, not in Python.

Another example showing aggregation joining of details.

An example that uses a data package (according to spec) as a data store:

The pipeline looks like this:

The Python source code for the pipeline:

# Aggregate population per independence type for every year
# Sources: Population and Country Codes datasets
#
 
from bubbles import Pipeline
 
# List of stores with datasets. In this example we are using the "datapackage"
# store
stores = {
    "source": {"type": "datapackages", "url": "."}
}
 
p = Pipeline(stores=stores)
# Set the source dataset 
p.source("source", "population")
 
# Prepare another dataset and keep just relevant fields
cc = p.fork(empty=True)
cc.source("source", "country-codes")
cc.keep_fields(["ISO3166-1-Alpha-3", "is_independent"])
 
# Join them – left inner join
p.join_details(pop, "Country Code", "ISO3166-1-Alpha-3")
 
# Aggregate Value by status and year
p.aggregate(["is_independent", "Year"],
                       [["Value", "sum"]],
                       include_count=True)
 
# Sort for nicer output...
p.sort(["is_independent", "Year"])
 
# Print pretty table.
p.pretty_print()
 
p.run()

Note about Metadata

I have been using Python as a scripting language to define my pipelines. Observant reader might have noticed, that all I did was just composition of some messages, which is true. The p Pipeline object contains just a graph and the run() method uses an execution engine to resolve the graph and pick the appropriate operations for the given thask. That means, my whole processing pipeline does not need to be written in Python at all. It migh be described as a JSON for example, it might even be generated from some graphical user inteface for flow based programming.

There is more into metadata in Bubbles than mentioned in this blog post. The framework understands higher level metadata, such as analytical – role of a field from data analysis perspective. For example the aggregate() operation might by default aggregate all fields that are of analytical type measure and that information is passed on. Which results in less writing and less noise on the side of pipeline designer.

Summary

Why should someone who just wants to achieve his goal of extracting, transforming and presenting the data care about the underlying technology and query language? Mostly these days when we are dealing with so many systems it is an unnecessary distraction. Moreover, many ETL blocks are generic and reusable, why we would have to write the same code for every system we use?

Having an abstract ETL framework allows us to share transformations, cleaning methods, quality checks and more much easier.

In addition, it leaves the optimization of the process to the operation writers, to the people with technical skills, who know when it is good to move data over the networks and through the disks, or if we can just compose an operation and issue a sigle statment that the source system understands.

Future

The bubbles is still just a prototype, for the brave ones. But I would love to see it as a Python ETL/data integration framework. The short term needs and objectives are:

Simpler pipeline definition interface, more functional programming oriented
Larger library of higher level reusable components, such as dimension loaders (there is more into UPSERT that many of us think, but that is another story)
Easier way to write operations.
Larger variety of supported backends and services

If anyone is willing to help to prototype, I will gladly guide him/her. Let us build a python open source integration framework together. Extensible. Understandable. Focused on the use, way of thinking and pipeline design workflow.

Links:

Homepage
Github
IRC: freenode.net Channel: #databrewery

Data Central: a static frontend for data package collections

2014-08-19T00:00:00+00:00

This post explains our issues at the Portuguese open data front when it comes to providing bulk datasets in standard and easy-to-parse ways. It also introduces Data Central, our tentative solution to those issues: a Python tool to generate static web frontends for your data packages.

First problem: Have a common format for storing datasets

At Transparência Hackday Portugal, as with any other open data interest group, we work with many datasets. An issue that has been slowing us down for a long time is that we never had a centralized solution for storing datasets: some are in Google Docs, others in Git repositories, others live on web servers.

Before that, another issue was the data format: we found ourselves lost among CSV or JSON files, SQL database dumps, spreadsheets and plaintext files. Converting these was something we’d do in an ad hoc basis, and the challenge of finding (or devising) a common format usually stumbled into differing personal preferences and the difficulty involved in mass-conversion of heterogeneous data collections.

Solution: Tabular data packages

We stumbled almost accidentally into the Data Package standards page. It was a revelation to see how elegant a solution this was to our format problems: using the Tabular Data Package spec, we could go ahead and convert our datasets into CSV, along with their metadata – which is fairly easy to generate and maintain using the existing tools for the job. From there, we can also develop scripts to re-fetch and update the datasets, as well as post-processing tools to generate other formats from the data package.

There is already much information available on Data Packages:

the Frictionless Data vision, which clearly lays out the problem and the proposed workflow to deal with heterogeneous sets of data
the Data Package info page
the Tabular Data Package info page, which is the format we use
the comprehensive specifications for Data Packages and Tabular Data Packages
many tools to manage and publish data packages at data.okfn.org/tools

So our common data format problem is now solved. We then faced another issue: how to publish and distribute these datasets in an equally frictionless way.

Second problem: Simple system to publish data packages

Something that we’ve also been missing was a central point from which to distribute the datasets we have. Having a site to aggregate all of our data packages would be a necessary step for some requirements we had:

It would make hosting data workshops easier, by providing a quick way to access bulk data instead of fumbling around with USB sticks, Google documents and Dropbox links.
It’d make our efforts more visible, by aggregating all our work that is currently all over the place and presenting them in a simple manner.
More importantly, it gives us an easier way to present our work in gathering and converting data, and a better argument to present to public entities for publishing their data: instead of saying “Give us your data so we can convert it and make it open”, we can simply say “Give us your data so it can be available at OurGreatOpenDataPortal.pt”. Having a separate “brand” makes things easier to explain – and open data matters are involved enough to be able to hold people’s attention.

There are existing solutions, such as DataTank or, more prominently, CKAN. So why wouldn’t CKAN be an option?

CKAN is a brilliant framework for hosting, managing and dealing with groups of heterogeneous datasets. However, installing CKAN is an involved process, and its power comes at the cost of maintaining a full web application: it requires a carefully configured server, doing regular updates, and ensuring server resources are not going above a reasonable level. And since we’re a small team, we don’t require most of its advanced features (like permissions).

Finally, at Transparência Hackday we have to manage many web applications already, and being too familiar with the experience made us look for a simpler application design.

Solution: Data Central, a static site generator for data package collections

We set out to design a simple application that could meet our purposes. The main design principles are:

Enable access to bulk data sets. Easy, straightforward access to the actual files is the main driver behind the current implementation. This differs from an API-driven approach which, while powerful, would require significant additional complexity.
Generated static HTML site – Publishing datasets doesn’t need a real-time server-based application to query the data and show it. We would only need to update the site daily, at most, and we could then skip the server-side logic.
Generate locally and upload – The site generation ought to happen locally. We decided to have one of our non-remote servers take care of the hard work of generating the site, and then upload it with rsync to a hosted service.
Low hardware footprint – Local generation means that our system spec requirements are low. Not needing specialized hardware means that we can use an old computer for this task. It’s actually what we do – the site generation is being done on an old 2007 Sony Vaio laptop with a broken screen.
Separate the datasets from the site – By hosting each data package on a separate Git repository, the local generator could fetch it and re-generate the site without having to host and manage a separate copy of the data package and run the risk of both versions going out of sync. We found this happens often when building a database-driven web application. By separating the data packages and the web frontend, packagers and editors can work independently on the data, while the site generator updates the live version periodically.
Operated via the command line – For the sake of simplicity and at the cost of user-friendliness, we settled for a CLI-centered management workflow. We realised that managing this kind of site should be a mostly automated process, and an efficient way to do this would be to restrict the application to a set of scripts that can be managed through Makefiles and run by cron jobs.

There are some significant downsides to this direction, though.

There is no API since it’s all just HTML. This might be the most evident shortcoming of a static approach.
This also means there are no search capabilities. One could always consider using a third-party search engine since the site is plain HTML that can be scraped by Google, DuckDuckGo and other web crawlers.
There is no support for dynamic content, such as a site blog. Listing external feeds could be done through widgets in JavaScript.
Since the application management is done locally through the command line, there isn’t any web interface to make edits or changes inside the browser.

How it works

The workflow goes like this:

Data packages are published and updated on individual repositories by package maintainers.
The Datacentral application is configured to become aware of which repositories it should track.
The first run of the application clones all repositories and generates the HTML pages for each data package.
Individual HTML pages (About, Contact) are generated from local Markdown files.
The generated output can then be pushed through FTP or rsync to a remote, public web server.

In practice, there is a generate.py script that inspects each data package and uses Jinja to fill up a set of HTML template files. It saves the generated HTML in an _output directory, that can then be inspected using a local webserver or pushed into a live VPS. All actions, from installation to generation and upload, can be carried out by means of a Makefile.

If you’re interested in reading more about Data Central and even trying it out (it’s simple!), check out the project site. We’d heartily welcome all possible feedback, so please let us know about any bugs, suggestions or feature requests at the Datacentral issue tracker. Finally, you can see it in action in our (in development) Portuguese independent data hub, Central de Dados.

Labs newsletter: 5 June, 2014

2014-06-05T00:00:00+00:00

Welcome back to the OKFN Labs! Members of the Labs have been building tools, visualizations, and even new data protocols—as well as setting up conferences and events. Read on to learn more.

If you’d like to suggest a piece of news for next month’s newsletter, leave a comment on its GitHub issue.

commasearch

Thomas Levine has been working on an innovative new approach to searching tabular data, commasearch.

Unlike a normal search engine, where you submit words and get pages of words back, with commasearch, you submit spreadsheets and get spreadsheets in return.

What does that mean, and how does it work? Check out Thomas’s excellent blog post “Pagerank for Spreadsheets” to learn more.

GitHub diffs for CSV files

Submitted by Paul Fitzpatrick.

GitHub has added CSV viewing support in their web interface, which is fantastic, but it still doesn’t handle changes well. If you use Chrome, and want lovely diffs, check out James Smith’s CSVHub extension (blogpost and screenshot). The diffs are produced using the daff library, available in javascript, ruby, php, and python3.

Textus Wordpress plugin

Update from Iain Emsley.

The Open Literature project to provide a Wordpress plugin back-end for the Textus viewer has made new progress.

This project’s goal was to keep the existing Textus frontend—which has been split off as its own project by Rufus Pollock—and replace the backend with a Wordpress plugin, to make it easier to deploy. A version of this plugin backend is now available.

The new plugin acts as a stand-alone module that can be enabled and disabled as required by the administrative user. It creates a new Wordpress post type called “Textus” which is available as part of the menu, giving the user a place to upload text and annotation files using the Media uploader.

If you are interested in the project, check out its issues and discussion on the Open Humanities list.

Data protocols: updates

Data Protocols, the Labs’s set of lightweight standards and patterns for open data, has had a couple of interesting developments.

The JSON Table Schema protocol has just added support for constraints (i.e. validation), thanks to Leigh Dodds. This adds a constraints attribute containing requirements on the content of fields. See the full list of valid constraints on the JSON Table Schema site.

The Data Package Manager tool for Data Packages is shaping up nicely: the install and init commands have now been implemented. You can see an animated GIF of the former in the issue thread.

AnnotatorJS: new home

Annotator is “an open-source JavaScript library to easily add annotation functionality to any webpage”.

The project now lives on its own domain at annotatorjs.org. Check it out and see how easy it is to add comments and notes to your pages!

csv,conf

Data makers everywhere will want to check out csv,conf, a fringe event of Open Knowledge Festival 2014 taking place in Berlin on 15 July.

csv,conf is a non-profit community conference that will “bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share tools and stories”.

Tickets are $75, $50 with an OKFest ticket. If you can make it to Berlin in July and you’re into “advancing the art of data collaboration”, come join in!

First International Sport Hackdays Kick Off New OK Working Group

2014-06-04T00:00:00+00:00

At the first International Sports Hackdays in Basel, Sierre and Milan, over 120 developers and designers, journalists and scientists, professionals and amateurs came together to prototype new approaches to make creative use of sports data. They built new types of hardware, new interfaces for fitness equipment and spectator apps, analyzed Tour de France performances and FC Basel’s tactics, sport education policies, infrastructures and much more – and thus brought the spirit of open innovation and creative technology use to the field of sports. More hackdays are coming up, and a new international OK Working Group is being kicked off!

Desktop of a participant, sketching and prototyping a Tour de France data visualization.

Project Secondlamp is visualizing the intensity and tendency of a football match with the color and brightness of two light bulbs.

While open government data has become an established force for transparency, efficiency and innovation in the public sector, the world of sports stands at a beginning: even though there’s so much passion, even though there’s so much potential, sports data often remain in the closed coffers of functionaries. Last weekend, the International Sports Hackdays were just one successful play to change this game, just one step towards opening up sports to data, and sports data to the world. Therefore: Mr. Blatter, tear down this wall, make FIFA’s data available to all!

Martin Rumo, Embedded Computer Scientist at Switzerland’s Federal Institute of Sports in Magglingen explained the situation: “In elite sport, we collect more and more data every day, but to make it used and useful, we must build bridges between developers, designers and data scientists and the world of sports.”

With experts for athletic data from leading companies such as Deltatre or Technogym or from Switzerland’s National Sports Centre, with data visualization experts from companies such as Tuxtax or Interactive Things as well as academics, hackers and makers from leading local tech firms, the event attracted talents hardly ever brought together, a set of interdisciplinary innovators that proved to be extraordinarily productive – in building bridges, but also in making actual progress.

Plot helping to visualize and analyze the career of a road bicycle racer, using data from the Tour de France.

Sensors connected to a fooseball table, collecting data and corelating it to historic pro soccer matches – to make connections like “you’re playing like Liverpool-Basel today”.

The fascinating projects developed by the creative industry volunteers included hardware projects, software projects and data visualizations. They concentrated mainly on football (Secondlamp, second-screen match app BlitzPoll, ..), cycling (such as deep, historical analyses of Tour de France performances). The prototypes covered both personal apps like Sportee or BeatIt and the organizational level, e.g for the financial analysis of state-funded sport promotion as done in OptiSports or, specifically for Switzerland, the tool called Spörtle. In all cases, the approaches taken were extraordinarily inventive and of amazing quality: the creativity and innovation that could be experienced in these two days in Milan, Sierre and Basel was very impressive indeed.

The data sources used at the event are all available on datahub.io. These datasets include crowdsourced data, data extracted from public websites as well official data releases made available to a broader audience for the very first time. To foster and facilitate the open publication and productive use of open sports data, Open Knowledge is currently incubating an official sports data working group. The group will cover a wide range of issues that can be tackled with sports data: from leisure sport performance data in professional sport to using data for financial transparency and governance in sports institutions. The group is already attracting top-notch experts as well as data-wrangling fans and data journalists, so it's time to join the game now!

Using open football data - Get ready for the World Cup in Brazil 2014

2014-05-06T00:00:00+00:00

Football is the world’s most popular sport and the World Cup in Brazil - kicking off next month in São Paulo on June 12th (in 38 days 3 hours 15 minutes and counting) - is the world’s biggest (sport) event with 32 national teams from six continents competing in 64 matches in 12 cities for the championship title.

Where’s the open football data? Let’s ask the intertubes

Now lets say you want to build a world cup match day widget for your site using a web service (HTTP JSON API) that gets you all teams, groups, matches, players, and so on.

Example - HTTP JSON API:

GET /event/worldcup.2014/teams

{
  "event": {
    "key":   "worldcup.2014", "title": "World Cup 2014"
  },
  "teams": [
    { "key": "gre", "title": "Greece",      "code": "GRE" },
    { "key": "ned", "title": "Netherlands", "code": "NED" },
    { "key": "ger", "title": "Germany",     "code": "GER" },
    { "key": "por", "title": "Portugal",    "code": "POR" },
    ...
  ]
}

Ideally, there’s a free service using open football data from the world football federation, from the world’s sport cable channels, from the world’s sport newspapers, and so on. Let’s ask the intertubes to find out the state of open football data in the real world - let’s google json world cup brazil or post a question on the open data stackexchange ‘Q: Any Open Data Sets for the (Football) World Cup (in Brazil 2014)?’.

Nothing. Nada. Nichts. Niente. Zilch. Zero. So what? Let’s build an open football data project.

What’s `football.db?`

Let’s welcome football.db. An open football data project offering - suprise, suprise - free open public domain football data for the World Cup in Brazil 2014, and more.

The open football project also sports a free self-hosted HTTP JSON API service for football data, for example. Get started in two steps:

Step 1: Download the worldcup2014.db SQLite Database
Step 2: Serve up teams, rounds, matches, etc. via HTTP JSON API using the sportdb command line tool

$ sportdb serve

Services available include:

/event/world.2014/teams – List all teams
/event/world.2014/rounds – List all rounds (matchdays)
/event/world.2014/round/20 – List all matches in a round e.g. - 20th Round (=> Final)

GET /event/world.2014/round/1

{
  "event": { "key": "world.2014", "title": "World Cup 2014" },
  "round": { "pos": 1, "title": "Matchday 1" },
  "games": [
    {
      "team1_key": "bra",
      "team1_title": "Brazil",
      "team1_code": "BRA",
      "team2_key": "cro",
      "team2_title": "Croatia",
      "team2_code": "CRO",
      "play_at": "2014/06/12",
      "score1": null,
      "score2": null,
      "score1ot": null,
      "score2ot": null,
      "score1p": null,
      "score2p": null
    }
  ]
}

How does it work? Distributed is the new centralized

The open football data project collects public domain data sets in plain old text files that you store on your hard disk and that you can share with a distributed version tracker (that is, git repos) with your friends or the world. A free public domain command line tool (that is, sportdb) lets you read the plain text data sets into your SQL database of choice (for example, MySQL, PostgreSQL, SQLite, etc).

Example: `europe/teams.txt` - Comma-separated values

Let’s look at a plain text file for national teams in Europe, for example:

############
# UEFA (Union of European Football Associations)
# - 54 members

aut, Austria, AUT, at, fifa|uefa
bel, Belgium, BEL, be, fifa|uefa
cyp, Cyprus,  CYP, cy, fifa|uefa
...

(Source: europe/teams.txt)

The plain text file uses the comma-separated values (CSV) format with some extras for comments, blank lines, etc.

Example: `worldcup/2014/cup.txt` - Mini football data language

For match schedules the open football project use a new strutured data format, that is, a new domain-specific language (DSL).

Example - Open Football Match Schedule Language:

(1) Thu Jun/12 17:00   Brazil - Croatia    @ Arena de São Paulo, São Paulo (UTC-3)
(2) Fri Jun/13 13:00   Mexico - Cameroon   @ Estádio das Dunas, Natal (UTC-3)

(Source: world-cup/2014/cup.txt)

Why invent yet another data format? The new mini language for structured football match schedule data offers you the best of both worlds, that is, 1) looks n feels like free-form plain text - easy-to-read and easy-to-write - 2) but offers a 100-% data accuracy guarantee (when loading into SQL tables, for example).

The mini language also includes support for groups, matchdays, grounds, and more. Example:

############################
# World Cup 2014 Brazil

Group A  |  Brazil       Croatia              Mexico         Cameroon
Group B  |  Spain        Netherlands          Chile          Australia
Group C  |  Colombia     Greece               Côte d'Ivoire  Japan
Group D  |  Uruguay      Costa Rica           England        Italy
Group E  |  Switzerland  Ecuador              France         Honduras
Group F  |  Argentina    Bosnia-Herzegovina   Iran           Nigeria
Group G  |  Germany      Portugal             Ghana          United States
Group H  |  Belgium      Algeria              Russia         South Korea


Matchday 1  |  Thu Jun/12
Matchday 2  |  Fri Jun/13
Matchday 3  |  Sat Jun/14
...

(16) Round of 16            |  Sat Jun/28 - Tue Jul/1
(17) Quarter-finals         |  Fri Jul/4 - Sat Jul/5
(18) Semi-finals            |  Tue Jul/8 - Wed Jul/9
(19) Match for third place  |  Sat Jul/12
(20) Final                  |  Sun Jul/13


Group A:

(1) Thu Jun/12 17:00   Brazil - Croatia       @ Arena de São Paulo, São Paulo (UTC-3)
(2) Fri Jun/13 13:00   Mexico - Cameroon      @ Estádio das Dunas, Natal (UTC-3)

(17) Tue Jun/17 16:00   Brazil - Mexico        @ Estádio Castelão, Fortaleza (UTC-3)
(18) Wed Jun/18 18:00   Cameroon - Croatia     @ Arena Amazônia, Manaus (UTC-4)

(33) Mon Jun/23 17:00   Cameroon - Brazil      @ Brasília (UTC-3)
(34) Mon Jun/23 17:00   Croatia  - Mexico      @ Recife (UTC-3)


Group B:

(3) Fri Jun/13 16:00   Spain - Netherlands     @ Arena Fonte Nova, Salvador (UTC-3)
(4) Fri Jun/13 18:00   Chile - Australia       @ Arena Pantanal, Cuiabá (UTC-4)

(19) Wed Jun/18 16:00   Spain - Chile             @ Estádio do Maracanã, Rio de Janeiro (UTC-3)
(20) Wed Jun/18 13:00   Australia - Netherlands   @ Estádio Beira-Rio, Porto Alegre (UTC-3)

(35) Mon Jun/23 13:00   Australia - Spain         @ Curitiba (UTC-3)
(36) Mon Jun/23 13:00   Netherlands - Chile       @ São Paulo (UTC-3)
...

Interested? Find out more at the project site or post your questions or comments to the forum/mailing list. Thanks.

Appendix: Basics - What’s (Not) Open (Structured) Data?

What’s (Not) Open (Structured) Data?

Example 1:

A Free One-Page Booklet (PDF) Download for the Match Schedule from fifa.com.
- Copyright © FIFA 2014. All Rights Reserved.

Example 2a:

Match Schedule on FIFA Website

Example 2b:

Match Schedule on FIFA Website (Source - Document Object Model Tree)

Example 3a:

Match Schedule on Wikipedia

Example 3b:

Match Schedule on Wikipedia (Source - Plain Text or Mediawiki Text)

Cut-n-Paste Text:

June 2014  17:00   Brazil    Match 1   Croatia    Arena de São Paulo, São Paulo
June 2014  13:00   Mexico    Match 2   Cameroon   Arena das Dunas, Natal
June 2014  16:00   Brazil    Match 17  Mexico     Estádio Castelão, Fortaleza
June 2014  19:00   Cameroon  Match 18  Croatia    Arena Amazônia, Manaus
June 2014  17:00   Cameroon  Match 33  Brazil     Estádio Nacional Mané Garrincha, Brasília
June 2014  17:00   Croatia   Match 34  Mexico     Arena Pernambuco, Recife

Wikipedia Source:

===Group A===
\{\{\{\{main|2014 FIFA World Cup Group A}}}}
\{\{\{\{Fb cl2 header navbar}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|BRA}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|CRO}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=|border=green}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|MEX}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|CMR}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=}}}}
|}

\{\{\{\{Football box
|date=12 June 2014
|time=17:00
|team1=\{\{\{\{fb-rt|BRA}}}}
|score=[[2014 FIFA World Cup Group A#Brazil v Croatia|Match 1]]
|report=
|team2=\{\{\{\{fb|CRO}}}}
|goals1=
|goals2=
|stadium=[[Arena Corinthians|Arena de São Paulo]], [[São Paulo]]
|attendance=
|referee=
}}}}
\{\{\{\{Football box
|date=13 June 2014
|time=13:00
|team1=\{\{\{\{fb-rt|MEX}}}}
|score=[[2014 FIFA World Cup Group A#Mexico v Cameroon|Match 2]]
|report=
|team2=\{\{\{\{fb|CMR}}}}
|goals1=
|goals2=
|stadium=[[Arena das Dunas]], [[Natal, Rio Grande do Norte|Natal]]
|attendance=
|referee=
}}}}

CSV Conf 2014 - for Data Makers Everywhere

2014-05-05T00:00:00+00:00

Announcing CSV,Conf - the conference for data makers everywhere which takes place on 15 July 2014 in Berlin.

This one day conference will focus on practical, real-world stories, examples and techniques of how to scrape, wrangle, analyze, and visualize data. Whether your data is big or small, tabular or spatial, graphs or rows this event is for you.

Key Info

Where: Kalkscheune, Berlin, Germany
When: 15 July 2014, all day
Web: http://csvconf.com/
Register: http://register.csvconf.com/
Submit a talk: http://csvconf.com/#help (Deadline: Noon GMT 31st May)

CSV,Conf is run in conjunction with the week long Open Knowledge Festival.

What Is It About?

Building Community

We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share tools and stories.

For those who love data

CSV Conf is a non-profit community conference run by some folks who really love data and sharing knowledge. If you are as passionate about data and the application it has to society then you should join us!

Big and small

This isn’t a conference just about spreadsheets. We are curating content about advancing the art of data collaboration, from putting your CSV on GitHub to producing meaningful insight by running large scale distributed processing.

Colophon: Why CSV?

This conference isn’t just about CSV data. But we chose to call it CSV Conf because we think CSV embodies certain important qualities that set the tone for the event:

Simplicity: CSV is incredibly simple - perhaps the simplest structured data format there is
Openness: the CSV ‘standard’ is well-known and open - free for anyone to use
Easy to use: CSV is widely supported - practically every spreadsheet program, relational database and programming language in existence can handle CSV in some form or other
Hackable: CSV is text-based and therefore amenable to manipulation and access from a wide range of standard tools (including revision control systems such as git, mercurial and subversion)
Big or small: CSV files can range from under a kilobyte to gigabytes and its line-oriented structure mean it can be incrementally processed – you do not need to read an entire file to extract a single row.

More informally:

CSV is the data Kalashnikov: not pretty, but many [data] wars have been fought with it and even kids can use it. @pudo (Friedrich Lindenberg)

CSV is the ultimate simple, standard data format - streamable, text-based, no need for proprietary tools etc @rufuspollock (Rufus Pollock)

[The above is adapted from the “Why CSV” section of the Tabular Data Package specification]

Morph, a scraper platform for hackers and would be hackers

2014-03-22T00:00:00+00:00

In an ideal world…

In an ideal world we would go in search of a piece of data by using our favorite search engine and we would land on a page with a big download button. It would give you a few options for formats. You pick the right one and off you go.

Unfortunately, we all know from bitter experience that this is not yet routinely how the real world operates. We’re getting there, there’s more data out there than ever before but we still have a very very long way to go.

In the real world we hopefully find the data we’re after. Maybe it’s published on a government website somewhere. There’s no big download button, just a big html table.

If we just need to grab a snapshot copy of the data and it’s just on a single page we can copy and paste it with a bit of luck into a spreadsheet.

What if the data is spread over hundreds of pages or we need to keep it regularly updated? Well, we have to write a scraper.

If you know the basics of a language like PHP, Python or Ruby, writing a scraper isn’t very hard at all.

What is a pain is all the stuff around it. Where do I run it? How do I schedule it to run automatically? What if the website that I’m scraping changes? How do I check that the data is still coming in regularly? What do I do if I need an API so another application can regularly access the scraped data?

All of these things you can solve but why bother if something else can take care of that for you?

Introducing Morph.io

This is where Morph.io comes in. It’s a new free scraping platform made by the not-for-profit OpenAustralia Foundation.

The basic idea is you write your scraper in PHP, Python or Ruby. You can do your scraping in pretty much whatever way you want. All that matters is that the final data is written to an SQLite database in your local directory. This gives you enormous power and flexibility while for the simple case it all stays nice and easy.

All your scraper code is stored in GitHub under version control. People can fork your scraper and contribute fixes and everything on Morph.io integrates tightly with GitHub.

You can run that scraper from the commandline or manually via the web interface. You can also schedule it to run automatically every day.

You can then download the resulting data as CSVs or JSON or even do a custom SQL query against your SQLite database using the API.

You can also watch your scraper (or anyone elses) and get notified via email if the scraper errors.

You can work on the command line with your scraper and when you’re happy push the changes to GitHub and then the next time your scraper runs on Morph.io (either manually or automatically) it will pick up your changes.

Apart from making the straightforward use case of hosting and running a scraper easy, the focus of Morph.io is around collaboration. It’s easy to collaborate with someone else on developing and maintaining a scraper - you use GitHub in the way you know and there’s no mucking around with server deployments or the like.

Migrating from ScraperWiki Classic

If you’re a user of ScraperWiki Classic you probably have already received an email letting you know that the ScraperWiki Classic service is shutting down. Sad news as we’ve been long time users of ScraperWiki ourselves. However, we’ve gotten together with the ScraperWiki folks to make it super easy to migrate your existing ScraperWiki Classic scrapers over to Morph.io. It literally only requires two clicks.

Open Source

Morph.io is also open source licensed under the Affero GPL. So you can use Morph.io without any fear of vendor lock-in.

If your needs outgrow using Morph.io the service, then install your own private instance.

Get started

Hopefully this has given you enough motivation to give Morph.io a try.

Go to Morph.io and write a scraper!

Feedback on how Morph.io is working for you is always appreciated.

Labs newsletter: 20 March, 2014

2014-03-20T00:00:00+00:00

We’re back with a bumper crop of updates in this new edition of the now-monthly Labs newsletter!

Textus Viewer refactoring

The TEXTUS Viewer is an HTML + JS application for viewing texts in the format of TEXTUS, Labs’s open source platform for collaborating around collections of texts. The viewer has now been stripped down to its bare essentials, becoming a leaner and more streamlined beast that’s easier to integrate into your projects.

Check out the demo to see the new Viewer in action, and see the full usage instructions in the repo.

JSON Table Schema: foreign key support

The JSON Table Schema, Labs’s schema for tabular data, has just added an important new feature: support for foreign keys. This means that the schema now provides a method for linking entries in a table to entries in a separate resource.

This update has been in the works for a long time, as you can see from the discussion thread on GitHub. Many thanks to everyone who participated in that year-long discussion, including Jeff Allen, David Miller, Gunnlaugur Thor Briem, Sebastien Ballesteros, James McKinney, Paul Fitzpatrick, Josh Ferguson, Tryggvi Björgvinsson, and Rufus Pollock.

Renaming of Data Explorer

Data Explorer is Labs’s in-browser data cleaning and visualization app—and it’s about to get a name change.

For the past four months, discussion around the new name has been bubbling. As of right now, Rufus Pollock is proposing to go with the new name DataDeck.

What do you think? If you object, now’s your chance to jump in the thread and re-open the issue!

On the blog: SEC EDGAR database

Rufus has been doing some work with the Securities and Exchange Commission (SEC) EDGAR database, “a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports”. He has written up his initial findings on the blog and created a repo for the extracted data.

This is an interesting example of working with XBRL, the popular XML framework for financial reporting. You can find several good Python libraries for working with XBRL in Rufus’s message to the mailing list.

Labs Hangout: today!

Labs Hangouts are a fun and informal way for Labs members and friends to get together, discuss their work, and seek out new contributions—and the next one is happening today (20 March) at 1700-1800 GMT!

If you want to join in, visit the hangout Etherpad and record your name. The URL of the Hangout will be announced on the Labs mailing list as well as reported on the pad.

Get involved

Want to join in Labs activities? There’s lots to do! Possibilities for contribution include:

Google Spreadsheet imports for data.okfn.org
JSON and CSV import for TimeMapper
developer documentation for Data Pipes

And much much more. Leave an idea on the Ideas Page, or visit the Labs site to learn more about how you can join the community.

The SEC EDGAR Database

2014-03-04T00:00:00+00:00

This post looks at the Securities and Exchange Commission (SEC) EDGAR database. EDGAR is a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports:

All companies, foreign and domestic, are required to file registration statements, periodic reports, and other forms electronically through EDGAR. Anyone can access and download this information for free. [from the SEC website]

This post introduces the basic structure of the database, and how to get access to filings via ftp. Subsequent posts will look at how to use the structured information in the form of XBRL files.

Note: an extended version of the notes here plus additional data and scripts can be found in this SEC EDGAR Data Package on Github.

Human Interface

See http://www.sec.gov/edgar/searchedgar/companysearch.html

Bulk Data

EDGAR provides bulk access via FTP: ftp://ftp.sec.gov/ - official documentation. We summarize here the main points.

Each company in EDGAR gets an identifier known as the CIK which is a 10 digit number. You can find the CIK by searching EDGAR using a name of stock market ticker.

For example, searching for IBM by ticker shows us that the the CIK is 0000051143.

Note that leading zeroes are often omitted (e.g. in the ftp access) so this would become 51143.

Next each submission receives an ‘Accession Number’ (acc-no). For example, IBM’s quarterly financial filing (form 10-Q) in October 2013 had accession number: 0000051143-13-000007.

FTP File Paths

Given a company with CIK (company ID) XXX (omitting leading zeroes) and document accession number YYY (acc-no on search results) the path would be:

File paths are of the form:

/edgar/data/XXX/YYY.txt

For example, for the IBM data above it would be:

ftp://ftp.sec.gov/edgar/data/51143/0000051143-13-000007.txt

Note, if you are looking for a nice HTML version you can find it at in the Archives section with a similar URL (just add -index.html):

http://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm

Indices

If you want to get a list of all filings you’ll want to grab an Index. As the help page explains:

The EDGAR indices are a helpful resource for FTP retrieval, listing the following information for each filing: Company Name, Form Type, CIK, Date Filed, and File Name (including folder path).

Four types of indexes are available:

company — sorted by company name

form — sorted by form type

master — sorted by CIK number

XBRL — list of submissions containing XBRL financial files, sorted by CIK number; these include Voluntary Filer Program submissions

URLs are like:

ftp://ftp.sec.gov/edgar/full-index/2008/QTR4/master.gz

That is, they have the following general form:

ftp://ftp.sec.gov/edgar/full-index/{YYYY}/QTR{1-4}/{index-name}.[gz|zip]

So for XBRL in the 3rd quarter of 2010 we’d do:

ftp://ftp.sec.gov/edgar/full-index/2010/QTR3/xbrl.gz

CIK lists and lookup

There’s a full list of all companies along with their CIK code here: http://www.sec.gov/edgar/NYU/cik.coleft.c

If you want to look up a CIK or company by its ticker you can do the following query against the normal search system:

http://www.sec.gov/cgi-bin/browse-edgar?CIK=ibm&Find=Search&owner=exclude&action=getcompany&output=atom

Then parse the atom to grab the CIK. (If you prefer HTML output just omit output=atom).

There is also a full-text company name to CIK lookup here:

http://www.sec.gov/edgar/searchedgar/cik.htmL

(Note this does a POST to a ‘text’ API at http://www.sec.gov/cgi-bin/cik.pl.c)

Labs newsletter: 20 February, 2014

2014-02-20T00:00:00+00:00

The past few weeks have seen major improvements to the Labs website, another Open Data Maker Night in London, updates to the TimeMapper project, and more.

Labs Hangout: today

The next Labs online hangout is taking place today in just a few hours—now’s your chance to sign up on the hangout’s Etherpad!

Labs hangouts are informal online gatherings held on Google Hangout at which Labs members and friends get together to discuss their work and to set the agenda for Labs activities.

Today’s hangout will take place at 1700 - 1800 GMT. Check the hangout pad for more details, and watch the pad for notes from the meeting.

Crowdcrafting at Citizen Cyberscience Summit 2014

In today’s other news, Labs’s Daniel Lombraña González is presenting Crowdcrafting at the Citizen Cyberscience Summit 2014. You can read more about his presentation here.

Crowdcrafting is an open-source citizen science platform that “empowers citizens to become active players in scientific projects by donating their time in order to solve micro-task problems”. Crowdcrafting has been used by institutions including CERN, the United Nations, and the National Institute of Space Research of Brazil.

Labs site updates

Labs has been discussing improving the website for some time now, and the past weeks have seen many of those proposed improvements being put into action.

One of the biggest changes is a new projects page. Besides having a beautiful new layout, the new projects page implements filtering by tags, language, and more.

The site now also features a reciprocal linking of users and projects. The projects page now shows projects’ maintainers (n.b. plural!), and users pages now show which projects users contribute to (e.g. Andy Lulham’s page highlights his Data Pipes contributions).

TimeMapper improvements

TimeMapper is a Labs project allowing you to create elegant timelines with inline maps from Google Spreadsheets in a matter of seconds.

A number of improvements have been made to TimeMapper:

support for different types of data views (including simple timeline and map views)
nice URLs for anonymous timemaps
bugfix: missing start date doesn’t cause a crash

Open Data Maker Night February

Two weeks ago today, the ninth Open Data Maker Night London was hosted by Andy Lulham. This edition was a mapping special, featuring OpenStreetMap contributor Harry Wood.

Open Data Maker Nights are informal, action-oriented get-togethers where things get made with open data. Visit the Labs website for more information on them, including info on how to host your own.

DataPackage + Bubbles

On last week’s newsletter, you heard about Štefan Urbánek’s abstract data processing framework Bubbles. Štefan just notified the OKFN Labs list that he has created a demo of Bubbles using Data Packages, Labs’s simple standard for data publication.

“The example is artificial”, Štefan says, but it highlights the power of the Bubbles framework and the potential of the Data Package format.

Get involved

We’re always looking for new contributions at the Labs. Read about how you can join, and see the Ideas Page to get in on the ground floor of a Labs project—or just join the Labs mailing list to participate by offering feedback.

Labs newsletter: 30 January, 2014

2014-01-30T00:00:00+00:00

From now on, the Labs newsletter will arrive through a special announce-only mailing list, newsletter@okfnlabs.org, more details on which can be found below.

Keep reading for other new developments including the fifth Labs Hangout, the launch of SayIt, and new developments in the vision of “Frictionless Data”.

Not everyone who wants to know about Labs activities wants or needs to observe those activities unfolding on the main Labs list. For friends of Labs who just want occasional updates, we’ve created a new, Sendy-based announce-only list that will bring you a Labs newsletter every two weeks.

Everyone currently subscribed to okfn-labs@lists.okfn.org has been added to the new list. To join the new announce list, see the Labs Contact page, where there’s a form.

Labs Hangout no. 5

Last Thursday, Andy Lulham hosted the fifth OKFN Labs Hangout. The Labs Hangouts are a way for people curious about Labs projects to informally get together, share their work, and talk about the future of Labs.

For full details, check out the minutes from the hangout. Highlights included:

SayIt, a new publication platform for speeches & transcripts, introduced by Tom Steinberg of mySociety (see below for more!)
announcement of an Open Literature Sprint this past Saturday
full coverage of PyBossa source code with unit tests
Tom Morris’s work parsing and importing e-publications from Open Library
updates to Frictionless Data (see below)

SayIt

SayIt, an open-source tool for publishing and sharing transcripts, has just been launched by Poplus. At last week’s Labs Hangout, Tom Steinberg of mySociety (one half of Poplus, alongside Ciudadano Inteligente) shared some of the motivations behind the creation of the tool, which was also discussed on the okfn-discuss mailing list.

As Tom explained, mySociety’s They Work For You has proven the popularity of transcript data. But making the transcripts available in a nice way (e.g. with a decent API) has so far called for bespoke software development. SayIt is designed to encourage “nice” publication as the starting-point—and to serve as a pedagogical example of what a good data publication tool looks like.

Frictionless data: vision, roadmap, composability

We’ve heard about Rufus’s vision for an ecosystem of “frictionless data” in the past. Now the discussion is starting to get serious. data.okfn.org now hosts two key documents generated through the conversation:

the vision: what will create a dynamic, productive, and attractive open data ecosystem?
the roadmap: what has to happen to bring this vision to life?

The new roadmap is a particularly lucid overview of how the frictionless data vision connects with concrete actions. Would-be creators of this new ecosystem should consult the roadmap to see where to join in.

Discussion on the Labs list has also generated some interesting insights. Data Unity’s Kev Kirkland discussed his work with Semantic Web formalization of composable data manipulation processes, and Štefan Urbánek made a connection with his work on “abstracting datasets and operations” in the ETL framework Bubbles.

On the blog: OLAP part two

Last week, Štefan Urbánek wrote us an introduction to Online Analytical Processing. Shortly afterwards, he followed up with a second post taking a closer look at how OLAP data is structured and why.

Check out Štefan’s post to learn about how OLAP represents data as multidimensional “cubes” that users can slice and dice to explore the data along its many dimensions.

TimeMapper improvements

Andy Lulham has started working on TimeMapper, Labs’s easy-to-use tool for the creation of interactive timelines linked to geomaps.

Some of the improvements he has made so far have been bugfixes (e.g. preventing overflowing form controls, fixing the template settings file), but one of them is a new user feature: adding a way to change the starting event on a timeline so that they don’t always have to start at the beginning.

Get involved

Want to get involved with Labs’s projects? Now is a great time to join in! Check out the Ideas Page to see some of the many things you can do once you join Labs, or just jump on the Labs mailing list and take part in a conversation.

OLAP Cubes and Logical Models

2014-01-20T00:00:00+00:00

Last time we talked about OLAP in general – what it is and why it is useful. Today we are going to look at the data – how they are structured and why? What are cubes? What does it mean “multi-dimensional”?

Data Cubes and Logical Model

Application data might be a mess from user’s perspective. Not only that, data might be scattered all around the place in multiple systems. Even when the data would be put into one place called “data warehouse”, they will still have their original form which is not ready to answer our questions quickly. Purpose of the logical model is to hide physical structure of the data (how applications use it) and provide user-oriented view of the data (how business sees it).

“Answering questions quickly” does not depend only on database performance and amount of data. We might have the fastest database and computation engine in the world, but we will not get the answer quickly because it will take weeks to properly translate the human (business) question into technical terms. The challenges are:

Where are the data stored? What table? Which column?
What are the categories and what can I summarize?
What are the relationships between columns?
Is this category_id column the same as this pk_prod_cat?
Does this column contain a key (which is unique) or is the a label (which might be not, due to data evolution)?
How can I group the data?

All this information is collected in metadata called logical model. Analysts or report writers do not have to know where name of an organisation or category is stored, nor he does not have to care whether customer data is stored in single table or spread across multiple tables (customer, customer types, …). They just ask for “customer name” or “category code”.

Cubes

The data structures used in the OLAP are multidimensional data cubes or OLAP cubes:

Cube is a data structure that can be imagined as multi-dimensional spreadsheet. How we can imagine it? Take a spreadsheet, put year on columns, department on rows – that’s two-dimensional cube. Now create multiple sheets with data of the same structure, say one sheet per country. Now you have three-dimensional cube.

Facts and Measures

Fact is most detailed information that can be measured.

Example of a fact might be a contract, a spending, a phone call, a visit. We can measure:

contract: financial amount, discount, planned amount
spending: financial amount, quantity
phone call: duration, cost
visit: duration

Those measurable properties, such as amount, discount or duration are called measures.

We are mostly interested in summarized view: “what was the overall spending?”, “what is the average call duration?” or “how many contracts are there?” Those computed values are called aggregates or aggregated measures.

Facts might have multiple measures or they might even have none. If there are no measures we still can at least answer questions of type “how many?”.

Note: The terminology might differ slightly in various literature and systems. For example, Microsoft calls measure a measure group and they label aggregates as measures.

Dimensions

OLAP is suitable mostly for data which can be categorized – grouped by categories. The categorical view of data should be also the main interest of the data analysis. Example of categories might be: color, department, location or even a date.

The categories are called dimensions.

Dimensions provide context for facts:

Where did that happen?
When was the contract signed?
What kind of goods or services was in the contract?

Dimensions are used to filter queries:

What was the spending last year?
How many contracts signed by the department of Health?

They are used to control scope of aggregation of facts:

What was the number of contracts by department?
What was the average visit duration per month?
What are the sales of each product?

Concept Hierarchies

We might be interested in amount per year, then per month for particular year; products can be grouped by categories and subcategories; location might be defined by country, country might have multiple cities… Those are concept hierarchies of dimensions.

Hierarchy has multiple levels and there might be various hierarchical views of any dimension. For example the date might be split by year, month and day. Or it might be split by year, quarter, month and no day (because we have no daily data) or by year and week (for weekly data).

From technical perspective you might associate an attribute with a dimension. Depending on the modelling method a dimension might be composed of just one attribute or multiple attributes grouped by hierarchies.

Note: there are multiple approaches to concept hierarchies. The one described here is: Dimension might be composed of multiple levels and the levels are grouped into hierarchies. Another approach might be “hierarchies are lists of dimensions” where a dimension represents just single attribute.

Slicing and Dicing

We have a data cube full of facts, how can we explore the data? We slice the cube! What does that mean?

Say we have a data cube of contracts with dimensions: time, country and type (of procured subject)

We might be interested in spending in 2010:

… or contracts in Estonia:

… or contracts in Estonia in 2010:

… or just IT contracts in general:

IT contractsin Estonia in 2010:

Some OLAP systems might have this information readily available in a pre-computed (pre-aggregated), therefore we might get the answer very quickly despite of huge amount of original data. Even if the system does not store the pre-aggregated data cells, it might use some other transparent tricks to achieve fast responses.

Slicing and dicing is an operation that filters the data cells of a cube and narrows our focus from broader view:

Drilling down

How many contracts per year? or Which type of products was most wanted in 2012? are kind of questions that are answered by “drilling down” through the data. Drilling down means changing our focus to more detailed data.

Drilling down can be done by concept hierarchies – for example going from year summary to month summary to daily sales or by going from country level to regional level.

The opposite operation is called “roll-up” – for example going from a monthly view to a yearly view.

Try It

You might try OLAP with light-weight Python framework Cubes. I’ll be talking about the framework in more details in the future, meanwhile here are the main features:

ROLAP – OLAP on top of relational database
quick prototyping on top of existing database schemas
metadata driven with user-oriented metadata
localizable
OLAP API with HTTP server
no need to know Python

The development version includes pluggable datawarehouse (cubes from external sources) and many new backends such as MongoDB.

For reporting and data exploration you might use CubesViewer. More visualisation software being developed.

Summary

Concept of OLAP cubes and multidimensional modeling brings more understandable and usable data to the end-users. It very easy and straightforward to translate business questions into multidimensional query.

The OLAP systems, thanks to the nature of multi-dimensional data cubes, can prepare data by aggregating them up-in-front to provide answers faster.

Moreover, explicit metadata (logical model) allows not only more flexible data navigation but also easy transformation of the data to be used in various reporting software. Some OLAP tools can work with certain database schemas immediately.

To sum it up in few words, the multidimensional modeling of OLAP cubes brings: understandability, better usability, speed and logical data reusability.

Next time we will look at the Cubes – Lightweight Python framework – how to have an OLAP server running “in 15 minutes”.

Labs newsletter: 16 January, 2014

2014-01-16T00:00:00+00:00

Welcome back from the holidays! A new year of Labs activities is well underway, with long-discussed improvements to the Labs projects page, many new PyBossa developments, a forthcoming community hangout, and more.

Labs projects page

Getting the Labs project page organized better has been high on the agenda for some time now. In the past little while, significant progress has been made. New improvements to the project page include:

Oleg Lavrosky, Daniel Lombraña González, and Andy Lulham have all contributed to this development—and work is still ongoing, with further enhancements to attributes and more work on the UI still to come.

Lots of PyBossa milestones

PyBossa has achieved so many milestones since the last newsletter that it’s hard to know where to begin.

PyBossa v0.2.1 was released by Daniel Lombraña González, becoming a more robust service through the inclusion of a new rate-limiting feature for API calls. Alongside rate limits, the new PyBossa has improved security through the addition of a secure cookie-based solution for posting task runs. Full details can be found in the documentation.

Daniel also released a new PyBossa template for annotating pictures. The template, which incorporates the Annotorious.JS JavaScript library, “allow[s] anyone to extract structured information from pictures or photos in a very simple way”.

The Enki package for analyzing PyBossa applications was also released over the break. Enki makes it possible to download completed PyBossa tasks and associated task runs, analyze them with Pandas, and share the result as an IPython Notebook. Check out Daniel’s blog post on Enki to see what it’s about.

New on the blog

We’ve had a couple of great new contributions on the Labs blog since the last newsletter.

Thomas Levine has written about how he parses PDF files, lovingly exploring a problem that all data wranglers will encounter and gnash their teeth over at least a few times in their lives.

Stefan Urbanek, meanwhile, has written an introduction to OLAP, “an approach to answering multi-dimensional analytical queries swiftly”, explaining what that means and why we should take notice.

Dānabox

Labs friend Darwin Peltan reached out to the list to point out that his friend’s project Dānabox is looking for testers and general feedback. Labs members are invited to pitch in by finding bugs and breaking it.

Dānabox is “Heroku but with public payment pages”, crowdsourcing the payment for an app’s hosting costs. Dānabox is open source and built on the Deis platform.

Community hangout

It’s almost time for the Labs community hangout. The Labs hangout is the regular event where Labs members meet up online to discuss their work, find ways to collaborate, and set the agenda for the weeks to come.

When will the hangout take place? Rufus proposes moving the hangout from the 21st to the 23rd. If you want to participate, leave a comment on the thread to let Labs know what time would work for you.

Get involved

Labs is the Labs community, no more and no less, and you’re invited to become a part of it! Join the community by coding, blogging, kicking around ideas on the Ideas Page, or joining the conversation on the Labs mailing list.

Introduction to OLAP

2014-01-10T00:00:00+00:00

What is OLAP?

“Online Analytical Processing – OLAP is an approach to answering multi-dimensional analytical queries swiftly” says Wikipedia. What does that mean? What are multi-dimensional analytical queries? Why this approach? We will learn all this in a short blog series.

The term OLAP is becoming a bit less appropriate. The OLAP term comes from traditional data warehousing from times when “big data” would fit into your current laptop and it was time consuming to process even that little amount compared to today’s standards. Nowadays the majority of analytical processing can be considered online. More appropriate term is “multidimensional data processing” as we will see later. For now we will stick with the original name of the approach.

The basic concepts of OLAP are:

data cubes – multi-dimensional approach to data
fast aggregation or pre-aggregation

Why OLAP?

There are two sides of data: data to be used by application systems (transactional, operational) and data to be used by humans for decision making. The two kinds of data are in many cases very different, as the applications have other needs than humans do.

Applications require and store more detailed data. They require efficiency on transaction (operation) level and integrity for example. The data might be stored in different places, depending on how they are used by the systems. On the other hand, decision makers want to see the data at one place and in the form that reflects their view on the world. They don’t care how long does it take to store a million transactions, they want to know when those million transactions happened and where.

Why OLAP then? Decision makers, analysts or just any other curious people would like to answer their questions quickly. The data in database systems are not stored in a way that the questions can be answered easily. OLAP is the technical and semantic bridge between the two ways of using the data.

Why can an OLAP system answer the analytical queries swiftly or at least faster than operational system? The data in the analytical system are already modeled and prepared in a form closer to the typical questions:

Pre-aggregation – data are aggregated at different levels of granularity, such as monthly data, and stored. Questions like “what were the sales in January 2010” would not require any computation, just direct fetch of the answer. The data might be pre-aggregated at all levels and combinations of their properties (dimensions).

Multi-dimensional databases – data are stored in alternative kinds of faster structures.

Different Approach

In operational systems update of information is permitted even required. Keeping change history might be impractical in events with high frequency. From analytical point of view this might be undesired, as we would like to know historical evolution of a state. For example: it is sufficient to know actual account amount in the bank or available budget of a kind, analysts would like to see how the amount changed over time, how it might have been influenced by other events.

One of the differences that I consider crucial are the system requirements: for operational systems the requirements for data design are well known up in front. It is very unlikely that an application will generate an unexpected query to the system, therefore the system is built around the application’s behavior. On the other hand, requirements of an analysts are ad-hoc. The analytical system has to be designed in a way that would allow quick answers to business questions.

Redundancy: in applications redundancy might introduce quite lot of errors mostly because of data inconsistency. In the analytical system the redundancy is many times desired. Any information that is readily available close to the data being queried is making the responses faster. The design of the analytical systems and presence of historical data allows reconstructability of the redundant information, therefore any inconsistencies might be corrected. Examples of redundancies in the analytical systems:

multiple copies of the same data in different contexts
denormalized data

One more difference I am going to mention here is the amount of data being processed at a single time. In the operational systems only small amount of data is required to complete desired operation. For example: change of client’s address (client’s identification and new address), budget expense (budget line and the expended amount). In the analytical system large amount of data has to be “touched” to answer analysts question: “what was the spending by country?”

Here are the differences between analytical vs. operational data summarized:

subject oriented vs. application oriented
summarized vs. detailed
analysis driven vs. transaction driven
read-only vs. updateable
unknown processing requirements vs. well known processing requirements
redundancy allowed vs. redundancy undesired
large amount of data per operation vs. tiny amount of data per operation

The two systems can be physically separate, but with modern tools the analytical system might be integrated in a database platform with transactional.

Software

There are many commercial databases and applications for multi-dimensional data modelling and OLAP in form of Business Intelligence suites from historically big names such as Oracle, Microsoft, SAS and others. Google for “OLAP documentation” and company name to get the idea about their approach, capabilities and features.

New trend is to have OLAP or OLAP-like systems as a service, were one imports his data, the software transforms the data into data cubes and provides reporting interface.

There are very few open-source OLAP packages though and even fewer general-purpose. Just to mention two:

Cubes – light-weigh OLAP framework (written in Python, Github)
Pentaho – full-featured Business Intelligence suite (written in Java)

Summary

OLAP is a way of making transactional data usable and understandable for decision making.

How I parse PDF files

2013-12-25T00:00:00+00:00

Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. I sort of follow this decision process.

Do we need to read the file contents at all?
Do we only need to extract the text and/or images?
Do we care about the layout of the file?

Example PDFs

I’ll show a few different approaches to parsing and analyzing these PDF files. Different approaches make sense depending on the question you ask.

These files are public notices of applications for permits to dredge or fill wetlands. The Army Corps of Engineers posts these notices so that the public may comment on the notices before the Corps approves them; people are thus able to voice concerns about whether these permits would fall within the rules about what sorts of construction is permissible.

Theses files are downloaded daily from the New Orleans Army Corps of Engineers website and renamed according to the permit application and the date of download. They once fed into that the Gulf Restoration Network in their efforts to protect the wetlands from reckless destruction.

If I don’t need the file contents

Basic things like file size, file name and modification date might be useful in some contexts. In the case of PDFs, file size will give you an idea of how many/much of the PDFs are text and how many/much are images.

Let’s plot a histogram of the file sizes. I’m running this from the root of the documents repository, and I cleaned up the output a tiny bit.

$ ls --block-size=K -Hs */public_notice.pdf | sed 's/[^0-9 ].*//' | hist 5
|   2 | **
|  55 | ********************************************************************************
|   4 | *****
|   4 | *****
|  11 | ****************
|   4 | *****
|   2 | **
|   2 | **
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   2 | **
|   1 | *
|   2 | **
|   1 | *
|   3 | ****
|   6 | ********
|   4 | *****
|   8 | ***********
|   3 | ****
|   6 | ********
|   7 | **********
|  24 | **********************************
|  11 | ****************
|   6 | ********
|   4 | *****
|  23 | *********************************
|   7 | **********
|   7 | **********
|   3 | ****
|   3 | ****
|   1 | *
|   1 | *
|   1 | *
|   2 | **
|   2 | **
|   1 | *
|   3 | ****
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   2 | **
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
|   1 | *
TOTAL| 248 |

The histogram shows us two modes. The smaller mode, around 20 kb, corresponds to files with no images (PDF export from Microsoft Word), and the larger mode corresponds to files with images (scans of print-outs of the Microsoft Word documents). It looks like about 80 are just text and the other 170 are scans.

This isn’t a real histogram, but if we’d used a real one with an interval scale, the outliers would be more obvious. Let’s cut off the distribution at 400 kb and look more closely at the unusually large documents that are above that cutoff.

What’s in that 7 mb file? Well let’s find it.

$ ls --block-size=K -Hs */public_notice.pdf | grep '742.K'
7424K MVN-2010-1080-WLL_MVN-2010-1032-WLLB/public_notice.pdf

You can see it here. It’s not a typical public notice; rather, it is a series of scanned documents related to a permit transfer request. Interesting.

Next, how are two large files within 5 kb of each other?

$ ls --block-size=K -Hs */public_notice.pdf | grep 860K
860K MVN-2012-006152-WII/public_notice.pdf
860K MVN-2012-1797-CU/public_notice.pdf

Those are here

Hmm. Nothing special about those. People see patterns in randomness.

Now let’s look at some basic properties of the pdf files. This will give us a basic overview of one file.

$ pdfinfo MVN-2013-00026-WKK/public_notice.pdf
Creator:        FUJITSU fi-4010CU
Producer:       Adobe Acrobat 9.52 Paper Capture Plug-in
CreationDate:   Fri Jan 25 09:45:08 2013
ModDate:        Fri Jan 25 09:46:16 2013
Tagged:         yes
Form:           none
Pages:          3
Encrypted:      no
Page size:      606.1 x 792 pts
Page rot:       0
File size:      199251 bytes
Optimized:      yes
PDF version:    1.6

Let’s run it on all of the files.

$ for file in */public_notice.pdf; do pdfinfo $file && echo; done
# Lots of output here

What was used to produce these files?

$ for file in */public_notice.pdf; do pdfinfo $file|sed -n 's/Creator: *//p' ; done|sort|uniq -c
Acrobat PDFMaker 10.1 for Word
Acrobat PDFMaker 9.1 for Word
FUJITSU fi-4010CU
HardCopy
HP Digital Sending Device
Oracle9iAS Reports Services
PScript5.dll Version 5.2.2
Writer

When were they created?

$ for file in */public_notice.pdf; do pdfinfo $file|grep CreationDate: > /dev/null && date -d "$(pdfinfo $file|sed -n 's/CreationDate: *//p')" --rfc-3339 date ; done
2012-07-03
2012-07-06
2012-07-06
2012-07-06
# ...

How many pages do they have?

$ for file in */public_notice.pdf; do pdfinfo $file|sed -n 's/Pages: *//p' ; done | hist 1
  |   1 | 
  |  27 | **********
  | 198 | ********************************************************************************
  |  16 | ******
  |   1 | 
  |   2 | 
  |   1 | 
  |   1 | 
  |   1 | 
TOTAL | 248 |

It might actually be fun to see relate these variables to each other. For example, when did the Corps upgrade from PDFMaker 9.1 to PDFMaker 10.1?

Anyway, we got somewhere interesting without looking at the files. Now let’s look at them.

If messy, raw file contents are fine

The main automatic processing that I run on the PDFs is a search for a few identification numbers. The Army Corps of Engineers uses a number that starts with “MVN”, but other agencies use different numbers. I also search for two key paragraphs

My approach is pretty crude. For the PDFs that aren’t scans, I just use pdftotext.

# translate
pdftotext "$FILE" "$FILE"

Then I just use regular expressions to search the resulting text file.

pdftotext normally screws up the layout of PDF files, especially when they have multiple columns, but it’s fine for what I’m doing because I only need to find small chunks of text rather than a whole table or a specific line on multiple pages.

As we saw earlier, most of the files contain images, so I need to run OCR. Like pdftotext, OCR programs often mess up the page layout, but I don’t care because I’m using regular expressions to look for small chunks.

I don’t even care whether the images are in order; I just use pdfimages to pull out the images and then tesseract to OCR each image and add that to the text file. (This is all in the translate script that I linked above.)

If I care about the layout of the page

If I care about the layout of the page, pdftotext probably won’t work. Instead, I use pdftohtml or inkscape. I’ve never needed to go deeper, but if I did, I’d use something like PDFMiner.

pdftohtml

pdftohtml is useful because of its -xml flag.

$ pdftohtml -xml MVN-2013-00180-ETT/public_notice.pdf
Page-1
Page-2
Page-3
$ head MVN-2013-00180-ETT/public_notice.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.22.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
        <fontspec id="0" size="37" family="Times" color="#000000"/>
        <fontspec id="1" size="21" family="Times" color="#000000"/>
        <fontspec id="2" size="16" family="Times" color="#000000"/>
        <fontspec id="3" size="13" family="Times" color="#000000"/>
        <fontspec id="4" size="16" family="Times" color="#000000"/>

Open that with an XML parser like lxml

# This is python
import lxml.etree
pdf2xml = lxml.etree.parse('MVN-2013-00180-ETT/public_notice.xml')

One of the things that I try to extract is the “CHARACTER OF WORK” section. I do this with regular expressions, but we could also do this with the XML. Here are some XPath selectors that get us somewhere.

# This is python
print pdf2xml.xpath('//text/b[text()="CHARACTER OF WORK"]/../text()')
print pdf2xml.xpath('//text/b[text()="CHARACTER OF WORK"]/../following-sibling::text/text()')      

Inkscape

Inkscape can convert a PDF page to an SVG file. I have a little script that runs this across all pages within a PDF file.

Once you’ve converted the PDF file to a bunch of SVG files, you can open it with an XML parser just like you could with the pdftohtml output, except this time much more of the layout is preserved, including the groupings of elements on the page.

Here’s a snippet from one project where I used Inkscape to parse PDF files. I created a crazy system for receiving a very messy PDF table over email and converting it into a spreadsheet that is hosted on a website.

This function is contains all of the parsing functions for a specific page of the pdf file once it has been converted to SVG. It takes an lxml.etree._ElementTree object like the one we get from lxml.etree.parse, along with some metadata. It runs a crazy XPath selector (determined only after much test-driven development) to pick out the table rows, and then runs a bunch of functions (not included) to pick out the cells within the rows.

def page(svg, file_name, page_number):
    'I turn a svg tree into a list of dictionaries.'
    # County name
    county = unicode(svg.xpath(
        '//svg:g/svg:path[position()=1]/following-sibling::svg:text/svg:tspan/text()',
        namespaces = { 'svg': 'http://www.w3.org/2000/svg' }
    )[0])
    rows = _page_tspans(svg)

    def skip(reason):
        print 'Skipped a row on %s page %d because %s.' % (file_name, page_number, reason)

    data = []
    for _row in rows:
        row_text = [text.xpath('string()') for text in _row]
        try:
            if row_text == []:
                skip('the row is empty')
                print row_text
            elif _is_header(row_text):
                skip('it appears to be a header.')
                print row_text
    # ...

I’d like to point out the string() xpath command. That converts the current node and its decendents into plain text; it’s particularly nice for inconsistently structured files like this one.

Optical character recognition

People often think that optical character recognition (OCR) is going to be a hard part. It might be, but it doesn’t really change this decision process. If I care about where the images are positioned on the page, I’d probably use Inkscape. If I don’t, I’d probably use pdfimages, as I did here.

Review

When I’m parsing PDFs, I use some combination of these tools.

Basic file analysis tools (ls or another language’s equivalent)
PDF metadata tools (pdfinfo or an equivalent)
pdftotext
pdftohtml -xml
Inkscape via pdf2svg
PDFMiner

I prefer the ones earlier in the list when the parsing is less involved because the tools do more of the work for me. I prefer the ones towards the end as the job gets more complex because these tools give me more control.

If I need OCR, I use pdfimages to remove the images and tesseract to run OCR. If I needed to run OCR and know more about the layout, I might convert the PDFs to SVG with Inkscape and, and then take the images out of the SVG in order to know more precisely where they are in the page’s structure.

This article was originally posted on Thomas Levine’s site.

Convert data between formats with Data Converters

2013-12-17T00:00:00+00:00

Data Converters is a command line tool and Python library making routine data conversion tasks easier. It helps data wranglers with everyday tasks like moving between tabular data formats—for example, converting an Excel spreadsheet to a CSV or a CSV to a JSON object.

The current release of Data Converters can convert between Excel spreadsheets, CSV data, and JSON tables, as well as some geodata formats (with additional requirements).

Its smart parser can guess the types of data, correctly recognizing dates, numbers, strings, and so on. It works as easily with URLs as with local files, and it is designed to handle very large files (bigger than memory) as easily as small ones.

Converting data

Converting an Excel spreadsheet to a CSV or a JSON table with the Data Converters command line tool is easy. Data Converters is able to read XLS(X) and CSV files and to write CSV and JSON, and input files can be either local or remote.

dataconvert simple.xls out.csv
dataconvert out.csv out.json

# URLs also work
dataconvert https://github.com/okfn/dataconverters/raw/master/testdata/xls/simple.xls out.csv

Data Converters will try to guess the format of your input data, but you can also specify it manually.

dataconvert --format=xls input.spreadsheet out.csv

Instead of writing the converted output to a file, you can also send it to stdout (and then pipe it to other command-line utilities).

dataconvert simple.xls _.json  # JSON table to stdout
dataconvert simple.xls _.csv   # CSV to stdout

Converting data files can also be done within Python using the Data Converters library. The dataconvert convenience function shares the dataconvert command line utility’s file reading and writing functionality.

from dataconverters import dataconvert
dataconvert('simple.xls', 'out.csv')
dataconvert('out.csv', 'out.json')
dataconvert('input.spreadsheet', 'out.csv', format='xls')

Parsing data

Data Converters can do more than just convert data files. It can also parse tabular data into Python objects that captures the semantics of the source data.

Data Converters’ various parse functions each return an iterator over the records of the source data along with a metadata dictionary containing information about the data. The records returned by parse are not just (e.g.) split strings: they’re hash representations of the contents of the row, with column names and data types auto-detected.

import dataconverters.xls as xls
with open('simple.xls') as f:
    records, metadata = xls.parse(f)
    print metadata
    print [r for r in records]
=> {'fields': [{'type': 'DateTime', 'id': u'date'}, {'type': 'Integer', 'id': u'temperature'}, {'type': 'String', 'id': u'place'}]}
=> [{u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Galway', u'temperature': 1.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Galway', u'temperature': -1.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Galway', u'temperature': 0.0}, {u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Berkeley', u'temperature': 6.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Berkeley', u'temperature': 8.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Berkeley', u'temperature': 5.0}]

What’s next?

Excel spreadsheets and CSVs aren’t the only kinds of data that need converting.

Data Converters also supports geodata conversion, including converting between KML (the format for geographical data used in Google Maps and Google Earth), GeoJSON, and ESRI Shapefiles.

Data Converters’ ability to convert between tabular data may also grow, adding JSON support on the input side and XLS(X) support on the output side—as well as new conversions for XML, SQL dumps, and SPSS.

Visit the Data Converters home page to learn how to install Data Converters and its dependencies, and check out Data Converters on GitHub to see how you can contribute to the project.

Labs newsletter: 12 December, 2013

2013-12-12T00:00:00+00:00

We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the Nomenklatura, Annotator, and Data Protocols projects and some new posts on the Labs blog.

Migration from Trello to GitHub

For some time now, Labs activities requiring coordination have been organized on Trello—but those days are now over. Labs has moved its organizational setup over to GitHub, coordinating actions and making plans by means of GitHub issues. This change comes as a big relief to the many Labs members who already use GitHub as their main platform for collaboration.

General Labs-related activities are now tracked on the Labs site’s issues, and activities around individual projects are managed (as before!) through those projects’ own issues.

New Bad Data

New examples of bad data continue to roll in—and we invite even more new submissions.

Bad datasets added since last newsletter include the UK’s Greater London Authority spend data (65+ files with 25+ different structures!), Nature Magazine’s supplementary data (an awful PDF jumble), and more.

Nomenklatura: new alpha

As we’ve previously noted, Labs member Friedrich Lindenberg has been thinking about producing “a fairly radical re-framing” of the Nomenklatura data reconciliation service.

Friedrich has now released an alpha version of a new release of Nomenklatura at nk-dev.pudo.org. The major changes with this alpha include:

A fully JavaScript-driven frontend
String matching now happens inside the PostgreSQL database
Better introductory text explaining what Nomenklatura does
“entity” and “alias” domain objects have been merged into “entity”

Friedrich is keen to hear what people think about this prototype—so jump in, give it a try, and leave your comments at the Nomenklatura repo.

Annotator v1.2.9

A new maintenance release of Annotator came out ten days ago. This new version is intended to be one of the last in the v1.2.x series—indeed, v1.2.8 itself was intended to be the last, but that version had some significant issues that this new release corrects.

Fixes in this version include:

Fixed a major packaging error in v1.2.8. Annotator no longer exports an excessive number of tokens to the page namespace.
Notification display bugfixes. Notification levels are now correctly removed after notifications are hidden.

The new Annotator is available, as always, from GitHub.

Data Protocols updates

Data Protocols is a project to develop simple protocols and formats for working with open data. Rufus Pollock wrote a cross-post to the list about several new developments with Data Protocols of interest to Labs. These included:

Close to final agreement on a spec for adding “primary keys” to the JSON Table Schema (discussion)
Close to consensus on spec for “foreign keys” (discussion)
Proposal for a JSON spec for views of data, e.g. graphs or maps (discussion)

For more, check out Rufus’s message and the Data Protocols issues.

On the blog

Labs members have added a couple new posts to the blog since the last newsletter. Yours truly (with extensive help from Rufus) posted on using Data Pipes to view a CSV. Michael Bauer, meanwhile, wrote about the new Reconcile-CSV service he developed while working on education data in Tanzania. Look to the Labs blog for the full scoop.

Get involved

If you have some spare time this holiday season, why not spend it helping out with Labs? We’re always always looking for new people to join the community—visit the Labs issues and the Ideas Page to get some ideas for how you can join in.

Introducing Reconcile-CSV

2013-12-06T00:00:00+00:00

Recently I spent a week in Tanzania working on education data with the ministry of education (blog post here). One of the problems we faced there were spreadsheets, we liked to merge, without having any unique IDs. I quickly realized we can do this through reconciliation services in OpenRefine. The API and some projects implementing the reconciliation service are described in the OpenRefine wiki. Nevertheless, most of the projects had a ton of requirements or need a database.

I wanted a service that is:

easy to install and run
works on top of a CSV file

Because I love Clojure and I already had a fuzzy-matching library at hand, I chose to go down that route. Clojure has the great advantage of being able to generate .jar files including all the dependencies. - thus running the service is a matter of executing a .jar file.

All that was left was implementing the reconciliation API around it. I am proudly introducing reconcile-csv - a reconciliation service, running on your machine without much hassle. See http://okfnlabs.org/reconcile-csv for more details and instructions.

While this might not be the first reconciliation service to be written - I do think by now it’s the easiest to use: you’ll only need java, a CSV file and the .jar provided.

View a CSV (Comma Separated Values) in Your Browser

2013-12-05T00:00:00+00:00

This post introduces one of the handiest features of Data Pipes: fast (pre) viewing of CSV files in your browser (and you can share the result by just copying a URL).

The Raw CSV

CSV files are frequently used for storing tabular data and are widely supported by spreadsheets and databases. However, you can’t usually look at a CSV file in your browser - usually your browser will automatically download a CSV file. And even if you could look at a CSV file, it is not very pleasant to look at:

The Result

But using the Data Pipes html feature, you can turn an online CSV into a pretty HTML table in a few seconds. For example, the CSV you’ve just seen would become this pretty table:

Using it

To use this service, just visit http://datapipes.okfnlabs.org/html/ and paste your CSV’s URL into the form.

For power users (or for use from the command line or API), you can just append your CSV url to:

http://datapipes.okfnlabs.org/csv/html/?url=

Previewing Just the First Part of a CSV File

You can also extend this basic previewing using other datapipes features. For example, suppose you have a big CSV file (say with more than a few thousand rows). If you tried to turn this into an HTML table and then view in your browser, it would probably crash it.

So what if you could just see a part of the file? After all, you may well only be interested in seeing what that CSV file looks like, not every row. Fortunately, Data Pipes supports only showing the first 10 lines of a CSV file using a head operation. To demonstrate, let’s just extend our example above to use head. This gives us the following URL (click to see the live result):

http://datapipes.okfnlabs.org/csv/head/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

Colophon

Data Pipes is a free and open service run by Open Knowledge Foundation Labs. You can find the source code on GitHub at: https://github.com/okfn/datapipes. It also available as a Node library and command line tool.

If you like previewing CSV files in your browser, you might also be interested in the Recline CSV Viewer, a Chrome plugin that automatically turns CSVs into searchable HTML tables in your browser.

Labs newsletter: 28 November, 2013

2013-11-28T00:00:00+00:00

Another busy week at the Labs! We’ve had lots of discussion around the idea of “bad data”, a blog post about Mark’s aid tracker, new PyBossa developments, and a call for help with a couple of projects. Next week we can look forward to another Open Data Maker Night in London.

Bad Data

Last Friday, Rufus announced Bad Data, a new educational mini-project that highlights real-world examples of how data shouldn’t be published.

This announcement was greeted with glee and with contributions of new examples. Open government activist Ivan Begtin chimed in with the Russian Ministry of the Interior’s list of regional offices and the Russian government’s tax rates for municipalities. Labs member Friedrich Lindenberg added the German finance ministry’s new open data initiative. As Andy Lulham said, “bad data” will be very useful for testing the new Data Pipes operators.

You can follow the whole discussion thread in the list archive.

Blog post: Looking at aid in the Philippines

At last week’s hangout, you heard about Mark Brough’s new project, a browser for aid projects in the Philippines generated from IATI data.

Now you can read more about Mark’s project on the blog, learning about where the data comes from, how the site is generated from the data (interestingly, it uses the Python-based static site generator Frozen-Flask), and what Mark plans to do next.

New PyBossa cache system

Labs member and citizen science expert Daniel Lombraña González has been “working really hard to add a new cache system to PyBossa”, the open source crowdsourcing platform.

As Daniel has discovered, the Redis key-value store meets all his requirements for a load-balanced, high-availability, persistent cache. As he put it: “Redis is amazing. Let me repeat it: amazing.”

Read the blog post to learn more about the new Redis-based PyBossa setup and its benefits.

Contributions needed: iOS and Python development

Philippe Plagnol of Product Open Data needs a few good developers to help with some projects.

Firstly, the Product Open Data Android app has been out for a while (source code), and it’s high time there was a port for Apple devices. If you’re interested in contributing to the port, leave a comment at this GitHub issue.

Secondly, work is now underway on a brand repository which will assign a Brand Standard Identifier Number (BSIN) to each brand worldwide, making it possible to integrate products in the product repository. Python developers are needed to help make this happen. If you want to help out, join in this GitHub thread. (Lots of people have already signed up!)

Next week: Open Data Maker Night London #7

On the 4th of December next week, the seventh London Open Data Maker Night is taking place. Anyone interested in building tools or insights from data is invited to drop in at any time after 6:30 and join the fun. (Please note that the event will take place on Wednesday rather than the usual Tuesday.)

What is an Open Data Maker Night? Read more about them here.

Get involved

Labs is always looking for new contributors. Read more about how you can join the community, whether you’re a coder, a data wrangler, or a communicator, and check out the Ideas Page to see what else is brewing.

Looking at aid in the Philippines

2013-11-25T00:00:00+00:00

See also: “A closer look at aid in the Philippines”

Since Typhoon Yolanda/Haiyan struck the Philippines on 8th November there has been some discussion around the availability of information to help coordinate activities effectively in the disaster response phase.

To see what data was already available, I put together a quick projects browser, which generates a static site for all projects currently available in the Philippines that have been published to the IATI Registry, a CKAN instance for sharing aid information in the standard IATI format.

Philippines Projects browser

Where the data comes from

IATI data is available in a standard XML schema, which makes it relatively easy to pull together quickly. However, with almost 200 publishers and over 3000 individual packages now published, searching for data for a single country would be complicated.

The IATI datastore simplifies this task by pulling together all of the data published to the IATI Registry each night, and provides a queryable API for requesting results in CSV, XML or JSON format.

It was therefore possible to get all data for the Philippines by querying:

http://iati-datastore.herokuapp.com/api/1/access/activity.json?recipient-country=PH&limit=50&offset=%s

… and then paging through the results (by increasing offset by 50 each time)

The IATI Datastore

NB: IATI was originally designed with a focus on traditional development aid, which is why the number of projects that specifically relate to the typhoon are limited. But the key concepts are mostly the same, which is why you do see some humanitarian aid in there.

Creating a static site

I decided to create a static site so that:

I would not have to think about creating a database, or modelling the data;
I could deploy the site to Github pages, so I wouldn’t have to think about server setup;
The site would run fast.

Ruby has some nice modules for creating static sites, such as Jekyll and particularly Middleman in this sort of context.

However, I wanted to write this one in Python, and unfortunately all of the static site generators available in Python come with a lot of assumptions about the structure of your site (basically, they all think it should look like a blog).

One exception (discovered via this blogpost) is Frozen-Flask. This is great because Flask is super simple and easy to work with, and provides a lot of the stuff you need out of the box.

In this instance, all I had to do was add the lines:

app.config['FREEZER_RELATIVE_URLS'] = True
freezer = Freezer(app)

and then to generate a static site, which is output to /build:

freezer.freeze()

Frozen-Flask wants to set all of the URLs as absolute and beginning with /. Because I’m deploying to Github pages, I need to be able to have relative URLs. Frozen-Flask provides an option to do that with FREEZER_RELATIVE_URLS, but the weird side effect of this is that all the URLs have to end with index.html.

What’s next?

Firstly, it would be great to get more data in there. The projects browser can be updated automatically each night by pulling the data in from the datastore – something I’m going to try and get set up in the next couple of days.

However, with a couple of exceptions (particularly GlobalGiving and UNOCHA FTS), in general IATI publishers update their data a maximum of once a month. The projects browser suggests that may not be frequent enough in this sort of situation. Additionally, it is somewhat difficult to see which projects are related to this crisis as opposed to previous earthquakes and typhoons in the region (or alternatively, development rather than humanitarian activities). Some discussions have begun about adding an extension to IATI to capture additional fields that might be relevant to humanitarian actors, for example to tag a project as related to the Haiyan response.

Secondly, the interface could be improved somewhat. Some other ways of filtering and searching through the data would be useful, and improving the performance of the existing filter would be sensible.

Let me know what you think!

Email: mark.brough@publishwhatyoufund.org
Tweet: @mark_brough or @aidtransparency
Source code: http://github.com/pwyf/philippines
Live site: http://pwyf.github.io/philippines

Labs newsletter: 21 November, 2013

2013-11-21T00:00:00+00:00

This week, Labs members gathered in an online hangout to discuss what they’ve been up to and what’s next for Labs. This special edition of the newsletter recaps that hangout for those who weren’t there (or who want a reminder).

Data Pipes update

Last week you heard about Andy Lulham’s improvements to Data Pipes, the online streaming data transformations service. He didn’t stop there, and in this week’s hangout, Andy described some of the new features he has been adding:

parse and render are now streaming operations
option parsing now uses optimist
a basic command-line interface
… and much, much more

Coming up next: map & filter with arbitrary functions!

Crowdcrafting: progress and projects

New Shuttleworth fellow Daniel Lombraña González reported on progress with CrowdCrafting, the citizen science platform built with PyBossa.

CrowdCrafting now has more than 3,500 users (though Daniel cautions that this doesn’t mean much in terms of participation), and the site now has more answers than tasks.

Last week, the team at MicroMappers used CrowdCrafting to classify tweets about the typhoon disaster in the Philippines. Digital mapping activists SkyTruth, meanwhile, have used CrowdCrafting to map and track fracking sites in the northeast United States. Daniel has also been in contact with EpiCollect about a project on trash collection in Spain.

Open Data Button

Labs member Oleg Lavrovsky discussed the Open Data Button, an interesting fork of the recently-launched Open Access Button.

The Open Access Button, an idea of the Open Science working group at OKCon 2013, is a bookmarklet that allows users to report their experiences of having their research blocked by paywalls. The Open Data Button applies this same idea to Open Data: users can use it to report their problems with legal and technical restrictions on data. (As Rufus pointed out, this ties in nicely with the IsItOpenData project.)

Queremos Saber

Labs ally Vítor Baptista reported on a new development with Queremos Saber, the Brazilian FOI request portal.

Changes in the way the Brazilian federal government accepts FOI requests have caused Queremos Saber problems. The federal government no longer accepts requests by email, forcing the use of a specialized FOI system which they are now promoting for local governments as well. This limits the number of places that will accept requests from Queremos Saber.

A solution to this problem is underway: an email-based API that will take emails received at certain addresses (e.g. ministryofhealthcare@queremossaber.org.br) and turn them into instructions for a web crawler to create an FOI request in the appropriate system. An interesting side effect of this would be the creation of an anonymization layer, allowing users to bypass the legal requirement that FOI requests not be placed anonymously.

Philippines Projects

Labs data wrangler Mark Brough showed off a test project collecting data on aid activities in the Philippines. Mark’s small static site, updated each night, collects IATI aid data on projects in the Philippines and republishes it in a more browsable form.

Mark also discussed another data-mashup project, still in the planning stage, that would combine budget and aid data for Tanzania (or any other developing country)—similar to Publish What You Fund’s old Uganda project but based on a non-static dataset.

Global Economic Map

Alex Peek discussed his initiative to create the Global Economic Map, “a collection of standardized data set of economic statistics that can be applied to every country, region and city in the world”.

The GEM will draw data from sources like government publications and SEC filings and will cover eleven statistics that touch on GDP, employment, corporations, and budgets. The GEM aims to be fully integrated with Wikidata.

Frictionless data

Finally, Rufus Pollock discussed data.okfn.org and the mission of “frictionless data”: making it “as simple as possible to get the data you want into the tool of your choice.”

data.okfn.org aims to help achieve this goal by promoting, among other things, simple data standards and the tooling to support them. As reported in last week’s newsletter, this now includes a Data Package Manager based on npm, now working at a very basic level. It also includes the data.okfn.org Data Package Viewer, which provides a nice view on data packages hosted on GitHub, S3, or wherever else.

Improving the Labs site

The hangout wrapped up with a discussion of how to improve the Labs site. Besides some discussion of the possibility of a one-click creation system for Open Data Maker Nights, talk focused on improving the projects page.

Oleg, who has volunteered to take the lead in reforming the projects page, highlighted the need for a way to differentiate projects by their activity level and their need for more contributors. Mark agreed, suggesting also that it would be nice to be able to filter projects by the languages and technologies they use. Both ideas were proposed as a way to fill out Tod Robbins’s suggestion that the projects page needs categories.

See the Labs hangout notes for the full details of this discussion.

Get involved

As always, Labs wants you to join in and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement, and have a look at the Ideas Page to see what other members have been thinking.

Bad Data: real-world examples of how not to do data

2013-11-19T00:00:00+00:00

We’ve just started a mini-project called Bad Data. Bad Data provides real-world examples of how not to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly.

This isn’t about being critical but about educating—providing examples of how not to do something may be one of the best ways of showing how to do it right. It also provides a source of good practice material for budding data wranglers!

Each “bad” dataset on gets a simple page that shows what’s wrong along with a preview or screenshot.

We’ve started to stock the site with some of the better examples of bad data that we’ve come across over the years. This includes machine-unreadable Transport for London passenger numbers from the London Datastore and a classic “ASCII spreadsheet” from the US Bureau of Labor Statistics.

We welcome contributions of new examples! Submit them here.

Labs newsletter: 14 November, 2013

2013-11-14T00:00:00+00:00

Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.

Data Pipes: lots of improvements

Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.

This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:

new operations: strip (removes empty rows), tail (truncate dataset to its last rows)
new features: a range function and a “complement” switch for cut; options for grep
all operations in pipeline are now trimmed for whitespace
basic tests have been added

Have a look at the closed issues to see more of what Andy has been up to.

Webshot: new homepage and feature

Last week we introduced you to Webshot, a web API for screenshots of web pages.

Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.

Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.

On the blog: natural language processing with Python

Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.

“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.

Data Packages workflow à la Node

Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init?

Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.

Nomenklatura: looking forward

Nomenklatura is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.

Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed “a fairly radical re-framing of the service”.

The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.

Data Issues: following issues

Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.

Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”

Get involved

Natural Language Processing using Python

2013-11-11T00:00:00+00:00

This weekend the Google Developer Group in Cairo arranged 2-days workshops followed by a hackathon. During this event, I organized a workshop about NLTK and the use of Python in Natural Language Processing (NLP). The session’s slides can be found here. The beauty of NLP is that it enables computers to extract knowledge from unstructured data inside textual documents. Websites like Zite use NLP to deliver custom news to readers based on their taste. NLP enables Google to extract times and dates from email messages so that Gmail users can automatically add events mentioned in their emails into their calendars. The same technology enables us to translate text and predict the language of a tweet. Data journalists can also use NLP to analyse transcripts of the speeches of politicians and MPs to find newsworthy information that is not feasible to be found without such technology.

Normalization and Tokenization.

Two initial steps should take place before dealing with text. Words can take various forms according to their context. For example, the same word is capitalized when it is placed in the beginning of the sentence, and it starts with small letters elsewhere. Plural words have different endings from their singular counterparts, while conjugation changes the ending of verbs.

The word “free” appears twice in the following sentence, “Free hosting and free domain”, but for a computer to know that it is the same word regardless of its case, we may need to convert the whole sentence into lower cases. This is simple done in Python as follows:

“Free hosting and free domain”.lower()

Nevertheless, in some cases we might need to take the case the words into our considerations, especially when it carries some information about the meaning of those words. Consider the following example, “The CEO of Apple gave me an apple”.

Additionally, stemming is used to make sure that plural and singular words become the same. It also normalizes adjectives, adverbs and verbs given their various conjugations:

import nltk
stemmer = nltk.PorterStemmer()
stemmer.stem('running') # => run
stemmer.stem('shoes') # => shoe
stemmer.stem('expensive') # => expens

Another useful command in NLTK is clean_html(), which is capable of removing all HTML tags from a given text.

After normalizing our text, we usually need to divide it into sentences and words. The split() method is capable of converiting the following string, “We sell you finger-licking fires.”, into that list of words:

"We sell you finger-licking fries.".split()
['We', 'sell', 'you', 'finger-licking', 'fries.']

One problem with the previous command that it did not deal well with the hyphen and the fullstop. Anther alternative method is the wordpunct_tokenize() provided by NLTK:

from nltk.tokenize import wordpunct_tokenize 
wordpunct_tokenize('We sell you finger-licking fries.')
['We', 'sell', 'you', 'finger', '-', 'licking', 'fries', '.']

Text Analysis

NLTK allows us to find out the frequencies of each word in our textual data. In the demo ‘gnugpl.py’, you can see how to use nltk.Text() to list the top n words in the GPL3 license. Similarly, we can get the frequency distribution of characters in text, rather than words. We will show later on how we cam detect the language of some text using the frequency distribution of characters there.

Frequency distribution of characters in both English and Arabizi

One problem with word frequencies is that big percentage of the top n words are stop-words. Stop-words are common words in a certain language that are not related to the topic of the document, such as “the”, “of”, “and”, etc. In the demo, wikianalysis.py, we grabbed the text of the Wikipedia pages of Egypt, Tunisia and Lebanon. The top n words from each page were put in a table here. One way to deal with stop-words is to re-weight terms. Words that appear in one page but not in the other should be given higher weight compared to words that are common in the three pages, even if they have higher frequencies. Thus, we divided the counts of each word in an page by the total count of it across the 3 pages. The results were put in the next tab, where we can see that the words marked in green are the ones related to each country. Additionally, you can use the collocations() method in NLTK to find out word pairs that frequently appear together.

Text Classification

People sometimes find it easier to write Arabic words in Latin letters on the social media websites. This way of writing is commonly known as Arabizi or Francoarab. We needed to find a way to be able to tell if a text is written in English or in Francoarab. Initially, we noticed that the letter distribution varies between the two. The distribution of consecutive letter pairs also varies. Thus, in the demo, franco.py, we used Naive Bayes Classifier to predict the language of a given text. We trained the classifier using the distribution of character pairs (bigrams) in our training set, “corpus/franco.txt”.

In addition to classifying a whole document, one might also need to predict the category of each word in the document. The categories are better known as Part of Speech (PoS) tags. For example, the word “book” is considered as a noun in “I am reading a book”, while in the phrase “I am going to book my train ticket”, it happens to be a verb. PoS tagging is thus used to whether a word is a noun, adjective, verb, etc. There are built in PoS taggers in NLTK, however in our demo, cairotraffic.py, we wanted to have our own set of PoS tags.

People normally use the hashtag #CairoTraffic on twitter to update their friends about the status of the traffic in the different streets of Cairo. Although, such tweets easily understood by humans, it is hard for a computer to extract structured data from them. For example, the following two phrases carry the same meaning despite their different wording:

The road to Zamalek from Tahrir is blocked.
Traffic was totally blocked from Tahrir in the direction of Zamalek.

Thus, in “cairotraffic.py” we needed create PoS tags for the “FROM” and “TO” locations in each tweet. We used NLTK’s UnigramTagger() and BigramTagger() in addition to Naive Bayes classifier to extract the words reflecting the “FROM” and “TO” locations from each tweet. It was clear from our demo that the Naive Bayes classifier outperformed the unigram and bigram taggers for unseen words and the variations in sentence structures.

In addition to PoS tagging, we also applied a simple Naive Bayes classifier to tell whether a road is blocked (za7ma) or alright (7alawa) given the words used in a tweet.

Conclusion

Python and NLTK makes it easy to carry on complex natural language processing tasks using few lines of code. We have seen here that the toolkit provide all suitable methods for doing text tokenization, analysis and classification. You can also read more about the concepts discussed here and the other capabilities of NLTK in this free book by Steven Bird, Ewan Klein, and Edward Loper.

If any of the topics discussed here is not clear, please feel free to ask me about it. Also feel free to contact me if you have any comments about the code or you would like me to help you using it in any of your projects.

Labs newsletter: 7 November, 2013

2013-11-07T00:00:00+00:00

There was lots of interesting activity around Labs this week, with two launched projects, a new initiative in the works, and an Open Data Maker Night in London.

Webshot: online screenshot service

webshot.okfnlabs.org, an online service for taking screenshots of websites, is now live, thanks to Oliver Searle-Barnes and Simon Gaeremynck.

Try it out with an API call like this:

http://webshot.okfnlabs.org/?url=http://okfnlabs.org&width=640&height=480

Read more about the development behind the service here.

Product Open Data Android app

The first version of the Android app for Product Open Data has launched, allowing you to conveniently look up open data associated with a product on your phone.

The source code for the app is available on GitHub.

Crowdcrafting for Public Bodies

PublicBodies.org aims to provide “a URL for every part of government”. Many entries in the database lack good description text, though, making them harder to use effectively. Fixing this would be a good use of CrowdCrafting.org, the crowd-sourcing platform powered by PyBossa.

Rufus suggests this start small and begin with EU public bodies. It should be easy to build a CrowdCrafting app to cover those, says Daniel Lombraña González. Friedrich Lindenberg thinks this approach could work for other datasets as well.

Discussion of this idea is still happening on the list, so jump in and say what you think—or help build the app!

Open Data Maker Night #6

The sixth Open Data Maker Night took place this past Tuesday in London. Open Data Maker Nights are informal events where people make things with open data, whether apps or insights.

This night’s focus was on adding more UK and London data to OpenSpending, and it featured special guest Max Ogden. It was hosted by the Centre for Creative Collaboration.

Our next Open Data Maker Night will happen in early December. If you want to organize your own, though, it’s super easy: just see the Open Data Maker Night website for help booting, promoting, and running the event.

Tracking Issues with Data the Simple Way

2013-11-06T00:00:00+00:00

Data Issues is a prototype initiative to track “issues” with data using a simple bug tracker—in this case, GitHub Issues.

We’ve all come across “issues” with data, whether it’s “data” that turns out to be provided as a PDF, the many ways to badly format tabular data (empty rows, empty columns, inlined metadata …), “ASCII spreadsheets”, or simply erroneous data.

Key to starting to improve data quality is a way to report and record these issues.

We’ve thought about ways to address this for quite some time and, led by Labs member Friedrich Lindenberg, even experimented with building our own service. But recently, thanks to a comment from Labs member David Miller, we were hit with a blinding insight: why not do the simplest thing possible and just use an existing bug tracker tool? And so was born the current version of Data Issues based on a github issue tracker!

Aside: Before you decide we were completely crazy not to see this in the first place, it should be said that doing data issues “properly” (in the medium term) probably does require something a bit more than a normal bug tracker. For example, it would be nice to be able to both pinpoint an issue precisely (e.g. the date in column 5 on line 3751 is invalid) and group similar issues (e.g. all amounts in column 7 have a commas in them). Doing this would require a tracker that was customized for data. The solution described in this post, however, seems like a great way to get started.

Introducing Data Issues

Given the existence of so many excellent issue-tracking systems, we thought the best way to start is to reuse one—in the simplest possible way.

With Data Issues, we’re using GitHub Issues to track issues with datasets. Data Issues is essentially just a GitHub repository whose Issues are used to report problems on open datasets. Any problem with any dataset can be reported on Data Issues.

To report an issue with some data, just open an issue in the tracker, add relevant info on the data (its URL, who’s responsible for it, the line number of the bug, etc.), and explain the problem. You can add labels to group related issues—for example, if multiple datasets from the same site have problems, you can add a label that identifies the dataset’s site of origin.

Straightaway, the issue you raise becomes a public notice of the problem with the dataset. Everyone interested in the dataset has access to the issue. The issue is also actionable: each issue contains a thread of comments that can be used to track the issue’s status, and the issue can be closed when it has been fixed. All issues submitted to Data Issues are visible in a central list, which can be filtered by keyword or label to zoom in on relevant issues. All of these great features come for free because we’re using GitHub Issues.

Get Involved

For Data Issues to work, people need to use it. If civic hackers, journalists, and other data wranglers learn about Data Issues and start using it to track their work on datasets, we might find that the problem of tracking issues with datasets has already been solved.

You can also contribute by helping develop the project into something richer than a simple Issues page. One limitation of Data Issues is that raising an issue does not actually contact the parties responsible for the data. Our next goal is to automate sending along feedback from Data Issues, making it a more effective bug tracker.

If you want to discuss new directions for Data Issues or point out something you’ve built that contributes to the project, get in touch via the Labs mailing list.

A Python guide for open data file formats

2013-10-17T00:00:00+00:00

If you are an open data researcher you will need to handle a lot of different file formats from datasets. Sadly, most of the time, you don’t have the opportunity to choose which file format is the best for your project, but you have to comport with all of them to be sure that you won’t find a dead end. There’s always someone who knows the solution to your problem, but that doesn’t mean that answers come easy. Here is a guide for each file formats from the open data handbook and a suggestion with a python library to use.

JSON is a simple file format that is very easy for any programming language to read. Its simplicity means that it is generally easier for computers to process than others, such as XML. Working with JSON in Python is almost the same such as working with a python dictionary. You will need the json library, but it is preinstalled to every python 2.6 and after.

import json
json_data = open("file root")
data = json.load(json_data)

Then data[“key”] prints the data for the json

XML is a widely used format for data exchange because it gives good opportunities to keep the structure in the data and the way files are built on, and allows developers to write parts of the documentation in with the data without interfering with the reading of them. This is pretty easy in python as well. You will need minidom library. It is also preinstalled.

from xml.dom import minidom
xmldoc = minidom.parse("file root")
itemlist = xmldoc.getElementsByTagName("name")

This prints the data for the “name” tag.

RDF is a W3C-recommended format and makes it possible to represent data in a form that makes it easier to combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations. RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing open data initiatives on the Web. RDF is still not widespread, but it has been a trend among Open Government initiatives, including the British and Spanish Government Linked Open Data projects. The inventor of the Web, Tim Berners-Lee, has recently proposed a five-star scheme that includes linked RDF data as a goal to be sought for open data initiatives I use rdflib for this file format. Here is an example.

from rdflib.graph import Graph
g = Graph()
g.parse("file root", format="format")
for stmt in g:
   print(stmt)

In rdf you can run queries too and return only the data you want. But this isn’t easy as parsing it. You can find a tutorial here

Spreadsheets. Many authorities have information left in the spreadsheet, for example Microsoft Excel. This data can often be used immediately with the correct descriptions of what the different columns mean. However, in some cases there can be macros and formulas in spreadsheets, which may be somewhat more cumbersome to handle. It is therefore advisable to document such calculations next to the spreadsheet, since it is generally more accessible for users to read. I prefer to use a tool like xls2csv and then use the output file as a csv. But if you want for any reason to work with an xls, here is the best source I had www.python-excel.org. The most popular is the first one, xlrd. There is also another library openpyxl, where you can work with xlsx files.

Comma Separated Files (CSV) files can be a very useful format because it is compact and thus suitable to transfer large sets of data with the same structure. However, the format is so spartan that data are often useless without documentation since it can be almost impossible to guess the significance of the different columns. It is therefore particularly important for the comma-separated formats that documentation of the individual fields are accurate. Furthermore it is essential that the structure of the file is respected, as a single omission of a field may disturb the reading of all remaining data in the file without any real opportunity to rectify it, because it cannot be determined how the remaining data should be interpreted. You can use the CSV python library. Here is an example.

import csv
with open('eggs.csv', 'rb') as csvfile:
   file = csv.reader("file root", delimiter=' ', quotechar='|')
   for row in file:
      print ', '.join(row)

Plain Text (txt) are very easy for computers to read. They generally exclude structural metadata from inside the document however, meaning that developers will need to create a parser that can interpret each document as it appears. Some problems can be caused by switching plain text files between operating systems. MS Windows, Mac OS X and other Unix variants have their own way of telling the computer that they have reached the end of the line. You can load the txt file but how you will use it after that, it depends on the data format.

text_file = open("file root", "r")
lines = text_file.read()

This example will return the whole txt.

PDF Here is the biggest problem in open data file formats. Many datasets have their data in pdf and unfortunately it isn’t easy to read and then edit them. PDF is really presentation oriented and not content oriented. But you can use PDFMiner to work with it. I won’t include any example here since it isn’t a trivial one, but you can find anything you want in their documentation.

HTML. Nowadays much data is available in HTML format on various sites. This may well be sufficient if the data is very stable and limited in scope. In some cases, it could be preferable to have data in a form easier to download and manipulate, but as it is cheap and easy to refer to a page on a website, it might be a good starting point in the display of data. Typically, it would be most appropriate to use tables in HTML documents to hold data, and then it is important that the various data fields are displayed and are given IDs which make it easy to find and manipulate data. Yahoo has developed a tool yql that can extract structured information from a website, and such tools can do much more with the data if it is carefully tagged. I have used many times a python library called Beautiful Soup for my projects.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
soup.title
soup.title.name
soup.title.string
soup.title.parent.name
soup.p
soup.p['class']
soup.a
soup.find_all('a')
soup.find(id="link3")

Those are only a few of what you can do with this library. By calling the tag, will return the content. You can find more on their documentation

Scanned image. Yes. It is true. Probably the least suitable form for most data, but both TIFF and JPEG-2000 can at least mark them with documentation of what is in the picture - right up to mark up an image of a document with full text content of the document. If images are clean, containing only text and without any noise, you can use a library called pytesser. You will need the PIL library to use it. Here is an example.

from pytesser import *
image = Image.open('fnord.tif')  # Open image object using PIL
print image_to_string(image)

Proprietary formats. Last but not least, some dedicated systems, etc. have their own data formats that they can save or export data in. It can sometimes be enough to expose data in such a format - especially if it is expected that further use would be in a similar system as that which they come from. Where further information on these proprietary formats can be found should always be indicated, for example by providing a link to the supplier’s website. Generally it is recommended to display data in non-proprietary formats where feasible.. I suggest to google if there is any library specific for this dataset.

Additional Info. Maybe you will find useful the Panda library, whose I/O capabilities integrate and unify access from/to most of the formats: CSV, Excel, HDF, SQL, JSON, HTML, Pickle.

Introducing TimeMapper - Create Elegant TimeMaps in Seconds

2013-10-11T00:00:00+00:00

TimeMapper lets you create elegant and embeddable timemaps quickly and easily from a simple spreadsheet.

A timemap is an interactive timeline whose items connect to a geomap. Creating a timemap with TimeMapper is as easy as filling in a spreadsheet template and copying its URL.

In this quick walkthrough, we’ll learn how to recreate the timemap of medieval philosophers shown above using TimeMapper.

Getting started with TimeMapper

To get started, go to the TimeMapper website and sign in using your Twitter account. Then click Create a new Timeline or TimeMap to start a new project. As you’ll see, it really is as easy as 1-2-3.

TimeMapper projects are generated from Google Sheets spreadsheets. Each item on the timemap – an event, an individual, or anything else associated with a date (or two, for the start and end of a period) – is a spreadsheet row.

What can you put in the spreadsheet? Check out the TimeMapper template. It contains all of the columns that TimeMapper understands, plus a row of cells explaining what each of them means. Your timemap doesn’t have to use all of these columns, though—it just requires a Start date, a Title, and a Description for each item, plus geographical coordinates for the map.

So you’ve put your data in a Google spreadsheet—how can you make it into a timemap? Easy! From Google Sheets, go to File -> Publish to the web and hit Start publishing. Then click on your sheet’s Share button and set the sheet’s visibility to Anyone who has the link can view. You can either copy the URL from Link to share and paste that URL into the box in Step 2 of the TimeMapper creation process or click on Select from Your Google Drive to just browse to the sheet. Whichever you do, then hit Connect and Publish—and voilà!

Embedding your new timemap is just as easy as creating it. Click on Embed in the top right corner. It will pop up a snippet of HTML which you can paste into your webpage to embed the timemap. And that’s all it takes!

Coming next

We have big plans for TimeMapper, including:

Support for indicating size and time on the map
Quickly create TimeMaps using information from Wikipedia
Connect markers in maps to form a route
Options for timeline- and map-only project layouts
Disqus-based comments
Core JS library, timemapper.js, so you can build your own apps with timemaps

Check out the TimeMapper issues list to see what ideas we’ve got and to leave suggestions.

Code

In terms of the internals the app is a simple node.js app with storage into s3. The timemap visualization is pure JS built using KnightLabs excellent Timeline.js for the timeline and Leaflet (with OSM) for the maps. For those interested in the code it can be found at: https://github.com/okfn/timemapper/

History and credits

TimeMapper is made possible by awesome open source libraries like TimelineJS, Backbone, and Leaflet, not to mention open data from OpenStreetMap. When we first built a TimeMapper-style site in 2007 under the title “Weaving History”, it was a real struggle over many months to build a responsive JavaScript-heavy app. Today, thanks to libraries like these and advances in browsers, it’s now a matter of weeks.

Datapackageproxy - work with datapackages in your browser

2013-10-11T00:00:00+00:00

Datapackages are a neat idea along the “using data like we use code” way. While Tryggvi has created a nice python module to handle datapackages - there is a problem using datapackages in javascript.

In an ideal world I’d just call something like d3.csv() on any csv file on the web. Browser restrictions, however don’t allow loading of arbitrary files from arbitrary websites (for good reasons). To do so anyway, you’ll need to explicitly allow this. (read more on CORS.

Datapackages are hosted on a variety of hosters and many don’t support CORS - thus we’ll need to proxy them through a system that understand the format and is CORS enabled: datapackageproxy.

Using datapackageproxy is simple. To access a resource of a package, simply use:

http://datapackageproxy.appspot.com/resource?url=http://data.okfn.org/data/bond-yields-uk-10y

the optional id=’’ parameter allows specifying the id (though this is not implemented yet). It will return the data as a csv. (so you can use it in d3.csv()). To get the metadata (package definition) of a datapackage use:

http://datapackageproxy.appspot.com/metadata?url=http://data.okfn.org/data/bond-yields-uk-10y

The datapackageproxy is built on appspot - so if there is very heavy load, it might go over usage limits. (If this happens, I’ll try to either move it or figure something else out…)

Find the code on github - and the proxy itself on datapackageproxy.appspot.com

PublicBodies.org - Update no. 2

2013-10-07T00:00:00+00:00

Herewith is a report on recent improvements to PublicBodies.org, our project in Open Knowledge Foundation Labs project to provide “a URL (and information) on every “public body” - that’s every government funded agency, department or organization.

New data

New data contributed over the last couple of months is now validated and live - this includes new data for Switzerland, Greece, Brazil and the US. Huge thank-you to contributors here including Hannes, Charalampos, Augusto and Todd.

We also have pending data for Italy and China to get in once it has been reviewed, and we have data in progress for Canada!

We’d love to have more data - if you’re interested in contributing see https://github.com/okfn/publicbodies#contribute-data

Updated Schema for Data

Thanks to input from James McKinney and others we’ve reworked the schema quite extensively to match up as much as possible with the Popolo spec. You can see the new schema in the datapackage.json.

Or if you don’t like raw JSON so much in prettier HTML version on: http://data.okfn.org/community/okfn/publicbodies

Search support

We now have basic search support via google custom search: http://publicbodies.org/search

Get Involved

As always we’d love help! There is a full list of issues here and example items:

Data as Code Deja-Vu

2013-10-04T00:00:00+00:00

Someone just pointed me at this post from Ben Balter about Data as Code in which he emphasizes the analogies between data and code (and especially open data and open-source – e.g. “data is where code was 2 decades ago” …).

I was delighted to see this post as it makes many points I deeply agree with - and have for some time. In fact, reading it gave me something a sense of (very positive) deja-vu since it made similar points to several posts I and others had written several years ago - suggesting that perhaps we’re now getting close to the critical mass we need to create a real distributed and collaborative open data ecosystem!

It also suggested it was worth dusting off and recapping some of this earlier material as much of it was written more than 6 years ago, a period which, in tech terms, can seem like the stone age.

Previous Thinking

For example, there is this essay from 2007 on Componentization and Open Data that Jo Walsh and I wrote for our XTech talk that year on CKAN. It emphasized analogies with code and the importance of componentization and packaging.

This, in turn, was based on Four principles for Open Knowledge Development and What do we mean by componentization for knowledge. We also emphasized the importance of “version control” in facilitating distributed collaboration, for example in Collaborative Development of Data (2006/2007) and, more recently, in Distributed Revision Control for Data (2010) and this year in Git (and GitHub) for Data.

Package Managers and CKAN

This also brings me to a point relevant both to Ben’s post and Michal’s comment: the original purpose (and design) of CKAN was precisely to be a package manager a la rubygems, pypi, debian etc. It has evolved a lot from that into more of a “wordpress for data” - i.e. a platform for publishing, managing (and storing) data because of user demand. (Note that in early CKAN “datasets” were called packages in both the interface and code - a poor UX decision ;-) that illustrated we were definitely ahead of our time - or just wrong!)

Some sense of what was intended is evidenced by the fact that in 2007 we were writing a command line tool called datapkg (since changed to dpm for data package manager) to act at the command equivalent of gem / pip / apt-get - see this Introducting DataPkg post which included this diagram illustrating how things were supposed to work.

Recent Developments

As CKAN has evolved into a more general-purpose tool – with less of a focus on just being a registry supported automated access – we’ve continued to develop those ideas. For example:

The basic “package” idea from CKAN has evolved into the Data Package spec - and Simple Data Format
We’ve explored storing data using code tools like git - with a dedicated datasets organization on Github
We’ve re-booted the idea of a simple registry and storage mechanism in the form of http://data.okfn.org/ - with data stored in simple data format in git repos on github, displayed in a very simple registry with good tool integration, and curated by a dedicated group of maintainers
We’ve booted the “Frictionless Data” initiative as a way to bring together these different activities in one coherent vision of how we can do something simple to make progress

Data Packages and Frictionless Data - from data.okfn.org

Full stack datavis - scraperwiki, d3 and github.

2013-09-24T00:00:00+00:00

The city of Vienna started releasing waiting times for some of its service offices recently. I followed my usual hunch and just wrote a small script on scraperwiki that stows away the JSON released by the city not knowing yet what to do with it.

Weeks later Hacks/Hackers Vienna decided to host a hackathon. I couldn’t make it (I thought I might) but had the idea to develop the data into a visualization. I sat down later that week and published a visualization of waiting times.

So why am I rambling on about this?

I realized a couple of things while doing this:

One or two years back, facing a problem like this I would have: made space on a server, write an extensive scraper in python, set up a database to store stuff, write a backend web-application to generate graphics and spit them out.

Today: I have my scraper and backend run as a service by scraperwiki, use d3 to generate graphics on the client (much better looking ones) and host the whole thing for free on github - because I don’t need a backend anymore.

This is made possible by:

More and more things offered as a service (often for free)
Amazing frameworks in modern languages, that make development easier
Fantastic resources to exchange knowledge

Developing a small data-driven application used to be a lot of work - not anymore. While it takes a while to get used to the intricate ways of some frameworks (d3 has a quite unique way of doing things): once you’re over the hump, things get a lot easier. This leaves you in the end thinking about the visualization or application you’re building: not worrying about server security, costs and setup.

This also means: Full stack datavisualization has become easier. You needed a team of specialists (sysadmins, backend-developers, designers) to do a decent dataviz: now you just learn the missing parts and you’re able to pull it off.

Using d3 as user input

2013-09-16T00:00:00+00:00

Recently, I was at Chicas Poderosas in Bogota - the three day event featured talks on two days and a hackday on the last. During the event I was approached by Natalia an industrial designer who introduced a project of hers: Electrocardiogr_ama. She wanted to build an app with similar features and pitched it on the hackday. I’ve ended up working with Natalia and Knight/Mozilla Opennews Fellow Sonya Song on the project.

Using D3 for visualizing the output was quite straigt forward. But then, we wanted to have some easy to use user input - we graded mood on a scale, but how to represent it best? Numbers from 1-x as they are often used didn’t seem very intuitive (is 1 best or 10 best?). After thinking about it for a while we had an idea of using a smiley as a slider - the smiley would smile if happy and look sad if dragged to a sad status.

see it working here (try draging it up and down):

To read it’s value we use the following code.


function sbmt() {
  smilescale=d3.scale.linear()
    .domain([50,250])
    .range([1,10])

  note=document.getElementById("note").value;
  d3.select("svg > g#smiley").each(function(d) {
    score=smilescale(d.y);
    // XHTTP Post Request follows here
    })
}

If you want to see it in action: try out the Moodlog app, or check out the github repo.

User inputs are often not very intuitive, let’s make them better!

Data Pipes - streaming online data transformations

2013-09-11T00:00:00+00:00

Data Pipes provides an online service built in NodeJS to do simple data transformations – deleting rows and columns, find and replace, filtering, viewing as HTML – and, furthermore, to connect these transformations together Unix pipes style to make more complex transformations. Because Data Pipes is a web service, data transformation with Data Pipes takes place entirely online and the results and process are completely shareable simply by sharing the URL.

An example

This takes the input data (sourced from this original Greater London Authority financial data), slices out the first 50 rows (head), deletes the first column (its blank!) (cut), deletes rows 1 through 7 (delete) and finally renders the result as HTML (html).

http://datapipes.okfn.labs.org/csv/head -n 50/cut 0/delete 1:7/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

Before

After

Motivation - Data Wrangling, Pipes, NodeJS and the Unix Philosophy

When you find data in the wild you usually need to poke around in it and then to some cleaning for it to be usable.

Much of the inspiration for Data Pipes comes from our experience using Unix command lines tools like grep, sed, and head to do this kind of work. These tools a powerful way to operate on streams of text (or more precisely streams of lines of text, since Unix tools process text files line by line). By using streams, they can scale to large files easily (they don’t load the whole file but process it bit by bit) and, more importantly, allow “piping” – that is, direct connection of the output of one command with the input of another.

This already provides quite a powerful way to do data wrangling (see here for more). But there are limits: data isn’t always line-oriented, plus command line tools aren’t online, so it’s difficult to share and repeat what you are doing. Inspired by a combination of Unix pipes and the possibilities of NodeJS’s great streaming capabilities, we wanted to take the pipes online for data processing – and so Data Pipes was born.

We wanted to use the Unix philosophy that teaches us to solve problems with cascades of simple, composable operations that manipulate streams, an approach which has proven almost universally effective.

Data Pipes brings the Unix philosophy and the Unix pipes style to online data. Any CSV data can be piped through a cascade of transformations to produce a modified dataset, without ever downloading the data and with no need for your own backend. Being online means that the operations are immediately shareable and linkable.

More Examples

Take, for example, this copy of a set of Greater London Authority financial data. It’s unusable for most purposes, simply because it doesn’t abide by the CSV convention that the first line should contain the headers of the table. The header is preceded by six lines of useless commentary. Another problem is that the first column is totally empty.

First of all, let’s use the Data Pipes html operation to get a nicer-looking view of the table.

GET /csv/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Now let’s get rid of those first six lines and the empty column. We can do this by chaining together the delete operation and the cut operation:

GET /csv/delete 0:6/cut 0/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

And just like that, we’ve got a well-formed CSV!

But why stop there? Why not take the output of that transformation and, say, search it for the string “LONDON” with the grep transform, then take just the first 20 entries with head?

GET /csv/delete 0:6/cut 0/grep LONDON/head -n 20/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Awesome!

What’s next?

Data Pipes already supports a useful collection of operations, but it’s still in development, and more are yet to come, including find-and-replace operation sed plus support for arbitrary map and filter functions.

You can see the full list on the Data Pipes site, and you can suggest more transforms to implement by raising an issue.

Data Pipes needs more operations for its toolkit. That means its developers need to know what you do with data – and to think about how it can be broken down in the grand old Unix fashion. To join in, check out Data Pipes on GitHub and let us know what you think.

Miga, a new app generator for structured data

2013-08-27T00:00:00+00:00

I’m pleased to announce the Miga Data Viewer, or Miga, an open source tool I created that lets you create a web/mobile app nearly automatically from a set of CSV data.

There are already various applications/frameworks that provide a JavaScript-enabled front-end for structured data - and not just structured data, but CSV data. These include two published by the Open Knowledge Foundation itself, Recline.js and the related Data Explorer.

Miga is different from these and (I think) other tools in a few ways. Instead of presenting an aggregate view of the data, its interface is more similar to that of an app, where each entity/row has its own page. (There are also ways to view the data in aggregate, with maps and schedules and the like.)

There are some other features that Miga has that I don’t believe other data-browsing tools have at the moment, including the ability to browse multiple, linked tables (like having one file about countries and other about cities, where a column in the latter connects the two). But the most important difference is that Miga provides offline viewing: once a site, i.e. a dataset, has been accessed, it can be viewed even if the internet connection is lost. The offline capability is provided through two very useful technologies: Web SQL and Application Cache. (The use of Web SQL unfortunately means that Miga can’t be used on Firefox or Internet Explorer, though it works fine on all other major browsers, including mobile browsers.)

I think Miga could potentially be useful for publishing a lot of different kinds of data - for both regular browsing and mobile app-style functionality. I would encourage anyone to try it out for themselves - if you have a set of data in CSV format, or can create it, the rest of the setup is not that hard.

How a bad experience made an OKFN labs project

2013-08-22T00:00:00+00:00

From theory to experimentation

Back in November 2010, I faced a problem while teaching my students about the Semantic Web. I wanted to convey the idea that Semantic Web technologies can break down the barriers between dataset silos on the Web and simplify the publication and consumption of open data. This idealistic idea was suddenly undermined when we moved from theory to practice. My exercise used open data to answer the following question: which films have been biased by a partnership relation between the film director and a member of the cast? This question, when written in the SPARQL Semantic Web querying language, can be executed on the DBPedia SPARQL Endpoint (if you are curious about the results, visit http://bit.ly/14didjd). Based on this query I built a whole exercise for my students to discover the potential of the SPARQL language.

When the day came for my students to experiment with the amazing capabilities of Semantic Web technology, the DBPedia server (hosting the SPARQL Endpoint service used by this exercise) was down, leaving me with some awkward remarks like: “it was too beautiful to be true”. As a result of that experience I took 2 decisions: 1) Instead of using one exercise and one endpoint, I will provide 3 exercises using 3 different endpoints to maximise my chances to have at least one running, and 2) I decided to develop an application that gives the real picture of one aspect of the Semantic Web architecture: the availability of SPARQL Endpoints.

SPARQL Endpoint Status

The SPARQL Endpoint Status application has been monitoring the publicly available SPARQL Endpoints listed in datahub.io for 2 and a half years. From this study, we can see in the following figure:

that the mean endpoint availability has been decreasing over time; however the mean trend is not followed by all endpoints. For instance, the DBPedia endpoint has always been above 90% of availability whereas the mean availability is affected by a growing number of offline endpoints exemplified by KASABI NASA endpoint.

The variance of endpoint-availability profiles is illustrated by the distribution

where many endpoints fall into one of two extremes: 24.3% of endpoints are always down, whereas 31% of endpoints have an availability rate higher than 95%. The apparent overall decline in endpoint availability is possibly an effect of maturation. SPARQL is currently moving away from experimentation, leaving permanently offline endpoints in its wake (e.g. Kasabi endpoints) with fewer new experimental endpoints being reported. However, other endpoints (such as data.gov) are supported by well-established stakeholders (here, the U.S. government), and are part of a sustainable policy to deliver a high quality of service to end-user applications.

An OKFN Labs Project

Thanks to a growing community interest and the support of the Open Knowledge Foundation, our project scope is now being extended to further monitor the performance, the discoverability and the interoperability of SPARQL Endpoints. A new version of the tool will soon be hosted by OKFN and will be presented during the ISWC 2013 conference.

As soon as the tool is up and running, we will announce it on the OKFN Labs blog, so stay tuned!

ropenspending - accessing the OpenSpending API through R

2013-08-14T00:00:00+00:00

Tonight a couple of us were having a discussion on the OpenSpending IRC channel on how we can promote and better document the usage of the API. Tony had already begun to work on OpenSpending using R. I had previously done so as well. This prompted me to start out and create an R Package: ropenspending.

ropenspending aims at making it easier to access the openspending API from within R. It provides access to certain bits of the API - most importantly the aggreagate function (through openspending.aggregate). While it is still in it’s infancy, it is functional and can be obtained on github. I’ll work to push this on CRAN as well (as soon as I figured out how to do this).

Diffing and patching tabular data

2013-08-08T00:00:00+00:00

A few years ago at the Eastern Conference for Workplace Democracy in New Hampshire, a bunch of friends chatting on a grassy knoll realized they were all working on overlapping directories of their communities, and decided to pool their efforts. They tracked down some techies (I was one) and set them to work building a directory website. Someone should have warned them about techies.

Eight years later, okay there is a directory website, but the project has morphed into something a lot more ambitious:

A full-blown co-op to deal with the cultural and legal side of data sharing. This is the Data Commons Co-op.
A growing toolbox to deal with the technological side of data sharing, specifically how to have fun (rather than get depressed) collaborating on data projects. This is the Coopy Toolbox.

We, like others in the Open Data world, have been asking: where’s the git(hub) for data? More fundamentally, where’s the diff and patch programs for data? Where’s something like diff3 for doing 3-way merges? Can we bring the whole free and open toolchain of diffing, patching, merging, and version control to the world of data?

The toolchain is there already, for some

Fun data collaboration is possible today by inventive use of existing tools, as Rufus Pollock has noted. Here’s an example of a pull request found in the wild, made to a repository on github that tracks some bus routes in Iceland in regular CSV files:

https://github.com/gudmundur/straeto-data/pull/4
“Fixed stops on the wrong side”

I’m a bit embarrassed to remember how excited I was to stumble across this real live data-oriented pull request. I ran around showing people, saying “this!” “this!” (One of those moments when you realize: you really are a nerd). Some asked, why is this better than, for example, a shared spreadsheet online, edited live? For the same reason that git and hg were so much more exciting than svn and cvs. The awful artificial problem of who to bestow write-access upon and who to keep outside the clique just evaporates, and the equivalent of social coding kicks in.

There are definitely some technical drawbacks to working this way today though. For example: what happens when things go wrong? A poor merge or a merge conflict in a text file still results in a text file, which a user can edit as usual to fix up. Text file in, text file out. But a poor merge or a conflict in a CSV file can leave you with an invalid CSV file with missing/surplus columns on some rows, or with conflict markers inserted. This bumps the user out of Gnumeric/LibreOffice/Excel/Sqlite/… or whatever they are using to edit the table, and leaves them staring at randomly garbled text. Another problem: column changes look awful in line-oriented diffs, which isn’t a deal-breaker but which is certainly a pity. We can do better.

A diff for (tabular) data

There didn’t seem to be any neutral format for comparing tables out there when I went looking. SQL could be abused to serve (DELETE clauses to express rows removed, INSERTs rows added, UPDATEs for modified cells, etc) but I couldn’t see that leading anywhere happy. My first idea was to express diffs as tables in CSV form, as a list of operations. This was ugly. Then Joe Panico of http://www.diffkit.org and I hammered out something we called TDIFF, “tabular diff format”, very much inspired by classic diffs, with added awareness of columns. This was better, but still felt a bit clunky. Finally, I settled on what now seems obvious: a “highlighter” format that is just the original table with stylized editing marks to show changes (and large chunks of unchanged material removed).

For example, here is a highlighter data diff against Jessica Lord’s Hack Spots spreadsheet at the time of writing (Hack Spots is a list of hacking-friendly coffee shops and the like, demoing sheetsee.js):

Contributer's Twitter

Name

Address

City

State

long

lat

Country

Wifi Password

Outlets

Couch

Large Table

Brewing

Outdoor Seating

hexcolor

...

cwmma

Block 11

11 Bow St

Somerville

-71.096974

42.380881

USA

yes

Intelligentsia

#B9FCFC

→

thomaslevine

Bordelands Cafe→Borderlands Café

870 Valencia St

San Francisco

-122.42151

37.759031

USA

open

yes

coffee

#B9FCFC

lukekarrys

Cartel Coffee Lab

225 W University Dr

Tempe

-111.942978

33.421907

USA

espresso

yes

In-house

#B9FCFC

thomaslevine

El Beit

158 Bedford Ave

Brooklyn

-73.956847

40.718529

USA

brooklyn

few

yes

#B9FCFC

---

hij1nx

Five Elephant

Reichenberger Straße 101, 10999

Berlin

13.43829

52.493365

yes

5 Elephant

yes

#B9FCFC

uhduh

Gangplank

260 S Arizona Ave

Chandler

-111.841302

33.244008

USA

walktheplank

yes

#B9FCFC

...

sfrdmn

Noisebridge

2169 Mission St

San Francisco

-122.419161

37.762372

USA

Open

yes

BYOC

possibly

#B9FCFC

+++

fitzyfitzyfitzy

Sandwich Theory

590 Valley Rd

Montclair

-74.208086

40.840497

USA

sandwich

yes

coffee

yes

#B9FCFC

This was generated by taking the Hack Spot spreadsheet, editing it in gnumeric, then comparing with the original using coopyhx. I corrected a typo, added an entry for Sandwich Theory in my neighborhood, and - completely accidentally - deleted the entry for Five Elephant.

Rather than editing the Hack Spots spreadsheet directly in google docs, in my ideal world I’d send a pull-request (and someone would catch my Five Elephant goof).

So far, the diff we have is just row-based. Suppose I also added a column for the location’s website, and deleted the password column (rather missing the whole point) - the highlighter data diff would now look like this:

+++

---

Contributer's Twitter

Name

Website

Address

City

State

long

lat

Country

Wifi Password

Outlets

Couch

Large Table

Brewing

Outdoor Seating

hexcolor

...

cwmma

Block 11

11 Bow St

Somerville

-71.096974

42.380881

USA

yes

Intelligentsia

#B9FCFC

→

thomaslevine

Bordelands Cafe→Borderlands Café

870 Valencia St

San Francisco

-122.42151

37.759031

USA

open

yes

coffee

#B9FCFC

lukekarrys

Cartel Coffee Lab

225 W University Dr

Tempe

-111.942978

33.421907

USA

espresso

yes

In-house

#B9FCFC

thomaslevine

El Beit

158 Bedford Ave

Brooklyn

-73.956847

40.718529

USA

brooklyn

few

yes

#B9FCFC

---

hij1nx

Five Elephant

Reichenberger Straße 101, 10999

Berlin

13.43829

52.493365

yes

5 Elephant

yes

#B9FCFC

uhduh

Gangplank

260 S Arizona Ave

Chandler

-111.841302

33.244008

USA

walktheplank

yes

#B9FCFC

...

sfrdmn

Noisebridge

2169 Mission St

San Francisco

-122.419161

37.762372

USA

Open

yes

BYOC

possibly

#B9FCFC

+++

fitzyfitzyfitzy

Sandwich Theory

www.sandwichtheory.com

590 Valley Rd

Montclair

-74.208086

40.840497

USA

yes

coffee

yes

#B9FCFC

The highlighter format is tabular, and designed to be as simple as I could make it without introducing ambiguity. The first column in a highlighter diff is called the “action” column, containing marks meaning “inserted row”, “deleted row”, “modified row”, etc. Remaining columns are drawn from either or both of the tables being compared. If there are column differences to note, as there are here, an extra row called the “schema row” is inserted, which has marks for the inserted, deleted, or otherwise modified columns. The whole diff can be transmitted safely in CSV, then optionally formatted for prettiness using some mechanical rules.

There are plenty of other details, but that is the basic flavor of the Coopy highlighter diff format today. You can read more about it or try out two different implementations live, a Javascript implementation and a C++ implementation. You can also get a feel for using this kind of diff in a workflow at GrowRows.com. Please send bug reports, or ideas for better alternatives!

Dealing with conflict

What happens if two people make conflicting changes to a table? A regular text-based merge would stick in >>>>>>> ======== <<<<<<< blocks, which would destroy our table’s structure. I’ve played with a few ways to do better. The method I’m happiest with so far is to report on conflicts in an extension of the highlighter diff format that shows the alternate updates possible. Imagine if, as I fixed the spelling of “Bordelands Cafe” to “Borderlands Café”, someone else had already corrected it to the slightly different “Borderlands Cafe”. So the diff would be:

@@	Contributer's Twitter	Name	Address	City	State	long	lat	Country	Wifi Password	Outlets	Couch	Large Table	Brewing	Outdoor Seating	hexcolor
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	cwmma	Block 11	11 Bow St	Somerville	MA	-71.096974	42.380881	USA		yes	yes		Intelligentsia		#B9FCFC
→	thomaslevine	Bordelands Cafe→Borderlands Cafe→Borderlands Café	870 Valencia St	San Francisco	CA	-122.42151	37.759031	USA	open	yes	?	?	coffee	no	#B9FCFC
	lukekarrys	Cartel Coffee Lab	225 W University Dr	Tempe	AZ	-111.942978	33.421907	USA	espresso	yes	yes	no	In-house	no	#B9FCFC
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Resolving the conflict amounts to just editing the diff, deleting the parts you don’t want and keeping the parts you do. You can get a sense for how this works by testing on http://paulfitz.github.io/coopyhx/. Be sure to select “Use 3-way comparison” option, which will set up two versions of a table with a shared “common ancestor”. Double-click on cells in the diff to view their plain “CSV” representation, and edit them.

Full-blown revision control

I’ve tried two methods to build revision control with all this:

Modifying fossil, a distributed revision control system with beautifully compact and hackable source code, to use tabular diffs and merges natively. The result is ssfossil (“spreadsheet fossil”) in the Coopy toolbox.
Using custom diff and merge drivers with git, to achieve a similar result. A tutorial for doing this is at http://share.find.coop/doc/tutorial_git.html.

Both approaches share the same features:

No change to how the SCM stores data internally. For example, fossil will continue using its delta encoding, likewise git (technically in pack files only).
The visualization of diffs changes, and how merges happen. This is good, since changes that would conflict in text-file world may well not conflict in tabular world, and we are guaranteed to always have valid tables.

Until making more radical changes to the SCM, it definitely makes sense to store tables in a text format. Formats I’ve experimented with are:

CSV. Simple, globally understood. But just a table.
CSVS. I made this up. It is an extension to CSV with multiple tables, an unambiguous spot for column names, and a place for table names. Looks like this.
Sqlitext, pronounced “Sqlite Text”. I made this up. This is a text dump of an Sqlite database, with consistent ordering of rows. With careful use of clean and smudge filters, a “live” Sqlite database can be stored in git, using this format as an intermediate. This has the nice property of storing more meta-data (keys, references, etc).
SocialCalc. A text format for representing spreadsheets used by SocialCalc and inherited by Audrey Tang’s http://ethercalc.org. Stores table formatting and other good stuff.

Conspicuously absent from this list are common formats like those of Excel. We need one more trick to deal with those.

The last mile

Complicated spreadsheets are not great candidates for version control as I’ve imagined it so far, since we don’t have a way to diff/merge non-data features. So arbitrary spreadsheets in Gnumeric, LibreOffice, and other programs (for simplicity I’m going to call all these programs “Excel” from now on, forgive me) with charts and formulae aren’t really in our scope. But simple spreadsheets, just storing data without anything fancy, can be very useful. And Excel is certainly a convenient, familiar editor for tables.

Putting an Excel file in a git/fossil repository won’t lead anywhere good. But what we can do is this:

We use git/fossil/… to do version control on data in a version-control-friendly format.
We keep that data in sync with an Excel file using a merge method that preserves formatting as much as possible. The Excel file is never regenerated from scratch (except parhaps once, on initial cloning), but instead incrementally patched.

In principle a modified SCM could collapse these steps but we’re definitely not there yet. So what I’ve written is a program called (perhaps confusingly) Coopy that handles the end-to-end work of versioning Excel (and similar) files. Here is Coopy cloning a repository with a single table in it called “numbers” (the user needed a URL for the repository in order to do this, see the manual at http://share.find.coop/doc/CoopyGuide.pdf for a complete workthrough):

I won’t win any awards for UI design, I know. At this point, under the hood, the repository is checked out on the user’s machine, with data in a neutral format. The list of tables is shown, in this example just a table called “numbers”. When the user selects that table for the very first time, they are prompted to save it:

They can choose the format to save the data, for example in an Excel-compatible format. The appropriate conversion happens, and the file opens in an appropriate editor (gnumeric for me):

We can now go ahead and edit the table at will. When we’re ready, in Coopy, we click “push out”. We’ll be prompted for a commit message describing the changes, and (the first time) where to actually push to:

From then on, “pulling in” and “pushing out” will act as if they are operating on the spreadsheet, with local formatting being preserved even if no format information is in fact being stored in the neutral repository format. It is perhaps hard to see why that is important, but imagine how annoying it would be if, for example, the column sizes of a spreadsheet kept getting reset everytime you pulled in a collaborator’s changes.

There’s a lot more to say, but a key point is that we could now have one person editing a table in Excel, another in Gnumeric, another tweaking it using Sqlite, and the whole thing being periodically sync’d to a MySQL database on a webserver. Fun!

The power of patching

Stepping back from full-on revision control, I’d like to mention something nice that popped out of this that I hadn’t anticipated. Once you have diff + patch, you can play games like this:

Store data in some form optimized for machine access, e.g. a MySQL database with carefully chosen keys, indexes, cross-references etc.
Export part of data to some easy-to-edit form, e.g. a spreadsheet.
Make changes in the spreadsheet.
Generate a diff for that spreadsheet.
Apply that diff as a patch to the original data store e.g. in MySQL.

The export step here will usually blow away all sorts of meta-data vital to the database. It may also scramble stuff due to type mismatches or other muddles. But remember, the patch will get applied with all the original meta-data available, so things work out just fine more often than I expected. I’m excited to push forward on reducing the irreversibility of data exports. Today, as soon as a format conversion happens, fixes to the converted data are much less likely to ever make it back to the original source. This is sad, and can’t be allowed to continue.

Next steps

There’s so much to do it is hard to summarize. And this is all just one piece of the puzzle, for one kind of data (the carefully curated kind listing your neighborhood wifi hotspots, not the gigabyte-per-minute stream from a set of temperature sensors). Here are just some of the things that need doing:

Nail down the diff format or formats. There are wobbly areas such as column/row reordering. It’d also be exciting to use cross-references between tables when known (e.g. in Sqlite, and maybe Simple Data Format in the future) to produce more meaningful diffs on relational data.
Support more data formats, more completely. I’ve just been scratching my own random itches so far.
Get really really solid community-tested implementations of diffing, patching, merging etc.
Get nice repository hosting in place. In fact, fossil is interesting for that since it is self-hosting, but it is a bit geeky for non-programmers. I’m hoping to push http://GrowRows.com in this direction. And of course, github works, and would work even better if they supported data diffs.
Think about how systematic data transformations might be handled better, if they can be handled at all. Personally, I’m waiting for dat.

This is all frankly way too much for me. Help!

Appendix: a list of the main Coopy tools

The Coopy toolbox (website, repository, manual) contains the following utilities:

ssdiff: Show the difference between two tables/databases/spreadsheets.
sspatch: Modify a table/database/spreadsheet to apply the changes described in a pre-computed difference.
ssmerge: Integrate changes in table/database/spreadsheets that have a common ancestor.
ssresolve: Select a particular resolution to a merge conflict.
ssformat: Convert tables/databases/spreadsheets from one format to another.
ssrediff: Convert a diff from one format to another (for example, from highlighter format to a sequence of SQL instructions).
ssfossil: A lightly modified version of fossil to use ssmerge’s 3-way merge algorithm on data.
Coopy: A first pass at a user interface for versioning Excel and other non-textual formats.

The toolbox is written in C++. Recently I’ve ported some of the core parts of the toolbox to a Javascript (via Haxe) implementation. This port is called coopyhx (website, repository). The reimplementation is better in several respects than the original (need to merge them!), but supports far fewer formats. The port contains:

The coopyhx program, which is a stripped down version of ssdiff and sspatch, operating only on basic CSV/JSON tables and the highlighter diff format.
A javascript library for diffing and patching, suitable for in-browser use.
A render function for converting highlighter diffs in CSV format into pretty HTML.
A render function for handsontable to allow online editing of diffs in a pretty format.

Awkwardly, there’s also an entirely separate ruby implementation (source, gem), strictly limited to Sqlite, that was written for use on ScraperWiki (classic).

Related websites:

http://growrows.com, a start at a service for crowd-sourcing tables, using diffs and patches (without calling them that).
http://datacommons.find.coop/vision, the Data Commons Co-op, incorporated in July 2012 in Massachusetts. The co-op has 20 member organizations, mostly in the US, a couple in Canada, plus one recently in the UK. This co-op specializes in archiving, correlating, and disseminating data about alternative economic activity, and needs lots of software that doesn’t quite exist yet!

Mapping Antimatter tracks with CrowdCrafting.org

2013-08-06T00:00:00+00:00

This last weekend, CERN hosted a very special event: the 2nd CERN Summer Student Webfest organized by the Citizen Cyberscience Centre.

The Webfest invites CERN summer students to participate in a 48 hours marathon hacking new applications, tools, games, etc. about physics. This year, I participated and worked in a very interesting one: The Antimatter project

With a team of around 8 persons, we divided the work in different areas and learned about the project and the goals for the CrowdCrafting application.

Michael Doser from CERN and the spokesperson from the AEgIS experiment, is studying antimatter.

But, what is antimatter?

The observable Universe is composed almost entirely of matter but we can produce stuff called antimatter in the lab. Antimatter is material composed of antiparticles. So for example, a positron (the antiparticle of an electron) combines with an antiproton to form an antihydrogen atom.

Antiparticles have the same mass as normal matter particles but the opposite charge. When an antiparticle collides with an ordinary matter particle they both obliterate to emit radiation and some other particles - this is called annihilation.

Because of Einstein’s weak equivalence principle (gravity doesn’t depend on composition) antiparticles should interact gravitationally just like particles of ordinary matter - and that’s what scientist’s expect to observe - but if they don’t then Einstein was wrong…

What’s the experiment?

The Antihydrogen Experiment: Gravity, Interferometry, Spectroscopy (AEgIS) experiment at CERN shoot antihydrogen atoms horizontally, whereupon they fly (and drop) until they hit a wall made of matter - any matter will do, silicon, silver, paper,… - and annihilate there

On hitting the wall, the antihydrogen annihilates with a nucleus of the wall to produce mostly pions and some other particles - which we’ll call starburst.

The starburst travel through a special gel called an emulsion and we can see its tracks. If we trace these tracks to their point of origin then we know exactly where annihilation occurred.

Then as we know the starting position of the antiparticles, the distance they travelled to the point of annihilation and how much they dropped - we can work out how far the antiparticle fell during its journey.

Then we can figure out how antimatter interacts gravitationally.

Michael Doser gave us access to a set of 99 areas photographed with a microscope, that allows us to see tracks and the starbursts. Each of the areas have 40 pictures. These pictures cover the same area but at a differen depth.

As we discussed about the project, we decided to create a “movie style” task, where the CrowdCrafting application will be playing in a loop all the images for the same area. Then, we will allow the volunteers to map the tracks using their mouse as in any image software. The coordinates of the tracks, starting and ending points, will be saved, and we will use those points to render in real time a 3D model of the tracks thanks to WebGL.

We divided the work between different groups, and we worked together in the different areas:

Creating of tasks based on the data
2D movie style using HTML5 canvas feature
3D model of tracks using HTML5 WebGL
Physics description of the problem and tutorial.

For the 2D Canvas solution we decided to use the popular Kinetic.JS library. This library is very versatile as you can not only render images in the 2D canvas, but also paint lines.

For the 3D model we decided to use the popular Three.JS library. We created a 3D area using the Tron colors palette to draw the reported tracks by the users.

Then, we have another group that worked really hard in explaining the physics of the experiment and the tutorial. We even created a Mozilla Webmaker project about it.

At the end of Sunday we had a fully operational prototype that allows you to actually track antimatter in CrowdCrafting:

From here I would like to thank to all the team members because the actually loved the project and push it to the next level. This efforts will help other CrowdCrafting/PyBossa developers to use the new HTML5 Canvas and WebGL features developed for this application, as the source code is already available in Github and can be used as a template for any CrowdCrafting/PyBossa application.

If you want, you can follow the Github repository development of the project.

data.okfn.org - update no. 2

2013-08-06T00:00:00+00:00

data.okfn.org is the Labs’ repository of high-quality, easy-to-use open data. This update summarizes some of the improvements to data.okfn.org that have taken place over the past two months.

New tools

Several tools which make it easier to use the Data Package standard are now operational. These include a Data Package creator, a Data Package viewer, and there’s progress on a validator for Data Packages.

Data Package Creator

Turning a CSV into a Data Package means creating a file, datapackage.json, which houses the metadata associated with the CSV. The Data Package Creator simplifies this process.

Provide the Creator with the URL of a CSV and it will return a well-formed JSON object with the required fields, as well as a raw JSON URL (the JSON URL serves as a basic machine accessible API).

Data Package Viewer

The metadata included with Data Packages makes it possible to construct a simple view of the data. We now provide an online Data Package Viewer to do this for you.

Just provide the link to your Data Package and Viewer generates a user-friendly description, a graph of the data, and a summary of the data fields. Here, for example, is the Viewer’s display of US wheat production data.

New datasets

The biggest data news was having our first ‘out-of-the-blue’ contribution of an ‘official’ dataset! Evan Wheeler pinged us to offer a comprehensive collection of country codes for the world’s countries in Simple Data Format. Here is the:

Comprehensive Country Codes dataset on data.okfn.org
Associated GitHub repo for the dataset

Also new:

If you want to contribute a new dataset, check out the instructions and the outstanding requests.

New standards pages

Among data.okfn.org’s chief purpose is promoting simple standards for data transport in the form of Data Package and Simple Data Format - helping to create a world of frictionless data.

Key here is providing simple, easy-to-understand, information and so we’ve revamped the standards page and created two new pages dedicated to providing simple introduction and overview for Data Package and Simple Data Format:

Get involved

Anyone can contribute, and it’s easy – if you can use a spreadsheet, you can help!

Instructions for getting involved can be found here.

Analyzing Icelandic conviction rates with CrowdCrafting.org

2013-07-31T00:00:00+00:00

CrowdCrafting.org hosts a wide variety of applications that range from science to humanities. Since the official launch of CrowdCrafting.org, lots of applications have been created , but one of them has done a really impressive job: Héraðsdómar - sýknað eða sakfellt.

Héraðsdómar - sýknað eða sakfellt is an application developed by Páll Hilmarsson (@pallih, Github). The application was one of the most popular and active applications in CrowdCrafting.org when it was published (300 volunteers helped!), so I wanted to interview the author and ask some questions about it: why he created the application, what was the result, etc.

Páll told me that he created the application after reading the an article published in an Icelandic news web site.

The article analyzed the conviction rates of a named judged in the Reykjavik district court, stating that the conviction rates for cases where he presided as a judge was 99%. Páll found it interesting, but also “biased” as the reporter only analyzed one judge.

After the publication of the story, some bloggers and readers of the post, discussed about why analyzing only one judge, reporting it back to the author. The journalist addressed all the questions and comments answering that
calculating all the conviction rates for every case would take too long.

Páll was not happy with this answer, so he decided to show him, and other reporters, that this could be easily done by crowdsourcing the job, and that it would not take too long.

Páll uploaded around 4,700 rulings as tasks, and the volunteers analyze them in 7 days! Each ruling went to at least three different users, totaling 14,208 assessments. In the end more than 17,000 assessments were made by over 300 users! (you can check the stats here).

But here it comes the best part, Páll only spent 10 hours in this project (including the time to scrape the rulings, set up the tasks on CrowdCrafting.org and displaying the results on his blog. Amazing!

Making puzzles out of Shapefiles - bringing Open Data to the physical world

2013-07-28T00:00:00+00:00

For a while I’ve been thinking about how to make Open Data more tangible. Even with great visualizations, it tends to remain stuck in computers and smartphones. Recently, I had the idea to start taking geodata, released by cities, and start making it into physical things. This is the first steps and prototypes: Making a puzzle out of district borders.

Thanks to the Open Week organized by OKFN Austria member Stefan Kasberger I finally got the chance to put this idea into action. Here’s what we did:

First download the city boundaries as Shapefile
Make sure it’s WGS 84 EPSG 4326 formatted
Convert it to SVG using kartograph.py
Convert it to PDF - so the lazercutter can understand it
Adopt the PDF for lazercutting (set line-widths to hairlines…)
Cut!
Try to assemble the puzzle

To make the process a little easier Stefan created a script for converting shapefiles to svg with the data.

We created puzzles for both Vienna and Graz using district boundaries released as Open Data. Once done, we noticed that solving a district boundary puzzle is not as easy as it seems… (even though the number of pieces are limited)

Apps using DBpedia Wikipedia from Open Knowledge Foundation Greece

2013-07-23T00:00:00+00:00

Having developed the Greek DBpedia, the first Internationalized DBpedia, OKFN Greece is now involved in the OKFN Labs by introducing three applications using DBpedia.

1. DBpedia Spotlight

DBpedia Spotlight is an application that automatically spots and disambiguates words or phrases of text documents that might be sources of DBpedia and annotates them with DBpedia URIs.

DBpedia spotlight implements the Aho-Corasick string matching algorithm in the spotting stage described above along with the use of Apache Lucene over the index built in the offline training / configuration stage. For the disambiguation of the spotted words/phrases, a VSM representation of the DBpedia resources is used along with a variant of the TF-IDF technique for determining the weight of words based on their ability to distinguish between candidates of a given term.

The Greek DBpedia Spotlight with full compatibility with the Greek characters, encoded in UTF-8, was implemented by graduate student Ioannis Avraam under the supervision of Dr. Charalampos Bratsas, coordinator of OKFN: Greece. The project was organised by the OKFN Greece in coordination with the Web Science Master Program and Semantic Web Unit of the Aristotle University of Thessaloniki. The Greek DBpedia spotlight is deployed as a Web service and features a user interface at http://dbpedia-spotlight.okfn.gr/. The source code is open and available inder Apache license V2 at https://github.com/iavraam/dbpedia-spotlight.git (dbpediaSpotlight_el branch).

2. Day Like Today

Day Like Today (http://el.dbpedia.org/apps/DayLikeToday/) is the second application that uses DBpedia in order to inform the user about what happened a day like today, in the past. Similar existing applications are using data that their author has included into the application. The application developed, differs in, that the data displayed have been extracted from Wikipedia with DBpedia queries and visualize the results in a timeline using the okfn lab timeliner.

At the back end of the application, the user can choose the DBpedia that will be queried as well as which the queries themselves. The queries are submitted, data is analyzed then exported into JSON format and forwarded to the frontend and illustrated into a pie chart.

Queries return data such as :

Title of the fact
Date of the fact
A small description
A small thumbnail
A large picture
Link to the related article in Wikipedia
Link to the corresponding DBpedia which we got the data.

The source code is open and available inder Apache license V2 at https://github.com/okfngr/DayLikeToday

Here are some statistics on the amount of information that was exported

From Greek DBpedia

From English DBpedia

From German DBpedia

3.DBpedia Game

The third application is DBpedia ** *Game * **(http://wiki.el.dbpedia.org/apps/ssg/dist/ssg.html ) an entertaining and educational tool to produce knowledge and evaluation of knowledge. It is consisted of a series of games such as: multiple choice, anagram, hangman and matching. The novelty of the DBpedia Game is the immediate and automatic generation of games from Wikipedia’s data using Semantic Web technologies in Greek Linked Data (DBpedia). The DBpedia data is accessed with the SPARQL Query Language. A total of 31 SPARQL queries were created, that retrieved facts belonging to 8 general categories (geography, history, athletics, astronomy, general, chemistry, politics, economy). These facts form the basis the next processing steps of the application. A sample query is depicted in Listing 1, retrieving Animal labels along with their depiction and feedback links for the DBpedia and Wikipedia pages.

Even with a relatively small number of 30 queries, we managed to produce a total of 12,000 sets of results from Greek DBpedia (cf. Table 1)

The present version runs as a Java Applet and is a rapid prototype to be improved. (https://github.com/okfngr/DBpedia-Game)

Spanish Party Financing Scandals - CrowdSourcing Data Extraction with CrowdCrafting

2013-07-11T00:00:00+00:00

Spanish society has been bombarded recently with a flurry of news stories about possible cases of corruption in the major political parties like the Partido Socialista Obrero Español and the Partido Popular.

In January of 2013 the party that rules the country, Partido Popular (PP), was featured in the front page of the newspaper El País with a new case about a possible financial scandal in the party. The new disclosed several scanned copies of the accounting book of the party – with donations from companies to members of the party – allegedly hand written by the official treasurer of PP: Luis Bárcenas, as well as several accounts in Swiss banks.

Since then, many news and press conferences have happened, however the recent 28 of June decision had a very interesting twist. The judge ordered Luis Bárcenas to be sent to prison after the financial anticorruption district attorney requested this.

After this action, the Spanish mass media started to ask questions about this imprisonment, and the consequences that could have to party that rules the country.

This last Sunday 7 of July, one of the main newspapers from Spain published an interview with the suspect: Luis Bárcenas. The interesting side is that up to now, Bárcenas denied all the accusations, however as he is now in jail he has started to admit that the donations were made from Spanish development companies and other enterprises, that the money was delivered in some cases in plastic bags, and that usually those companies obtained contracts with the with the administrations governed by the party.

The next day, the public face of the party, María Dolores de Cospedal, gave a press conference about these accusations saying that everything is completely false.

A few minutes after the press conference was over, the Anonymous hackers group disclosed the last 20 year of accounting info from the politic party in bayfiles.net and distributed the links on Twitter, newspapers, etc.

The links were spread like fire and people started to coordinate themselves under the hashtag #cuentasdelpp to analyze the data. They found that most of the documents are PDF scans from hand written notes, some reports from different fiscal accounting softwares, etc. so they asked for help. As the data format is not machine readable, someone suggested to use CrowdCrafting.org to do the transcription taking as inspiration the sample PDF transcription app. A few hours later the application was up and running:

This is a great example about how citizens can coordinate themselves to analyze a problem from their society using open tools like CrowdCrafting and PyBossa. Unfortunately due to legal threats regarding the leaked data, the author has, as of today, felt obliged to take the app down (we hope temporarily). I have contacted the author, and, as soon as we know if the owner can re-open it, we will be letting you know!

PublicBodies.org progress

2013-07-09T00:00:00+00:00

There have been many new developments with PublicBodies.org, the Labs project which aims to provide “a URL for every part of government”, since the last update on the Labs blog.

The news includes: a new and improved backend; a push for integration with Nomenklatura; discussion of a revamp of the PublicBodies schema; lots of new data waiting to be integrated; and a new idea for how PublicBodies might be useful.

PublicBodies: now much shinier

Thanks to the hard work of Labs member Rowan Crawford, PublicBodies is now a proper webapp. It’s now a Node.js app running on Heroku, and its interface is much nicer than before. Let’s all give Rowan a hand!

Development of the PublicBodies website is ongoing. The next task for improving the site will be adding search.

Nomenklatura integration

Entity reconciliation is crucial for a service like PublicBodies. Luckily, the Labs has another project that simplifies reconciliation, namely Nomenklatura. The obvious step is to start pushing PublicBodies data to Nomenklatura and pulling it when it gets updated. This idea is discussed more fully in an issue.

Contributor David Read has got the ball rolling with Nomenklatura integration by pushing UK public bodies data. This is a great start – but we want to automate this and start automatically pushing CSVs across to Nomenklatura. Volunteers to build this functionality, please step up!

Popolo schema integration

Popolo is a project with a goal very relevant to PublicBodies: the creation of “international open government data specifications relating to the legislative branch of government”. These include a data specification for organizations.

We’re considering reworking the PublicBodies schema to follow the Popolo organization spec. The changes would be nontrivial but wouldn’t involve any massive reorganization of the data. Please help us think this through by joining in the discussion in the issues.

Lots of new data

Once the matter of revamping the schema is resolved, we can start integrating the heaps of new data which has been contributed. The new data includes public bodies from the US, Germany, China, Quebec, Italy, and Slovenia. You can see it all here. Thanks to the contributors who have brought this data together.

The sooner we come to a decision about the Popolo schema, the sooner we can start incorporating all of this new material – so please let us know what you think!

Discussion: organization identifiers

Contributor Mark Brough has come up with an interesting idea for how PublicBodies might be useful: it could be used to generate organisation identifiers usable in situations calling for unique identifiers, such as IATI data publication. As Mark observes, public organizations often lack these identifiers, which makes publishing data a struggle.

Read the details of Mark’s proposal in the issues, and let him know what you think.

Open Data Maker Night London No 3 - Tuesday 16th July

2013-07-08T00:00:00+00:00

The next Open Data Maker Night London will be on Tuesday 16th July 6-9pm (you can drop in any time during the evening). Like the last two it is kindly hosted by the wonderful Centre for Creative Collaboration, 16 Acton Street, London.

When: Tuesday 16th July 2013
Where: Centre for Creative Collaboration, 16 Acton Street, London.
Signup: on Meetup page (optional but nice to know numbers!)

Look forward to seeing folks there!

What

Open Data Maker Nights are informal events focused on “making” with open data – whether that’s creating apps or insights. They aren’t a general meetup – if you come, expect to get pulled into actually building something, though we won’t force you!

Who

The events usually have short introductory talks about specific projects and suggestions for things to work on – it’s absolutely fine to turn up knowing nothing about data or openness or tech as, there’ll an activity for you to help and someone to guide you in contributing!

Organize your own!

Not in London? Why not organize your own Open Data Maker night in your city? Anyone can and it’s easy to do – find out more »

Open Data QA - the Aid Transparency Tracker

2013-07-08T00:00:00+00:00

Back in April, I wrote on the Open Knowledge Foundation main blog to launch the first component of our Aid Transparency Tracker, a tool to analyse aid donors’ commitments to publish more open data about their aid activities.

At the end of that post, I pointed to our future plans to also monitor the quality of publication. It is possible to do this programmatically because donors have agreed to publish their data according to the IATI Standard.

Over the last six months we’ve spent a lot of time building a framework for testing the quality of aid donors’ IATI data, as well as a survey tool to capture data not available in the IATI format. We launched this to donors last month.

We will be releasing the results as part of our 2013 Aid Transparency Index in October. In the meantime, I wanted to give a sneak peak of some of the things the Tracker can now do. All the source code is on Github.

Automatic testing framework

The biggest part of this tool is the automated data quality analysis. This works as follows:

Tests: A series of tests is written in FoXPath (“a cunning version of XPath”), a language we created for this purpose. The idea was to make the tests a bit more readable for non-programmers, agnostic about the language used to run them, and structured so that regular expressions could implement them in whatever language is required.
Registry: The IATI Registry (a CKAN instance) is queried to check for any changes to the data. The Registry uses CKAN Archiver to create a hash of each package every night.
Testing: all of the tests are run against each package found for testing. This is run as a background process, with RabbitMQ used for the queue.
Results: Each result is stored as a pass, fail, or error, alongside the package id, the test id, the publishing organisation’s id, and the activity identifier (if applicable and available).
Aggregations: results are then aggregated up to create a percentage of passes for each test for each package.
Indicators: when presenting the results in the user interface, tests are grouped into indicators to make the information more readable. At the moment, there is only one set of indicators (our 2013 Index indicators), but some fairly small changes would make it possible to add other sets of indicators - for example, indicators that show whether data is good for an Aid Information Management System, or for making maps with, or for results data.

Remaining hurdles

Improving tests: The tests need to become more expressive, adding more conditions to when they should or shouldn’t be run. Some of these expressions are supported within the existing version of FoXPath.
Changed packages: Refreshing packages currently has to be done manually, because of the way the IATI Registry records changes to files. This should be fixed within the next couple of weeks.
Space: The data quality tool currently stores the result for each test alongside each activity. This uses 15GB of space each time it runs. I’m considering dropping this data, and only storing the aggregates, because such detailed data doesn’t appear to be as useful as I originally thought it might be.
Speed: testing is quite slow at the moment, and aggregation takes a particularly long time. I’m going to revisit that section of the code (and in fact the aggregation architecture as a whole) to optimise it. However, in the medium term, some more substantial changes might be needed, possibly including re-writing this component in a compiled language.

Survey component

In previous years, we used Global Integrity’s Indaba platform for the survey. However, because of the quite different way this year’s Index is constructed, we decided to build our own bespoke survey tool.

Many of the donors we include in the Index have not yet begun publishing data to IATI, and none of them are yet publishing all of the fields. We need to capture this information in the Index, while encouraging donors to publish as much as possible in their IATI data.

What it does

Donor-specific indicators: if information is found in the donor’s IATI data, then there’s no need to ask again where that information can be found. If it’s not found at all in the IATI data, then we look to see where else we can find that information on a donor’s website.
Format matters: more accessible formats are scored higher in this year’s Index. We’re encouraging donors to move as much information out of PDFs into websites, then into CSV, Excel, or some other machine readable format, and then into the IATI-XML format. Obviously, it’s great if they can jump steps and go straight to IATI-XML - we’re seeing that from several donors this year.
Retaining an audit trail: Each of the steps in the survey are recorded and will be published, so that if there are disagreements between us, the donor, or the independent reviewer, then readers can see that correspondence and reach their own conclusions.

How it works

When a new survey is created, indicators are only created if they score 0 in the IATI data quality assessment.
When a user has finised responding to the survey, they submit the form and a simple linear workflow moves the survey to the next step.
Users have access to specific parts of the workflow for a specific organisation, depending on whether they’re a donor or an independent reviewer, and whether they should have edit permissions or read-only permissions.

What we’re aiming to achieve

Finally, it’s worth emphasising what we’re trying to achieve from all of this, and looking at the extent to which we’re doing that already.

Non-IATI publishers begin publishing to IATI: the incentives in the Index are very clearly structured this year: more points are awarded for publishing in more open formats, with the internationally comparable IATI standard format scored highest. Several donors are trying to start publishing by the July 31st deadline, which is when automated data collection will end and we’ll begin writing our analysis.
IATI publishers publish more fields, and improve their data: several donors are working to add more fields into their data where they have that information to hand.
Donors can use the data quality tool to improve their own publication: several donors are using the tool to flag areas where there could be improvements in their data. We want this tool to be useful on an ongoing basis to donors, but that will require both that tests run nightly, and also that donors can test unpublished data. We’ll be working on those features over the next month.

What’s next

We’ll be presenting the Aid Transparency Tracker at OKCon in Geneva in September, and talking about how it could be used as a basis for monitoring the quality of data in other open data spending standards.

We would also very much welcome any feedback. Please get in touch:

Email: mark.brough@publishwhatyoufund.org
Twitter: @mark_brough

Querying ElasticSearch - A Tutorial and Guide

2013-07-01T00:00:00+00:00

ElasticSearch is a great open-source search tool that’s built on Lucene (like SOLR) but is natively JSON + RESTful. Its been used quite a bit at the Open Knowledge Foundation over the last few years. Plus, as its easy to setup locally its an attractive option for digging into data on your local machine.

While its general interface is pretty natural, I must confess I’ve sometimes struggled to find my way around ElasticSearch’s powerful, but also quite complex, query system and the associated JSON-based “query DSL” (domain specific language).

This post therefore provides a simple introduction and guide to querying ElasticSearch that provides a short overview of how it all works together with a good set of examples of some of the most standard queries.

Note: here at Open Knowledge Foundation Labs we have several open-source ElasticSearch related project including an easy-to-use Javascript Library for ElasticSearch and the Recline suite of JS Data Components which make it easy and fast to build powerful JS+HTML-based interfaces to ElasticSearch.

Table of Contents

Terminology and URLs
Quickstart
Querying
- Basic Queries Using Only the Query String
- Full Query API
Query Language
Appendix

Terminology and URLs

Throughout {endpoint} refers to the ElasticSearch index type (aka table). Note that ElasticSearch often let’s you run the same queries on both “indexes” (aka database) and types.

If you were just using ElasticSearch standalone an example of an endpoint would be: http://localhost:9200/gold-prices/monthly-price-table.

Key urls:

Query: {endpoint}/_search (in ElasticSearch < 0.19 this will return an error if visited without a query parameter)
- Query example: {endpoint}/_search?size=5&pretty=true
Schema (Mapping): {endpoint}/_mapping

Quickstart

cURL (or Browser)

The following examples utilize the cURL command line utility. If you prefer, you you can just open the relevant urls in your browser:

    # query for documents / rows with title field containing 'jones'
    # added pretty=true to get the json results pretty printed
    curl {endpoint}/_search?q=title:jones&size=5&pretty=true

Adding some data:

    # Data (argument to -d) should be a JSON document
    curl -X POST  {endpoint} -d '{
      "title": "jones",
      "amount": 5.7
    }'

Javascript

A simple ajax (JSONP) request to the data API using jQuery:

    var data = {
      size: 5 // get 5 results
      q: 'title:jones' // query on the title field for 'jones'
    };
    $.ajax({
      url: {endpoint}/_search,
      dataType: 'jsonp',
      success: function(data) {
        alert('Total results found: ' + data.hits.total)
      }
    });

Note: we’ve written a simple JS library for ElasticSearch which makes working with ElasticSearch much easier. Here’s a sample:

// Your ElasticSearch instance is running at http://localhost:9200/
// We are using index 'twitter' and type (table) 'tweet'
var endpoint = 'http://localhost:9200/twitter/tweet';

// Table = an ElasticSearch Type (aka Table)
// http://www.elasticsearch.org/guide/reference/glossary/#type
var table = ES.Table(endpoint);

// Create some data
table.upsert({
  id: '123',
  title: 'My new tweet'
}).done(function() {
  // now get it
  table.get('123').done(function(doc) {
    console.log(doc);
  });
});

// Query for data
// Queries follow Recline Query spec -
// http://okfnlabs.org/recline/docs/models.html#query-structure
// (very similar to ES)
table.query({
  q: 'hello'
  filters: [
    { term: { 'owner': 'jones' } }
  ]
}).done(function(out) {
  console.log(out);
});

Python

import urllib2
import json

# =================================
# Store some data

url = '{endpoint}'
data = {
    'title': 'jones',
    'amount': 5.7
    }
# have to send the data as JSON
data = json.dumps(data)

req = urllib2.Request(url, data, headers)
out = urllib2.urlopen(req)
print out.read()

# =================================
# Query the resulting "table"

url = '{endpoint}/_search?q=title:jones&size=5'
req = urllib2.Request(url)
out = urllib2.urlopen(req)
data = out.read()
print data
# returned data is JSON
data = json.loads(data)
# total number of results
print data['hits']['total']

Querying

Basic Queries Using Only the Query String

Basic queries can be done using only query string parameters in the URL. For example, the following searches for text ‘hello’ in any field in any document and returns at most 5 results:

{endpoint}/_search?q=hello&size=5

Basic queries like this have the advantage that they only involve accessing a URL and thus, for example, can be performed just using any web browser. However, this method is limited and does not give you access to most of the more powerful query features.

Basic queries use the q query string parameter which supports the Lucene query parser syntax and hence filters on specific fields (e.g. fieldname:value), wildcards (e.g. abc*) and more.

There are a variety of other options (e.g. size, from etc) that you can also specify to customize the query and its results. Full details can be found in the ElasticSearch URI request docs.

Full Query API

More powerful and complex queries, including those that involve faceting and statistical operations, should use the full ElasticSearch query language and API.

In the query language queries are written as a JSON structure and is then sent to the query endpoint (details of the query langague below). There are two options for how a query is sent to the search endpoint:

Either as the value of a source query parameter e.g.:
```
 {endpoint}/_search?source={Query-as-JSON}
```

Or in the request body, e.g.:

 curl -XGET {endpoint}/_search -d 'Query-as-JSON'

For example:

 curl -XGET {endpoint}/_search -d '{
     "query" : {
         "term" : { "user": "kimchy" }
     }
 }'

Query Language

Queries are JSON objects with the following structure (each of the main sections has more detail below):

    {
        size: # number of results to return (defaults to 10)
        from: # offset into results (defaults to 0)
        fields: # list of document fields that should be returned - http://elasticsearch.org/guide/reference/api/search/fields.html
        sort: # define sort order - see http://elasticsearch.org/guide/reference/api/search/sort.html

        query: {
            # "query" object following the Query DSL: http://elasticsearch.org/guide/reference/query-dsl/
            # details below
        },

        facets: {
            # facets specifications
            # Facets provide summary information about a particular field or fields in the data
        }

        # special case for situations where you want to apply filter/query to results but *not* to facets
        filter: {
            # filter objects
            # a filter is a simple "filter" (query) on a specific field.
            # Simple means e.g. checking against a specific value or range of values
        },
    }

Query results look like:

{
    # some info about the query (which shards it used, how long it took etc)
    ...
    # the results
    hits: {
        total: # total number of matching documents
        hits: [
            # list of "hits" returned
            {
                _id: # id of document
                score: # the search index score
                _source: {
                    # document 'source' (i.e. the original JSON document you sent to the index
                }
            }
        ]
    }
    # facets if these were requested
    facets: {
        ...
    }
}

Query DSL: Overview

Query objects are built up of sub-components. These sub-components are either basic or compound. Compound sub-components may contains other sub-components while basic may not. Example:

{
    "query": {
        # compound component
        "bool": {
            # compound component
            "must": {
                # basic component
                "term": {
                    "user": "jones"
                }
            }
            # compound component
            "must_not": {
                # basic component
                "range" : {
                    "age" : {
                        "from" : 10,
                        "to" : 20
                    }
                } 
            }
        }
    }
}

In addition, and somewhat confusingly, ElasticSearch distinguishes between sub-components that are “queries” and those that are “filters”. Filters, are really special kind of queries that are: mostly basic (though boolean compounding is alllowed); limited to one field or operation and which, as such, are especially performant.

Examples, of filters are (full list on RHS at the bottom of the query-dsl page):

term: filter on a value for a field
range: filter for a field having a range of values (>=, <= etc)
geo_bbox: geo bounding box
geo_distance: geo distance

Rather than attempting to set out all the constraints and options of the query-dsl we now offer a variety of examples.

Examples

Match all / Find Everything

{
    "query": {
        "match_all": {}
    }
}

This will perform a full-text style query across all fields. The query string supports the Lucene query parser syntax and hence filters on specific fields (e.g. fieldname:value), wildcards (e.g. abc*) as well as a variety of options. For full details see the query-string documentation.

{
    "query": {
        "query_string": {
            "query": {query string}
        }
    }
}

Filter on One Field

{
    "query": {
        "term": {
            {field-name}: {value}
        }
    }
}

High performance equivalent using filters:

{
    "query": {
        "constant_score": {
            "filter": {
                "term": {
                    # note that value should be *lower-cased*
                    {field-name}: {value}
                }
            }
        }
}

Find all documents with value in a range

This can be used both for text ranges (e.g. A to Z), numeric ranges (10-20) and for dates (ElasticSearch will converts dates to ISO 8601 format so you can search as 1900-01-01 to 1920-02-03).

{
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    {field-name}: {
                        "from": {lower-value}
                        "to": {upper-value}
                    }
                }
            }
        }
    }
}

For more details see range filters.

Full-Text Query plus Filter on a Field

{
    "query": {
        "query_string": {
            "query": {query string}
        },
        "term": {
            {field}: {value}
        }
    }
}

Filter on two fields

Note that you cannot, unfortunately, have a simple and query by adding two filters inside the query element. Instead you need an ‘and’ clause in a filter (which in turn requires nesting in ‘filtered’). You could also achieve the same result here using a bool query.

{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "and": [
                    {
                        "range" : {
                            "b" : { 
                                "from" : 4, 
                                "to" : "8"
                            }
                        },
                    },
                    {
                        "term": {
                            "a": "john"
                        }
                    }
                ]
            }
        }
    }
}

Geospatial Query to find results near a given point

This uses the Geo Distance filter. It requires that indexed documents have a field of geo point type.

Source data (a point in San Francisco!):

# This should be in lat,lon order
{
  ...
  "Location": "37.7809035011582, -122.412119695795"
}

There are alternative formats to provide lon/lat locations e.g. (see ElasticSearch documentation for more):

# Note this must have lon,lat order (opposite of previous example!)
{
  "Location":[-122.414753390488, 37.7762147914147]
}

# or ...
{
  "Location": {
    "lon": -122.414753390488,
    "lat": 37.7762147914147
  }
}

We also need a mapping to specify that Location field is of type geo_point as this will not usually get guessed from the data (see below for more on mappings):

"properties": {
    "Location": {
        "type": "geo_point"
     }
     ...
}

Now the actual query:

{
    "query": {
        "filtered" : {
            "query" : {
                "match_all" : {}
            },
            "filter" : {
                "geo_distance" : {
                    "distance" : "20km",
                    "Location" : {
                        "lat" : 37.776,
                        "lon" : -122.41
                    }
                }
            }
        }
    }
}    

Note that you can specify the query using specific lat, lon attributes even though original data did not have this structure (you can also use a query similar to the original structure if you wish - see Geo distance filter for more information).

Facets provide a way to get summary information about then data in an elasticsearch table, for example counts of distinct values.

ElasticSearch (and hence the Data API) provides rich faceting capabilities. The ES facet docs go a great job of listing of the various kinds of facets available and their structure so I won’t repeat it all here. Here is a list of some of the most important (full list on the facets page):

Terms - counts by distinct terms (values) in a field
Range - counts for a given set of ranges in a field
Histogram and Date Histogram - counts by constant interval ranges
Statistical - statistical summary of a field (mean, sum etc)
Terms Stats - statistical summary on one field (stats field) for distinct terms in another field. For example, spending stats per department or per region.
Geo Distance: counts by distance ranges from a given point

Note that you can apply multiple facets per query.

Appendix

Adding, Updating and Deleting Data

ElasticSeach, and hence the Data API, have a standard RESTful API. Thus:

POST      {endpoint}         : INSERT
PUT/POST  {endpoint}/  : UPDATE (or INSERT)
DELETE    {endpoint}/  : DELETE

For more on INSERT and UPDATE see the Index API documentation.

There is also support bulk insert and updates via the Bulk API.

Schema Mapping

As the ElasticSearch documentation states:

Mapping is the process of defining how a document should be mapped to the Search Engine, including its searchable characteristics such as which fields are searchable and if/how they are tokenized. In ElasticSearch, an index may store documents of different “mapping types”. ElasticSearch allows one to associate multiple mapping definitions for each mapping type.

Explicit mapping is defined on an index/type level. By default, there isn’t a need to define an explicit mapping, since one is automatically created and registered when a new type or new field is introduced (with no performance overhead) and have sensible defaults. Only when the defaults need to be overridden must a mapping definition be provided.

Relevant docs: http://elasticsearch.org/guide/reference/mapping/.

JSONP support

JSONP support is available on any request via a simple callback query string parameter:

?callback=my_callback_name

Basic data cleaning with Data Explorer

2013-06-28T00:00:00+00:00

Data Explorer is a client-side web application for data processing and visualization. With Data Explorer, you can import data, transform it with JavaScript code, and visualize it on a graph or a map – all fully within the browser and with your data and code nicely persisted to gists.

This tutorial will get you started using Data Explorer by walking you through the cleaning of a messy data set. It introduces some of the basic concepts of the Recline library which provides Data Explorer’s model of data and highlights why it’s nice to be able to use JavaScript to wrangle data.

Getting started

For this tutorial, we’re going to use a set of Greater London Authority (GLA) financial data, a report of payments made by the GLA for property worth more than £250 over a one-month period in 2013. Conveniently for our purposes, this dataset is a little buggy.

To get started, go to the Data Explorer website, and click Get started with your own data to proceed to the Import page. From there, you will be able to load the data in a number of ways: uploading a local file, pasting in a URL, or pasting the data itself into a text box. Choose your preferred method and hit the appropriate Load button, which will take you to the Preview & Save page.

The Preview shows what the data will look like as a grid. Already some fiddling is necessary to get things started. The row containing the names of fields is six rows down, and the fields are all nameless – except for one with an erroneous name! To fix this, change the Skip initial rows value to 6.

You can also see that there is a blank column, but you can’t do anything about this yet. Just choose a name for the dataset, click Save, and move on to actually working with the data.

The grid and the graph

Once the data has been loaded and named, you are taken to the Data Explorer proper. Your first view of the data will be the Grid, a tabular display identical to what was already shown in the Preview screen.

Data visualizations are constructed with the Graph. Let’s try to make a graph of the data. Click Graph to go to the graph screen, which will ask you to choose which of the data’s fields to bind to the two axes, Axis 1 (= x) and Axis 2 (= y). First change the Graph Type to Points. Then, for Group Column, choose “Clearing Date”, and for Series A, choose “Amount”. You should get a graph that looks like this.

This graph is useless. There are no points with an Amount higher than about £990. A quick look at the grid will tell you that in fact many points have Amounts running into the millions of pounds. Also note that the x axis is completely unlabeled. If you scan your cursor over the data points, which displays their underlying value, you’ll see that their horizontal arrangement is meaningless.

The problem is that the dataset is formatted badly. All of the values in the Amounts field that run higher than £999.99 include a comma, which prevents them from being parsed as numbers. The dates, too, are not being treated as dates but just as ordinary strings, making it impossible to put them on a scale.

To fix these problems, we’ll write some code. Roll up your sleeves and get ready.

Basics of DE coding

To pull up Data Explorer’s tool for editing and running JavaScript code, click Code at the top right of the page. This will cause the JavaScript console to drop down. This console consists of a panel for editing code and, beneath it, an area where messages are printed.

A bit of code is included in the edit panel by default:

loadDataset("current", function (error, dataset) {
  // error will be null unless there is an error
  // dataset is a Recline memory store (http://reclinejs.com//docs/src/backend.memory.html).
  console.log(dataset);
});

This code loads up the current dataset by calling the function loadDataset on the string "current" (the name of the current dataset) and an anonymous callback function which binds a representation of the dataset to the name dataset. The callback function, as defined, prints the dataset to the console’s message area by calling console.log on dataset. Watch this code work by clicking Run the Code.

The console output might surprise you. The dataset is represented as a JavaScript object with attributes "records" and "fields", the first of which is an array of objects with attributes for each of the top-level object’s "fields". This is an instance of the Recline memory store. A dataset is a collection of records, and a record is just an object.

If you understand that, you’re ready to clean the dataset.

Cleaning with JavaScript

The full gamut of JavaScript tools and tricks are available to you when you clean data in Data Explorer. Besides handy core JavaScript functionality like regular expressions, Data Explorer makes the Underscore.js suite of functional programming utilities available for data cleaning.

To clean a dataset, write code inside a loadDataset call that modifies the dataset in the appropriate way, and finish by calling saveDataset on the modified dataset. All code presented in this section is to be placed inside the curly brackets of the loadDataset callback function.

Let’s start by getting rid of that annoying blank column we noticed earlier. To do this, we have to delete _noname_ from the dataset’s fields. We must also drop the _noname_ attribute from every record in the dataset.

To get rid of the bad field, set the dataset’s field attribute to be the old value of the attribute minus the field named "_noname_".

dataset.fields = 
  _.reject(dataset.fields,
           function (f) {
             return f.id === "_noname_";
           }) ;

Erasing the bad field from each record can be done with an application of each, which calls a function with side effects on each item in a collection.

_.each(dataset.records,
       function (r) {
         delete r._noname_ ;
       }) ;

Now let’s look at the next problem: the unparsed Amounts with commas. To fix these, we need to eliminate the commas and then parse the resulting string as a float. Since we’re already iterating over every record in the dataset, we can add to the anonymous function in the each call:

if (typeof r.Amount === "string") {
  r.Amount = parseFloat(r.Amount.replace(/,/g, "")) ;
}

Finally, we can fix the dates. There are two problems with these. The first is that the Recline dataset object needs to know that the type of their field is date. The second is that the dates haven’t been parsed. To fix the first problem, add to the loadDataset callback function:

_.find(dataset.fields,
       function (f) {
         return f.id === "Clearing Date" ;
       })
 .type = "date" ;

Next, add another bit to the anonymous function in each:

if (typeof r["Clearing Date"] === "string") {
  r["Clearing Date"] = new Date(r["Clearing Date"]) ;
}

That’s it! All that remains is to save the modified dataset. At the bottom of the loadDataset callback function, add a line to save the data:

saveDataset(dataset) ;

Click Run the Code and watch the data transform before your very eyes. The new graph view of the data is now meaningful, correct, and fully consistent with your expectations. Awesome!

You can also have another look at the grid, which will show you exactly what has changed in your data.

If you have logged in to GitHub, you will be able to save the result of your work. To share the work, simply copy the URL of your project. An example of a project constructed according to the instructions above can be found here.

Conclusion

With Data Explorer, the full power of arbitrary JavaScript code (enhanced with Underscore functionality) can be brought to bear on tough data cleaning problems. The cleaning script’s effects are immediately visible in the grid and graph views of the data, which enables an easy, interactive style of data cleaning. And it is all done without a backend, in memory and in the browser.

data.okfn.org - update no. 1

2013-05-28T00:00:00+00:00

This is the first of regular updates on Labs project http://data.okfn.org/ and summarizes some of the changes and improvements over the last few weeks.

1. Refactor of site layout and focus.

We’ve done a refactor of the site to have stronger focus on the data. Front page tagline is now:

We’re providing key datasets in high quality, easy-to-use and open form

Tools and standards are there in a clear supporting role. Thanks to all the suggestions and feedback on this and welcome more - we’re still iterating.

2. Pull request data workflow

There was a nice example of the pull request data workflow being used (by a complete stranger!): https://github.com/datasets/house-prices-uk/pull/1

3. New datasets

For example:

US house prices http://data.okfn.org/data/house-prices-us
Annual consumer price index http://data.okfn.org/data/cpi

Looking to contribute data check out the instructions http://data.okfn.org/about/contribute#data and the outstanding requests: https://github.com/datasets/registry/issues

4. Tooling

We have a DataPackage.JSON creator tool in progress at http://data.okfn.org/tools/dp/create (here’s the relevant github issue)
We have a new data package viewer created by James Smith

5. Feedback on standards

There’s been a lot of valuable feedback on the data package and json table schema standards including some quite major suggestions (e.g. substantial change to JSON Table Schema to align more closely with JSON Schema - thx to jpmckinney)

Next steps

There’s plenty more coming up soon in terms of data and the site and tools.

Complete the datapackage.json generator (support for gdocs especially)
Complete the datapackage.json validator
More datasets especially key indices

Get Involved

Anyone can contribute and its easy – if you can use a spreadsheet you can help!

Instructions for getting involved here: http://data.okfn.org/about/contribute

Open Humanities Hangout - Open Correspondence and the Letter Net

2013-05-21T00:00:00+00:00

Our next Open Humanities Hangout will take place next Tuesday, 28th May. This is the latest in the series of regular hangouts we’ve been organizing over the past few months with people interested in tapping in to the growing amount of open cultural data and content.

What: Open Humanities Hangout looking at opening up historical correspondence and mapping the “letter net” – e.g. did Dickens write to George Eliot and did she write back? Come help us find out! Read more below
When: Tuesday 28th May 2013 at 1700 BST, 12:00 EDT, 1800 CET
Where: Online via Google Hangout and IRC – we’ll publish the hangout url nearer the time
Who: anyone who loves the humanities and wants to see the great works of our past accessible and re-usable by anyone regardless of background or location.
Signup: please sign up here or email sam.leon@okfn.org. Note you can always just drop in on the day but it helps us if we have a sense of numbers!

About the Hangouts

The Humanities Hangouts are an informal virtual get together to build apps and insights using open cultural material. Among other things participants have put together an app that helps you to get to know Shakespeare better called Bardomatic, hacked on an annotation tool for public domain texts called TEXTUS and created interactive timelines of the great Western medieval philosophers (helping to improve and de-bug the Timeliner tool in the process).

The Challenge: Mapping Networks of Correspondence

We want to construct a workflow that will enable anyone to take a published set of letters and turn it into open data and content that we can explorer and visualize. Ultimately we want the network of correspondence – the “letter net”.

Suggested process

This is something to discusson the hangout, but we think the effort will involve at least 3 steps:

Locate published collection of letters
- Great if these are already digitized on gutenberg
Extract structured data like author, recipient, date, location
- Geo-code all those locations
- If the texts are not digitized start thinking about that!
Visualise the results

We’ve already done work on steps 1 and even 2 in the case of Dickens. For geocoding there’s already a simple geocoding guide on the School of Data. For visualization there are plenty of options that we’ll explore on the hangout. (And if we want to start scanning and OCRing there are manuals on how to build your own scanner).

Our Goal

Our basic goal is a set of beautiful and insightful set of visualisations about the correspondence of key cultural figures.

Longer term we would love to see a database of correspondence that is open to everyone to use and add to.

Nomenklatura - Data Matching and Reconciliation Made Easy

2013-05-16T00:00:00+00:00

Nomenklatura is a simple service that makes it easy to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names against that canonical list – for example, matching Acme Widgets, Acme Widgets Inc and Acme Widgets Incorporated to the canonical “Acme Widgets”.

With Nomenklatura its a matters of minutes to set up your own set of master data to match against and it provides a simple user interface and API which you can then use do matching (the API is compatible with Open Refine’s reconciliation function).

Nomenklatura can not only store the master set of entities you want to match against but also will learn and record the various aliases for a given entity - such as a person, organisation or place - may have in various datasets.

As such Nomenklatura chooses a design half way between an entity database (such as OpenCorporates, PopIt or similar services) and a automated de-duplication software (such as dedupe or SILK).

Nomenklatura has been battle-tested with real-world usage, for example to de-duplicate the names of German parliamentarians, UK government departments and spending data schemas and EU lobbyists.

Typically, a data extraction process will check all the entity names it discovers in the source data against nomenklaturas API. If Nomenklatura does not recognize a name, a new alias record is stored as a placeholder. This alias can then be matched to an entity by the user through a simple-to-use reconciliation user interface.

To kickstart such a process, data can be uploaded via CSV - but new entities can be created dynamically as well. The advantage of a manual approach is that it minimizes the risk of false matches – this level of quality assurance can be crucial, if, for example, the output will be displayed in an application that is intended to hold government to account.

This Release

This latest release of Nomenklatura includes a number of important changes:

The domain model was refactored to use a clearer naming scheme, canonical values are now called “entities”, and their alternative spellings are now “aliases”.
CSV upload support allows users to submit a list of entities, aliases or fully executed mappings.
Support for the Open Refine API was added, so that each Nomenklatura dataset can be added as a reconciliation service and used to clean data from inside Refine.
Keyboard shortcuts were added to the reconciliation tool, so that matches can be identified without using a mouse - a fast user can now match a few hundred records an hour.
The Python client library has been refactored and submitted to PyPi, it can be installed via “pip install pynomenklatura”.

Credits and Links

Nomenklatura was developed by Labs Member Friedrich Lindenberg with contributions from other folks including fellow Labs members Michael Bauer.

Nomenklatura source code on GitHub

Update on PublicBodies.org - a URL for every part of Government

2013-05-01T00:00:00+00:00

This is an update on PublicBodies.org - a Labs project whose aim is to provide a “URL for every part of Government”: http://publicbodies.org/

PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have distinct corporate existence). Examples would include government ministries or departments, state-run organizations such as libraries, police and fire departments and more.

We run into public bodies all the time in projects like OpenSpending (either as spenders or recipients). Back in 2011 as part of the “Organizations” data workshop at OGD Camp 2011, Labs member Friedrich Lindenberg scraped together a first database and site of “public bodies” from various sources (primarily FoI sites like WhatDoTheyKnow, FragDenStaat and AskTheEU).

We’ve recently redone the site converting the sqlite DB to simple flat CSV files:

Main github repo: https://github.com/okfn/publicbodies
Example raw CSV: https://raw.github.com/okfn/publicbodies/master/data/gb.csv

The site itself is now super-simple flat-files hosted on s3 (build code here). Here’s an example of the output:

European Parliament: http://publicbodies.org/eu/european-parliament.html
Associated JSON API (with CORS!) http://publicbodies.org/eu/european-parliament.json

The simplicity of CSV for data plus simple templating to flat-files is very attractive. There are some drawbacks such as changes to primary template resulting in a full rebuild and upload of ~6k files so, especially as the data grows, we may want to look into something a bit nicer but for the time being this works well.

Next Steps

There’s plenty that could be improved e.g.

More data - other jurisdictions (we only cover EU, UK and Germany) + descriptions for the bodies (this could be a nice crowdcrafting app)
Search and Reconciliation (via nomenklatura)
Making it easier to submit corrections or additions

The full list of issues is on github here: https://github.com/okfn/publicbodies/issues

Help is most definitely wanted! Just grab one of the issues or get in touch …

Quick and Dirty Analysis on Large CSVs

2013-04-11T00:00:00+00:00

I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be found in this issue on OpenSpending’s things-to-do repo).

The dataset I’m working with is the consolidated spending (over £25k) by all UK goverment departments. Thanks to the efforts of of OpenSpending folks (and specifically Friedrich Lindenberg) this data is already nicely ETL’d from thousands of individual CSV (and xls) files into one big 3.7 Gb file (see below for links and details).

My question is what is the best way to do quick and dirty analysis on this?

Examples of the kinds of options I was considering were:

Simple scripting (python, perl etc)
Postgresql - load, build indexes and then sum, avg etc
Elastic MapReduce (AWS Hadoop)
Google BigQuery

Love to hear what folks think and if there are tools or approaches they would specifically recommend.

The Data

Here’s the 3.7 Gb CSV
A Data Package file for the data describing the fields: https://raw.github.com/openspending/dpkg-uk25k/master/datapackage.json

Cleaning up Greater London Authority Spending (for OpenSpending)

2013-04-03T00:00:00+00:00

I’ve been working to get Greater London Authority spending data cleaned up and into OpenSpending. Primary motivation comes from this question:

Which companies got paid the most (and for doing what)? (see this issue for more)

I wanted to share where I’m up to and some of the experience so far as I think these can inform our wider efforts - and illustrate the challenges just getting and cleaning up data. I note that the code and README for this ongoing work is in a repo on github: https://github.com/rgrp/dataset-gla

Data Quality Issues

There are 61 CSV files as of March 2013 (a list can be found in scrape.json).

Unfortunately the “format” varies substantially across files (even though they are all CSV!) which makes using this data real pain. Some examples:

no of fields and there names vary across files (e.g. SAP Document no vs Document no)
number of blank columns or blank lines (some files have no blank lines (good!), many have blank lines plus some metadata etc etc)
There is also at least one “bad” file which looks to be an excel file saved as CSV
Amounts are frequently formatted with “,” making them appear as strings to computers.
Dates vary substantially in format e.g. “16 Mar 2011”, “21.01.2011” etc
No unique transaction number (possibly document number)

They also switched from monthly reporting to period reporting (where there are 13 periods of approx 28d each).

Progress so far

I do have one month loaded (Jan 2013) with a nice breakdown by “Expenditure Account”:

http://openspending.org/gb-local-gla

Interestingly after some fairly standard grants to other bodies, “Claim Settlements” comes in as the biggest item at £2.3m

Data getting archived at http://data.openspending.org/datasets/gb-local-gla/
Clean up script: https://github.com/rgrp/dataset-gla/blob/master/scripts/process.js

sqlaload, an ETL wrapper for SQLAlchemy

2013-03-30T00:00:00+00:00

sqlaload is a small library that I use to handle databases in Python data processing. In many projects, your process starts with very messy data (something you’ve scraped or loaded from a hand-prepared Excel sheet). In subsequent stages, you gradually add cleaned values in new columns or new tables. Managing a full SQL schema for such operations can be a hassle, you really want something close to MongoDB: a NoSQL data store you can throw fairly random data at and get it back later.

With sqlaload, the idea is to combine some of the schema flexibility, while still keeping things in a structured database in the background:

import sqlaload as sl
engine = sl.connect('sqlite:///test.db')

# add some data:  
sl.add_row(engine, 'mytable', {'name': 'Foo', 'has_this': True})
sl.add_row(engine, 'mytable', {'name': 'Bar', 'has_other': True})

# Look up a record
row = sl.find_one(engine, 'mytable', name='Foo')
assert row['has_this']==True

# Update a record:
sl.upsert(engine, 'mytable', {'name': 'Foo', 'location': 'Atlantis'}, ['name'])

# Or create one:
sl.upsert(engine, 'mytable', {'name': 'Qux', 'location': 'Elsewhere'}, ['name'])

I first saw this type of SQL schema generation implemented in ScraperWiki: they have a couple of high-level SQLite wrappers that expand your database as you feed them data. We later adopted that concept for the joint CKAN/ScraperWiki webstore, which neither project ended up using.

Still, webstore had become an essential part of many of my data projects as an operational data store. Eventually, I decided to kick out the networking aspect: data access via HTTP is terribly slow and I wanted to have my data in Postgres, not SQLite. The webstore code went into sqlaload, and became a thin wrapper on top of SQLAlchemy core (the non-ORM database abstraction part of SQLAlchemy).

Running on top of SQLAlchemy also means that all of its functionality - for example the query expression language - are available and can be used to call up more advanced functionality.

If you want to try it out, sqlaload is now on PyPI and the README has a lot of detailed documentation on the library.

Next Steps for Textus

2013-03-27T00:00:00+00:00

At the Culture Labs hangout yesterday we wrote up the plans for the next steps for Textus we have been discussing over the last few months.

The result is this slide deck overview. It both introduces Textus and outlines next steps (slide 12 onwards).

Key Points

We want to:

Maximize simplicity
Connect with a CMS (people always want other content than just the texts)

Implications are:

Componentize “Textus” and separate text preparation / import from presentation
Create a plugin to make “Textus” style functionality one-click install into Wordpress
Eliminate dependencies on ElasticSearch & NodeJS (texts & markup stored in plain files online or in WP …)

Specifically, we plan to break Textus into 3 components:

textus-formatter - nodejs app/command line tool for formatting texts
textus-viewer - JS-only viewer
textus-wordpress - wordpress integration

Progress on the Data Explorer

2013-03-18T00:00:00+00:00

This is an update on progress with the Data Explorer (aka Data Transformer).

Progress is best seen from this demo which takes you on a tour of house prices and the difference between real and nominal values.

More information on recent developments can be found below. Feedback is very welcome - either here or the issues https://github.com/okfn/dataexplorer.

What is the Data Explorer

For those not familiar, the Data Explorer is a HTML+JS app to view, visualize and process data just in the browser (no backend!). It draws heavily on the Recline library and features now include:

Importing data from various sources (the UX of this could be much improved!)
Viewing and visualizing using Recline to create grids, graphs and maps
Cleaning and transforming data using a scripting component that allows you to write and run javascript
Saving and sharing: everything you create (scripts, graphs etc) can be saved and then shared via public URL.

Note, that persistence (for sharing) is to Gists (here’s the gist for the House Prices demo linked above). This has some nice benefits such as versioning; offline editing (clone the gist, edit and push); and bl.ocks.org-style ability to create a gist and have it result in public viewable output (though with substantial differences vs blocks …).

What’s Next

There are many areas that could be worked on – a full list of issues is in github. The most important I think at the moment are:

Storing the data “locally” in the data project. At present, data is always loaded from an “external” source. This probably involves extending the current Recline datastore to back on to IndexedDB.
A better project creation & data import process - I think we could learn a lot from Refine here
“Fork” support
More documentation and tutorials especially for scripting
Getting rid of the many rough edges especially on the UX side of things!

I’d very interested in people’s thoughts on the app so far and what should be done next and code contributions are also very welcome (the app has already benefitted from the efforts of many people including the likes of Martin Keegan and Michael Aufreiter to the app itself; and from folks like Max Ogden, Friedrich Lindenberg, James Casbon, Gregor Aisch, Nigel Babu (and many more) in the form of ideas, feedback, work on Recline etc).

Recline JS - Componentization and a Smaller Core

2013-02-26T00:00:00+00:00

Over time Recline JS has grown. In particular, since the first public announce of Recline last summer we’ve had several people producing new backends and views (e.g. backends for Couch, a view for d3, a map view based on Ordnance Survey’s tiles etc etc).

As I wrote to the labs list recently, continually adding these to core Recline runs the risk of bloat. Instead, we think it’s better to keep the core lean and move more of these “extensions” out of core with a clear listing and curation process - the design of Recline means that new backends and views can extend the core easily and without any complex dependencies.

This approach is useful in other ways. For example, Recline backends are designed to support standalone use as well as use with Recline core (they have no dependency on any other part of Recline - including core) but this is not very obvious as it stands (where the backend is bundled with Recline). To take a concrete example, the Google Docs backend is a useful wrapper for the Google Spreadsheets API in its own right. While this is already true, when this code is in the main Recline repository it isn’t very obvious but having the repo split out with its own README would make this much clearer.

So the plan is …

Announce this approach of a leaner core and more “Extensions”
- Link to the specifications for Backends and Views
- Create an official Recline Extensions page
Identify first items to split out from core - see this issue
Identify what components should remain in core? (I’m thinking Dataset + Memory DataStore plus one Grid, Graph and Map)

So far I’ve already started the process of factoring out some backends (and soon views) into standalone repos, e.g. here’s GDocs:

https://github.com/okfn/recline.backend.gdocs

Any thoughts very welcome and if you already have Recline extensions lurking in your repos please add them to the wiki page

Exporting PyBossa data to CSV or JSON with one click

2013-02-20T00:00:00+00:00

I’m really happy to announce that today we have finally added a feature that will allow to export your data into a CSV format with just one click (we also support the same for JSON).

For this purpose, all the applications in PyBossa now feature a new URI:

http://PYBOSSA-SERVER/app/slug/export

Where you will find several options to export the tasks or task runs (the answers) to different formats. In the case of the CSV format, you will get a CSV file that could be downloaded to your computer to load it later in any spreadsheet software :-)

NOTE: bear in mind that CSV is a flat format, so nested JSON objects will be “dumped” as they are, so for example if you are using GeoJSON for storing some location, you will get in the CSV file the JSON object as a string. You can see an example of this issue in the Urban Parks application, as this demo application uses the GeoJSON format for storing the location of the parks.

If you prefer JSON, just click in any of the buttons and save the generated file!

If you want to try the new feature, just go ahead and check it in CrowdCrafting.org

Mozilla FirefoxOS App Days & Crowdcrafting.org

2013-01-29T00:00:00+00:00

Last Saturday, the 26th of January, Mozilla held in parallel in 25 cities all over the world a hack day, the #FirefoxOSAppDay, about creating new web applications for their new FirefoxOS mobile OS and the desktop web browser (this stills in beta and alpha mode!).

One of the events was held in Madrid, Spain, organized by the Mozilla Hispano Community so I had the chance to expend some time with the Mozilla community and play with the new APIs and developer tools for their new platform.

In the morning we attend several talks by several experts on the new APIs that Mozilla are developing to integrate mobile actions like for example the battery API that will allow you to check the device battery status (right now integrated in the W3C standards) or the Alarm API that you can use to schedule a notification, or for an application to be started, at a specific time.

Mozilla is working really hard to standardize and integrate most of these APIs into the W3C in order to make them available in any web browser. Some of the APIs are actually now accepted in the W3C as for example the Battery Status API, Network Information API, Ambient light sensor or the Proximity sensor.

Mozilla also presented their efforts in making as easy as possible to create an application from scratch re-using several building-blocks they have created for their new platform. Basically, they have created a web page where you can copy and paste code snippets that you can later re-use in your own application, keeping the look and feel of the platform.

After the talks, all the participants had a better idea of what we could develop with the platform: a web application that could use the hardware of the new mobile phone devices, as well as Android phones out of the box!

As the goal of the day was to create an app for the FirefoxOS, my idea was to create an application that could help to track when a new scientific application has been added to crowdcrafting so you could help doing some tasks in the new application.

The web application basically lets you know which apps are new since the last time you check it out.

The application works in any web browser (even Chrome) but if you want to feel how it will be in the new OS you can try it in your phone if you have an Android device. You will need to install the latest Firefox nightly (note: this is an experimental build, so it may crash in your phone!) and then type this URL:

http://daniellombrana.es/crowdcrafting-app

You will be able to install it in your phone and run it whenever you want directly from your home screen. If you don’t want to install the browser, just open the link with a modern web browser and you should see it running (the install button will only work in Firefox Beta and Aurora builds).

PyBossa.JS or how you can easily create new PyBossa applications

2013-01-28T00:00:00+00:00

In the last weeks we have been working hard in order to make easier to develop new PyBossa applications. For this reason, we are happy to announce a new version of PyBossa.JS. This new version introduces several improvements:

Creating an app is much easier! You only have to override two methods: pybossa.taskLoaded and pybossa.presentTask to fit your app, and call pybossa.run(‘your-app-slug’).
Pre-loading tasks by default! Now your app could improve its performance, as the next task for the user will be loaded in the background for you while the user stills solving the first one!
Automatically update the task URL. The library will change the browser’s URL to the current task automatically, so using services like Disqus for comments is really simple (check the updated version of Flickr Person Finder for more details!).

As a result of this new version, there are at least two applications using the new PyBossa.JS version:

Flickr Person Finder has been updating, using this new set of features. If you try the application you will see that loading the next task (in this case an image which is usually 1024x1024px big) is almost instantly. Additionally, the app shows how you can use the Disqus service to allow your users to add comments for each task, but only loading them when the user wants.
The Face We Make is a new application where you have to guess the emoticon that a person is representing in a photo. This app is a joint effort with the official The Face We Make project by Dexter Miranda and Daniel Lombraña González. The app has been updated for using the new pre-loading of tasks, and once you have completed all of them (only 10 photos!) show you your results, in other words, how many of your guesses are right/wrong.

Finally, we have also added the “missing features” that allow you to create an application without using the API. Right now, you can create an application using only the web forms for creating the application:

You can also add and work in the task presenter (we have included the CodeMirror plugin, so you will see how it looks your code as you type it!):

As well as importing the tasks via a CSV file importer (you can even import the CSV file from a Google Spreadsheet!):

The documentation has been updated in order to reflect this new features, and as a result you should be able to write an application really fast. However, we are far from perfect, so any feedback that you can give us will be really good! Thus, please, leave in the comments your feedback or send us an e-mail to info@pybossa.com. We will be more than happy to hear your thoughts on PyBossa!

Journoid, data notifications

2013-01-25T00:00:00+00:00

At the Open Interests hackday in November, a discussion with Martin Stabe from the FT’s interactive desk led a prototype of Journoid. The idea is to monitor changing on-line datasets for remarkable information, like earthquakes, procurement in a particular industry or a close parliamentary vote. While we’d discussed alerting in the context of OpenSpending before, Martin had a pretty specific list of wishes that neither PANDA nor IFTTT can handle:

Search not just for a single keyword or query, but compare the incoming data to a table of matches, such as a list of famous people, well-known companies or any other set of items that you may be interested in.
Use Google Docs for configuration. The FT uses Google Apps internally and it’s an interface that their reporters already understand - just add a “Config” sheet to your keyword document, and store all relevant settings - like the source URL and recipient email - in there.

The Journoid prototype from the hackday only fulfills the first of those requirements - and I’m still struggling with #2, as it’s surprisingly hard to find a good Google Docs client library for Python.

Still, the hack was a nice demo: sift through a data dump from the UK departmental spending, check the supplier information against a list of companies of interest and finally send a message if there is a hit.

As a further experiment, I was able to use OpenCorporates to check the supplier’s company status, answering a simple but interesting question: does the government do business with insolvent (or even dissolved) companies? It’s interesting to think what other matches can be made when the comparison list is actually an API.

What’s next? It’s time to clean up the messy hackday code, to finish up GDocs configuration, some hosted solution and possibly a few other input formats.

This will also probably be my last post to OKFN Labs - early next month, I’ll join Knight-Mozilla OpenNews at Spiegel Online to spend ten months working on tools like this, assisting journalists in telling more compelling stories on the web. I hope that by continuing to cooperate with my friends in the Spending Stories project on Journoid and similar efforts we can bring open (and some non-open) data into the media, making a difference.

Photo credit: Mike Tigas, If this then news demo (similar project we’ve started at OpenNews)

Web Scraping with CSS Selectors in Node using JSDOM or Cheerio

2013-01-15T00:00:00+00:00

I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node given that it is pure JS and therefore could be a more natural fit when getting info out of web pages.

In particular, when my first steps when looking to extract information from a website is to open up the Chrome Developer tools (or Firebug in Firefox) and try and extract information by inspecting the page and playing around in the console - the latter is especially attractive if jQuery is available.

What I often end up with from this is a few lines of jQuery selectors. My desire here was to find a way to directly reuse these same css selectors I use in my browser experimentation directly in the scraping script. Now, things like pyquery do exist in python (and there is some css selector support in the brilliant BeautifulSoup) but a connection with something like Node seems even more natural - it is after the JS engine from a browser!

UK Crime Data

My immediate motivation for this work was wanting to play around with the UK Crime data (all open data now!).

To do this I needed to:

Get the data in consolidated form by scraping the file list and data files from http://police.uk/data/ - while they commendably provide the data in bulk there is no single file to download, instead there is one file per force per month.
Do data cleaning and analysis - this included some fun geo-conversion and csv parsing

I’m just going to talk about the first part in what folllows - though I hope to cover the second part in a follow up post.

I should also note that all the code used for scraping and working with this data can be found in the UK Crime dataset data package on GitHub on Github - scrape.js file is here. You can also see some of the ongoing results of these data experiments in an experimental UK crime “dashboard” here.

Scraping using CSS Selectors in Node

Two options present themselves when doing simple scraping using css selectors in node.js:

Using jsdom (+ jquery)
Using cheerio (which provides jquery like access to html) + something to retrieve html (my preference is request but you can just uses node’s built in http request)

For the UK crime work I used jsdom but I’ve subsequently used cheerio as it is substantially faster so I’ll cover both here (I didn’t discover cheerio until I’d started on the crime work!).

Here’s an excerpted code example (full example in the source file):

var url = 'http://police.uk/data';
// holder for results
var out = {
  'streets': []
}
jsdom.env({
  html: url,
  scripts: [
    'http://code.jquery.com/jquery.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    // find all the html links to the street zip files
    $('#downloads .months table tr td:nth-child(2) a').each(function(idx, elem) {
      // push the url (href attribute) onto the list
      out['streets'].push( $(elem).attr('href') );
    });
  });
});

As an example of Cheerio scraping here’s an example from work scraping info the EU’s TED database (sample html file):

var url = 'http://files.opented.org.s3.amazonaws.com/scraped/100120-2011/summary.html';
// place to store results
var data = {};
// do the request using the request library
request(url, function(err, resp, body){
  $ = cheerio.load(body);

  data.winnerDetails = $('.txtmark .addr').html();

  $('.mlioccur .txtmark').each(function(i, html) {
    var spans = $(html).find('span');
    var span0 = $(spans[0]);
    if (span0.text() == 'Initial estimated total value of the contract ') {
      var amount = $(spans[4]).text()
      data.finalamount = cleanAmount(amount);
      data.initialamount = cleanAmount($(spans[1]).text());
    }
  });
});

Archiving Twitter the Hacky Way

2013-01-08T00:00:00+00:00

There are many circumstances where you want to archive a tweets - maybe just from your own account or perhaps for a hashtag for an event or topic.

Unfortunately Twitter search queries do not give data more than 7 days old and for a given account you can only get approximately the last 3200 of your tweets and 800 items from your timeline. [Update: People have pointed out that Twitter released a feature to download an archive of your personal tweets at the end of December - this, of course, still doesn’t help with queries or hashtags]

Thus, if you want to archive twitter you’ll need to come up with another solution (or pay them, or a reseller, a bunch of money - see Appendix below!). Sadly, most of the online solutions have tended to disappear or be acquired over time (e.g. twapperkeeper). So a DIY solution would be attractive. After reading various proposals on the web I’ve found the following to work pretty well (but see also this excellent google spreadsheet based solution).

The proposed process involves 3 steps:

Locate the Twitter Atom Feed for your Search
Use Google Reader as your Archiver
Get your data out of Google Reader (a 1000 items at a time!)

One current drawback of this solution is that each stage has to be done by hand. It could be possible to automate more of this, and especially the important third step, if I could work out how to do more with the Google Reader API. Contributions or suggestions here would be very welcome!

Note that the above method will become obsolete as of March 5 2013 when Twitter close down RSS and Atom feeds - continuing their long march to becoming a ~~fully~~ more closed and controlled ecosystem.

As you struggle, like me, to get precious archival information out of Twitter it may be worth reflecting on just how much information you’ve given to Twitter that you are now unable to retrieve (at least without paying) …

Twitter Atom Feed

Twitter still have Atom feeds for their search queries:

http://search.twitter.com/search.atom?q=my_search

Note that if you want to search for a hash tag like #OpenData or a user e.g. @someone you’ll need to escape the symbols:

http://search.twitter.com/search.atom?q=%23OpenData

Unfortunately twitter atom queries are limited to only a few items (around 20) so we’ll need to continuously archive that feed to get full coverage.

Archiving in Google Reader

Just add the previous feed URL in your Google Reader account. It will then start archiving.

Aside: because the twitter atom feed is limited to a small number of items and the check in google reader only happens every 3 hours (1h if someone else is archiving the same feed) you can miss a lot of tweets. One option could be to use Topsy’s RSS feeds http://otter.topsy.com/searchdate.rss?q=%23okfn (though not clear how to get more items from this feed either!)

Gettting Data out of Google Reader

Google Reader offers a decent (though still beta) API. Unoffical docs for it can be found here: http://undoc.in/

The key URL we need is:

http://www.google.com/reader/atom/feed/[feed_address]?n=1000

Note that the feed is limited to a maximum of 1000 items and you can only access it for your account if you are logged in. This means:

If you have more than a 1000 items you need to find the continuation token in each set of results and then at &c={continuation-token} to your query.
Because you need to be logged in your browser you need to do this by hand :-( (it may be possible to automate via the API but I couldn’t get anything work - any tips much appreciated!)

Here’s a concrete example (note, as you need to be logged in this won’t work for you):

http://www.google.com/reader/atom/feed/http://search.twitter.com/search.atom%3Fq%3D%2523OpenData?n=1000

And that’s it! You should now have a local archive of all your tweets!

Appendix

Increasing Twitter is selling access to the full Twitter archive and there are a variety of 3rd services (such as Gnip, DataSift, Topsy and possibly more) who are offering full or partial access for a fee.

Bundes-Git – German Laws on GitHub

2012-12-13T00:00:00+00:00

If you compare software code and legislation you can find many similarities: both are big bodies of text spread over multiple units (laws/files). The total amount of text inevitably grows bigger over time with many small changes to existing parts while most of the corpus stays the same.

However, the tooling and editing process for these domains is very different: while developers are in the fortunate position that they can build and improve their own tools, legislators are stuck with proprietary tools like MS Word that are simply not built to collaboratively work on a big corpus of text.

But if source code and laws have a similar information structure, why not apply the tools used in software development to the legislative process? That is what Bundes-Git (“Federal Git”) is currently trying out in Germany.

Bundes-Git is a Git version control repository of all German Federal Laws and Regulations as Markdown. The goal was to come up with the simplest solution to handle laws that could possibly work and integrate it well into the existing developer ecosystem.

The idea has been well received with an article on Wired.com and articles on German IT news sites Heise and Golem.

The popularity can surely also be attributed to our marvelous Bundes-Git mascot, dubbed octo eagle, thought up by myself and designed by Konstantin Käfer released under CC0 (please go this way if you are interested in a t-shirt or hoodie).

Design decisions explained

All other law storage formats use XML. But to me XML is neither human readable nor human writable. Let me get into the details of some of the design decisions:

Git because it’s the most popular distributed version control system right now.
GitHub because it’s the most popular Git host right now and comes with some nice perks like Pull Request and GitHub Pages.
Markdown because any more structure like XML or JSON would make it harder for humans to read or write the format and diffs would be difficult to read.
Naming files index.md because it works nicely with Jekyll and GitHub Pages renders all laws into a currently very simple page.
YAML Front Matter is necessary for Jekyll but also serves as nice a meta data store on laws.
Committing from branches with non-fast-forward merges because… uhmm. This is really up for discussion. I want to keep track of where changes originate and branches are created for each law publication but this heavily diverts from the clean commit history philosophy that e.g. the Linux kernel lives by.

There are some more software development concepts that can be applied to the legislation process. Here are some fun things I’d like to try:

A prose.io-like editor to easily create law proposals and make a pull request.
Measuring the complexity of corpus/laws/paragraphs and using Travis CI to test pull requests if they make the complexity worse. Pattern is a Python NLP library and they recently released a German module which I want to try on our laws.
Testing foreign key integrity: are all referenced paragraphs still available?
Create an informative visualization out of the Git log automatically like Gregor Aisch did by hand for the German political party law.
Let the German president sign off on commits to master.

The design decisions around Bundes-Git fit nicely into the Git/GitHub eco system but they are not set in stone. They also create some problems and annoyances that need to be fixed or circumvented. While I believe the general philosophy and the freshness of the approach is the right direction, we clearly need more discussion.

Future happenings around Bundes-Git:

We applied for funding at Testing 123 Global Integrity Innovation Fund. Bundes-Git definitely fits their criteria of brand new, innovative and high-risk. The decision will be made later this month, fingers crossed!
I will talk at the 29th Chaos Communication Congress about Bundes-Git.
There will be Bundes-Git Hacker Meetup in mid January. If you are interested, sign up here.

We decided that the language of discussion on GitHub will be German, but feel free to start a conversation on the OKF Open Legislation mailing list.

Also be sure to follow @bundesgit on Twitter!

Speeding Up Your PyBossa App

2012-12-12T00:00:00+00:00

Thanks to the free crowd-crafting tool PyBossa, nowadays the biggest challenge for successful crowd-sourcing is engaging users for participating in tasks, and to keep that motivation at a high level over time. Therefor the user experience of crowd-sourcing apps plays a crucial role.

After participating in quite a few tasks myself, I found that the loading time in between two tasks was the most annoying thing. Doing crowd-sourcing tasks often feels like doing something stupid, and you really want to get things done as fast as possible. Sometimes it needs just a single click to solve a task, but then it takes seconds to load the next one.

This is because all existing apps where designed in a synchronous fashion. The client requests a new task and presents it to the user as soon as it has been loaded. After the user has solved the task, the result is submitted and after the result has been stored a new task is requested and so on.

(click to enlarge)

Some apps even need to load additional information, such as images or data coming from external APIs. This loading time accumulates quickly, and will most probably lower the motivation of your users!

Pre-loading subsequent tasks == magic

The idea for reducing the loading time is actually pretty simple: We let the app load the next task while the user is solving the current one. This results in a parallel process as described in the following chart:

To implement this in PyBossa, we needed to change the PyBossa API a little bit (thanks @teleyinex). Before that change consecutive calls to the newtask endpoint would return the same task again and again, until the user has solved it. Now with the newly introduced parameter offset you can request the next tasks in line.

Another requirement for pre-loading of tasks is to keep the entire app on one page as otherwise the cached task would be lost. The rest of this post describes a smart way to implement this using jQuery.Deferred.

Smart implementation using jQuery.Deferred

Looking from our PyBossa app, the pre-loading of the next task and the user solving the current one are two asynchronous actions running in parallel. We have to wait until both are completed before we can proceed to the next task.

This article reminded me of a smart way to implement this using jQuery.Deferred. The following function shows everything we need for our main loop.

function run(task) {
    var nextLoaded = loadTask(1),
        taskSolved = presentTask(task);
    $.when(nextLoaded, taskSolved).done(run);
}

To start the loop, we need to load the first task and pass it to run.

loadTask().done(run);

Now let’s take a look at loadTask(). The parameter offset is passed to the API. After the task and everything else we might need is loaded we mark the deferred as resolved and pass the task over the done handler. Finally we return a ‘locked’ version of the deferred object.

function loadTask(offset) {
    offset = offset || 0;
    var taskLoaded = $.Deferred();
    $.getJSON('/api/app/'+appid+'/newtask?offset=' + offset, function(task) {
        // load more data if you need
        // and then, resolve Deferred
        taskLoaded.resolve(task);
    });
    return taskLoaded.promise();
}

We can use exactly the same method to model the user action. Therefor presentTask() will returned a deferred object, too. It gets resolved as soon as the user has solved the task and the answer is correctly submitted to PyBossa.

function presentTask(task) {
    var taskSolved = $.Deferred();
    // update presenter html
    $('.question').html(task.question);
    // wait for user action
    $('button.submit').off('click').on('click', function() {
        var answer = { foo: "Bar" }; // fetch answer from UI
        pybossa.saveTask(task.id, answer).done(function() {
            taskSolved.resolve();            
        });
    });
    return taskSolved.promise();
}

And that’s it.

This method will significantly speed up your PyBossa app, especially if you need to fetch data from third party APIs. Remind yourself that even a speedup of a few seconds is a huge benefit for your voluntary users, as they are likely to go through this process quite often. And you really don’t want to waste their time, do you?

Update: Why not try the FlickrPerson demo app the speedy way?

Javascript Timeline Libaries - A Review

2012-12-04T00:00:00+00:00

This post is a rough and ready overview of various javascript timeline libraries that arose from research in creating a timeline view for Recline JS. Note this material hung around on my hard disk for a few months so some of it may already be a little bit out of date!

October 2013: We have released TimeMapper a new online app for creating Timelines and TimeMaps quickly and easily. Check it out at http://timemapper.okfnlabs.org/

I want to start with a general comment. Timeline libraries consist of various components:

Data loading
- Date parsing
Band (timeline) rendering
Showing render info on individual items

For me a timeline visualization library need only be the second of these but most that I’ve come across do more.

In fact a major issue in my opinion with most libraries is that they are under-componentized - they don’t separate cleanly into these different components and end up doing everything.

To take one example, the Verite timeline (in my view is one of the best libraries out there) has a whole bunch of its own custom date parsing built in inside an internal utility library which are hard to override or replace and also has a large chunk of code just for loading from google docs and other data sources. (You can of course somewhat solve this somewhat – as I do in Recline by parsing the dates directly and then submitting in a standardized form).

In my view, even if library authors do want to include these sorts of things, it would be good to do it in a way that allowed for a clean separation so that you could just use the parts you wanted (and/or over-ride parts more cleanly).

Propublica Timeline Setter

http://propublica.github.com/timeline-setter/
HTML + JS
- But Requires a build step (using ruby)
Very simple and compact design (nice!)

Verite Timeline

http://timeline.verite.co/
Very elegant frontend design
2 bands in timeline segment and tight integration of item display
Includes much more than Timeline (e.g. sourcing data from google docs etc)
Mozilla Public License (was GPL)

Simile Timeline

http://www.simile-widgets.org/timeline/
The original open-source JS timeline but less regularly update and maintained today: “As of Spring 2012, Exhibit is the only Simile widget seeing active development.” and the timeline control has not been updated since 2009 (see this stackoverflow question for more

Chronoline

http://stoicloofah.github.com/chronoline.js/
Recently developed and updated
MIT licensed

Timeglider

https://github.com/timeglider/jquery_widget
Non-open license (but was MIT licensed earlier on

CHAPS Timeline

http://almende.github.com/chap-links-library/timeline.html
Looks pretty nice though CSS is not quite as elegant (probably fixable!)
Not clear whether it supports multiple bands

Following Money and Influence in the EU - the Open Interests Hackathon

2012-11-29T00:00:00+00:00

Making sense of massive datasets that document the processes of lobbying and public procurement at European Union level is not an easy task. Yet a group of 25 journalists, developers, graphic designers and activists worked together at the Open Interests Europe hackathon last weekend to create tools and maps that make it easier for citizens and journalists to see how lobbyists try to influence European policies and to understand how governments award contracts for public services. The hackathon was organised by the European Journalism Centre and the Open Knowledge Foundation with support from Knight-Mozilla OpenNews.

At the Google Campus Cafe in Londonndon, one group dived into European lobbying data made available via an API: api.lobbyfacts.eu. Created by a group of five NGOs: Corporate Europe Observatory, Friends of the Earth Europe, Lobby Control, Tactical Tech and the Open Knowledge Foundation, the API gives access to up-to-date, structured information about persons and organisations registered as lobbyists in the EU Transparency Register. The API is part of lobbyfacts.eu, a website that aims to make it easy for anyone to track lobbyists and their influence at European Union level, due to launch in January 2013.

One of the projects Createdd with the lobby register data is a map showing the locations of the offices of lobby firms based on their turnover. The size of the bubbles on the map corresponds to the turnover of the firm. Built by Friedrich Lindenberg, the map is an overlay of a Stamen Design map with Leafletjs.

Screenshot of api.lobbyfacts.eu/map showing locations of lobbying firms across Europe

Other teams focused on data analysis, comparing the data from the EU Transparency Register with that of the Register of Expert Groups. Interesting leads for possible further investigative work resulted from the comparison of the figures reported by lobby firms in the Transparency Register with those collected by the National Bank of Belgium. “Some companies underreported massively to the National Bank of Belgium and some of them were making themselves look bigger in the Transparency Register,” said Eric Wesselius, leader of the lobby transparency challenge and co-founder of Corporate Europe Observatory. Wesselius’ organisation will continue investigations in this area.

A second group of journalists and graphic designers led by Jack Thurston, an activist involved in Fishsubsidy.org, discussed how fish subsidy data could be used for finding journalistic stories and explored various ways in which the unintended consequences of the EU fish subsidies programme, such as overfishing, could be compellingly presented to the general public.

Sketch for interactive graphic showing fishing vessels, their trajectory and the subsidies they receive, made by graphic designer Helene Sears

A theyhird group looked into European public procurement data. “Public procurement is an area that is underreported by journalists,” said data journalist Anders Pedersen, founder of OpenTED. “9-25% of the GDP in the EU is procurement - highest in the Netherlands where it is around 35%. It’s a real issue in times of austerity who provides our services,” he added.

Several scrapers were built to access the data relating to winners of contracts and the values of these contracts from the EU publication TED (Tenders Electronic Daily). A map of public procurement contracts by awarding city was created using Google Fusion Tables by geocoding the original CSV file, enriched with OpenStreetMap.

Screenshot of map of public procurement contracts by Benjamin Simatos and Martin Stabe</em></p>

Pedersen’s long term goal is to create an interface and an API for EU public procurement data and to publish some more visualisations. “A lot of the work that got done here [at the hackathon] we would not have gotten done in the next months maybe. It really helped us push far ahead in terms of ideas and in terms of getting stuff done.”

This blog post is cross-posted from the Data-driven Journalism Blog.

Photo of participants at the hackahon by Mehdi Guiraud.</p>

Scraping Data Behind a CAPTCHA

2012-11-13T00:00:00+00:00

How much does the highest paid person in the Brazilian Federal Senate earns? That’s the question I asked myself a few weeks ago, and one that should be easy to answer. In Brazil, every public body must publish its employees’ salaries online, but some do so in a terrible way. The Federal Senate is one of these.

To access its data you have to not only fill in your personal info, but also solve a CAPTCHA for each salary you want to see. With no other tricks, it would take ages to answer my question. I needed a way to gather all salaries and compare them. But how to scrape a page that’s “protected” behind a CAPTCHA?

Decaptcher is a company that sells CAPTCHA-solving services. They provide an API that you can send an image, and get the contained text. It’s really cheap (US$ 1.38 per 1.000 CAPTCHAs), and works well, albeit a bit slow (30~40 secs). They promise a success rate of over 95%, but I got only 43% in my tests. Probably because the CAPTCHAs I’m sending are really hard to read.

Their API is simple to implement, with only 3 actions (upload, refund, and balance). There’re examples in C# and PHP, and I’ve hacked together one in Ruby. For a bit more than US$ 5.92, I was able to access and publish the salaries of 4,487 public servants in http://senado.cc.

There’re many other companies that offer the same service, like Death by CAPTCHA, Bypass CAPTCHA, Beat CAPTCHA, and Antigate. These services allow us to access public data that would be unreachable otherwise, but they might be considered illegal in some countries. As we’re not breaking the CAPTCHA, but paying people to solve them, we should be fine. But don’t take my word for it: ask a lawyer.

Recline JS Search Demo

2012-11-01T00:00:00+00:00

We’ve recently finished a demo for ReclineJS showing how it can be used to build JS-based (ajax-style) search interfaces in minutes (or even seconds!): http://reclinejs.com/demos/search/

Because of Recline’s pluggable backends you get out of the box support for data sources such as SOLR, Google Spreadsheet, ElasticSearch, or plain old JSON or CSV – see examples below for live examples of using different backends.

Interested in using this yourself? The (prettified) source JS for the demo is available (plus the raw version) and it shows how simple it is to build an app like this using Recline – plus it has tips on how to customize and extend).

More Examples

In addition to the simple example with local data there are several other examples showing how one can use this with other data sources including Google Docs and SOLR:

A search example using a google docs listing Shell Oil spills in the Niger delta
A search example running of OpenSpending SOLR API – we suggest searching for something interesting like “Drugs” or “Nuclear power”!

Code

The full (prettified) source JS for the demo is available (plus the raw version) but here’s a key code sample to give a flavour:

// ## Simple Search View
//
// This is a simple bespoke Backbone view for the Search. It Pulls together
// various Recline UI components and the central Dataset and Query (state)
// object
//
// It also provides simple support for customization e.g. of template for list of results
//
//      var view = new SearchView({
//        el: $('some-element'),
//        model: dataset
//        // EITHER a mustache template (passed a JSON version of recline.Model.Record
//        // OR a function which receives a record in JSON form and returns html
//        template: mustache-template-or-function
//      });
var SearchView = Backbone.View.extend({
  initialize: function(options) {
    this.el = $(this.el);
    _.bindAll(this, 'render');
    this.recordTemplate = options.template;
    // Every time we do a search the recline.Dataset.records Backbone
    // collection will get reset. We want to re-render each time!
    this.model.records.bind('reset', this.render);
    this.templateResults = options.template;
  },

  // overall template for this view
  template: ' \
    <div class="controls"> \
      <div class="query-here"></div> \
    </div> \
    <div class="total"><h2><span></span> records found</h2></div> \
    <div class="body"> \
      <div class="sidebar"></div> \
      <div class="results"> \
        {{{results}}} \
      </div> \
    </div> \
    <div class="pager-here"></div> \
  ',

  // render the view
  render: function() {
    var results = '';
    if (_.isFunction(this.templateResults)) {
      var results = _.map(this.model.records.toJSON(), this.templateResults).join('\n');
    } else {
      // templateResults is just for one result ...
      var tmpl = '{{#records}}' + this.templateResults + '{{/records}}';
      var results = Mustache.render(tmpl, {
        records: this.model.records.toJSON()
      });
    }
    var html = Mustache.render(this.template, {
      results: results
    });
    this.el.html(html);

    // Set the total records found info
    this.el.find('.total span').text(this.model.recordCount);

    // ### Now setup all the extra mini-widgets
    //
    // Facets, Pager, QueryEditor etc

    var view = new recline.View.FacetViewer({
      model: this.model
    });
    view.render();
    this.el.find('.sidebar').append(view.el);

    var pager = new recline.View.Pager({
      model: this.model.queryState
    });
    this.el.find('.pager-here').append(pager.el);

    var queryEditor = new recline.View.QueryEditor({
      model: this.model.queryState
    });
    this.el.find('.query-here').append(queryEditor.el);
  }
});

Labs Show and Tell - 26th October!

2012-10-23T00:00:00+00:00

We’re having the next Show and Tell on Friday, 26 October at 2:30 pm BST via Google Hangout on Air. As usual, the URL will be posted on OKFN Labs’ G+ Page.

If you’d like to present, add your name to the list. Remember, #okfn on irc.freenode.net will be the backchannel for discussion and questions, so don’t forget to hang out there.

What’s Show and Tell?

Have you built some cool tech you want to show everyone? Played around with some data? The Labs Show and Tell is your chance to share it with the OKFN Labs community! You get 2 to 5 minute to show us what you built!

Missed the last one?

On Oct 12, 2012, we had the first Labs Show and Tell. Here’s what we talked about:

Scientific Promiscuity - Michael Bauer

Scientific papers are rarely written by a single person. Usually many authors come together to work on a specific issue. This visualization uses data obtained from Pubmed to show collaboration between authors.

Activity API - Code - Tom Rees

Activity API scrapes through multiple data sources and creates one single PostgreSQL database with all the data. It scrapes through GitHub, Twitter, mailing lists posts, and Twitter.

Dashboard - Code - Tom Rees

The OKFN Community Dashboard provides an overview of community activity. We have a flourishing and diverse set of activities and it can be hard, even for people ‘inside’ to see what is going on. The Dashboard helps us see quickly what is going on.

nomenklatura - Code - Friedrich

A lot of time in data wrangling is spent making mappings of variant names to a canonical form. This app provides an easy-to-use, web-based method for creating such mappings, to allow for a more managed data cleansing pipeline.

Messy Tables - Friedrich

A library for dealing with messy tabular data in several formats, guessing types and detecting headers.

froide - stefanw

Froide is a Freedom Of Information tracker. The name comes from Freedom of Information (de). Also Froide sounds like Freude which is German for joy.

PyBossa on Travis CI - Nigel

PyBossa now uses Travis CI for continuous integration. Makes reviewing pull requests easier since we can see test status right away.

Wrangling dirty data with messytables.

2012-10-22T00:00:00+00:00

One of the largest data collection projects we have done so far has been the consolidation of the UK’s departmental expenditure. Over 370 different government entities have published a total of more than 7000 spreadsheets. Many of those have obviously been hand-crafted or at least manually processed. Our goal was to consolidate the contained information into a single spreadsheet, discarding all the eccentricities included by the individual publishers.

messytables is a simple Python library that tries to extract tabular contents from spreadsheet documents created by human editors. Often, even files released as CSV or Excel are still not easy to parse programmatically. Some people like to start off spreadsheets with a title column or some metadata, while others use inapproriate formats to represent numbers or dates.

The tool offers a set of functions that help to make parsing data easier:

A headers detector tries to determine which row in a spreadsheet contains the actual header definitions (as opposed to any trailing content).
type detection attempts to guess the data type for each column, including a wide range of commonly used date formats.
support for streaming data, so that extremely large tables can be processed without loading the entire data into memory.
and, of course, it supports a range of spreadsheet types - from trusty CSV to Excel and even OpenOffice formats.

We’ve since also started using messytables to load data into the data API of CKAN, where it serves as the ETL for the datastore and related ReclineJS previews.

If you’re interested, check out the messytables documentation and the uk25k scripts which use it to gather UK government finance.

Of course, messytables is not a cure-all and only useful for reading data.

tablib, for example, has a fantastic API that makes writing, analyzing and converting data a breeze.
csvkit has a set of command line utilities that should be pre-installed on any computer.

But when it comes to tables that are a complete mess: give it a try!

Open Interests Hackathon in London, 24-25 November

2012-10-15T00:00:00+00:00

The European Journalism Centre and the Open Knowledge Foundation, sponsored by Knight-Mozilla OpenNews, invite you to the Open Interests Hackathon to track the the interests and money flows which shape European policy.

When: 24-25 November

Where: Google Campus Cafe, 4-5 Bonhill Street, EC2A 4BX London

How EU money is spent is an issue that concerns everyone who pays taxes to the EU. As the influence of Brussels lobbyists grows, it is increasingly important to draw the connections between lobbying, policy-making and funding. Journalists and activists need browsable databases, tools and platforms to investigate lobbyists’ influence and where the money goes in the EU. Join us and help build these tools!

Open Interests Europe brings together developers, designers, activists, journalists and other geeks for two days of collaboration, learning, fun, intense hacking and app building.

Visit the event page to learn more

Labs Show and Tell - All Welcome!

2012-10-10T00:00:00+00:00

Built an app or tool you want to show people? Played around with some interesting data? Know of a new development people should know about? Want to find out what others are doing?

Come to the Show and Tell this Friday and share what you are up to with the community!

Want to participate? Just add your name to the list on the etherpad! If you want to present just add a brief title and/or short description.

Remember, #okfn on irc.freenode.net will be the backchannel for discussion and questions, so feel free jump in there if you have questions or queries or just want to shoot the breeze.

When?

Friday, 12 October at 2:30 pm BST - that’s 10:30am EST, 3:30pm CET etc. The session will last 30m with presentation slots of 2-5m.

Where?

Google Hangout on Air and #okfn on irc.freenode.net. We’ll post the on air URL on OKFN Labs’ G+ Page, here and on OKFN Labs twitter.

Data Catalogues are People!

2012-09-25T00:00:00+00:00

Last week, Matej Kurian published a message on the okfn-labs mailing list, describing the various sources he had discovered for machine-readable excerpts of the EU’s joint procurement system, TED. What struck me about this message was that, apparently, this polite and brilliant policy wonk had turned into something strange: into a data catalogue.

While not quite a Kafka-grade transformation, it’s an odd turn to take for a researcher. But Matej is not the only one: the team of FarmSubsidies.org has experienced a similar re-definition, as did the ERDF researchers at the Bureau of Investigative Journalists.

The best data catalogues today are well-informed people.

When I talk to journalists about data acquisition, they seem to know this already: its often not just about where to look; it’s even more important to know who to talk to. But why does this observation from a telephone-and-filofax world hold true even in digital space, where every bit of knowledge is supposed to be only a click away?

I believe that some blame goes to the simplistic model underlying our efforts to catalogue data: the question of where to find a dataset is certainly important, but for those actually working with the data it’s just not enough. Once you dig into data, other questions rise to the foreground:

How do the different available datasets interact and integrate? Does the data I am looking for even make sense on its own - or do I need to combine several sources? Take, for example, the UKs Whole of Government Accounts: while data.gov.uk lists a few gigabytes worth of downloads for this dataset, it is completely impossible to interpret the data without also fetching Excel files (and PDF guidance) off the Treasury web site, the Department of Communities and Local Government site and - bonus points - emailing the Treasury for their internal toolkit.
How complete and up-to-date is the data? What technical and political constraints apply to the publication? Again, FarmSubsidies provide a nice example, as a 2010 European Court of Justice verdict has severely limited the availablity of the data - leading to an oddly limited dataset today.
Who else is working with this data and what are they doing? Are there derivative datasets that I should use instead of the source material? It may be worth knowing, for example, that as well as browsing the 6000-odd departmental spending spreadsheets, journalists can also search across a consolidated version of this data on OpenSpending.org

But why are current data portals so bad at capturing such information? Certainly, adding a few comment boxes and an app gallery can do a good job glossing over the problem, but the real problems seem to lie deeper in the technology:

Datasets are a useless unit. A while ago, Richard Cyganiak defined a dataset as “a set of data” - which I assume is a computer scientists way of telling you to get lost. And while I’m not normally a big fan of LOD-clouds, they got this right: all the interesting stuff is happening in between datasets. Whether it’s about reconstructing a process across several datasets or finding out about geographical and temporal coverage - datasets are at best building blocks, more often they are just arbitrary. So maybe its time to think about other mechanisms to represent data sources: what about policy maps and government wiring plans?
Even worse, the metadata we keep about datasets is mostly based on a bureaucratic mindset: they’re library-inspired, static index cards that hope to represent datasets, while data are really subject to complex processes both within and outside the institutions that produce them. For anyone using the data, activity metadata is the interesting part. We’ve already figured this out for software, where libraries like FreshMeat and SourceForge have been replaced by activity-driven platforms like GitHub. The key aspect here is that GitHub doesn’t require me to explictly make metadata - the relevant narrative is simply summarized from my working pattern.

Of course, all of this is just a long way of saying that the best metadata is in the data itself. So unless you’re working on the LHC stuff there really isn’t much of a reason to separate the two any longer: let’s make public, audit-trailed databases that report on themselves. This, of course, is easier said then done as it implies that all data will fit into one storage mechanism. In the real world (i.e. outside Linked Data land), this is unlikely to be true of structured data any time soon.

Still, even after fixing our model of how we talk about datasets on the web, I think we would still find that the best way to ensure that people collaborate around data is community-building: creating networks that garden the commons. Perhaps we should start cataloguing those.

WikipediaJS - accessing Wikipedia article data through Javascript

2012-09-10T00:00:00+00:00

WikipediaJS is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc.

The library is the work of Labs member Rufus Pollock. In essence, it is a small wrapper around the data and APIs of the DBPedia project and it is they who have done all the heavy lifting of extracting structured data from Wikipedia - huge credit and thanks to DBPedia folks!

Demo and Examples

A demo is included and you can see some examples of the library in action at the following links:

Colophon

WikipediaJS source code is on github

One of the reasons for creating WikipediaJS is that we think it can be useful in Timeliner and other apps as a way to quickly add new items to your timeline.

Timeliner - Make Nice Timelines Fast

2012-08-08T00:00:00+00:00

As part of the Recline launch I put together quickly some very simple demo apps one of which was called Timeliner:

http://timeliner.reclinejs.com/

This uses the Recline timeline component (which itself is a relatively thin wrapper around the excellent Verite timeline) plus the Recline Google docs backend to provide an easy way for people to make timelines backed by a Google Docs spreadsheet.

As an example of use, I started work on a “spending stories” timeline about the bankruptcy of US cities (esp in California) as a result of the “Great Recession” (source spreadsheet). I’ve also created an example timeline of major wars, a screenshot of which I’ve inlined:

Code

Source code for the Timeliner is here: https://github.com/okfn/timeliner

If you have suggestions for improvements, want to see the ones that already exist, or, gasp, find a bug please see the issue tracker: https://github.com/okfn/timeliner/issues

The Data Transformer - Cleaning Up Data in the Browser

2012-07-31T00:00:00+00:00

This a brief post to announce an alpha prototype version of the Data Transformer, an app to let you clean up data in the browser using javascript:

http://transformer.datahub.io/

2m overview video:

What does this app do?

You load a CSV file from github (fixed at the moment but soon to be customizable)
You write simple javascript to edit this file (uses ReclineJS transform and grid views + CSV backends – here’s the original ReclineJS transform demo)
You save this updated file back to github (via oauth login - this utilizes Michael’s great work in Prose!)

This prototype was hacked together in an afternoon a couple of weeks ago when I was fortunate enough to spend an an afternoon with Michael Aufreiter, Chris Herwig, Mike Morris and others at the Development Seed offices. It builds on ReclineJS + oauth / github connectors borrowed from Prose.

It’s part of an ongoing plan to create a “Data Orchestra” of lightweight data services that can play nicely together with each other and connect to things like the DataHub (or GitHub …): http://notebook.okfn.org/2012/06/22/datahub-small-pieces-loosely-joined/

Displaying PyBossa Urban Parks Data on a 3D Globe

2012-07-14T00:00:00+00:00

Labs member Daniel Lombraña González has built a 3-d globe showing the locatoins of urban parks around the world as located by volunteers using the Pybossa Urban Park geocoding app:

http://teleyinex.github.com/pybossa-urbanpark-globe/ — (Source code)

Background

The Urban Parks geo-coding application is a micro-tasking app running on PyBossa. In the app volunteers are asked to find an urban park for cities around the world. The volunteers use a web map to browse the city, and then submit an answer: the coordinates of the urban park by placing a marker in the map, or saying: I don’t find any park.

More details about PyBossa can be found on the official site http://pybossa.com and also in the online documentation.

dataissues.org - public issue tracking for data defects

2012-07-10T00:00:00+00:00

On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. While we submitted proposals for Grano and DataProtocols, we decided to hold back on this idea for another round. Still, sharing is caring.

1. What do you propose to do? [20 words]

We’ll create a web service where data wranglers and consumers can log errors arising from processing, viewing or using data.

2. How will your project make data more useful? [50 words]

All data has errors. While data quality is often talked about, the best practice for data apps is often to have half a paragraph on the ‘about’ page. We want to build a service that is useful to data wranglers, but can also serve as documentation for end-users and basis for further discussion.

3. How is your project different from what already exists? [30 words]

Error reporting for software is either done as task tickets (e.g. github.com) or by capturing raw application output (e.g. exceptional.io). For data, we want to combine these two approaches to let users group recurring errors into issues that can then be discussed and fixed.

4. Why will it work? [100 words]

While all data processing workflows are different from dataset to dataset, the types of errors that occur are often quite similar and can be stored in a shared service. This is both immediately useful when doing data work - especially scheduled, unsupervised processes - but also as an activity log for other people to see.

We’ll create both an easy-to-use online validation tool to check spreadsheets against a certain schema and an API with client libraries that can be integrated into existing processing pipelines. The reported issues can be full-out errors, but also probes that highlight implausible values.

5. Who is working on it? [100 words]

The Open Knowledge Foundation is…

6. What part of the project have you already built? [100 words]

We’ve got extensive experience working with dataset metadata from DataHub.io and produced a number of complex data processing pipelines (e.g. for UK spending data, that merges over 5000 spreadsheets in different formats). These clearly show the need for better reporting, and we have built several ad-hoc solutions but know that is a major area that is inadequately addressed in our work and those of others. We have already got a basic prototype and can build a first increment quickly.

7. How would you use News Challenge funds? [50 words]

We’ll built it! We’ll develop a full version of this service iteratively, test and promote it. We plan to work together with civic data projects as early adopters to get quick feedback and adapt the service to suit their needs.

8. How would you sustain the project after the funding expires? [50 words]

This will be perfectly suited to SaaS freemium model in which heavy and/or professional users who need to report large amounts of errors and generate complex reports pay a subscription fee. In addition as open-source software the project can be re-used and extended by others.

If you think this is a good idea, help hacking and contribute patches to the dataissues repository!

Grano - social network analysis for advocates and journalists.

2012-07-09T00:00:00+00:00

On June 21st, the Knight News Challenge Round on Data ended. The day before, Rufus, Ross and I sat down to write out some ideas that we’d been discussing for a while. The first idea I want to repost here is a proposal for Grano, which I’ve discussed in this blog before.

1. What do you propose to do? [20 words]

We’ll make a powerful tool for journalists and advocates to keep track of actors and their relationships in complex environments.

2. How will your project make data more useful? [50 words]

It’ll enable users to manage research in a structured way, helping them to link raw data to the actors, events and organisations they’re already investigating and to find those that they may have missed before. We’ll help users do their job more thoroughly, while creating a structure that can be re-used later.

3. How is your project different from what already exists? [30 words]

Network analysis means many things to people: it’s graph algorithms to coders, network diagrams to designers and CRM to business. Journalists and advocates need evidence gathering and information linkage to be at the core of these things.

4. Why will it work? [100 words]

We want to focus on four functions that will make this a practical tool instead of a gimmick:

a) allowing users to easily integrate bulk data to complement manually entered information,

b) helping them to keep track of the source for each fact that is entered and keeping a full version history,

c) providing easy access control so that users can choose which information to keep private and which links to publish with others and

d) text snippets, so that researchers can combine structured analysis and narrative fragments in which the tool will detect references to the network’s entities.

5. Who is working on it? [100 words]

The Open Knowledge Foundation wants to cooperate with investigative networks around the world to develop this project. We’ve already been pioneering data collection and presentation tools, such as DataHub.io and OpenSpending, as well as efforts like the Data Journalism Handbook and the School of Data to widen data literacy.

Friedrich Lindenberg (OKFN) has worked on several data projects and data-journalism training and will lead this project.
Ross Jones (OKFN) will contribute as a software architect.
Stefan Candea (2011 Nieman Fellow at Harvard, Director of the Romanian Center for Investigative Journalism) has offered to advise us.

6. What part of the project have you already built? [100 words]

We’ve already built Grano, a REST backend that can store network information, generate custom reports about nodes and relations and run full-text search. Because we think that meaningful network analysis is hard, we are conservative in the choice of technology to focus on outcomes. To force that, we decided to base our tool on a concrete use case. The software is now first used in an unannounced project that tracks lobbying in the EU, powering a special-purpose, JavaScript-only site. Unfortunately, this means the current prototype does not have a stand-alone web interface and the serious data integration capabilities we think it needs.

7. How would you use News Challenge funds? [50 words]

We want to develop Grano to give investigative journalists and civic hackers a (re-usable) web interface to design their network structure, manually enter data, integrate bulk data sets and to explore the resulting network, make notes, calculate key metrics and to export reports, rankings and network visualizations.

8. How would you sustain the project after the funding expires? [50 words]

While the service is going to be of immediate use, we believe that advocacy groups and newsrooms will also deploy it as a backend to their features and campaign sites. We aim to make Grano into a thriving open source project, supported through custom services for power users.

If you like this idea, please vote for it on the proposal page.

Open Knowledge Labs: The Data Wrangling Blog

Introduction to Statistics With Data Packages and Gonum

Go, Data Packages & Gonum

Reading Datapackage

Gonum and statistics

Announcing datapackage-pipelines version 2.0

What Changed?

New Low-level API and stdout Redirect

Dataflows integration

Standard Processor Refactoring

Deprecations

The Road Ahead

Links and References

Contributors

Data Factory & DataFlows - Tutorial

Learn how to write your own processing flows

Builtin Processors

Data Factory & DataFlows - An Introduction

Philosophy and Goals

Architecture

On Data Wrangling

Introducing dataflows

A demo is worth a thousand words

More ….

Processing Tabular Data Packages in Clojure

Setup

The Data

Loading the Data Package

Casting Types with core.spec

Processing Tabular Data Packages in Java

Setup

The Data

Packaging

Iterating

Inferring the Schema

Contributing

Collecting, Analysing and Sharing Twitter Data

Collecting the Data

Analysing the Data

Sharing the Data

Packaging the Data and associated resources

Publishing on Github

Conclusion

Processing Tabular Data Packages in Go

Setup

The Periodic Table Data Package

Inspecting Package Metadata

Quick Look At the Data

Processing the Data Package’s Content

Frictionless Data Lib - A Design Pattern for Accessing Files and Datasets

Overview of the Pattern

The Pattern in Detail

open method

File locators

File

Metadata: descriptor

Accessing data

stream

rows

Support for TableSchema and CSV Dialect

Dataset

open for datasets

Dataset Locators

descriptor

identifier (optional)

README

files

addFile

Operators

Conclusion

Appendix: Why we need a pattern like this

All data wrangling tools need to load and then pass around “file-like objects” as they process data

A file is more than a byte stream: the stream may be structured and there is usually the need for associated metadata

Tool authors find themselves inventing their own “stream-plus-metadata” objects … but they are all different

Plus, many tools also need to access collection of files, i.e. datasets

Having a common API pattern for files (stream-plus-metadata) and datasets would reduce duplication and support plug and play with tooling

Appendix: Design Principles

Orient to the data wrangler workflow

Zen - maximum viable simplicity

Core objects should be kept as simple as possible (and no simpler)

`open` method

Metadata: `descriptor`

`stream`

`rows`

`open` for datasets

`descriptor`

`identifier` (optional)

`files`