Measure for Measure

Posted on 13 July 2017 by Dan Fowler

In his Open Knowledge International Tech Talk, Developer Brook Elgie describes how we are using Data Package Pipelines and Redash to gain insight into our organization in a declarative, reproducible, and easy to modify way.

This post briefly introduces a newly launched internal project at Open Knowledge International called Measure, its history, motivation, and the tech that drives it. To learn more, watch the embedded video demonstration by developer Brook Elgie and check out the code.

What is Measure?

Measure is a system that allows us to collect and analyze metrics from various internal sources and external platforms through a combination of easy-to-write YAML docs and a user-friendly interface. These include the number of views on our main website, downloads of our libraries from PyPI, retweets on Twitter, and form-based records of project outputs (e.g. recent talks we’ve given). Like many organizations, we rely heavily on hosted platforms to execute on our mission, each of which has its own interface to useful data. This can make it harder to correlate events (e.g. how many downloads did this software package have after this blog post?) and yield insight across platforms. It’s critical to harmonize access to this data not only for us to learn how to be more effective, but also to demonstrate to external funders the impact of our work advancing the cause of openness. It’s also important for this data to be accessible to everyone at the organization, regardless of their technical skill.

Brook Elgie describes Measure in an Open Knowledge International Tech Talk

How Does it Work?

Measure relies on several technologies we are developing here at Open Knowledge International around our Frictionless Data project. Each of our projects has a source specification file defined in YAML and split into themes. For example, social-media is a theme for data sources such as Twitter and Facebook, while code-packaging is a theme for PyPI and other software repositories we upload to. Each theme has a pipeline which is composed of processors which do the actual work of fetching data and transforming the Data Package (a collection of data and descriptive metadata) and its resources. Data is moved through the thematic pipeline using Data Package Pipelines and a handful of other tools in the Frictionless Data project. The final processor writes the processed resources to the Measure database, which is used as the data source for our visualisation tool, Redash. Each pipeline is configured to run once a day. You can read more about Data Package Pipelines and how it enables this process in its introductory blog post.

By consolidating our metrics into a single database and surfacing through Redash, it’s easy to create and share visualisations across one or more data sources, create dashboards of project and organization health, and make truly data-driven decisions with minimal friction.