Bootstrapping data standards with Frictionless Data

Posted on 21 December 2017 by Paul Walsh

When it comes to tabular data, the Frictionless Data specifications provide users with strong conventions for declaring both the shape of data (via schemas) and information about the data (as metadata on package and resource descriptors).

Within the Frictionless Data world, we purposefully refer to specification work as specifications, and not standards. The specifications therein provide clear conventions for working with data, and declare fundamental interfaces on which a modular software system that works with these specifications can be built. It is very meta. However, the specifications and software foundation do make the Frictionless Data ecosystem a powerful and compelling technical foundation on which to build data standards.

Some reasons for why:

data serialised in a format that can be read by software developers use to build tools such as APIs, and also by many consumer programs that are used by consumers of data with little to no technical know how.
built in progressive enhancement, where metadata, as well as structural and schematic information about the data, can be incorporated over time without modifying the original data source.
A large and growing collection of tools, in many programming languages, for working with the Frictionless Data specifications.
The specifications and the software are platform agnostic. A major example of this is being web-friendly without being dependent on the web (as with many linked data approaches). Linkable data, not Linked Data.

We’ll demonstrate this with some examples below, which are a proof of concept for the idea of using Frictionless Data as a technical foundation for data standards. This is an ongoing work that we intend to iterate on in response to feedback to this initial take.

Of course, we do not in any way think that the technical implementation of a data standard is what “data standards” is about. Data standards are about communities of practice, stakeholder engagement, and increasingly, a vehicle of change at the level of policy and governance. Technical implementation, in this wider context, is but a small, yet crucial, component. Indeed, this is a critical part of the promise we are pointing to here - that by building on a common foundation, communities building data standards can focus a little less on the technical implementation details and a little more on the change they want to see by creating them.

Grant funding

360Giving is an organization that helps funders to be transparent about the grants they award. It provides a standard for publishing grants data in a common format, and a Registry to host the data. Publishers can upload a spreadsheet that contains various fields describing the different activities they funded. We will demonstrate how a custom Data Package profile could describe one of these spreadsheets, ensuring that the required metadata fields are present and that the contents of the file conform to the schema.

We will use this sample dataset, taken directly from the Registry without any changes:

Identifier	Title	Description	Currency	Amount Awarded	Amount Disbursed	Award Date
360G-blagravetrust-00658000009YZRq	Achieving Further	Work with 22 FE colleges to improve attainment; attendance and participation	GBP	300000	300000	2014-07-08
360G-blagravetrust-00658000007A1UQ	Training on feedback for Portsmouth VCS	Improving feedback skills for Portsmouth VCS - Feedback Fund 2016	GBP	3933	3933	2016-08-09
360G-blagravetrust-00658000008vdAl	Creative learning programme	Portsmouth young people leaving care	GBP	75000	25000	2016-11-08
360G-blagravetrust-00658000007lweS	Feedback Fund	Feedback Fund 2016	GBP	2094	2094	2016-08-09

Our first step was to create a Table Schema describing the expected contents of the fields, which was then embedded in the Data Package descriptor. This was easy as like we mentioned before, there already is a well defined schema for how the fields should be. For the purposes of this example we just focused on a subset of all available fields. Here are some example fields:

Type	Constraints
Identifier	string
Title	string	maxLength: 140
Description	string
Currency	string	enum: [‘AED’, ‘AFN’, ‘ALL’, ‘AMD’, …]
Amount Awarded	number
Amount Disbursed	number
Award Date	date
URL	string
…	…	…
Funding Org:Name	string
Funding Org:Department	string
Grant Programme:Code	string
Grant Programme:Title	string
Grant Programme:URL	string
From an open call?	string
Related Activity	string
Last modified	datetime
Data Source	string

Our custom Grants Data Package extends the Data Package specification by adding the following fields:

Name	Description	Type
funder	A JSON object describing the funding organization. It can include the following properties: `id`, `name`, `email`, `url`	object
year	The year that the grants data in this file covers	integer
modified	The timestap of when this dataset was last modifed	datetime

This follows closely the JSON specification that 360Giving has, with the rest of the fields covered by the standard Data Package specification.

Once we have our data packaged in this way, we can leverage all the ecosystem of tools built around Data Packages to work with it. For instance, using the datapackage library we can iterate over the contents of the file:

import datapackage

datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'
dp = datapackage.Package(datapackage_url)

for row in dp.resources[0].iter(keyed=True):
    print(row)
    # {'Funding Org:Identifier': 'GB-CHC-1164021', 'Beneficiary Location:Geographic Code Type': 'UA', 'From an open call?': None, 'Beneficiary Location:Name': 'Reading', 'Grant Programme:Code': None, 'Beneficiary Location:Geographic Code': 'E06000038', 'Amount Disbursed': Decimal('300000'), 'Recipient Org:City': 'Newbury', 'Award Date': datetime.datetime(2014, 7, 8, 0, 0), 'Beneficiary Location:Longitude': Decimal('-0.95543100000000003024780426130746491253376007080078125'), 'Recipient Org:Web Address': 'http://www.afaeducation.org', 'Recipient Org:Charity Number': '1142154', 'Grant Programme:Title': None, 'Related Activity': None, 'Grant Programme:URL': None, 'Recipient Org:Country': 'UK', 'Funding Org:Name': 'The Blagrave Trust', 'Title': 'Achieving Further', 'Planned Dates:End Date': datetime.datetime(2017, 6, 30, 0, 0), 'Recipient Org:Postal Code': 'RG14 1JQ', 'Identifier': '360G-blagravetrust-00658000009YZRq', 'Data Source': None, 'Planned Dates:Start Date': None, 'Currency': 'GBP', 'Description': 'Work with 22 FE colleges to improve attainment; attendance and participation', 'Recipient Org:Identifier': 'GB-CHC-1142154', 'Recipient Org:Description': 'Charity working with nurseries schools and colleges to raise attainment and achivement of children particularly those with barriers to learning', 'Funding Org:Department': None, 'Beneficiary Location:Country Code': None, 'Last modified': None, 'URL': None, 'Amount Awarded': Decimal('300000'), 'Beneficiary Location:Latitude': Decimal('51.4541449999999969122654874809086322784423828125'), 'Recipient Org:County': 'Berkshire', 'Recipient Org:Name': 'Achievement for All', 'Recipient Org:Street Address': 'Oxford House, Oxford Street', 'Planned Dates:Duration (months)': None, 'Recipient Org:Company Number': None}
 

Also, as we define the Table Schema, we can use goodtables to perform data validation and get a report of issues found:

from goodtables import validate

datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'
validate(datapackage_url)
	'''
	{'error-count': 0,
	 'preset': 'datapackage',
	 'table-count': 1,
	 'tables': [{'datapackage': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/datapackage.json',
	   'encoding': None,
	   'error-count': 0,
	   'errors': [],
	   'format': 'inline',
	   'headers': ['Identifier',
		'Title',
		'Description',
		'Currency',
		'Amount Awarded',
		'Amount Disbursed',
		'Award Date',
		'URL',
		'Planned Dates:Start Date',

		...		

		'Grant Programme:URL',
		'From an open call?',
		'Related Activity',
		'Last modified',
		'Data Source'],
	   'row-count': 70,
	   'schema': 'table-schema',
	   'scheme': None,
	   'source': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/360G-blagravetrust-2016.xlsx',
	   'time': 0.53,
	   'valid': True}],
	 'time': 1.386,
	 'valid': True,
	 'warnings': []}
	'''

IATI Registry

The IATI Standard is a technical framework to publish aid, development, and humanitarian data in a standard way. Data published in the IATI standard is indexed on the IATI Registry. Here we will demonstrate the creation of a custom Data Package profile to package data meant to be published in the registry, ensuring that it has the required metadata.

Here are the fields available when publishing a new IATI file on the registry:

Name	Data Package field	Description	Type
`registry-file-id`	`name`	A unique identifier for the activity record	string
`registry-publisher-id`	-	Publisher identificator on the IATI Registry	string
`title`	`title`	The title of the dataset	string
`description`	`description`	Some useful notes about the data	string
`source-url`	`resources[0]['path']`	URL to a publicly accessible IATI file	string
`contact-email`	-	Contact email for publisher	string
`file-type`	-	Must be either ‘Activity’ or ‘Organization’	string
`recipient-country`	-	Recipient country	string
`last-updated-datetime`	-	Timestamp of the last modification	date-time
`activity-count`	-	Number of activities described in the data	integer
`default-language`	-	Language of the data	string
`secondary-publisher`	-	The publisher this dataset is published on behalf of	string

To create the new profile, we will add those fields that do not map directly to the Data Package specification to a standard Data Package descriptor and create a custom JSON Schema to validate it. Here is the resulting Data Package descriptor.

Trees

The Open Council Data defined the standard Trees 1.3 for describing the trees in a geographical region (e.g. a council). This standard includes information about the location, type, and other characteristics of individual trees, which is useful for planning future growth, maintenance of canopy cover, managing risk of falling branches, etc.

We are using the from Colac Otway Shire Trees as an example.

lat	lon	genus	species	dbh	height	common	location	ref	maturity	planted	address
-38.344595	143.592171	Melaleuca	Stypheliodes	1	5	Prickly Paperback	street	10001	mature	1975-01-01	106 Queen ST COLAC VIC 3250
-38.346198	143.591812	Melaleuca	Stypheliodes	1	4	Prickly Paperback	street	10004	mature	1975-01-01	122 Queen ST COLAC VIC 3250
-38.342097	143.588944	Fraxinus	Excelsior	1.2	12	Golden Ash	street	10007	mature	1980-01-01	40 Rae ST COLAC VIC 3250
-38.341927	143.588715	Agonis	Flexuosa	0.4	5	Weeping Willow Myrtle	street	10018	semi-mature	1980-01-01	47 Rae ST COLAC VIC 3250//next to coles coaches
-38.342044	143.591182	Eucalyptus	Nichollii	0.3	6	Willow Peppermint	street	10021	mature	1980-01-01	56 Rae ST COLAC VIC 3250//Between Queen St & CCDA

This data was modified from the source to conform to the Trees 1.3 specification. All the data is available here.

The Trees Data Package extends the Data Package specification by adding the following fields:

Name	Description	Type
countryCode	A single or an array of 2-letter ISO country code defining the country(ies) present in the data	string
geospatialCoverage	Geospatial area contained in the dataset	geojson

Name	Title	Type	Constraints
lat	Latitude in decimal degrees (EPSG:4326)	number	required: True
lon	Longitude in decimal degrees (EPSG:4326)	number	required: True
genus	Botanical genus, in title case (e.g. Eucalyptus)	string
species	Botanical species, in title case (e.g. Regnans)	string
dbh	Diameter at breast height (130cm above ground), in centimeters. If this information is available only as a range, this contains the middle of the range.	number	minimum: 0
dbh_min	Minimum diameter at breast height (130cm above ground)	number	minimum: 0
dbh_max	Maximum diameter at breast height (130cm above ground)	number	minimum: 0
year_min	Lower bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2018).	year
year_max	Upper bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2023).	year
crown	Width in metres of the tree’s foliage (also known as crown spread). If this information is available only as a range, this contains the middle of the range.	number	minimum: 0
crown_min	Minimum width in meters of the tree’s foliage	number	minimum: 0
crown_max	Maximum width in meters of the tree’s foliage	number	minimum: 0
height	Height in meters. If this information is available only as a range, this contains the middle of the range.	number	minimum: 0
height_min	Minimum height in meters	number	minimum: 0
height_max	Maximum height in meters	number	minimum: 0
common	Common name for species (non-standardised)	string
location	Where the tree is located	string	enum: [‘park’, ‘street’, ‘council’]
ref	Council-specific identifier, enabling joining to other datasets	number
maintenance	How often the tree is inspected (in months)	number	minimum: 0
maturity		string	enum: [‘young’, ‘semi-mature’, ‘mature’, ‘over-mature’]
planted	Date of planting	date
updated	Date of addition to database or most recent revision	date
health	Health of tree growth	string	enum: [‘stump’, ‘dead’, ‘poor’, ‘fair’, ‘good’]
variety	Any part of the scientific name below species level, including subspecies or variety	string
description	Other information about the tree that is not in its scientific name or species	string
family	Botanical family	string
ule_min	Lower bound on useful life expectancy when surveyed	number	minimum: 0
ule_max	Upper bound on useful life expectancy when surveyed	number	minimum: 0
address	Street address	string

import datapackage

datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/trees/datapackage.json'
dp = datapackage.Package(datapackage_url)

for row in dp.resources[0].iter(keyed=True):
    print(row)
    # {'lat': Decimal('-38.347497'), 'lon': Decimal('143.595686'), 'genus': 'Melaleuca', 'species': 'Nesophila', 'dbh': Decimal('0.25'), 'dbh_min': None, 'dbh_max': None, 'year_min': None, 'year_max': None, 'crown': None, 'crown_min': None, 'crown_max': None, 'height': Decimal('2'), 'height_min': None, 'height_max': None, 'common': 'Snowy Honey Myrtle', 'location': 'street', 'ref': Decimal('10379'), 'maintenance': None, 'maturity': 'semi-mature', 'planted': datetime.date(1980, 1, 1), 'updated': None, 'health': None, 'variety': None, 'description': None, 'family': None, 'ule_min': None, 'ule_max': None, 'address': '18 Thomas ST COLAC VIC 3250'}

Conculsion

This has been a high-level exploration of using Tabular Data Package and Table Schema as a “specification framework”, allowing one to bootstrap a proof of concept data standard. Taking this approach, one gains access to a collection of modular software libraries that provide powerful APIs for working with this data according to the rules and condition of the standard that is declared. Data validation, processing, transport, and consumption do not require custom tool chains once the data standard is declared as a Tabular Data Package Profile.

The approach described here is a first step in the direction of domain-specific tabular data profiles. A future iteration would likely integrate work we are currently undertaking in the Fiscal Data Package which enables the simple declaration of domain concepts via columnType annotations on Table Schemas. This enables data standard authors to work at a level of abstraction of domain concepts, rather than the “primitive types” we work with here via Table Schema. We plan to revisit this work once the columnType work from Fiscal Data Package is stable for general use.

For now, all the schemas above work as described, and open up all the software in the Frictionless Data ecosystem to those following this approach.

You can check the source code for all the examples listed in the following GitHub repository:

https://github.com/frictionlessdata/profiles

Bootstrapping data standards with Frictionless Data

Grant funding

IATI Registry

Trees

Conculsion

Comments

Related Projects

Frictionless Data

Recent Posts

Bootstrapping data standards with Frictionless Data

Grant funding

IATI Registry

Trees

Conculsion

Comments

Related Projects

Frictionless Data

Recent Posts

Introduction to Statistics With Data Packages and Gonum

Announcing datapackage-pipelines version 2.0

Data Factory & DataFlows - Tutorial

Data Factory & DataFlows - An Introduction

Processing Tabular Data Packages in Clojure

Processing Tabular Data Packages in Java