When it comes to tabular data, the Frictionless Data specifications provide users with strong conventions for declaring both the shape of data (via schemas) and information about the data (as metadata on package and resource descriptors).
Within the Frictionless Data world, we purposefully refer to specification work as specifications, and not standards. The specifications therein provide clear conventions for working with data, and declare fundamental interfaces on which a modular software system that works with these specifications can be built. It is very meta. However, the specifications and software foundation do make the Frictionless Data ecosystem a powerful and compelling technical foundation on which to build data standards.
Some reasons for why:
- data serialised in a format that can be read by software developers use to build tools such as APIs, and also by many consumer programs that are used by consumers of data with little to no technical know how.
- built in progressive enhancement, where metadata, as well as structural and schematic information about the data, can be incorporated over time without modifying the original data source.
- A large and growing collection of tools, in many programming languages, for working with the Frictionless Data specifications.
- The specifications and the software are platform agnostic. A major example of this is being web-friendly without being dependent on the web (as with many linked data approaches). Linkable data, not Linked Data.
We’ll demonstrate this with some examples below, which are a proof of concept for the idea of using Frictionless Data as a technical foundation for data standards. This is an ongoing work that we intend to iterate on in response to feedback to this initial take.
Of course, we do not in any way think that the technical implementation of a data standard is what “data standards” is about. Data standards are about communities of practice, stakeholder engagement, and increasingly, a vehicle of change at the level of policy and governance. Technical implementation, in this wider context, is but a small, yet crucial, component. Indeed, this is a critical part of the promise we are pointing to here - that by building on a common foundation, communities building data standards can focus a little less on the technical implementation details and a little more on the change they want to see by creating them.
Grant funding
360Giving is an organization that helps funders to be transparent about the grants they award. It provides a standard for publishing grants data in a common format, and a Registry to host the data. Publishers can upload a spreadsheet that contains various fields describing the different activities they funded. We will demonstrate how a custom Data Package profile could describe one of these spreadsheets, ensuring that the required metadata fields are present and that the contents of the file conform to the schema.
We will use this sample dataset, taken directly from the Registry without any changes:
Identifier | Title | Description | Currency | Amount Awarded | Amount Disbursed | Award Date |
---|---|---|---|---|---|---|
360G-blagravetrust-00658000009YZRq | Achieving Further | Work with 22 FE colleges to improve attainment; attendance and participation | GBP | 300000 | 300000 | 2014-07-08 |
360G-blagravetrust-00658000007A1UQ | Training on feedback for Portsmouth VCS | Improving feedback skills for Portsmouth VCS - Feedback Fund 2016 | GBP | 3933 | 3933 | 2016-08-09 |
360G-blagravetrust-00658000008vdAl | Creative learning programme | Portsmouth young people leaving care | GBP | 75000 | 25000 | 2016-11-08 |
360G-blagravetrust-00658000007lweS | Feedback Fund | Feedback Fund 2016 | GBP | 2094 | 2094 | 2016-08-09 |
Our first step was to create a Table Schema describing the expected contents of the fields, which was then embedded in the Data Package descriptor. This was easy as like we mentioned before, there already is a well defined schema for how the fields should be. For the purposes of this example we just focused on a subset of all available fields. Here are some example fields:
Name / Title | Type | Constraints | |
---|---|---|---|
Identifier | string | ||
Title | string | maxLength: 140 | |
Description | string | ||
Currency | string | enum: [‘AED’, ‘AFN’, ‘ALL’, ‘AMD’, …] | |
Amount Awarded | number | ||
Amount Disbursed | number | ||
Award Date | date | ||
URL | string | ||
… | … | … | |
Funding Org:Name | string | ||
Funding Org:Department | string | ||
Grant Programme:Code | string | ||
Grant Programme:Title | string | ||
Grant Programme:URL | string | ||
From an open call? | string | ||
Related Activity | string | ||
Last modified | datetime | ||
Data Source | string |
Our custom Grants Data Package extends the Data Package specification by adding the following fields:
Name | Description | Type |
---|---|---|
funder | A JSON object describing the funding organization. It can include the following properties: id , name , email , url
|
object |
year | The year that the grants data in this file covers | integer |
modified | The timestap of when this dataset was last modifed | datetime |
This follows closely the JSON specification that 360Giving has, with the rest of the fields covered by the standard Data Package specification.
Once we have our data packaged in this way, we can leverage all the ecosystem of tools built around Data Packages to work with it. For instance, using the datapackage
library we can iterate over the contents of the file:
import datapackage
datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'
dp = datapackage.Package(datapackage_url)
for row in dp.resources[0].iter(keyed=True):
print(row)
# {'Funding Org:Identifier': 'GB-CHC-1164021', 'Beneficiary Location:Geographic Code Type': 'UA', 'From an open call?': None, 'Beneficiary Location:Name': 'Reading', 'Grant Programme:Code': None, 'Beneficiary Location:Geographic Code': 'E06000038', 'Amount Disbursed': Decimal('300000'), 'Recipient Org:City': 'Newbury', 'Award Date': datetime.datetime(2014, 7, 8, 0, 0), 'Beneficiary Location:Longitude': Decimal('-0.95543100000000003024780426130746491253376007080078125'), 'Recipient Org:Web Address': 'http://www.afaeducation.org', 'Recipient Org:Charity Number': '1142154', 'Grant Programme:Title': None, 'Related Activity': None, 'Grant Programme:URL': None, 'Recipient Org:Country': 'UK', 'Funding Org:Name': 'The Blagrave Trust', 'Title': 'Achieving Further', 'Planned Dates:End Date': datetime.datetime(2017, 6, 30, 0, 0), 'Recipient Org:Postal Code': 'RG14 1JQ', 'Identifier': '360G-blagravetrust-00658000009YZRq', 'Data Source': None, 'Planned Dates:Start Date': None, 'Currency': 'GBP', 'Description': 'Work with 22 FE colleges to improve attainment; attendance and participation', 'Recipient Org:Identifier': 'GB-CHC-1142154', 'Recipient Org:Description': 'Charity working with nurseries schools and colleges to raise attainment and achivement of children particularly those with barriers to learning', 'Funding Org:Department': None, 'Beneficiary Location:Country Code': None, 'Last modified': None, 'URL': None, 'Amount Awarded': Decimal('300000'), 'Beneficiary Location:Latitude': Decimal('51.4541449999999969122654874809086322784423828125'), 'Recipient Org:County': 'Berkshire', 'Recipient Org:Name': 'Achievement for All', 'Recipient Org:Street Address': 'Oxford House, Oxford Street', 'Planned Dates:Duration (months)': None, 'Recipient Org:Company Number': None}
Also, as we define the Table Schema, we can use goodtables to perform data validation and get a report of issues found:
from goodtables import validate
datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'
validate(datapackage_url)
'''
{'error-count': 0,
'preset': 'datapackage',
'table-count': 1,
'tables': [{'datapackage': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/datapackage.json',
'encoding': None,
'error-count': 0,
'errors': [],
'format': 'inline',
'headers': ['Identifier',
'Title',
'Description',
'Currency',
'Amount Awarded',
'Amount Disbursed',
'Award Date',
'URL',
'Planned Dates:Start Date',
...
'Grant Programme:URL',
'From an open call?',
'Related Activity',
'Last modified',
'Data Source'],
'row-count': 70,
'schema': 'table-schema',
'scheme': None,
'source': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/360G-blagravetrust-2016.xlsx',
'time': 0.53,
'valid': True}],
'time': 1.386,
'valid': True,
'warnings': []}
'''
IATI Registry
The IATI Standard is a technical framework to publish aid, development, and humanitarian data in a standard way. Data published in the IATI standard is indexed on the IATI Registry. Here we will demonstrate the creation of a custom Data Package profile to package data meant to be published in the registry, ensuring that it has the required metadata.
Here are the fields available when publishing a new IATI file on the registry:
Name | Data Package field | Description | Type |
---|---|---|---|
registry-file-id |
name |
A unique identifier for the activity record | string |
registry-publisher-id |
- | Publisher identificator on the IATI Registry | string |
title |
title |
The title of the dataset | string |
description |
description |
Some useful notes about the data | string |
source-url |
resources[0]['path'] |
URL to a publicly accessible IATI file | string |
contact-email |
- | Contact email for publisher | string |
file-type |
- | Must be either ‘Activity’ or ‘Organization’ | string |
recipient-country |
- | Recipient country | string |
last-updated-datetime |
- | Timestamp of the last modification | date-time |
activity-count |
- | Number of activities described in the data | integer |
default-language |
- | Language of the data | string |
secondary-publisher |
- | The publisher this dataset is published on behalf of | string |
To create the new profile, we will add those fields that do not map directly to the Data Package specification to a standard Data Package descriptor and create a custom JSON Schema to validate it. Here is the resulting Data Package descriptor.
Trees
The Open Council Data defined the standard Trees 1.3 for describing the trees in a geographical region (e.g. a council). This standard includes information about the location, type, and other characteristics of individual trees, which is useful for planning future growth, maintenance of canopy cover, managing risk of falling branches, etc.
We are using the from Colac Otway Shire Trees as an example.
lat | lon | genus | species | dbh | dbh_min | dbh_max | year_min | year_max | crown | crown_min | crown_max | height | height_min | height_max | common | location | ref | maintenance | maturity | planted | updated | health | variety | description | family | ule_min | ule_max | address |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-38.344595 | 143.592171 | Melaleuca | Stypheliodes | 1 | 5 | Prickly Paperback | street | 10001 | mature | 1975-01-01 | 106 Queen ST COLAC VIC 3250 | |||||||||||||||||
-38.346198 | 143.591812 | Melaleuca | Stypheliodes | 1 | 4 | Prickly Paperback | street | 10004 | mature | 1975-01-01 | 122 Queen ST COLAC VIC 3250 | |||||||||||||||||
-38.342097 | 143.588944 | Fraxinus | Excelsior | 1.2 | 12 | Golden Ash | street | 10007 | mature | 1980-01-01 | 40 Rae ST COLAC VIC 3250 | |||||||||||||||||
-38.341927 | 143.588715 | Agonis | Flexuosa | 0.4 | 5 | Weeping Willow Myrtle | street | 10018 | semi-mature | 1980-01-01 | 47 Rae ST COLAC VIC 3250//next to coles coaches | |||||||||||||||||
-38.342044 | 143.591182 | Eucalyptus | Nichollii | 0.3 | 6 | Willow Peppermint | street | 10021 | mature | 1980-01-01 | 56 Rae ST COLAC VIC 3250//Between Queen St & CCDA |
This data was modified from the source to conform to the Trees 1.3 specification. All the data is available here.
The Trees Data Package extends the Data Package specification by adding the following fields:
Name | Description | Type |
---|---|---|
countryCode | A single or an array of 2-letter ISO country code defining the country(ies) present in the data | string |
geospatialCoverage | Geospatial area contained in the dataset | geojson |
Name | Title | Type | Constraints |
---|---|---|---|
lat | Latitude in decimal degrees (EPSG:4326) | number | required: True |
lon | Longitude in decimal degrees (EPSG:4326) | number | required: True |
genus | Botanical genus, in title case (e.g. Eucalyptus) | string | |
species | Botanical species, in title case (e.g. Regnans) | string | |
dbh | Diameter at breast height (130cm above ground), in centimeters. If this information is available only as a range, this contains the middle of the range. | number | minimum: 0 |
dbh_min | Minimum diameter at breast height (130cm above ground) | number | minimum: 0 |
dbh_max | Maximum diameter at breast height (130cm above ground) | number | minimum: 0 |
year_min | Lower bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2018). | year | |
year_max | Upper bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2023). | year | |
crown | Width in metres of the tree’s foliage (also known as crown spread). If this information is available only as a range, this contains the middle of the range. | number | minimum: 0 |
crown_min | Minimum width in meters of the tree’s foliage | number | minimum: 0 |
crown_max | Maximum width in meters of the tree’s foliage | number | minimum: 0 |
height | Height in meters. If this information is available only as a range, this contains the middle of the range. | number | minimum: 0 |
height_min | Minimum height in meters | number | minimum: 0 |
height_max | Maximum height in meters | number | minimum: 0 |
common | Common name for species (non-standardised) | string | |
location | Where the tree is located | string | enum: [‘park’, ‘street’, ‘council’] |
ref | Council-specific identifier, enabling joining to other datasets | number | |
maintenance | How often the tree is inspected (in months) | number | minimum: 0 |
maturity | string | enum: [‘young’, ‘semi-mature’, ‘mature’, ‘over-mature’] | |
planted | Date of planting | date | |
updated | Date of addition to database or most recent revision | date | |
health | Health of tree growth | string | enum: [‘stump’, ‘dead’, ‘poor’, ‘fair’, ‘good’] |
variety | Any part of the scientific name below species level, including subspecies or variety | string | |
description | Other information about the tree that is not in its scientific name or species | string | |
family | Botanical family | string | |
ule_min | Lower bound on useful life expectancy when surveyed | number | minimum: 0 |
ule_max | Upper bound on useful life expectancy when surveyed | number | minimum: 0 |
address | Street address | string |
import datapackage
datapackage_url = 'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/trees/datapackage.json'
dp = datapackage.Package(datapackage_url)
for row in dp.resources[0].iter(keyed=True):
print(row)
# {'lat': Decimal('-38.347497'), 'lon': Decimal('143.595686'), 'genus': 'Melaleuca', 'species': 'Nesophila', 'dbh': Decimal('0.25'), 'dbh_min': None, 'dbh_max': None, 'year_min': None, 'year_max': None, 'crown': None, 'crown_min': None, 'crown_max': None, 'height': Decimal('2'), 'height_min': None, 'height_max': None, 'common': 'Snowy Honey Myrtle', 'location': 'street', 'ref': Decimal('10379'), 'maintenance': None, 'maturity': 'semi-mature', 'planted': datetime.date(1980, 1, 1), 'updated': None, 'health': None, 'variety': None, 'description': None, 'family': None, 'ule_min': None, 'ule_max': None, 'address': '18 Thomas ST COLAC VIC 3250'}
Conculsion
This has been a high-level exploration of using Tabular Data Package and Table Schema as a “specification framework”, allowing one to bootstrap a proof of concept data standard. Taking this approach, one gains access to a collection of modular software libraries that provide powerful APIs for working with this data according to the rules and condition of the standard that is declared. Data validation, processing, transport, and consumption do not require custom tool chains once the data standard is declared as a Tabular Data Package Profile.
The approach described here is a first step in the direction of domain-specific tabular data profiles. A future iteration would likely integrate work we are currently undertaking in the Fiscal Data Package which enables the simple declaration of domain concepts via columnType
annotations on Table Schemas. This enables data standard authors to work at a level of abstraction of domain concepts, rather than the “primitive types” we work with here via Table Schema. We plan to revisit this work once the columnType
work from Fiscal Data Package is stable for general use.
For now, all the schemas above work as described, and open up all the software in the Frictionless Data ecosystem to those following this approach.
You can check the source code for all the examples listed in the following GitHub repository:
Comments