This post walks you through the major changes in the Data Package v1 specs compared to pre-v1. It covers changes in the full suite of Data Package specifications including Data Resources and Table Schema. It is particularly valuable if:
- you were using Data Packages pre v1 and want to know how to upgrade your datasets
- if you are implementing Data Package related tooling and want to know how to upgrade your tools or want to support or auto-upgrade pre-v1 Data Packages for backwards compatibility
It also includes a script we have created (in JavaScript) that we’ve been using ourselves to automate upgrades of the Core Data.
The Changes
Two major changes in v1 were presentational:
- Creating Data Resource as a separate spec from Data Package. This did not change anything substantive in terms of how data packages worked but is important presentationally. In parallel, we also split out a Tabular Data Resource from the Tabular Data Package.
- Renaming JSON Table Schema to just Table Schema
In addition, there were a fair number of substantive changes. We summarize these in the sections below. For more detailed info see the current specifications and the old site containing the pre spec v1 specifications.
Table Schema
Link to spec: https://specs.frictionlessdata.io/table-schema/
Property | Pre v1 | v1 Spec | Notes | Issue |
---|---|---|---|---|
id/name | id | name | Renamed id to name to be consistent across specs | |
type/number | format: currency | format: currency - removed format: bareNumber format: decimalChar and groupChar |
#509 #246 |
|
type/integer | No additional properties | Additional properties: bareNumber | #509 | |
type/boolean | true: [yes, y, true, t, 1],false: [no, n, false, f, 0] | true: [ true, True, TRUE, 1],false: [false, False, FALSE, 0] | #415 | |
type/year + yearmonth | year and yearmonth NB: these were temporarily gyear and gyearmonth | #346 | ||
type/duration | duration | #210 | ||
type/rdfType | rdfType | Support rich “semantic web” types for fields | #217 | |
type/null | removed (see missingValue) | #262 | ||
missingValues | missingValues | Missing values support did not exist pre v1. | #97 |
Data Resource
Link to spec: https://specs.frictionlessdata.io/data-resource/
Note: Data Resource did not exist as a separate spec pre-v1 so strictly we are comparing the Data Resource section of the old Data Package spec with the new Data Resource spec.
Property | Pre v1 | v1 Spec | Notes | Issue |
---|---|---|---|---|
path | path and url | path only | url merged into path and path can now be a url or local path | #250 |
path | string | string or array | path can be an array to support a single resource split across multiple files | #228 |
name | recommended | required | Made name required to enable access to resources by name consistently across tools | |
profile | recommended | See profiles discussion | ||
sources, licenses … | Inherited metadata from Data Package like sources or licenses upgraded inline with changes in Data Package |
Tabular Data Resource
Link to spec: https://specs.frictionlessdata.io/data-resource/
Just as Data Resource split out from Data Package so Tabular Data Resource split out from the old Tabular Data Package spec.
There were no significant changes here beyond those in Data Resource.
Data Package
Link to spec: https://specs.frictionlessdata.io/data-package/
Property | Pre v1 | v1 Spec | Notes | Issue |
---|---|---|---|---|
name | required | recommended | Unique names are not essential to any part of the present tooling so we have have moved to recommended. | |
id | id property-globally unique | Globally unique id property | #228 | |
licenses | license - object or string. The object structure must contain a type property and a url property linking to the actual text | licenses - is an array. Each item in the array is a License. Each must be an object. The object must contain a name property and/or a path property. It may contain a title property. | ||
author | author | author is removed in favour of contributors | ||
contributor | name, email, web properties with name required | title property required with roles, role property values must be one of - author, publisher, maintainer, wrangler, and contributor. Defaults to contributor. | ||
sources | name, web and email and none required | title, path and email and title is required | ||
resources | resources array is required | #434 | ||
dataDependencies | dataDependencies | Moved to a pattern until we have greater clarity on need. | #341 |
Tabular Data Package
Link to spec: https://specs.frictionlessdata.io/tabular-data-package/
Tabular Data Package is unchanged.
Profiles
Profiles arrived in v1:
http://specs.frictionlessdata.io/profiles/
Profiles are the first step on supporting a rich ecosystem of “micro-schemas” for data. They provide a very simple way to quickly state that your data follows a specific structure and/or schema. From the docs:
Different kinds of data need different formats for their data and metadata. To support these different data and metadata formats we need to extend and specialise the generic Data Package. These specialized types of Data Package (or Data Resource) are termed profiles.
For example, there is a Tabular Data Package profile that specializes Data Packages specifically for tabular data. And there is a “Fiscal” Data Package profile designed for government financial data that includes requirements that certain columns are present in the data e.g. Amount or Date and that they contain data of certain types.
We think profiles are an easy, lightweight way to starting adding more structure to your data.
Profiles can be specified on both resources and packages.
Automate upgrading your descriptor according to the spec v1
We have created a data package normalization script that you can use to automate the process of upgrading a datapackage.json
or Table Schema from pre-v1 to v1.
The script enables you to automate updating your datapackage.json
for the following properties: path
, contributors
, resources
, sources
and licenses
.
This is a simple script that you can download directly from here:
https://raw.githubusercontent.com/datahq/datapackage-normalize-js/master/normalize.js
e.g. using wget:
wget https://raw.githubusercontent.com/datahq/datapackage-normalize-js/master/normalize.js
# path (optional) is the path to datapackage.json
# if not provided looks in current directory
normalize.js [path]
# prints out updated datapackage.json
You can also use as a library:
# install it from npm
npm install datapackage-normalize
so you can use it in your javascript:
const normalize = require('datapackage-normalize')
const path = 'path/to/datapackage.json'
normalize(path)
Conclusion
The above summarizes the main changes for v1 of Data Package suite of specs and instructions on how to upgrade.
If you want to see specification for more details, please visit Data Package specifications. You can also visit the Frictionless Data initiative for more information about Data Packages.
This blog post was originally published on datahub.io by Meiran Zhiyenbayev. Meiran works for Datopian who have been developing datahub.io as part of the Frictionless Data initative.
Comments