When crafting data from some other data, like packaging public data, using the good tools can really ease development process and reliability of the data.
make which have already been used for decades to build software, is a very good option as advocated by Mike Bostock’s in his blog.
A state-of-the-art Makefile
Let’s take an example with crafting geo-countries datapackage. We need to download data from NaturalEarth, extract the zip, convert it to json with ogr (the ‘‘swiss-army-knife’’ of maps), and rename a column. Following Mike Bostok’s instructions, here’s an appropriate
Makefile (that should lie the
scripts folder of the project):
all: ../data/countries.geojson ne_10m_admin_0_countries.zip: wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip ne_10m_admin_0_countries.README.html ne_10m_admin_0_countries.VERSION.txt ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx: ne_10m_admin_0_countries.zip unzip ne_10m_admin_0_countries.zip ne_10m_admin_0_countries.geojson: ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx ogr2ogr -select admin,iso_a3 -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp ../data: mkdir ../data ../data/countries.geojson: ne_10m_admin_0_countries.geojson ../data # Change the name of the fields after conversion cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
If you’re not familiar with Makefiles, the last section reads : “When both files
../data are available, you can run command
cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
and it will produce file
Make deduces the commands to be run, starting with the ones where everything is available, until it produces target
We achieve two very important goals with this
- it covers the whole process even the download part. It’s so easy to forget wether we have downloaded
ne_110m_admin_0_countries.zipwhen it is done by hand. But now every thing is written down so we can keep track of it in our source repository (like git), even if we change our mind.
makechecks the date consistency of the files. That means that if Scottland has gone independent in 2015 it would have created a new country, that Natural Earth would have added. Now you can download the updated version of
ne_10m_admin_0_countries.zip. When running
makeagain, it would notice that the unziped files like
ne_10m_admin_0_countries.dbfand so on are older than their source, so the
unzipcommand has to be run again ! And so on because
ne_10m_admin_0_countries.geojsonwould not be up to date, until every depending file is updated.
Even if this is a great improvement over running-all-the-commands-manually-and-don’t-remember-them as much custom-script-that-must-start-from-scratch-every-time, it is not enough to have a fluid and reliable development experience.
Improve collaboration with
Before we see in detail two major improvements, let’s see the same workflow written in a
tuttlefile (still in folder
file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip file://ne_10m_admin_0_countries.README.html, file://ne_10m_admin_0_countries.VERSION.txt, file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx <- file://ne_10m_admin_0_countries.zip unzip ne_10m_admin_0_countries.zip file://ne_10m_admin_0_countries.geojson <- file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx ogr2ogr -select admin,iso_a3 -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp file://../data <- cd .. mkdir data file://../data/countries.geojson <- file://ne_10m_admin_0_countries.geojson, file://../data # Change the name of the fields after conversion cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
Looks familiar ?
It is very close to Makefile, except for urls everywhere. Because
tuttle aims at giving a url to every bit of data, in order link them together.
You can see the first section of the tuttlefile clearly states the dependency of file
ne_10m_admin_0_countries.zip to url
This means that when the online list of countries change, no unusual action is required. You just have to execute
tuttle run as if you where building the data for the first time. It will notice the source url has changed and will reprocess dependencies accordingly.
The other difference with
make is not in the syntax, it’s in how it deals with changes in the
tuttlefile. If you ever worked with the
ogr2ogr command line tool, you know it’s impossible to make it right the first time. But if you change the command in a
Makefile, unfortunately running
make again won’t update the data because the date of the file
ne_10m_admin_0_countries.geojson seem coherent.
To improve this,
tuttle reacts to changes in every command. When you run it, it will first roll back as the previous command as if had never run by deleting whatever data has been produced. Then it will run the updated ogr2ogr command. That’s very handy when prototyping because you want focus on your code without side effects caused by remaining data.
This feature also proves really useful when working in a team. With
make, if you change the makefile, you need to send an mail to all your team with instructions of how to clean the workspace (ie : “Please remove file ../data/countries.geojson because I have changed the ogr2ogr command”), and hope nobody misses it because it would lead to undebuggable behaviour. On the other hand
tuttle guarantees the data corresponds exactly the
tuttlefile, so you can safely share or merge changes with your fellow contributors.
If you put both improvements over
make together (remote dependencies and reliably reprocess what have changed), we can set up a system that automatically updates datapackages when either the original data changes or when someone modifies the source code. Pretty cool, huh ?
I hope I’ve convinced you of the advantages of tuttle for collectively crafting data. If you’re interested, the best way to learn more about inline languages, url to databases or online resources, is to read the main tutorial.
And one more thing about the sugar syntax you can expect… You could simplify the first section of the
tuttlefile in only one line :
file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip ! download