When crafting data from some other data, like packaging public data, using the good tools can really ease development process and reliability of the data.
The venerable make
which have already been used for decades to build software, is a very good option as advocated by Mike Bostock’s in his blog.
A state-of-the-art Makefile
Let’s take an example with crafting geo-countries datapackage. We need to download data from NaturalEarth, extract the zip, convert it to json with ogr (the ‘‘swiss-army-knife’’ of maps), and rename a column. Following Mike Bostok’s instructions, here’s an appropriate Makefile
(that should lie the scripts
folder of the project):
all: ../data/countries.geojson
ne_10m_admin_0_countries.zip:
wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
ne_10m_admin_0_countries.README.html ne_10m_admin_0_countries.VERSION.txt ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx: ne_10m_admin_0_countries.zip
unzip ne_10m_admin_0_countries.zip
ne_10m_admin_0_countries.geojson: ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx
ogr2ogr -select admin,iso_a3 -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp
../data:
mkdir ../data
../data/countries.geojson: ne_10m_admin_0_countries.geojson ../data
# Change the name of the fields after conversion
cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
If you’re not familiar with Makefiles, the last section reads : “When both files ne_10m_admin_0_countries.geojson
and ../data
are available, you can run command cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
and it will produce file ../data/countries.geojson
”. Make
deduces the commands to be run, starting with the ones where everything is available, until it produces target all
.
We achieve two very important goals with this Makefile
:
- it covers the whole process even the download part. It’s so easy to forget wether we have downloaded
ne_10m_admin_0_countries.zip
orne_110m_admin_0_countries.zip
when it is done by hand. But now every thing is written down so we can keep track of it in our source repository (like git), even if we change our mind. - Running
make
checks the date consistency of the files. That means that if Scottland has gone independent in 2015 it would have created a new country, that Natural Earth would have added. Now you can download the updated version ofne_10m_admin_0_countries.zip
. When runningmake
again, it would notice that the unziped files likene_10m_admin_0_countries.dbf
and so on are older than their source, so theunzip
command has to be run again ! And so on becausene_10m_admin_0_countries.geojson
would not be up to date, until every depending file is updated.
Even if this is a great improvement over running-all-the-commands-manually-and-don’t-remember-them as much custom-script-that-must-start-from-scratch-every-time, it is not enough to have a fluid and reliable development experience.
Improve collaboration with tuttle
Before we see in detail two major improvements, let’s see the same workflow written in a tuttlefile
(still in folder scripts
) :
file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
file://ne_10m_admin_0_countries.README.html, file://ne_10m_admin_0_countries.VERSION.txt, file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx <- file://ne_10m_admin_0_countries.zip
unzip ne_10m_admin_0_countries.zip
file://ne_10m_admin_0_countries.geojson <- file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx
ogr2ogr -select admin,iso_a3 -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp
file://../data <-
cd ..
mkdir data
file://../data/countries.geojson <- file://ne_10m_admin_0_countries.geojson, file://../data
# Change the name of the fields after conversion
cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
Looks familiar ?
It is very close to Makefile, except for urls everywhere. Because tuttle
aims at giving a url to every bit of data, in order link them together.
You can see the first section of the tuttlefile clearly states the dependency of file ne_10m_admin_0_countries.zip
to url http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
.
This means that when the online list of countries change, no unusual action is required. You just have to execute tuttle run
as if you where building the data for the first time. It will notice the source url has changed and will reprocess dependencies accordingly.
The other difference with make
is not in the syntax, it’s in how it deals with changes in the tuttlefile
. If you ever worked with the ogr2ogr
command line tool, you know it’s impossible to make it right the first time. But if you change the command in a Makefile
, unfortunately running make
again won’t update the data because the date of the file ne_10m_admin_0_countries.geojson
seem coherent.
To improve this, tuttle
reacts to changes in every command. When you run it, it will first roll back as the previous command as if had never run by deleting whatever data has been produced. Then it will run the updated ogr2ogr command. That’s very handy when prototyping because you want focus on your code without side effects caused by remaining data.
This feature also proves really useful when working in a team. With make
, if you change the makefile, you need to send an mail to all your team with instructions of how to clean the workspace (ie : “Please remove file ../data/countries.geojson because I have changed the ogr2ogr command”), and hope nobody misses it because it would lead to undebuggable behaviour. On the other hand tuttle
guarantees the data corresponds exactly the tuttlefile
, so you can safely share or merge changes with your fellow contributors.
Conclusion
If you put both improvements over make
together (remote dependencies and reliably reprocess what have changed), we can set up a system that automatically updates datapackages when either the original data changes or when someone modifies the source code. Pretty cool, huh ?
I hope I’ve convinced you of the advantages of tuttle for collectively crafting data. If you’re interested, the best way to learn more about inline languages, url to databases or online resources, is to read the main tutorial.
And one more thing about the sugar syntax you can expect… You could simplify the first section of the tuttlefile
in only one line :
file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip ! download
Comments