Daniel Fireman was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Go programming language. You can read more about this in his grantee profile.
In this post, Fireman will show you how to install and use the Go libraries for working with Tabular Data Packages.
Our goal in this tutorial is to load a data package from the web and read its metadata and contents.
Setup
For this tutorial, we will need the datapackage-go and tableschema-go packages, which provide all the functionality to deal with a Data Package’s metadata and its contents.
We are going to use the dep tool to manage the dependencies of our new project:
The Periodic Table Data Package
A Data Package is a simple container format used to describe and package a collection of data. It consists of two parts:
- Metadata that describes the structure and contents of the package
- Resources such as data files that form the contents of the package
In this tutorial, we are using a Tabular Data Package containing the periodic table. The package descriptor (datapackage.json) and contents (data.csv) are stored on GitHub. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the header and the first three rows:
atomic number | symbol | name | atomic mass | metal or nonmetal? |
---|---|---|---|---|
1 | H | Hydrogen | 1.00794 | nonmetal |
2 | He | Helium | 4.002602 | noble gas |
3 | Li | Lithium | 6.941 | alkali metal |
Inspecting Package Metadata
Let’s start off by creating the main.go
, which loads the data package and inspects some of its metadata.
Before running the code, you need to tell the dep tool to update our project dependencies. Don’t worry; you won’t need to do it again in this tutorial.
Now that you have loaded the periodic table Data Package, you have access to its title
and name
fields through the Package.Descriptor() function. To do so, let’s change our main function to (omitting error handling for the sake of brevity, but we know it is very important):
And rerun the program:
And as you can see, the printed fields match the package descriptor. For more information about the Data Package structure, please take a look at the specification.
Quick Look At the Data
Now that you have loaded your Data Package, it is time to process its contents. The package content consists of one or more resources. You can access Resources via the Package.GetResource() method. Let’s print the periodic table data
resource contents.
The Resource.ReadAll() method loads the whole table in memory as raw strings and returns it as a Go [][]string
. This can be quick useful to take a quick look or perform a visual sanity check at the data.
Processing the Data Package’s Content
Even though the string representation can be useful for a quick sanity check, you probably want to use actual language types to process the data. Don’t worry, you won’t need to fight the casting battle yourself. Data Package Go libraries provide a rich set of methods to deal with data loading in a very idiomatic way (very similar to encoding/json).
As an example, let’s change our main
function to use actual types to store the periodic table and print the elements with atomic mass smaller than 10.
In the example above, all rows in the table are loaded into memory. Then every row is parsed into an element
object and appended to the slice. The resource.Cast
call returns an error if the whole table cannot be successfully parsed.
If you don’t want to load all data in memory at once, you can lazily access each row using Resource.Iter and use Schema.CastRow to cast each row into an element
object. That would change our main function to:
And our code is ready to deal with the growth of the periodic table in a very memory-efficient way :-)
We welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-go repository.
Comments