Processing Tabular Data Packages in Go

Daniel Fireman was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Go programming language. You can read more about this in his grantee profile.

In this post, Fireman will show you how to install and use the Go libraries for working with Tabular Data Packages.


Our goal in this tutorial is to load a data package from the web and read its metadata and contents.

Setup

For this tutorial, we will need the datapackage-go and tableschema-go packages, which provide all the functionality to deal with a Data Package’s metadata and its contents.

We are going to use the dep tool to manage the dependencies of our new project:

$ cd $GOPATH/src/newdataproj
$ dep init

The Periodic Table Data Package

A Data Package is a simple container format used to describe and package a collection of data. It consists of two parts:

  • Metadata that describes the structure and contents of the package
  • Resources such as data files that form the contents of the package

In this tutorial, we are using a Tabular Data Package containing the periodic table. The package descriptor (datapackage.json) and contents (data.csv) are stored on GitHub. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the header and the first three rows:

atomic number symbol name atomic mass metal or nonmetal?
1 H Hydrogen 1.00794 nonmetal
2 He Helium 4.002602 noble gas
3 Li Lithium 6.941 alkali metal

Inspecting Package Metadata

Let’s start off by creating the main.go, which loads the data package and inspects some of its metadata.

package main

import (
    "fmt"

    "github.com/frictionlessdata/datapackage-go/datapackage"
)

func main() {
    pkg, err := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    if err != nil {
        panic(err)
    }
    fmt.Println("Package loaded successfully.")
}

Before running the code, you need to tell the dep tool to update our project dependencies. Don’t worry; you won’t need to do it again in this tutorial.

$ dep ensure
$ go run main.go
Package loaded successfully.

Now that you have loaded the periodic table Data Package, you have access to its title and name fields through the Package.Descriptor() function. To do so, let’s change our main function to (omitting error handling for the sake of brevity, but we know it is very important):

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    fmt.Println("Name:", pkg.Descriptor()["name"])
    fmt.Println("Title:", pkg.Descriptor()["title"])
}

And rerun the program:

$ go run main.go
Name: period-table
Title: Periodic Table

And as you can see, the printed fields match the package descriptor. For more information about the Data Package structure, please take a look at the specification.

Quick Look At the Data

Now that you have loaded your Data Package, it is time to process its contents. The package content consists of one or more resources. You can access Resources via the Package.GetResource() method. Let’s print the periodic table data resource contents.

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    res := pkg.GetResource("data")
    table, _ := res.ReadAll()
    for _, row := range table {
        fmt.Println(row)
    }
}
$ go run main.go
[atomic number symbol name atomic mass metal or nonmetal?]
[1 H Hydrogen 1.00794 nonmetal]
[2 He Helium 4.002602 noble gas]
[3 Li Lithium 6.941 alkali metal]
[4 Be Beryllium 9.012182 alkaline earth metal]
...

The Resource.ReadAll() method loads the whole table in memory as raw strings and returns it as a Go [][]string. This can be quick useful to take a quick look or perform a visual sanity check at the data.

Processing the Data Package’s Content

Even though the string representation can be useful for a quick sanity check, you probably want to use actual language types to process the data. Don’t worry, you won’t need to fight the casting battle yourself. Data Package Go libraries provide a rich set of methods to deal with data loading in a very idiomatic way (very similar to encoding/json).

As an example, let’s change our main function to use actual types to store the periodic table and print the elements with atomic mass smaller than 10.

package main

import (
    "fmt"

    "github.com/frictionlessdata/datapackage-go/datapackage"
    "github.com/frictionlessdata/tableschema-go/csv"
)

type element struct {
    Number int     `tableheader:"atomic number"`
    Symbol string  `tableheader:"symbol"`
    Name   string  `tableheader:"name"`
    Mass   float64 `tableheader:"atomic mass"`
    Metal  string  `tableheader:"metal or nonmetal?"`
}

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    resource := pkg.GetResource("data")

    var elements []element
    resource.Cast(&elements, csv.LoadHeaders())
    for _, e := range elements {
        if e.Mass < 10 {
            fmt.Printf("%+v\n", e)
        }
    }
}
$ go run main.go
{Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal}
{Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas}
{Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal}
{Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal}

In the example above, all rows in the table are loaded into memory. Then every row is parsed into an element object and appended to the slice. The resource.Cast call returns an error if the whole table cannot be successfully parsed.

If you don’t want to load all data in memory at once, you can lazily access each row using Resource.Iter and use Schema.CastRow to cast each row into an element object. That would change our main function to:

func main() {
    pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
    resource := pkg.GetResource("data")

    iter, _ := resource.Iter(csv.LoadHeaders())
    sch, _ := resource.GetSchema()
    var e element
    for iter.Next() {
        sch.CastRow(iter.Row(), &e)
        if e.Mass < 10 {
            fmt.Printf("%+v\n", e)
        }
    }
}
$ go run main.go
{Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal}
{Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas}
{Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal}
{Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal}

And our code is ready to deal with the growth of the periodic table in a very memory-efficient way :-)


We welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-go repository.

Comments