Processing Tabular Data Packages in Java

Posted on 28 April 2018 by Georges Labrèche

Georges Labrèche was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Java programming language. You can read more about this in his grantee profile.

In this post, Labrèche will show you how to install and use the Java libraries for working with Tabular Data Packages.

Our goal in this tutorial is to load tabular data from a CSV file, infer data types and the table’s schema.

Setup

First things first, you’ll want to grab datapackage-java and the tableschema-java libraries.

The Data

For our example, we will use a Tabular Data Package containing the periodic table. You can find the data package descriptor and the data on GitHub.

Packaging

Let’s start by fetching and packaging the data:

// fetch the data
URL url = new URL("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json");

// package the data
Package dp = new Package(url);

That’s it, you’re all set to start playing with the packaged data. There are parameters you can set such as loading a schema or imposing strict validation so be sure to go through the project’s README for more detail.

Iterating

Now that you have a Data Package instance, let’s see what the data looks like. A data package can contain more than one resource so you have to use the Package.getResource() method to specify which resource you’d like to access.

Let’s iterate over the data:

// Get a resource named data from the data package
Resource resource = pkg.getResource("data");

// Get the Iterator
Iterator<String[]> iter = resource.iter();

// Iterate
while(iter.hasNext()){
	String[] row = iter.next();
   	String atomicNumber = row[0];
   	String symbol = row[1];
   	String name = row[2];
  	String atomicMass = row[3];
   	String metalOrNonMetal = row[4];
}

Notice how we’re fetching all values as String. This may not be what you want, particularly for the atomic number and mass. Alternatively, you can trigger data type inference and casting like this:

// Get Iterator.
// Third boolean is the cast flag.
Iterator<Object[]> iter = resource.iter(false, false, true));

// Iterator
while(iter.hasNext()){
	String[] row = iter.next();
   	int atomicNumber = row[0];
   	String symbol = row[1];
   	String name = row[2];
  	float atomicMass = row[3];
   	String metalOrNonMetal = row[4];
}

And that’s it, your data is now associated with the appropriate data types!

Inferring the Schema

We wouldn’t have had to infer the data types if we had included a Table Schema when creating an instance of our Data Package. If a Table Schema is not available, then it’s something that can also be inferred and created with tableschema-java:

URL url = new URL("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/data.csv");
Table table = new Table(url);
Schema schema = table.inferSchema();
schema.write("/path/to/write/schema.json");

The type inference algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned.

The inference algorithm traverses all of the table’s rows and attempts to cast every single value of the table. When dealing with large tables, you might want to limit the number of rows that the inference algorithm processes:

// Only process the first 25 rows for type inference.
Schema schema = table.inferSchema(25);

Be sure to go through tableschema-java’s README as well to learn more about how to operate with Table Schema.

Contributing

In case you discovered an issue that you’d like to contribute a fix for, or if you would like to extend functionality:

# install jabba and maven2
$ cd tableschema-java
$ jabba install 1.8
$ jabba use 1.8
$ mvn install -DskipTests=true -Dmaven.javadoc.skip=true -B -V
$ mvn test -B

Make sure that all tests pass, and submit a PR with your contributions once you’re ready.

We also welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-java repository.

Comments

We make tools, apps and insights using open stuff Join in »

Follow @okfnlabs

Subscribe to RSS

Processing Tabular Data Packages in Java

Setup

The Data

Packaging

Iterating

Inferring the Schema

Contributing

Comments

Recent Posts

Introduction to Statistics With Data Packages and Gonum

Announcing datapackage-pipelines version 2.0

Data Factory & DataFlows - Tutorial

Data Factory & DataFlows - An Introduction

Processing Tabular Data Packages in Clojure

Collecting, Analysing and Sharing Twitter Data