Georges Labrèche was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in Java programming language. You can read more about this in his grantee profile.
In this post, Labrèche will show you how to install and use the Java libraries for working with Tabular Data Packages.
Our goal in this tutorial is to load tabular data from a CSV file, infer data types and the table’s schema.
Setup
First things first, you’ll want to grab datapackage-java and the tableschema-java libraries.
The Data
For our example, we will use a Tabular Data Package containing the periodic table. You can find the data package descriptor and the data on GitHub.
Packaging
Let’s start by fetching and packaging the data:
That’s it, you’re all set to start playing with the packaged data. There are parameters you can set such as loading a schema or imposing strict validation so be sure to go through the project’s README for more detail.
Iterating
Now that you have a Data Package instance, let’s see what the data looks like. A data package can contain more than one resource so you have to use the Package.getResource()
method to specify which resource you’d like to access.
Let’s iterate over the data:
Notice how we’re fetching all values as String
. This may not be what you want, particularly for the atomic number and mass. Alternatively, you can trigger data type inference and casting like this:
And that’s it, your data is now associated with the appropriate data types!
Inferring the Schema
We wouldn’t have had to infer the data types if we had included a Table Schema when creating an instance of our Data Package. If a Table Schema is not available, then it’s something that can also be inferred and created with tableschema-java
:
The type inference algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned.
The inference algorithm traverses all of the table’s rows and attempts to cast every single value of the table. When dealing with large tables, you might want to limit the number of rows that the inference algorithm processes:
Be sure to go through tableschema-java
’s README as well to learn more about how to operate with Table Schema.
Contributing
In case you discovered an issue that you’d like to contribute a fix for, or if you would like to extend functionality:
Make sure that all tests pass, and submit a PR with your contributions once you’re ready.
We also welcome your feedback and questions via our Frictionless Data Gitter chat or via GitHub issues on the datapackage-java repository.
Comments