Open Knowledge Greece was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in R programming language. You can read more about this in their grantee profile.
In this post, Kleanthis Koupidis, a Data Scientist and Statistician at Open Knowledge Greece, explains how to create and use Data Packages in R.
Creating Data Packages in R
This section of the tutorial will show you how to install the R library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.
Setup
For this tutorial, we will need the Data Package R library (datapackage.r).
devtools library is required to install the datapackage.r
library from github.
# Install devtools package if not already
install.packages("devtools")
And then install the development version of datapackage.r from github.
devtools::install_github("frictionlessdata/datapackage-r")
Load
You can start using the library by loading datapackage.r
.
You can add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required name
key as well as a human-readable title
key. For the keys supported, please consult the full Data Package spec. Note, we will be creating the required resources
key further down below.
Infer a CSV Schema
We will use periodic-table data from remote path
atomic.number | symbol | name | atomic.mass | metal.or.nonmetal. |
---|---|---|---|---|
1 | H | Hydrogen | 1.00794 | nonmetal |
2 | He | Helium | 4.002602 | noble gas |
3 | Li | Lithium | 6.941 | alkali metal |
4 | Be | Beryllium | 9.012182 | alkaline earth metal |
5 | B | Boron | 10.811 | metalloid |
6 | C | Carbon | 12.0107 | nonmetal |
7 | N | Nitrogen | 14.0067 | nonmetal |
8 | O | Oxygen | 15.9994 | nonmetal |
9 | F | Fluorine | 18.9984032 | halogen |
10 | Ne | Neon | 20.1797 | noble gas |
We can guess our CSV’s schema by using infer
from the Table Schema library. We pass directly the remote link to the infer function, the result of which is an inferred schema. For example, if the processor detects only integers in a given column, it will assign integer
as a column type.
Once we have a schema, we are now ready to add a resource
key to the Data Package which points to the resource path and its newly created schema. Below we define resources with three ways, using json text format with usual assignment operator in R list objects and directly using addResource
function of Package
class:
And now, add resources to the Data Package:
Or you can directly add resources using addResources
function of Package
class:
Now we are ready to write our datapackage.json
file to the current working directory.
The datapackage.json
(download) is inlined below. Note that atomic number has been correctly inferred as an integer
and atomic mass as a number
(float) while every other column is a string
.
Publishing
Now that you have created your Data Package, you might want to publish your data online so that you can share it with others.
Using Data Packages in R
This section of the tutorial will show you how to install the R libraries for working with Tabular Data Packages and demonstrate a very simple example of loading a Tabular Data Package from the web and pushing it directly into a local SQL database and send query to retrieve results.
Setup
For this tutorial, we will need the Data Package R library (datapackage.r). Devtools library is also required to install the datapackage.r library from github.
# Install devtools package if not already
install.packages("devtools")
And then install the development version of datapackage.r from github.
devtools::install_github("frictionlessdata/datapackage.r")
Load
You can start using the library by loading datapackage.r
.
Reading Basic Metadata
In this case, we are using an example Tabular Data Package containing the periodic table stored on GitHub (datapackage.json, data.csv). This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the first five rows:
atomic.number | symbol | name | atomic.mass | metal.or.nonmetal. |
---|---|---|---|---|
1 | H | Hydrogen | 1.00794 | nonmetal |
2 | He | Helium | 4.002602 | noble gas |
3 | Li | Lithium | 6.941 | alkali metal |
4 | Be | Beryllium | 9.012182 | alkaline earth metal |
5 | B | Boron | 10.811 | metalloid |
Data Packages can be loaded either from a local path or directly from the web.
At the most basic level, Data Packages provide a standardized format for general metadata (for example, the dataset title, source, author, and/or description) about your dataset. Now that you have loaded this Data Package, you have access to this metadata
using the metadata dict attribute. Note that these fields are optional and may not be specified for all Data Packages. For more information on which fields are supported, see the full Data Package standard.
Reading Data
Now that you have loaded your Data Package, you can read its data. A Data Package can contain multiple files which are accessible via the resources
attribute. The resources
attribute is an array of objects containing information (e.g. path, schema, description) about each file in the package.
You can access the data in a given resource in the resources
array by reading the data
attribute.
You can further manipulate list objects in R by using
Loading into an SQL database
Tabular Data Packages contains schema information about its data using Table Schema. This means you can easily import your Data Package into the SQL backend of your choice. In this case, we are creating an SQLite database.
To create a new SQLite database and load the data into SQL we will need DBI package and RSQLite package, which contains SQLite (no external software is needed).
You can install and load them by using:
To create a new SQLite database, you simply supply the filename to dbConnect()
:
We will use data.table package to convert the list object with the data to a data frame object to copy them to database table.
You can easily copy an R data frame into a SQLite database with dbWriteTable()
:
The data are already to the database.
We can further issue queries to hte database and return first 5 elements:
Or return all elements with an atomic number of less than 10:
More about using databases, SQLite in R you can find in vignettes of DBI and RSQLite packages.
We welcome your feedback and questions via our Frictionless Data Gitter chat or via Github issues on the datapackage-r repository.
Comments