Creating and Using Data Packages in R

Open Knowledge Greece was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries in R programming language. You can read more about this in their grantee profile.

In this post, Kleanthis Koupidis, a Data Scientist and Statistician at Open Knowledge Greece, explains how to create and use Data Packages in R.


Creating Data Packages in R

This section of the tutorial will show you how to install the R library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.

Setup

For this tutorial, we will need the Data Package R library (datapackage.r).

devtools library is required to install the datapackage.r library from github.

    # Install devtools package if not already
    install.packages("devtools")

And then install the development version of datapackage.r from github.

    devtools::install_github("frictionlessdata/datapackage-r")

Load

You can start using the library by loading datapackage.r.

    library(datapackage.r)

You can add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required name key as well as a human-readable title key. For the keys supported, please consult the full Data Package spec. Note, we will be creating the required resources key further down below.

    dataPackage = Package.load()
    dataPackage$descriptor['name'] = 'period-table'
    dataPackage$descriptor['title'] = 'Periodic Table'
    # commit the changes to Package class
    dataPackage$commit()

    ## [1] TRUE

Infer a CSV Schema

We will use periodic-table data from remote path

atomic.number symbol name atomic.mass metal.or.nonmetal.
1 H Hydrogen 1.00794 nonmetal
2 He Helium 4.002602 noble gas
3 Li Lithium 6.941 alkali metal
4 Be Beryllium 9.012182 alkaline earth metal
5 B Boron 10.811 metalloid
6 C Carbon 12.0107 nonmetal
7 N Nitrogen 14.0067 nonmetal
8 O Oxygen 15.9994 nonmetal
9 F Fluorine 18.9984032 halogen
10 Ne Neon 20.1797 noble gas

We can guess our CSV’s schema by using infer from the Table Schema library. We pass directly the remote link to the infer function, the result of which is an inferred schema. For example, if the processor detects only integers in a given column, it will assign integer as a column type.

    filepath = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'

    schema = tableschema.r::infer(filepath)

Once we have a schema, we are now ready to add a resource key to the Data Package which points to the resource path and its newly created schema. Below we define resources with three ways, using json text format with usual assignment operator in R list objects and directly using addResource function of Package class:

    # define resources using json text
    resources = helpers.from.json.to.list(
      '[{
        "name": "data",
        "path": "filepath",
        "schema": "schema"
      }]'
    )
    resources[[1]]$schema = schema
    resources[[1]]$path = filepath

    # or define resources using list object
    resources = list(list(
      name = "data",
      path = filepath,
      schema = schema
      ))

And now, add resources to the Data Package:

    dataPackage$descriptor[['resources']] = resources
    dataPackage$commit()

    ## [1] TRUE

Or you can directly add resources using addResources function of Package class:

    resources = list(list(
      name = "data",
      path = filepath,
      schema = schema
      ))      

    dataPackage$addResource(resources)

Now we are ready to write our datapackage.json file to the current working directory.

    dataPackage$save('example_data')

The datapackage.json (download) is inlined below. Note that atomic number has been correctly inferred as an integer and atomic mass as a number (float) while every other column is a string.

    jsonlite::prettify(helpers.from.list.to.json(dataPackage$descriptor))

    ## {
    ##     "profile": "data-package",
    ##     "name": "period-table",
    ##     "title": "Periodic Table",
    ##     "resources": [
    ##         {
    ##             "name": "data",
    ##             "path": "https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv",
    ##             "schema": {
    ##                 "fields": [
    ##                     {
    ##                         "name": "atomic number",
    ##                         "type": "integer",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "symbol",
    ##                         "type": "string",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "name",
    ##                         "type": "string",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "atomic mass",
    ##                         "type": "number",
    ##                         "format": "default"
    ##                     },
    ##                     {
    ##                         "name": "metal or nonmetal?",
    ##                         "type": "string",
    ##                         "format": "default"
    ##                     }
    ##                 ],
    ##                 "missingValues": [
    ##                     ""
    ##                 ]
    ##             },
    ##             "profile": "data-resource",
    ##             "encoding": "utf-8"
    ##         }
    ##     ]
    ## }
    ##

Publishing

Now that you have created your Data Package, you might want to publish your data online so that you can share it with others.


Using Data Packages in R

This section of the tutorial will show you how to install the R libraries for working with Tabular Data Packages and demonstrate a very simple example of loading a Tabular Data Package from the web and pushing it directly into a local SQL database and send query to retrieve results.

Setup

For this tutorial, we will need the Data Package R library (datapackage.r). Devtools library is also required to install the datapackage.r library from github.

# Install devtools package if not already
install.packages("devtools")

And then install the development version of datapackage.r from github.

devtools::install_github("frictionlessdata/datapackage.r")

Load

You can start using the library by loading datapackage.r.

    library(datapackage.r)

Reading Basic Metadata

In this case, we are using an example Tabular Data Package containing the periodic table stored on GitHub (datapackage.json, data.csv). This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the first five rows:

    url = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'
    pt_data = read.csv2(url, sep = ',')
    knitr::kable(head(pt_data, 5), align = 'c')
atomic.number symbol name atomic.mass metal.or.nonmetal.
1 H Hydrogen 1.00794 nonmetal
2 He Helium 4.002602 noble gas
3 Li Lithium 6.941 alkali metal
4 Be Beryllium 9.012182 alkaline earth metal
5 B Boron 10.811 metalloid

Data Packages can be loaded either from a local path or directly from the web.

    url = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/package.json'
    datapackage = Package.load(url)
    datapackage$resources[[1]]$descriptor$profile = 'tabular-data-resource' # tabular resource descriptor profile
    datapackage$resources[[1]]$commit() # commit changes

    ## [1] TRUE

At the most basic level, Data Packages provide a standardized format for general metadata (for example, the dataset title, source, author, and/or description) about your dataset. Now that you have loaded this Data Package, you have access to this metadata using the metadata dict attribute. Note that these fields are optional and may not be specified for all Data Packages. For more information on which fields are supported, see the full Data Package standard.

    datapackage$descriptor$title

    ## [1] "Periodic Table"

Reading Data

Now that you have loaded your Data Package, you can read its data. A Data Package can contain multiple files which are accessible via the resources attribute. The resources attribute is an array of objects containing information (e.g. path, schema, description) about each file in the package.

You can access the data in a given resource in the resources array by reading the data attribute.

    table = datapackage$resources[[1]]$table
    periodic_table_data = table$read()

You can further manipulate list objects in R by using

 [purrr](https://cran.r-project.org/package=purrr), [rlist](https://cran.r-project.org/package=rlist) packages.

Loading into an SQL database

Tabular Data Packages contains schema information about its data using Table Schema. This means you can easily import your Data Package into the SQL backend of your choice. In this case, we are creating an SQLite database.

To create a new SQLite database and load the data into SQL we will need DBI package and RSQLite package, which contains SQLite (no external software is needed).

You can install and load them by using:

    install.packages(c("DBI","RSQLite"))

    library(DBI)
    library(RSQLite)

To create a new SQLite database, you simply supply the filename to dbConnect():

    dp.database = dbConnect(RSQLite::SQLite(), "") # temporary database

We will use data.table package to convert the list object with the data to a data frame object to copy them to database table.

    # install data.table package if not already
    # install.packages("data.table")

    periodic_table_sql = data.table::rbindlist(periodic_table_data)
    periodic_table_sql = setNames(periodic_table_sql,unlist(datapackage$resources[[1]]$headers))

You can easily copy an R data frame into a SQLite database with dbWriteTable():

    dbWriteTable(dp.database, "periodic_table_sql", periodic_table_sql)
    # show remote tables accessible through this connection
    dbListTables(dp.database)

    ## [1] "periodic_table_sql"

The data are already to the database.

We can further issue queries to hte database and return first 5 elements:

    dbGetQuery(dp.database, 'SELECT * FROM periodic_table_sql LIMIT 5')

    ##   atomic number symbol      name atomic mass   metal or nonmetal?
    ## 1             1      H  Hydrogen    1.007940             nonmetal
    ## 2             2     He    Helium    4.002602            noble gas
    ## 3             3     Li   Lithium    6.941000         alkali metal
    ## 4             4     Be Beryllium    9.012182 alkaline earth metal
    ## 5             5      B     Boron   10.811000            metalloid

Or return all elements with an atomic number of less than 10:

    dbGetQuery(dp.database, 'SELECT * FROM periodic_table_sql WHERE "atomic number" < 10')

    ##   atomic number symbol      name atomic mass   metal or nonmetal?
    ## 1             1      H  Hydrogen    1.007940             nonmetal
    ## 2             2     He    Helium    4.002602            noble gas
    ## 3             3     Li   Lithium    6.941000         alkali metal
    ## 4             4     Be Beryllium    9.012182 alkaline earth metal
    ## 5             5      B     Boron   10.811000            metalloid
    ## 6             6      C    Carbon   12.010700             nonmetal
    ## 7             7      N  Nitrogen   14.006700             nonmetal
    ## 8             8      O    Oxygen   15.999400             nonmetal
    ## 9             9      F  Fluorine   18.998403              halogen

More about using databases, SQLite in R you can find in vignettes of DBI and RSQLite packages.

We welcome your feedback and questions via our Frictionless Data Gitter chat or via Github issues on the datapackage-r repository.

Comments