Introducing datapak - Work with Tabular Data Packages using Ruby and ActiveRecord

Tabular data packages are a pragmatic way of both publishing your own data and consuming the data that others share with the world. The newly published datapak is a Ruby library that lets you work with tabular data packages using ActiveRecord and, thus, your SQL database of choice (by default the library uses an in-memory SQLite database).

Using datapak

Let’s try using the datapak gem in a simple example that pulls a list of S&P 500 companies from the Frictionless Data dataset registry.

require 'datapak'

Datapak.import(
  's-and-p-500-companies'
)

Using Datapak.import will:

1) download all data packages to the ./pak folder

2) (auto-)add all tables to an in-memory SQLite database using SQL create_table commands via ActiveRecord migrations e.g.

create_table :constituents_financials do |t|
  t.string :symbol             # Symbol         (string)
  t.string :name               # Name           (string)
  t.string :sector             # Sector         (string)
  t.float  :price              # Price          (number)
  t.float  :dividend_yield     # Dividend Yield (number)
  t.float  :price_earnings     # Price/Earnings (number)
  t.float  :earnings_share     # Earnings/Share (number)
  t.float  :book_value         # Book Value     (number)
  t.float  :_52_week_low       # 52 week low    (number)
  t.float  :_52_week_high      # 52 week high   (number)
  t.float  :market_cap         # Market Cap     (number)
  t.float  :ebitda             # EBITDA         (number)
  t.float  :price_sales        # Price/Sales    (number)
  t.float  :price_book         # Price/Book     (number)
  t.string :sec_filings        # SEC Filings    (string)
end

3) (auto-)import all records using SQL inserts e.g.

INSERT INTO constituents_financials
  (symbol,
   name,
   sector,
   price,
   dividend_yield,
   price_earnings,
   earnings_share,
   book_value,
   _52_week_low,
   _52_week_high,
   market_cap,
   ebitda,
   price_sales,
   price_book,
   sec_filings)
VALUES
  ('MMM',
   '3M Co',
   'Industrials',
   162.27,
   2.11,
   22.28,
   7.284,
   25.238,
   123.61,
   162.92,
   104.0,
   8.467,
   3.28,
   6.43,
   'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')

4) (auto-)add ActiveRecord models for all tables.

Now you can use all the “magic” of ActiveRecord to work with the datasets. Example:

class Constituent < ActiveRecord::Base
end


pp Constituent.count

# SELECT COUNT(*) FROM "constituents"
# => 496


pp Constituent.first

# SELECT  "constituents".* FROM "constituents" ORDER BY "constituents"."id" ASC LIMIT 1
# => #<Constituent:0x9f8cb78
         id:     1,
         symbol: "MMM",
         name:   "3M Co",
         sector: "Industrials">


pp Constituent.find_by!( symbol: 'MMM' )

# SELECT  "constituents".*
         FROM "constituents"
         WHERE "constituents"."symbol" = "MMM"
         LIMIT 1
# => #<Constituent:0x9f8cb78
         id:     1,
         symbol: "MMM",
         name:   "3M Co",
         sector: "Industrials">


pp Constituent.find_by!( name: '3M Co' )

# SELECT  "constituents".*
          FROM "constituents"
          WHERE "constituents"."name" = "3M Co"
          LIMIT 1
# => #<Constituent:0x9f8cb78
         id:     1,
         symbol: "MMM",
         name:   "3M Co",
         sector: "Industrials">


pp Constituent.where( sector: 'Industrials' ).count

# SELECT COUNT(*) FROM "constituents"
         WHERE "constituents"."sector" = "Industrials"
# => 63


pp Constituent.where( sector: 'Industrials' ).all

# SELECT "constituents".*
         FROM "constituents"
         WHERE "constituents"."sector" = "Industrials"
# => [#<Constituent:0x9f8cb78
          id:     1,
          symbol: "MMM",
          name:   "3M Co",
          sector: "Industrials">,
      #<Constituent:0xa2a4180
          id:     8,
          symbol: "ADT",
          name:   "ADT Corp (The)",
          sector: "Industrials">,...]

How to manually download a data package

Use the Datapak::Downloader class to download a data package to your disk (by default data packages get stored in ./pak).

dl = Datapak::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode`)

Will result in:

-- pak
   |-- language-codes
   |   |-- data
   |   |   |-- language-codes-3b2.csv
   |   |   |-- language-codes.csv
   |   |   `-- language-codes-full.csv
   |   `-- datapackage.json
   |-- s-and-p-500-companies
   |   |-- data
   |   |   |-- constituents.csv
   |   |   `-- constituents-financials.csv
   |   `-- datapackage.json
   `-- un-locode
       |-- data
       |   |-- code-list.csv
       |   |-- country-codes.csv
       |   |-- function-classifiers.csv
       |   |-- status-indicators.csv
       |   `-- subdivision-codes.csv
       `-- datapackage.json

How to manually add and import a data package

Use the Datapak::Pak class to read a data package and import it into an SQL database.

pak = Datapak::Pak.new( './pak/un-locode/datapackage.json' )
pak.tables.each do |table|
  table.up!      # (auto-) add table  using SQL create_table via ActiveRecord migration
  table.import!  # import all records using SQL inserts
end

That’s it.

Bonus: How to connect to a different SQL database

You can connect to any database supported by ActiveRecord. If you do NOT establish a connection in your script - the standard (default fallback) uses an in-memory SQLite3 database.

SQLite

For example, to create an SQLite3 database on disk, lets say datapak.db, use in your script (before the Datapak.import statement):

ActiveRecord::Base.establish_connection( adapter:  'sqlite3',
                                         database: './datapak.db' )

PostgreSQL

For example, to connect to a PostgreSQL database, use in your script (before the Datapak.import statement):

require 'pg'       ##  pull in PostgreSQL (pg) machinery

ActiveRecord::Base.establish_connection( adapter:  'postgresql'
                                         username: 'ruby',
                                         password: 'topsecret',
                                         database: 'database' )

Find Out More

datapak

Tabular Data Package

Comments