Introducing datapak - Work with Tabular Data Packages using Ruby and ActiveRecord
Tabular data packages are a pragmatic way of both publishing your own data and consuming the data that others share with the world. The newly published datapak is a Ruby library that lets you work with tabular data packages using ActiveRecord and, thus, your SQL database of choice (by default the library uses an in-memory SQLite database).
Using datapak
Let’s try using the datapak gem in a simple example that pulls a list of S&P 500 companies from the Frictionless Data dataset registry.
require 'datapak'
Datapak.import(
's-and-p-500-companies'
)
Using Datapak.import
will:
1) download all data packages to the ./pak
folder
2) (auto-)add all tables to an in-memory SQLite database using SQL create_table
commands via ActiveRecord
migrations e.g.
create_table :constituents_financials do |t|
t.string :symbol # Symbol (string)
t.string :name # Name (string)
t.string :sector # Sector (string)
t.float :price # Price (number)
t.float :dividend_yield # Dividend Yield (number)
t.float :price_earnings # Price/Earnings (number)
t.float :earnings_share # Earnings/Share (number)
t.float :book_value # Book Value (number)
t.float :_52_week_low # 52 week low (number)
t.float :_52_week_high # 52 week high (number)
t.float :market_cap # Market Cap (number)
t.float :ebitda # EBITDA (number)
t.float :price_sales # Price/Sales (number)
t.float :price_book # Price/Book (number)
t.string :sec_filings # SEC Filings (string)
end
3) (auto-)import all records using SQL inserts e.g.
INSERT INTO constituents_financials
(symbol,
name,
sector,
price,
dividend_yield,
price_earnings,
earnings_share,
book_value,
_52_week_low,
_52_week_high,
market_cap,
ebitda,
price_sales,
price_book,
sec_filings)
VALUES
('MMM',
'3M Co',
'Industrials',
162.27,
2.11,
22.28,
7.284,
25.238,
123.61,
162.92,
104.0,
8.467,
3.28,
6.43,
'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')
4) (auto-)add ActiveRecord
models for all tables.
Now you can use all the “magic” of ActiveRecord
to work
with the datasets. Example:
class Constituent < ActiveRecord::Base
end
pp Constituent.count
# SELECT COUNT(*) FROM "constituents"
# => 496
pp Constituent.first
# SELECT "constituents".* FROM "constituents" ORDER BY "constituents"."id" ASC LIMIT 1
# => #<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">
pp Constituent.find_by!( symbol: 'MMM' )
# SELECT "constituents".*
FROM "constituents"
WHERE "constituents"."symbol" = "MMM"
LIMIT 1
# => #<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">
pp Constituent.find_by!( name: '3M Co' )
# SELECT "constituents".*
FROM "constituents"
WHERE "constituents"."name" = "3M Co"
LIMIT 1
# => #<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">
pp Constituent.where( sector: 'Industrials' ).count
# SELECT COUNT(*) FROM "constituents"
WHERE "constituents"."sector" = "Industrials"
# => 63
pp Constituent.where( sector: 'Industrials' ).all
# SELECT "constituents".*
FROM "constituents"
WHERE "constituents"."sector" = "Industrials"
# => [#<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">,
#<Constituent:0xa2a4180
id: 8,
symbol: "ADT",
name: "ADT Corp (The)",
sector: "Industrials">,...]
How to manually download a data package
Use the Datapak::Downloader
class to download a data package
to your disk (by default data packages get stored in ./pak
).
dl = Datapak::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode`)
Will result in:
-- pak
|-- language-codes
| |-- data
| | |-- language-codes-3b2.csv
| | |-- language-codes.csv
| | `-- language-codes-full.csv
| `-- datapackage.json
|-- s-and-p-500-companies
| |-- data
| | |-- constituents.csv
| | `-- constituents-financials.csv
| `-- datapackage.json
`-- un-locode
|-- data
| |-- code-list.csv
| |-- country-codes.csv
| |-- function-classifiers.csv
| |-- status-indicators.csv
| `-- subdivision-codes.csv
`-- datapackage.json
How to manually add and import a data package
Use the Datapak::Pak
class to read a data package and import it into
an SQL database.
pak = Datapak::Pak.new( './pak/un-locode/datapackage.json' )
pak.tables.each do |table|
table.up! # (auto-) add table using SQL create_table via ActiveRecord migration
table.import! # import all records using SQL inserts
end
That’s it.
Bonus: How to connect to a different SQL database
You can connect to any database supported by ActiveRecord. If you do NOT establish a connection in your script - the standard (default fallback) uses an in-memory SQLite3 database.
SQLite
For example, to create an SQLite3 database on disk, lets say datapak.db
,
use in your script (before the Datapak.import
statement):
ActiveRecord::Base.establish_connection( adapter: 'sqlite3',
database: './datapak.db' )
PostgreSQL
For example, to connect to a PostgreSQL database, use in your script
(before the Datapak.import
statement):
require 'pg' ## pull in PostgreSQL (pg) machinery
ActiveRecord::Base.establish_connection( adapter: 'postgresql'
username: 'ruby',
password: 'topsecret',
database: 'database' )
Find Out More
datapak
- home :: github.com/textkit/datapak
- gem :: rubygems.org/gems/datapak
- rdoc :: rubydoc.info/gems/datapak
Tabular Data Package
- spec :: dataprotocols.org/tabular-data-package
- datasets :: data.okfn.org/data
Comments