Open Knowledge Labs: The Data Wrangling Blog
2022-08-18T16:35:53+00:00
http://okfnlabs.org/
Open Knowledge Labs
Introduction to Statistics With Data Packages and Gonum
2019-08-01T00:00:00+00:00
http://okfnlabs.org/blog/2019/08/01/intro-statistics-datapackage-gonum
<p>After 6 years at Google, Daniel Fireman is currently a Ph.D. student, professor and activist for government transparency and accountability in the Northeast of Brazil. He was one of the 2017’s Frictionless Data Tool Fund grantees and implemented the core Frictionless Data specification in the <a href="https://golang.org/">Go</a> programming language: <a href="https://github.com/frictionlessdata/datapackage-go">datapackage</a> and <a href="https://github.com/frictionlessdata/tableschema-go">tableschema</a>, which he still maintains. You can read more about this in <a href="https://frictionlessdata.io/articles/daniel-fireman/">his grantee profile</a>.</p>
<p>Since its first release in 2017, we’ve been improving <a href="https://github.com/frictionlessdata/datapackage-go">datapackage</a> and <a href="https://github.com/frictionlessdata/tableschema-go">tableschema</a> packages. Besides fixing bugs, we tried to make it easier to use data packages together with statistical/plotting libraries like <a href="https://gonum.org/">Gonum</a>. This post shows an example of such usage and was inspired in <a href="https://sbinet.github.io/posts/2017-10-04-intro-to-stats-with-gonum/">this</a> post, from <a href="https://github.com/sbinet">Sebastian Binet</a>.</p>
<hr />
<p>Our goal in this tutorial is to load a data package from the web and use <a href="https://gonum.org/">Gonum</a> to calculate some basic statistics.</p>
<h2 id="go-data-packages--gonum">Go, Data Packages & Gonum</h2>
<p><a href="https://github.com/frictionlessdata/datapackage-go/tree/master/datapackage">datapackage</a> is <em>“a package for working with <a href="http://specs.frictionlessdata.io/data-package/">Data Packages</a>“</em>. A Data Package consists of:</p>
<ul>
<li>Metadata that describes the structure and contents of the package</li>
<li>Resources such as data files that form the contents of the package</li>
</ul>
<p><a href="https://gonum.org/">Gonum</a> is <em>“a set of packages designed to make writing numeric and scientific algorithms productive, performant and scalable.”</em></p>
<p>Before being able to use <code class="language-plaintext highlighter-rouge">datapackage</code> and <code class="language-plaintext highlighter-rouge">Gonum</code>, we need to install <a href="https://golang.org/">Go</a>. We can download and install the <code class="language-plaintext highlighter-rouge">Go</code> toolchain for a variety of platforms and operating systems from <a href="https://golang.org/dl">golang.org/dl</a>. This post assumes the installation of version 11 or newer.</p>
<p>After installing Go, the runtime will download <code class="language-plaintext highlighter-rouge">Gonum</code>, <code class="language-plaintext highlighter-rouge">datapackage</code> and all its dependencies as part of running the go script.</p>
<h2 id="reading-datapackage">Reading Datapackage</h2>
<p>In this post, we are using a <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Package</a> containing the periodic table. The package descriptor (<a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json">datapackage.json</a>) and contents (<a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/data.csv">data.csv</a>) are stored on <a href="http://github.com/">GitHub</a>. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Let’s start by taking a quick look at the header and the first rows.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// file: stats.go</span>
<span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="s">"github.com/frictionlessdata/datapackage-go/datapackage"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">res</span> <span class="o">:=</span> <span class="n">pkg</span><span class="o">.</span><span class="n">GetResource</span><span class="p">(</span><span class="s">"data"</span><span class="p">)</span>
<span class="n">table</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">res</span><span class="o">.</span><span class="n">ReadAll</span><span class="p">()</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="m">4</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="gonum-and-statistics">Gonum and statistics</h2>
<p>Gonum provides many statistical functions. Let’s use it to calculate the mean, median, standard deviation and variance of the atomic masses.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// file: stats.go</span>
<span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="s">"math"</span>
<span class="s">"sort"</span>
<span class="s">"github.com/frictionlessdata/datapackage-go/datapackage"</span>
<span class="s">"github.com/frictionlessdata/tableschema-go/csv"</span>
<span class="s">"gonum.org/v1/gonum/stat"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">var</span> <span class="n">masses</span> <span class="p">[]</span><span class="kt">float64</span>
<span class="n">res</span> <span class="o">:=</span> <span class="n">pkg</span><span class="o">.</span><span class="n">GetResource</span><span class="p">(</span><span class="s">"data"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">res</span><span class="o">.</span><span class="n">CastColumn</span><span class="p">(</span><span class="s">"atomic mass"</span><span class="p">,</span> <span class="o">&</span><span class="n">masses</span><span class="p">,</span> <span class="n">csv</span><span class="o">.</span><span class="n">LoadHeaders</span><span class="p">());</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"data: %v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">masses</span><span class="p">)</span>
<span class="n">sort</span><span class="o">.</span><span class="n">Float64s</span><span class="p">(</span><span class="n">masses</span><span class="p">)</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"data: %v (sorted)</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">masses</span><span class="p">)</span>
<span class="c">// computes the weighted mean of the dataset.</span>
<span class="c">// we don't have any weights (ie, all weights are 1)</span>
<span class="c">// so we just pass a nil slice.</span>
<span class="n">mean</span> <span class="o">:=</span> <span class="n">stat</span><span class="o">.</span><span class="n">Mean</span><span class="p">(</span><span class="n">masses</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>
<span class="c">// computes the median of the dataset.</span>
<span class="c">// here as well, we pass a nil slice as weights.</span>
<span class="n">median</span> <span class="o">:=</span> <span class="n">stat</span><span class="o">.</span><span class="n">Quantile</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span> <span class="n">stat</span><span class="o">.</span><span class="n">Empirical</span><span class="p">,</span> <span class="n">masses</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>
<span class="n">variance</span> <span class="o">:=</span> <span class="n">stat</span><span class="o">.</span><span class="n">Variance</span><span class="p">(</span><span class="n">masses</span><span class="p">,</span> <span class="no">nil</span><span class="p">)</span>
<span class="n">stddev</span> <span class="o">:=</span> <span class="n">math</span><span class="o">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="n">variance</span><span class="p">)</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"mean= %v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">mean</span><span class="p">)</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"median= %v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">median</span><span class="p">)</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"variance= %v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">variance</span><span class="p">)</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"std-dev= %v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">stddev</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The program above performs some basic statistical operations on our dataset:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$></span> go run stats.go
... dependency download logs ...
data: <span class="o">[</span>1.00794 4.002602 6.941 9.012182 10.811 12.0107 14.0067 15.9994 18.9984032 20.1797 22.98976928 24.305 26.9815386 28.0855 30.973762 32.065 35.453 39.948 39.0983 40.078 44.955912 47.867 50.9415 51.9961 54.938045 55.845 58.933195 58.6934 63.546 65.38 69.723 72.64 74.9216 78.96 79.904 83.798 85.4678 87.62 88.90585 91.224 92.90638 95.96 98 101.07 102.9055 106.42 107.8682 112.411 114.818 118.71 121.76 127.6 126.90447 131.293 132.9054519 137.327 138.90547 140.116 140.90765 144.242 145 150.36 151.964 157.25 158.92535 162.5 164.93032 167.259 168.93421 173.054 174.9668 178.49 180.94788 183.84 186.207 190.23 192.217 195.084 196.966569 200.59 204.3833 207.2 208.9804 209 210 222 223 226 227 232.03806 231.03588 238.02891 237 244 243 247 247 251 252 257 258 259 262 267 268 271 272 270 276 281 280 285 284 289 288 293 294 294]
data: <span class="o">[</span>1.00794 4.002602 6.941 9.012182 10.811 12.0107 14.0067 15.9994 18.9984032 20.1797 22.98976928 24.305 26.9815386 28.0855 30.973762 32.065 35.453 39.0983 39.948 40.078 44.955912 47.867 50.9415 51.9961 54.938045 55.845 58.6934 58.933195 63.546 65.38 69.723 72.64 74.9216 78.96 79.904 83.798 85.4678 87.62 88.90585 91.224 92.90638 95.96 98 101.07 102.9055 106.42 107.8682 112.411 114.818 118.71 121.76 126.90447 127.6 131.293 132.9054519 137.327 138.90547 140.116 140.90765 144.242 145 150.36 151.964 157.25 158.92535 162.5 164.93032 167.259 168.93421 173.054 174.9668 178.49 180.94788 183.84 186.207 190.23 192.217 195.084 196.966569 200.59 204.3833 207.2 208.9804 209 210 222 223 226 227 231.03588 232.03806 237 238.02891 243 244 247 247 251 252 257 258 259 262 267 268 270 271 272 276 280 281 284 285 288 289 293 294 294] <span class="o">(</span>sorted<span class="o">)</span>
<span class="nv">mean</span><span class="o">=</span> 146.43746355915252
<span class="nv">median</span><span class="o">=</span> 140.90765
<span class="nv">variance</span><span class="o">=</span> 8026.634755570227
std-dev<span class="o">=</span> 89.59148818704948
</code></pre></div></div>
<p>Thanks for reading!</p>
<p>We welcome your feedback and questions via our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a> or via <a href="https://github.com/frictionlessdata/datapackage-go/issues">GitHub issues</a> on the datapackage-go repository.</p>
Daniel Fireman
Announcing datapackage-pipelines version 2.0
2018-10-18T00:00:00+00:00
http://okfnlabs.org/blog/2018/10/18/announcing-datapackage-pipelines-v2
<p>Today we’re releasing a major version for <a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a>, version 2.0.0.</p>
<p>This new version marks a big step forward in realizing the Data Factory concept and framework. We integrated <em>datapackage-pipelines</em> with its younger sister <em><a href="https://github.com/datahq/dataflows">dataflows</a></em>, and created a set of common building blocks you can now use interchangeably between the two frameworks.</p>
<p><img src="/img/posts/dataflows-and-dpp.png" alt="diagram showing the relationship between dataflows and datapackage-pipelines" />
<br />
<em>figure 1: diagram showing the relationship between dataflows and datapackage-pipelines</em></p>
<p>It’s now possible to bootstrap and develop flows using <em>dataflows</em>, and then run these flows as-is on a <em>datapackage-pipelines</em> server - or effortlessly convert them to the declarative yaml syntax.</p>
<p>Install datapackage-pipelines using <code class="language-plaintext highlighter-rouge">pip</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install datapackage-pipelines
</code></pre></div></div>
<h2 id="what-changed">What Changed?</h2>
<h3 id="new-low-level-api-and-stdout-redirect">New Low-level API and stdout Redirect</h3>
<p>One big change (and a long time request) is that processors are now allowed to print from inside their processing code, without interfering with the correct operation of the pipeline. All prints are automatically converted to logging.info(…) calls.This behaviour is enabled when using the new low-level API. The main change we’ve introduced is that ingest() is now a context manager. This means that you now should run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># New style for ingest and spew
with ingest() as ctx:
# Do stuff with datapackage and resource_iterator
spew(ctx.datapackage,
ctx.resource_iterator,
ctx.stats)
</code></pre></div></div>
<p>Backward compatibility is maintained for the old way of using ingest(), so you don’t have to update all your code immediately.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># This still works, but won’t handle print()s
parameters, datapackage, resource_iterator = ingest()
spew(datapackage, resource_iterator)
</code></pre></div></div>
<h3 id="dataflows-integration">Dataflows integration</h3>
<p>There’s a new integration with dataflows which allows running Flows directly from the <code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> file.
You can integrate dataflows within pipeline specs using the <code class="language-plaintext highlighter-rouge">flow</code> attribute instead of <code class="language-plaintext highlighter-rouge">run</code>. For example, given the following flow file, saved under <code class="language-plaintext highlighter-rouge">my-flow.py</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from dataflows import Flow, dump_to_path, load, update_package
def flow(parameters, datapackage, resources, stats):
stats[‘multiplied_fields’] = 0
def multiply(field, n):
def step(row):
row[field] = row[field] * n
stats[‘multiplied_fields’] += 1
return step
return Flow(update_package(name=’my-datapackage’),
load((datapackage, resources),
multiply(‘my-field’, 2))
</code></pre></div></div>
<p>And a <code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> in the same directory:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my-flow:
pipeline:
— run: load_resource
parameters:
url: http://example.com/my-datapackage/datapackage.json
resource: my-resource
— flow: my-flow
— run: dump.to_path
</code></pre></div></div>
<p>You can run the pipeline using <code class="language-plaintext highlighter-rouge">dpp run my-flow</code>.</p>
<p>If you want to wrap a flow inside a processor, you can use the <code class="language-plaintext highlighter-rouge">spew_flow</code> helper function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from dataflows import Flow
from datapackage_pipelines.wrapper import ingest
from datapackage_pipelines.utilities.flow_utils import spew_flow
def flow(parameters):
return Flow(
# Flow processing comes here
)
if __name__ == ‘__main__’:
with ingest() as ctx:
spew_flow(flow(ctx.parameters), ctx)
</code></pre></div></div>
<h3 id="standard-processor-refactoring">Standard Processor Refactoring</h3>
<p>We refactored all standard processors to use their counterparts from dataflows, thus removing code duplication and allowing us to move forward quicker. As a result, we’re also introducing a couple of <strong>new</strong> processors:</p>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">load</code> - Loads and streams a new resource (or resources) into the data package. It’s based on the dataflows processor with the same name, so it supports loading from local files, remote URL, data packages, locations in environment variables etc. For more information, consult the <a href="https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#load">dataflows documentation</a>.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">printer</code> - Smart printing processor for displaying the contents of the stream - comes in handy for development or monitoring a pipeline.It will not print all rows, but an logarithmically sparse sample - in other words, it will print rows 1-20, 100-110, 1000-1010 etc. It also prints the last 10 rows of the dataset.</p>
</li>
</ul>
<h2 id="deprecations">Deprecations</h2>
<p>We are <strong>deprecating</strong> a few processors — you can still use them as usual but they will be removed in the next major version (3.0):</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">add_metadata</code> - was renamed to <code class="language-plaintext highlighter-rouge">update_package</code> for consistency</li>
<li><code class="language-plaintext highlighter-rouge">add_resource</code> and <code class="language-plaintext highlighter-rouge">stream_remote_resources</code> - are being replaced by the <code class="language-plaintext highlighter-rouge">load</code></li>
<li><code class="language-plaintext highlighter-rouge">dump.to_path</code>, <code class="language-plaintext highlighter-rouge">dump.to_zip</code>, <code class="language-plaintext highlighter-rouge">dump.to_sql</code> - are being deprecated - you should use <code class="language-plaintext highlighter-rouge">dump_to_path</code>, <code class="language-plaintext highlighter-rouge">dump_to_zip</code> and <code class="language-plaintext highlighter-rouge">dump_to_sql</code> instead.
Note that <code class="language-plaintext highlighter-rouge">dump_to_path</code> and <code class="language-plaintext highlighter-rouge">dump_to_zip</code> lack some features that exist in the current processors — for example, custom file formatters and non-tabular file support. We might introduce some of that functionality into the new processors as well in the next versions - <em>in the meantime, please let us know what you think about these features and how badly you need them</em>.</li>
</ul>
<h2 id="the-road-ahead">The Road Ahead</h2>
<p>In the next versions we’re planning to further the integration of dataflows and datapackage-pipelines. We’re going to work on streamlining development and deployment as well as taking care of naming and documentation to harmonize all aspects of the dataflows ecosystem.
We’re also working on de-composing datapackage-pipelines into smaller, self contained components. In this version we took apart the standard processor code and some supporting libraries (e.g. <code class="language-plaintext highlighter-rouge">kvstore</code>) and delegated it to external libraries.</p>
<h2 id="links-and-references">Links and References</h2>
<ul>
<li>Read more on datapackage-pipelines here: <a href="https://github.com/frictionlessdata/datapackage-pipelines">https://github.com/frictionlessdata/datapackage-pipelines</a></li>
<li>Read more on dataflows here: <a href="https://github.com/datahq/dataflows">https://github.com/datahq/dataflows</a></li>
<li>Read more on Data Factory here: <a href="/blog/2018/08/29/data-factory-data-flows-introduction.html">http://okfnlabs.org/blog/2018/08/29/data-factory-data-flows-introduction.html</a></li>
</ul>
<h2 id="contributors">Contributors</h2>
<p>Thanks to <a href="https://github.com/OriHoch">Ori Hoch</a> for contributing code and other invaluable assistance with this release.</p>
Adam Kariv
Data Factory & DataFlows - Tutorial
2018-08-30T00:00:00+00:00
http://okfnlabs.org/blog/2018/08/30/data-factory-data-flows-tutorial
<p><em>Data Factory is an open framework for building and running lightweight data processing workflows quickly and easily. We recommend reading <a href="/blog/2018/08/29/data-factory-data-flows-introduction.html">this introductory blogpost</a> to gain a better understanding of underlying Data Factory concepts before diving into the tutorial below.</em></p>
<hr />
<h2 id="learn-how-to-write-your-own-processing-flows">Learn how to write your own processing flows</h2>
<p>Let’s start with the traditional ‘hello, world’ example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="s">'Hello'</span><span class="p">},</span>
<span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="s">'World'</span><span class="p">}</span>
<span class="p">]</span>
<span class="k">def</span> <span class="nf">lowerData</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
<span class="n">row</span><span class="p">[</span><span class="s">'data'</span><span class="p">]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'data'</span><span class="p">].</span><span class="n">lower</span><span class="p">()</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">data</span><span class="p">,</span>
<span class="n">lowerData</span>
<span class="p">)</span>
<span class="n">data</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">results</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># -->
# [
# [
# {'data': 'hello'},
# {'data': 'world'}
# ]
# ]
</span></code></pre></div></div>
<p>This very simple flow takes a list of <code class="language-plaintext highlighter-rouge">dict</code>s and applies a row processing function on each one of them.</p>
<p>We can load data from a file instead:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">load</span>
<span class="c1"># beatles.csv:
# name,instrument
# john,guitar
# paul,bass
# george,guitar
# ringo,drums
</span>
<span class="k">def</span> <span class="nf">titleName</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
<span class="n">row</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'name'</span><span class="p">].</span><span class="n">title</span><span class="p">()</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">load</span><span class="p">(</span><span class="s">'beatles.csv'</span><span class="p">),</span>
<span class="n">titleName</span>
<span class="p">)</span>
<span class="n">data</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">results</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># -->
# [
# [
# {'name': 'John', 'instrument': 'guitar'},
# {'name': 'Paul', 'instrument': 'bass'},
# {'name': 'George', 'instrument': 'guitar'},
# {'name': 'Ringo', 'instrument': 'drums'}
# ]
# ]
</span></code></pre></div></div>
<p>The source file can be a CSV file, an Excel file or a JSON file. You can use a local file name or a URL for a file hosted somewhere on the web.</p>
<p>Data sources can be generators and not just lists or files. Let’s take as an example a very simple scraper:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span>
<span class="kn">from</span> <span class="nn">xml.etree</span> <span class="kn">import</span> <span class="n">ElementTree</span>
<span class="kn">from</span> <span class="nn">urllib.request</span> <span class="kn">import</span> <span class="n">urlopen</span>
<span class="c1"># Get from Wikipedia the population count for each country
</span><span class="k">def</span> <span class="nf">country_population</span><span class="p">():</span>
<span class="c1"># Read the Wikipedia page and parse it using etree
</span> <span class="n">page</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="s">'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
<span class="n">tree</span> <span class="o">=</span> <span class="n">ElementTree</span><span class="p">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>
<span class="c1"># Iterate on all tables, rows and cells
</span> <span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">tree</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">'.//table'</span><span class="p">):</span>
<span class="k">if</span> <span class="s">'wikitable'</span> <span class="ow">in</span> <span class="n">table</span><span class="p">.</span><span class="n">attrib</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'class'</span><span class="p">,</span> <span class="s">''</span><span class="p">):</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">table</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">'tr'</span><span class="p">):</span>
<span class="n">cells</span> <span class="o">=</span> <span class="n">row</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">'td'</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">cells</span><span class="p">)</span> <span class="o">></span> <span class="mi">3</span><span class="p">:</span>
<span class="c1"># If a matching row is found...
</span> <span class="n">name</span> <span class="o">=</span> <span class="n">cells</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">find</span><span class="p">(</span><span class="s">'.//a'</span><span class="p">).</span><span class="n">attrib</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'title'</span><span class="p">)</span>
<span class="n">population</span> <span class="o">=</span> <span class="n">cells</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">text</span>
<span class="c1"># ... yield a row with the information
</span> <span class="k">yield</span> <span class="nb">dict</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="n">name</span><span class="p">,</span>
<span class="n">population</span><span class="o">=</span><span class="n">population</span>
<span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">country_population</span><span class="p">(),</span>
<span class="p">)</span>
<span class="n">data</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">results</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># --->
# [
# [
# {'name': 'China', 'population': '1,391,090,000'},
# {'name': 'India', 'population': '1,332,140,000'},
# {'name': 'United States', 'population': '327,187,000'},
# {'name': 'Indonesia', 'population': '261,890,900'},
# ...
# ]
# ]
</span></code></pre></div></div>
<p>This is nice, but we do prefer the numbers to be actual numbers and not strings.</p>
<p>In order to do that, let’s simply define their type to be numeric:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">set_type</span>
<span class="k">def</span> <span class="nf">country_population</span><span class="p">():</span>
<span class="c1"># same as before
</span> <span class="p">...</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">country_population</span><span class="p">(),</span>
<span class="n">set_type</span><span class="p">(</span><span class="s">'population'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'number'</span><span class="p">,</span> <span class="n">groupChar</span><span class="o">=</span><span class="s">','</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">data</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">results</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># -->
# [
# [
# {'name': 'China', 'population': Decimal('1391090000')},
# {'name': 'India', 'population': Decimal('1332140000')},
# {'name': 'United States', 'population': Decimal('327187000')},
# {'name': 'Indonesia', 'population': Decimal('261890900')},
# ...
# ]
# ]
</span>
</code></pre></div></div>
<p>Data is automatically converted to the correct native Python type.</p>
<p>Apart from data-types, it’s also possible to set other constraints to the data. If the data fails validation (or does not fit the assigned data-type) an exception will be thrown - making this method highly effective for validating data and ensuring data quality.</p>
<p>What about large data files? In the above examples, the results are loaded into memory, which is not always preferable or acceptable. In many cases, we’d like to store the results directly onto a hard drive - without having the machine’s RAM limit in any way the amount of data we can process.</p>
<p>We do it by using <em>dump</em> processors:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">set_type</span><span class="p">,</span> <span class="n">dump_to_path</span>
<span class="k">def</span> <span class="nf">country_population</span><span class="p">():</span>
<span class="c1"># same as before
</span> <span class="p">...</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">country_population</span><span class="p">(),</span>
<span class="n">set_type</span><span class="p">(</span><span class="s">'population'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'number'</span><span class="p">,</span> <span class="n">groupChar</span><span class="o">=</span><span class="s">','</span><span class="p">),</span>
<span class="n">dump_to_path</span><span class="p">(</span><span class="s">'country_population'</span><span class="p">)</span>
<span class="p">)</span>
<span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">process</span><span class="p">()</span>
</code></pre></div></div>
<p>Running this code will create a local directory called <code class="language-plaintext highlighter-rouge">county_population</code>, containing two files:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>├── country_population
│ ├── datapackage.json
│ └── res_1.csv
</code></pre></div></div>
<p>The CSV file - <code class="language-plaintext highlighter-rouge">res_1.csv</code> - is where the data is stored. The <code class="language-plaintext highlighter-rouge">datapackage.json</code> file is a metadata file, holding information about the data, including its schema.</p>
<p>We can now open the CSV file with any spreadsheet program or code library supporting the CSV format - or using one of the <strong>data package</strong> libraries out there, like so:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">datapackage</span> <span class="kn">import</span> <span class="n">Package</span>
<span class="n">pkg</span> <span class="o">=</span> <span class="n">Package</span><span class="p">(</span><span class="s">'country_population/res_1.csv'</span><span class="p">)</span>
<span class="n">it</span> <span class="o">=</span> <span class="n">pkg</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nb">iter</span><span class="p">(</span><span class="n">keyed</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="nb">next</span><span class="p">(</span><span class="n">it</span><span class="p">))</span>
<span class="c1"># prints:
# {'name': 'China', 'population': Decimal('1391110000')}
</span></code></pre></div></div>
<p>Note how using the data package meta-data, data-types are restored and there’s no need to ‘re-parse’ the data. This also works with other types too, such as dates, booleans and even <code class="language-plaintext highlighter-rouge">list</code>s and <code class="language-plaintext highlighter-rouge">dict</code>s.</p>
<p>So far we’ve seen how to load data, process it row by row, and then inspect the results or store them in a data package.</p>
<p>Let’s see how we can do more complex processing by manipulating the entire data stream:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">set_type</span><span class="p">,</span> <span class="n">dump_to_path</span>
<span class="c1"># Generate all triplets (a,b,c) so that 1 <= a <= b < c <= 20
</span><span class="k">def</span> <span class="nf">all_triplets</span><span class="p">():</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">):</span>
<span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">20</span><span class="p">):</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">b</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="mi">21</span><span class="p">):</span>
<span class="k">yield</span> <span class="nb">dict</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="n">c</span><span class="p">)</span>
<span class="c1"># Yield row only if a^2 + b^2 == c^1
</span><span class="k">def</span> <span class="nf">filter_pythagorean_triplets</span><span class="p">(</span><span class="n">rows</span><span class="p">):</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">rows</span><span class="p">:</span>
<span class="k">if</span> <span class="n">row</span><span class="p">[</span><span class="s">'a'</span><span class="p">]</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="n">row</span><span class="p">[</span><span class="s">'b'</span><span class="p">]</span><span class="o">**</span><span class="mi">2</span> <span class="o">==</span> <span class="n">row</span><span class="p">[</span><span class="s">'c'</span><span class="p">]</span><span class="o">**</span><span class="mi">2</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">row</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">all_triplets</span><span class="p">(),</span>
<span class="n">set_type</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'integer'</span><span class="p">),</span>
<span class="n">set_type</span><span class="p">(</span><span class="s">'b'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'integer'</span><span class="p">),</span>
<span class="n">set_type</span><span class="p">(</span><span class="s">'c'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'integer'</span><span class="p">),</span>
<span class="n">filter_pythagorean_triplets</span><span class="p">,</span>
<span class="n">dump_to_path</span><span class="p">(</span><span class="s">'pythagorean_triplets'</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">process</span><span class="p">()</span>
<span class="c1"># -->
# pythagorean_triplets/res_1.csv contains:
# a,b,c
# 3,4,5
# 5,12,13
# 6,8,10
# 8,15,17
# 9,12,15
# 12,16,20
</span></code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">filter_pythagorean_triplets</code> function takes an iterator of rows, and yields only the ones that pass its condition.</p>
<p>The flow framework knows whether a function is meant to handle a single row or a row iterator based on its parameters:</p>
<ul>
<li>if it accepts a single <code class="language-plaintext highlighter-rouge">row</code> parameter, then it’s a row processor.</li>
<li>if it accepts a single <code class="language-plaintext highlighter-rouge">rows</code> parameter, then it’s a rows processor.</li>
<li>if it accepts a single <code class="language-plaintext highlighter-rouge">package</code> parameter, then it’s a package processor.</li>
</ul>
<p>Let’s see a few examples of what we can do with a package processors.</p>
<p>First, let’s add a field to the data:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">load</span><span class="p">,</span> <span class="n">dump_to_path</span>
<span class="k">def</span> <span class="nf">add_is_guitarist_column_to_schema</span><span class="p">(</span><span class="n">package</span><span class="p">):</span>
<span class="c1"># Add a new field to the first resource
</span> <span class="n">package</span><span class="p">.</span><span class="n">pkg</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="p">.</span><span class="n">descriptor</span><span class="p">[</span><span class="s">'schema'</span><span class="p">][</span><span class="s">'fields'</span><span class="p">]</span>
<span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s">'is_guitarist'</span><span class="p">,</span>
<span class="nb">type</span><span class="o">=</span><span class="s">'boolean'</span>
<span class="p">))</span>
<span class="c1"># Must yield the modified datapackage
</span> <span class="k">yield</span> <span class="n">package</span><span class="p">.</span><span class="n">pkg</span>
<span class="c1"># And its resources
</span> <span class="k">yield</span> <span class="k">from</span> <span class="n">package</span>
<span class="k">def</span> <span class="nf">add_is_guitarist_column</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
<span class="n">row</span><span class="p">[</span><span class="s">'is_guitarist'</span><span class="p">]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'instrument'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'guitar'</span>
<span class="k">return</span> <span class="n">row</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="c1"># Same one as above
</span> <span class="n">load</span><span class="p">(</span><span class="s">'beatles.csv'</span><span class="p">),</span>
<span class="n">add_is_guitarist_column_to_schema</span><span class="p">,</span>
<span class="n">add_is_guitarist_column</span><span class="p">,</span>
<span class="n">dump_to_path</span><span class="p">(</span><span class="s">'beatles_guitarists'</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">process</span><span class="p">()</span>
</code></pre></div></div>
<p>In this example we create two steps - one for adding the new field (<code class="language-plaintext highlighter-rouge">is_guitarist</code>) to the schema and another step to modify the actual data.</p>
<p>We can combine the two into one step:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">load</span><span class="p">,</span> <span class="n">dump_to_path</span>
<span class="k">def</span> <span class="nf">add_is_guitarist_column</span><span class="p">(</span><span class="n">package</span><span class="p">):</span>
<span class="c1"># Add a new field to the first resource
</span> <span class="n">package</span><span class="p">.</span><span class="n">pkg</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">descriptor</span><span class="p">[</span><span class="s">'schema'</span><span class="p">][</span><span class="s">'fields'</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s">'is_guitarist'</span><span class="p">,</span>
<span class="nb">type</span><span class="o">=</span><span class="s">'boolean'</span>
<span class="p">))</span>
<span class="c1"># Must yield the modified datapackage
</span> <span class="k">yield</span> <span class="n">package</span><span class="p">.</span><span class="n">pkg</span>
<span class="c1"># Now iterate on all resources
</span> <span class="n">resources</span> <span class="o">=</span> <span class="nb">iter</span><span class="p">(</span><span class="n">package</span><span class="p">)</span>
<span class="c1"># Take the first resource
</span> <span class="n">beatles</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">resources</span><span class="p">)</span>
<span class="c1"># And yield it with with the modification
</span> <span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
<span class="n">row</span><span class="p">[</span><span class="s">'is_guitarist'</span><span class="p">]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'instrument'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'guitar'</span>
<span class="k">return</span> <span class="n">row</span>
<span class="k">yield</span> <span class="nb">map</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">beatles</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="c1"># Same one as above
</span> <span class="n">load</span><span class="p">(</span><span class="s">'beatles.csv'</span><span class="p">),</span>
<span class="n">add_is_guitarist_column</span><span class="p">,</span>
<span class="n">dump_to_path</span><span class="p">(</span><span class="s">'beatles_guitarists'</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">process</span><span class="p">()</span>
</code></pre></div></div>
<p>The contract for the <code class="language-plaintext highlighter-rouge">package</code> processing function is simple:</p>
<p>First modify <code class="language-plaintext highlighter-rouge">package.pkg</code> (which is a <code class="language-plaintext highlighter-rouge">Package</code> instance) and yield it.</p>
<p>Then, yield any resources that should exist on the output, with or without modifications.</p>
<p>In the next example we’re removing an entire resource in a package processor - this next one filters the list of Academy Award nominees to those who won both the Oscar and an Emmy award:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">from</span> <span class="nn">dataflows</span> <span class="kn">import</span> <span class="n">Flow</span><span class="p">,</span> <span class="n">load</span><span class="p">,</span> <span class="n">dump_to_path</span>
<span class="k">def</span> <span class="nf">find_double_winners</span><span class="p">(</span><span class="n">package</span><span class="p">):</span>
<span class="c1"># Remove the emmies resource -
</span> <span class="c1"># we're going to consume it now
</span> <span class="n">package</span><span class="p">.</span><span class="n">pkg</span><span class="p">.</span><span class="n">remove_resource</span><span class="p">(</span><span class="s">'emmies'</span><span class="p">)</span>
<span class="c1"># Must yield the modified datapackage
</span> <span class="k">yield</span> <span class="n">package</span><span class="p">.</span><span class="n">pkg</span>
<span class="c1"># Now iterate on all resources
</span> <span class="n">resources</span> <span class="o">=</span> <span class="nb">iter</span><span class="p">(</span><span class="n">package</span><span class="p">)</span>
<span class="c1"># Emmies is the first -
</span> <span class="c1"># read all its data and create a set of winner names
</span> <span class="n">emmy</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">resources</span><span class="p">)</span>
<span class="n">emmy_winners</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span>
<span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'nominee'</span><span class="p">],</span>
<span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'winner'</span><span class="p">],</span>
<span class="n">emmy</span><span class="p">))</span>
<span class="p">)</span>
<span class="c1"># Oscars are next -
</span> <span class="c1"># filter rows based on the emmy winner set
</span> <span class="n">academy</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">resources</span><span class="p">)</span>
<span class="k">yield</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s">'Winner'</span><span class="p">]</span> <span class="ow">and</span>
<span class="n">row</span><span class="p">[</span><span class="s">'Name'</span><span class="p">]</span> <span class="ow">in</span> <span class="n">emmy_winners</span><span class="p">),</span>
<span class="n">academy</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="c1"># Emmy award nominees and winners
</span> <span class="n">load</span><span class="p">(</span><span class="s">'emmy.csv'</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'emmies'</span><span class="p">),</span>
<span class="c1"># Academy award nominees and winners
</span> <span class="n">load</span><span class="p">(</span><span class="s">'academy.csv'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf8'</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'oscars'</span><span class="p">),</span>
<span class="n">find_double_winners</span><span class="p">,</span>
<span class="n">dump_to_path</span><span class="p">(</span><span class="s">'double_winners'</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">process</span><span class="p">()</span>
<span class="c1"># -->
# double_winners/academy.csv contains:
# 1931/1932,5,Actress,1,Helen Hayes,The Sin of Madelon Claudet
# 1932/1933,6,Actress,1,Katharine Hepburn,Morning Glory
# 1935,8,Actress,1,Bette Davis,Dangerous
# 1938,11,Actress,1,Bette Davis,Jezebel
# ...
</span></code></pre></div></div>
<h2 id="builtin-processors">Builtin Processors</h2>
<p>DataFlows comes with a few built-in processors which do most of the heavy lifting in many common scenarios, leaving you to implement only the minimum code that is specific to your specific problem.</p>
<p>A complete list, which also includes an API reference for each one of them, can be found in the <a href="https://github.com/datahq/dataflows/blob/master/PROCESSORS.md">DataFlows Built-in Processors</a> page.</p>
Adam Kariv
Data Factory & DataFlows - An Introduction
2018-08-29T00:00:00+00:00
http://okfnlabs.org/blog/2018/08/29/data-factory-data-flows-introduction
<p>Today I’d like to introduce a new library we’ve been working on - <code class="language-plaintext highlighter-rouge">dataflows</code>. DataFlows is a part of a larger conceptual framework for data processing.</p>
<p>We call it ‘<strong>Data Factory</strong>’ - an open framework for building and running lightweight data processing workflows quickly and easily. LAMP for data wrangling!</p>
<p>Most of you already know what <em><a href="http://frictionlessdata.io/data-packages/">Data Packages</a></em> are. In short, it is a portable format for packaging different resources (tabular or otherwise) in a standard way that takes care of most interoperability problems (e.g. <em>“what’s the character encoding of the file?”</em> or <em>“what is the data type for this column?”</em> or <em>“which date format are they using?”</em>). It also provides rich and flexible metadata, which users can then use to understand what the data is about (take a look at <a href="http://frictionlessdata.io/">frictionlessdata.io</a> to learn more!).</p>
<p><em>Data Factory</em> complements the <em>Data Package</em> concepts by adding dynamics to the mix.</p>
<p>While Data Packages are a great solution for describing data sets, these data sets are always <em>static</em> - located in one place. <em>Data Factory</em> is all about transforming Data Packages - modifying their data or meta-data and transmitting them from one location to another.</p>
<p><em>Data Factory</em> defines standard interfaces for building <em>processors</em> - software modules for mutating a Data Package - and protocols for streaming the contents of a Data Package for efficient processing.</p>
<h2 id="philosophy-and-goals">Philosophy and Goals</h2>
<p><em>Data Factory</em> is more pattern/convention than library.</p>
<p>An analogy is with web frameworks. Web frameworks were more about a core pattern plus a set of ready to use components than a library themselves. For example, python frameworks were built around WSGI e.g. Pylons, Flask etc. Or consider ExpressJS for Node.</p>
<p>In this sense these frameworks were about convention over configuration. They attempted to decrease the number of decisions that a developer using the framework is required to make without necessarily losing flexibility.</p>
<p>Like web frameworks, Data Factory uses convention over configuration with the aim of decreasing the number of decisions that a data developer is required to make without necessarily losing flexibility.</p>
<p>By following a standard scheme, developers are able to use a large and growing library of existing, reusable processors. This also increases readability and maintainability of data processing code.</p>
<p><strong>Our focus is on:</strong></p>
<ul>
<li>Small to medium sized data (KBs to GBs)</li>
<li>Desktop wrangling - people who start on their desktop</li>
<li>Easy transition from desktop to “cloud”</li>
<li>Heterogeneous data sources</li>
<li>Process using basic building blocks that are extensible</li>
<li>Less technical audience</li>
<li>Limited resources - limit on memory, CPU, etc.</li>
</ul>
<p>What are we <strong>not</strong>?</p>
<ul>
<li>Big data processing and machine learning. e.g. if you want to wrangle TBs of data in a distributed setup or want to train a machine learning model with GBs of data, you probably don’t want this.</li>
<li>Processing real-time event data.</li>
<li>Technical know-how <strong><em>is</em></strong> needed: we aren’t a fancy ETL UI – you probably need a bit of technical sophistication</li>
</ul>
<h2 id="architecture">Architecture</h2>
<p>This new framework is built on the foundations of the Frictionless Data project - both conceptually as well as technically. This project provided us the definition of <em>Data Packages</em> and the software to read and write these packages.</p>
<blockquote>
<p>On top of this Frictionless Data basis, we’re introducing a few new concepts:</p>
<ul>
<li>the <strong>Data Stream</strong> - essentially a Data Package in transit;</li>
<li>the <strong>Data Processor</strong>, which manipulates a Data Package, receiving one Data Stream as its input and producing a new Data Stream as its output.</li>
<li>A chain of Data Processors is what we call a <strong>Data Flow</strong>.</li>
</ul>
</blockquote>
<p>We will be providing a library of such processors: some for loading data from various sources, some for storing data in different locations, services or databases, and some for doing common manipulation and transformation on the processed data.</p>
<p>On top of all that we’re building a few integrated services:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">dataflows-server</code> (formerly known as <code class="language-plaintext highlighter-rouge">datapackage-pipelines</code>) - a server side multi-processor runner for Data Flows.</li>
<li><code class="language-plaintext highlighter-rouge">dataflows-cli</code> - a client library for building and running Data Flows locally</li>
<li><code class="language-plaintext highlighter-rouge">dataflows-blueprints</code> - ready made flow generators for common scenarios (e.g. ‘I want to regularly pull all my analytics from these X services and dump them in a database’)</li>
<li>and more to come.</li>
</ul>
<p><img src="/img/posts/data-factory.png" alt="Data Factory" /></p>
<h2 id="on-data-wrangling">On Data Wrangling</h2>
<p>In our experience, data processing starts simple - downloading and inspecting a CSV, deleting a column or a row. We wanted something that was as fast as the command line to get started but would also provide a solid basis as your pipeline grows. We also wanted something that provided some standardization and conventions over completely bespoke code.</p>
<p>With integration in mind, DataFlows comes with very little environmental requirements, and can be embedded in your existing data processing setup.</p>
<p>In short, DataFlows provides a simple, quick and easy-to-setup, and extensible way to build lightweight data processing pipelines.</p>
<h2 id="introducing-dataflows">Introducing dataflows</h2>
<p>The first piece of software we’re introducing today is <code class="language-plaintext highlighter-rouge">dataflows</code> and its standard library of processors.</p>
<p><code class="language-plaintext highlighter-rouge">dataflows</code> introduces the concept of a <code class="language-plaintext highlighter-rouge">Flow</code> - a chain of data processors, reading, transforming and modifying a stream of data and writing it to any location (or loading it to memory for further analysis).</p>
<p><code class="language-plaintext highlighter-rouge">dataflows</code> also comes with a rich set of built-in data processors, ready to do most of the heavy-lifting you’ll need to reduce boilerplate code and increase your productivity.</p>
<h3 id="a-demo-is-worth-a-thousand-words">A demo is worth a thousand words</h3>
<p>Most data processing starts simple: getting a file and having a look.</p>
<p>With <code class="language-plaintext highlighter-rouge">dataflows</code> you can do this in a few seconds <em>and</em> you’ll have a solid basis for whatever you want to do next.</p>
<p><strong><em>Bootstrapping a data processing script</em></strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pip <span class="nb">install </span>dataflows
<span class="nv">$ </span>dataflows init https://rawgit.com/datahq/demo/_/first.csv
Writing processing code into first_csv.py
Running first_csv.py
first:
<span class="c"># Name Composed DOB</span>
<span class="o">(</span>string<span class="o">)</span> <span class="o">(</span>string<span class="o">)</span> <span class="o">(</span><span class="nb">date</span><span class="o">)</span>
<span class="nt">---</span> <span class="nt">----------</span> <span class="nt">----------</span> <span class="nt">----------</span>
1 George 22 1943-02-25
2 John 90 1940-10-09
3 Richard 2 1940-07-07
4 Paul 88 1942-06-18
5 Brian n/a 1934-09-19
Done!
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">dataflows init</code> actually does 3 things:</p>
<ul>
<li>Analyzes the source file</li>
<li>Creates a processing script for reading it</li>
<li>Runs that script for you</li>
</ul>
<p>In our case, a script named <code class="language-plaintext highlighter-rouge">first_csv.py</code> was created - here’s what it contains:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ...
</span>
<span class="k">def</span> <span class="nf">first_csv</span><span class="p">():</span>
<span class="n">flow</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="c1"># Load inputs
</span> <span class="n">load</span><span class="p">(</span><span class="s">'https://rawgit.com/datahq/demo/_/first.csv'</span><span class="p">,</span>
<span class="nb">format</span><span class="o">=</span><span class="s">'csv'</span><span class="p">,</span> <span class="p">),</span>
<span class="c1"># Process them (if necessary)
</span> <span class="c1"># Save the results
</span> <span class="n">add_metadata</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'first_csv'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'first.csv'</span><span class="p">),</span>
<span class="n">printer</span><span class="p">(),</span>
<span class="p">)</span>
<span class="n">flow</span><span class="p">.</span><span class="n">process</span><span class="p">()</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">first_csv</span><span class="p">()</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">flow</code> variable contains the chain of processing steps (i.e. the processors). In this simple flow, <code class="language-plaintext highlighter-rouge">load</code> loads the source data, <code class="language-plaintext highlighter-rouge">add_metadata</code> modifies the file’s metadata and <code class="language-plaintext highlighter-rouge">printer</code> outputs the contents to the standard output.</p>
<p>You can run this script again at any time, and it will re-run the processing flow:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python first_csv.py
first:
<span class="c"># Name Composed DOB</span>
<span class="o">(</span>string<span class="o">)</span> <span class="o">(</span>string<span class="o">)</span> <span class="o">(</span><span class="nb">date</span><span class="o">)</span>
<span class="nt">---</span> <span class="nt">----------</span> <span class="nt">----------</span> <span class="nt">----------</span>
1 George 22 1943-02-25
...
</code></pre></div></div>
<p>This is all very nice, but now it’s time for some real data wrangling. By editing the processing script it’s possible to add more functionality to the flow - <code class="language-plaintext highlighter-rouge">dataflows</code> provides a simple, solid basis for building up your pipeline quickly, reliably and repeatedly.</p>
<p><strong><em>Fixing some bad data</em></strong></p>
<p>Let’s start by getting rid of that annoying <code class="language-plaintext highlighter-rouge">n/a</code> in the last line of the data.</p>
<p>We edit <code class="language-plaintext highlighter-rouge">first_csv.py</code> and add to the flow two more steps:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">removeNA</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
<span class="n">row</span><span class="p">[</span><span class="s">'Composed'</span><span class="p">]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'Composed'</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="s">'n/a'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">load</span><span class="p">(</span><span class="s">'https://rawgit.com/datahq/demo/_/first.csv'</span><span class="p">),</span>
<span class="c1"># added here custom processing:
</span> <span class="n">removeNa</span><span class="p">,</span>
<span class="c1"># now parse column as Integer:
</span> <span class="n">set_type</span><span class="p">(</span><span class="s">'Composed'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'integer'</span><span class="p">),</span>
<span class="n">printer</span><span class="p">()</span>
<span class="p">)</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">removeNa</code> is a simple function which modifies each row it sees, replacing <code class="language-plaintext highlighter-rouge">n/a</code>s with the empty string. After it we call <code class="language-plaintext highlighter-rouge">set_type</code>, which declares the <code class="language-plaintext highlighter-rouge">Composed</code>column should be an integer - and verifies it’s indeed an integer while processing data.</p>
<p><strong><em>Writing the cleaned data</em></strong></p>
<p>Finally, let’s write the output to a file using the <code class="language-plaintext highlighter-rouge">dump_to_path</code> processor:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">removeNA</span><span class="p">(</span><span class="n">row</span><span class="p">):</span>
<span class="n">row</span><span class="p">[</span><span class="s">'Composed'</span><span class="p">]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'Composed'</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="s">'n/a'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">Flow</span><span class="p">(</span>
<span class="n">load</span><span class="p">(</span><span class="s">'https://rawgit.com/datahq/demo/_/first.csv'</span><span class="p">),</span>
<span class="n">add_metadata</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s">'beatles_infoz'</span><span class="p">,</span>
<span class="n">title</span><span class="o">=</span><span class="s">'Beatle Member Information'</span><span class="p">,</span>
<span class="p">),</span>
<span class="n">removeNa</span><span class="p">,</span>
<span class="n">set_type</span><span class="p">(</span><span class="s">'Composed'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">'integer'</span><span class="p">),</span>
<span class="n">dump_to_path</span><span class="p">(</span><span class="s">'first_csv/'</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Now, we re-run our modified processing script…</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python first_csv.py
...
</code></pre></div></div>
<p>we get a valid Data Package which we can use…</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree
├── first_csv
│ ├── datapackage.json
│ └── first.csv
</code></pre></div></div>
<p>which contains a normalized and cleaned-up CSV file…</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">head </span>out/out.csv
Name,Composed,DOB
George,22,1943-02-25
John,90,1940-10-09
Richard,2,1940-07-07
Paul,88,1942-06-18
Brian,,1934-09-19
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">datapackage.json</code>, a JSON file containing the package’s metadata…</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>first_csv/datapackage.json <span class="c"># Edited for brevity</span>
<span class="o">{</span>
<span class="s2">"count_of_rows"</span>: 5,
<span class="s2">"name"</span>: <span class="s2">"beatles_infoz"</span>,
<span class="s2">"title"</span>: <span class="s2">"Beatle Member Information"</span>,
<span class="s2">"resources"</span>: <span class="o">[</span>
<span class="o">{</span>
<span class="s2">"name"</span>: <span class="s2">"first"</span>,
<span class="s2">"path"</span>: <span class="s2">"first.csv"</span>,
<span class="s2">"schema"</span>: <span class="o">{</span>
<span class="s2">"fields"</span>: <span class="o">[</span>
<span class="o">{</span><span class="s2">"name"</span>: <span class="s2">"Name"</span>, <span class="s2">"type"</span>: <span class="s2">"string"</span><span class="o">}</span>,
<span class="o">{</span><span class="s2">"name"</span>: <span class="s2">"Composed"</span>, <span class="s2">"type"</span>: <span class="s2">"integer"</span><span class="o">}</span>,
<span class="o">{</span><span class="s2">"name"</span>: <span class="s2">"DOB"</span>, <span class="s2">"type"</span>: <span class="s2">"date"</span><span class="o">}</span>
<span class="o">]</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">]</span>
<span class="o">}</span>
</code></pre></div></div>
<p>and is very simple to use in Python (or JS, Ruby, PHP and many other programming languages) -</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">python</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">datapackage</span> <span class="kn">import</span> <span class="n">Package</span>
<span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">Package</span><span class="p">(</span><span class="s">'first_csv/datapackage.json'</span><span class="p">)</span>
<span class="o">>>></span> <span class="nb">list</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nb">iter</span><span class="p">())</span>
<span class="p">[[</span><span class="s">'George'</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="n">datetime</span><span class="p">.</span><span class="n">date</span><span class="p">(</span><span class="mi">1943</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">25</span><span class="p">)],</span>
<span class="p">[</span><span class="s">'John'</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="n">datetime</span><span class="p">.</span><span class="n">date</span><span class="p">(</span><span class="mi">1940</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">9</span><span class="p">)],</span>
<span class="p">[</span><span class="s">'Richard'</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">datetime</span><span class="p">.</span><span class="n">date</span><span class="p">(</span><span class="mi">1940</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">)],</span>
<span class="p">[</span><span class="s">'Paul'</span><span class="p">,</span> <span class="mi">88</span><span class="p">,</span> <span class="n">datetime</span><span class="p">.</span><span class="n">date</span><span class="p">(</span><span class="mi">1942</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">18</span><span class="p">)],</span>
<span class="p">[</span><span class="s">'Brian'</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="n">datetime</span><span class="p">.</span><span class="n">date</span><span class="p">(</span><span class="mi">1934</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">19</span><span class="p">)]]</span>
<span class="o">>>></span>
</code></pre></div></div>
<h2 id="more-">More ….</h2>
<p>Lots, lots more - there is a whole suite of processors built in plus you can quickly add your own with a few lines of python code.</p>
<p>Dig in at the project’s <a href="https://github.com/datahq/dataflows">GitHub Page</a> or continue reading the in-depth tutorial <a href="/blog/2018/08/30/data-factory-data-flows-tutorial.html">here</a>.</p>
Adam Kariv
Processing Tabular Data Packages in Clojure
2018-05-07T00:00:00+00:00
http://okfnlabs.org/blog/2018/05/07/datapackages-in-clojure
<p>Matt Thompson was one of 2017’s <a href="https://toolfund.frictionlessdata.io">Frictionless Data Tool Fund</a> grantees tasked with extending implementation of core Frictionless Data <a href="https://github.com/frictionlessdata/datapackage-clj">data package</a> and <a href="https://github.com/frictionlessdata/tableschema-clj">table schema</a> libraries in Clojure programming language. You can read more about this in <a href="/articles/matt-thompson/">his grantee profile</a>. In this post, Thompson will show you how to set up and use the <a href="http://clojure.org">Clojure</a> libraries for working with <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Packages</a>.</p>
<hr />
<p>This tutorial uses a worked example of downloading a data package from a remote location on the web, and using the Frictionless Data tools to read its contents and metadata into Clojure data structures.</p>
<h2 id="setup">Setup</h2>
<p>First, we need to set up the project structure using the <a href="http://leiningen.org">Leiningen</a> tool. If you don’t have Leiningen set up on your system, follow the link to download and install it. Once it is set up, run the following command from the command line to create the folders and files for a basic Clojure project:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>lein new periodic-table</code></pre></figure>
<p>This will create the <em>periodic-table</em> folder. Inside the <em>periodic-table/src/periodic-table</em> folder should be a file named <em>core.clj</em>. This is the file you need to edit during this tutorial.</p>
<h2 id="the-data">The Data</h2>
<p>For this tutorial, we will use a pre-created data package, the Periodic Table Data Package hosted by the Frictionless Data project. A <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> is a simple container format used to describe and package a collection of data. It consists of two parts:</p>
<ul>
<li>Metadata that describes the structure and contents of the package</li>
<li>Resources such as data files that form the contents of the package</li>
</ul>
<p>Our Clojure code will download the data package and process it using the metadata information contained in the
package. The data package can be found <a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json">here on GitHub</a>.</p>
<p>The data package contains data about elements in the periodic table, including each element’s name, atomic number, symbol and atomic weight. The table below shows a sample taken from the first three rows of the CSV file:</p>
<table class="table table-striped table-bordered" style="display: block;">
<thead>
<tr>
<th>atomic number</th>
<th>symbol</th>
<th>name</th>
<th>atomic mass</th>
<th>metal or nonmetal?</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>H</td>
<td>Hydrogen</td>
<td>1.00794</td>
<td>nonmetal</td>
</tr>
<tr>
<td>2</td>
<td>He</td>
<td>Helium</td>
<td>4.002602</td>
<td>noble gas</td>
</tr>
<tr>
<td>3</td>
<td>Li</td>
<td>Lithium</td>
<td>6.941</td>
<td>alkali metal</td>
</tr>
</tbody>
</table>
<h2 id="loading-the-data-package">Loading the Data Package</h2>
<p>The first step is to load the data package into a Clojure data structure (a map). The initial step is to require the data package library in our code (which we will give the alias <strong>dp</strong>). Then we can use the <strong>load</strong> function to load our data package into our project. Enter the following code into the core.clj file:</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">ns</span><span class="w"> </span><span class="n">periodic-table.core</span><span class="w">
</span><span class="p">(</span><span class="no">:require</span><span class="w"> </span><span class="p">[</span><span class="n">frictionlessdata.datapackage</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">dp</span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="n">frictionlessdata.tableschema</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">ts</span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="n">clojure.spec.alpha</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">s</span><span class="p">]))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">pkg</span><span class="w">
</span><span class="p">(</span><span class="nf">dp/load</span><span class="w"> </span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">))</span></code></pre></figure>
<p>This pulls the data in from the remote GitHub location and converts the metadata into a Clojure map. We can access this metadata by using the <code class="language-plaintext highlighter-rouge">descriptor</code> function along with keys such as <code class="language-plaintext highlighter-rouge">:name</code> and <code class="language-plaintext highlighter-rouge">:title</code> to get the relevant information:</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nb">println</span><span class="w"> </span><span class="p">(</span><span class="nb">str</span><span class="w"> </span><span class="s">"Package name:"</span><span class="w"> </span><span class="p">(</span><span class="nf">dp/descriptor</span><span class="w"> </span><span class="n">pkg</span><span class="w"> </span><span class="no">:name</span><span class="p">)))</span><span class="w">
</span><span class="p">(</span><span class="nb">println</span><span class="w"> </span><span class="p">(</span><span class="nb">str</span><span class="w"> </span><span class="s">"Package title:"</span><span class="w"> </span><span class="p">(</span><span class="nf">dp/descriptor</span><span class="w"> </span><span class="n">pkg</span><span class="w"> </span><span class="no">:title</span><span class="p">)))</span></code></pre></figure>
<p>The package descriptor contains metadata that describes the contents of the data package. What about accessing the data itself? We can get to it using the <code class="language-plaintext highlighter-rouge">get-resources</code> function:</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">table</span><span class="w"> </span><span class="p">(</span><span class="nf">dp/get-resources</span><span class="w"> </span><span class="n">pkg</span><span class="w"> </span><span class="no">:data</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="nb">doseq</span><span class="w"> </span><span class="p">[</span><span class="n">row</span><span class="w"> </span><span class="n">table</span><span class="p">]</span><span class="w">
</span><span class="p">(</span><span class="nb">println</span><span class="w"> </span><span class="n">row</span><span class="p">))</span></code></pre></figure>
<p>The above code locates the data in the data package, then goes through it line by line and prints the contents.</p>
<h2 id="casting-types-with-corespec">Casting Types with core.spec</h2>
<p>We can use Clojure’s <a href="https://clojure.org/guides/spec">spec</a> library to define a schema for our data, which can then be used to cast the types of the data in the CSV file.</p>
<p>Below is a spec description of a periodic element type, consisting of an atomic number, atomic symbol, the element’s name, its mass, and whether or not the element is a metal or non-metal:</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::number</span><span class="w"> </span><span class="n">int?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::symbol</span><span class="w"> </span><span class="nb">string?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::name</span><span class="w"> </span><span class="nb">string?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::mass</span><span class="w"> </span><span class="n">float?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::metal</span><span class="w"> </span><span class="nb">string?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::element</span><span class="w"> </span><span class="p">(</span><span class="nf">s/keys</span><span class="w"> </span><span class="no">:req</span><span class="w"> </span><span class="p">[</span><span class="no">::number</span><span class="w"> </span><span class="no">::symbol</span><span class="w"> </span><span class="no">::name</span><span class="w"> </span><span class="no">::mass</span><span class="w"> </span><span class="no">::metal</span><span class="p">]))</span></code></pre></figure>
<p>The above spec can be used to cast values in our tabular data so that they match the specified schema. The example below shows our tabular data values being cast to fit the spec description. Then the <code class="language-plaintext highlighter-rouge">-main</code> function loops through the elements, printing only those with an atomic mass of over 10.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">ns</span><span class="w"> </span><span class="n">periodic-table.core</span><span class="w">
</span><span class="p">(</span><span class="no">:require</span><span class="w"> </span><span class="p">[</span><span class="n">frictionlessdata.datapackage</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">dp</span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="n">frictionlessdata.tableschema</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">ts</span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="n">clojure.spec.alpha</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">s</span><span class="p">]))</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::number</span><span class="w"> </span><span class="n">int?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::symbol</span><span class="w"> </span><span class="nb">string?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::name</span><span class="w"> </span><span class="nb">string?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::mass</span><span class="w"> </span><span class="n">float?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::metal</span><span class="w"> </span><span class="nb">string?</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">s/def</span><span class="w"> </span><span class="no">::element</span><span class="w"> </span><span class="p">(</span><span class="nf">s/keys</span><span class="w"> </span><span class="no">:req</span><span class="w"> </span><span class="p">[</span><span class="no">::number</span><span class="w"> </span><span class="no">::symbol</span><span class="w"> </span><span class="no">::name</span><span class="w"> </span><span class="no">::mass</span><span class="w"> </span><span class="no">::metal</span><span class="p">]))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">pkg</span><span class="w">
</span><span class="p">(</span><span class="nf">dp/load</span><span class="w"> </span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">resources</span><span class="w"> </span><span class="p">(</span><span class="nf">dp/get-resources</span><span class="w"> </span><span class="n">pkg</span><span class="w"> </span><span class="no">:data</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">elements</span><span class="w"> </span><span class="p">(</span><span class="nf">dp/cast</span><span class="w"> </span><span class="n">resources</span><span class="w"> </span><span class="n">element</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">defn</span><span class="w"> </span><span class="n">-main</span><span class="w"> </span><span class="p">[]</span><span class="w">
</span><span class="p">(</span><span class="nb">doseq</span><span class="w"> </span><span class="p">[</span><span class="n">e</span><span class="w"> </span><span class="n">elements</span><span class="p">]</span><span class="w">
</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nb"><</span><span class="w"> </span><span class="p">(</span><span class="no">:mass</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="mi">10</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nb">println</span><span class="w"> </span><span class="n">e</span><span class="p">))))</span></code></pre></figure>
<p>When run, the program produces the following output:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>lein run
<span class="o">{</span>::number 1 ::symbol <span class="s2">"H"</span> ::name <span class="s2">"Hydrogen"</span> ::mass 1.00794 ::metal <span class="s2">"nonmetal"</span><span class="o">}</span>
<span class="o">{</span>::number 2 ::symbol <span class="s2">"He"</span> ::name <span class="s2">"Helium"</span> ::mass 4.002602 ::metal <span class="s2">"noble gas"</span><span class="o">}</span>
<span class="o">{</span>::number 3 ::symbol <span class="s2">"Li"</span> ::name <span class="s2">"Lithium"</span> ::mass 6.941 ::metal <span class="s2">"alkali gas"</span><span class="o">}</span>
<span class="o">{</span>::number 4 ::symbol <span class="s2">"Be"</span> ::name <span class="s2">"Beryllium"</span> ::mass 9.012182 ::metal <span class="s2">"alkaline earth metal"</span><span class="o">}</span></code></pre></figure>
<p>This concludes our simple tutorial for using the Clojure libraries for Frictionless Data.</p>
<hr />
<p>We welcome your feedback and questions via our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a> or via <a href="https://github.com/frictionlessdata/datapackage-clj/issues">GitHub issues</a> on the <a href="https://github.com/frictionlessdata/datapackage-clj">datapackage-clj</a> repository.</p>
Matt Thompson
Processing Tabular Data Packages in Java
2018-04-28T00:00:00+00:00
http://okfnlabs.org/blog/2018/04/28/datapackages-in-java
<p>Georges Labrèche was one of 2017’s <a href="https://toolfund.frictionlessdata.io">Frictionless Data Tool Fund</a> grantees tasked with extending implementation of core Frictionless Data libraries in Java programming language. You can read more about this in <a href="https://frictionlessdata.io/articles/georges-labreche/">his grantee profile</a>.</p>
<p>In this post, Labrèche will show you how to install and use the <a href="https://www.java.com/en/">Java</a> libraries for working with <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Packages</a>.</p>
<hr />
<p>Our goal in this tutorial is to load tabular data from a CSV file, infer data types and the table’s schema.</p>
<h2 id="setup">Setup</h2>
<p>First things first, you’ll want to grab <a href="https://github.com/frictionlessdata/datapackage-java">datapackage-java</a> and the <a href="https://github.com/frictionlessdata/tableschema-java">tableschema-java</a> libraries.</p>
<h2 id="the-data">The Data</h2>
<p>For our example, we will use a <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Package</a> containing the periodic table. You can find the <a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json">data package descriptor</a> and the <a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/data.csv">data</a> on GitHub.</p>
<h2 id="packaging">Packaging</h2>
<p>Let’s start by fetching and packaging the data:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// fetch the data</span>
<span class="no">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="k">new</span> <span class="no">URL</span><span class="o">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="o">);</span>
<span class="c1">// package the data</span>
<span class="nc">Package</span> <span class="n">dp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Package</span><span class="o">(</span><span class="n">url</span><span class="o">);</span></code></pre></figure>
<p>That’s it, you’re all set to start playing with the packaged data. There are parameters you can set such as loading a schema or imposing strict validation so be sure to go through the project’s <a href="https://github.com/frictionlessdata/datapackage-java/blob/master/README.md">README</a> for more detail.</p>
<h2 id="iterating">Iterating</h2>
<p>Now that you have a Data Package instance, let’s see what the data looks like. A data package can contain more than one resource so you have to use the <code class="language-plaintext highlighter-rouge">Package.getResource()</code> method to specify which resource you’d like to access.</p>
<p>Let’s iterate over the data:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Get a resource named data from the data package</span>
<span class="nc">Resource</span> <span class="n">resource</span> <span class="o">=</span> <span class="n">pkg</span><span class="o">.</span><span class="na">getResource</span><span class="o">(</span><span class="s">"data"</span><span class="o">);</span>
<span class="c1">// Get the Iterator</span>
<span class="nc">Iterator</span><span class="o"><</span><span class="nc">String</span><span class="o">[]></span> <span class="n">iter</span> <span class="o">=</span> <span class="n">resource</span><span class="o">.</span><span class="na">iter</span><span class="o">();</span>
<span class="c1">// Iterate</span>
<span class="k">while</span><span class="o">(</span><span class="n">iter</span><span class="o">.</span><span class="na">hasNext</span><span class="o">()){</span>
<span class="nc">String</span><span class="o">[]</span> <span class="n">row</span> <span class="o">=</span> <span class="n">iter</span><span class="o">.</span><span class="na">next</span><span class="o">();</span>
<span class="nc">String</span> <span class="n">atomicNumber</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">0</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">symbol</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">1</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">name</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">2</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">atomicMass</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">3</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">metalOrNonMetal</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">4</span><span class="o">];</span>
<span class="o">}</span></code></pre></figure>
<p>Notice how we’re fetching all values as <code class="language-plaintext highlighter-rouge">String</code>. This may not be what you want, particularly for the atomic number and mass. Alternatively, you can trigger data type inference and casting like this:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Get Iterator.</span>
<span class="c1">// Third boolean is the cast flag.</span>
<span class="nc">Iterator</span><span class="o"><</span><span class="nc">Object</span><span class="o">[]></span> <span class="n">iter</span> <span class="o">=</span> <span class="n">resource</span><span class="o">.</span><span class="na">iter</span><span class="o">(</span><span class="kc">false</span><span class="o">,</span> <span class="kc">false</span><span class="o">,</span> <span class="kc">true</span><span class="o">));</span>
<span class="c1">// Iterator</span>
<span class="k">while</span><span class="o">(</span><span class="n">iter</span><span class="o">.</span><span class="na">hasNext</span><span class="o">()){</span>
<span class="nc">String</span><span class="o">[]</span> <span class="n">row</span> <span class="o">=</span> <span class="n">iter</span><span class="o">.</span><span class="na">next</span><span class="o">();</span>
<span class="kt">int</span> <span class="n">atomicNumber</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">0</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">symbol</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">1</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">name</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">2</span><span class="o">];</span>
<span class="kt">float</span> <span class="n">atomicMass</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">3</span><span class="o">];</span>
<span class="nc">String</span> <span class="n">metalOrNonMetal</span> <span class="o">=</span> <span class="n">row</span><span class="o">[</span><span class="mi">4</span><span class="o">];</span>
<span class="o">}</span></code></pre></figure>
<p>And that’s it, your data is now associated with the appropriate data types!</p>
<h2 id="inferring-the-schema">Inferring the Schema</h2>
<p>We wouldn’t have had to infer the data types if we had included a <a href="https://frictionlessdata.io/docs/table-schema/">Table Schema</a> when creating an instance of our Data Package. If a Table Schema is not available, then it’s something that can also be inferred and created with <code class="language-plaintext highlighter-rouge">tableschema-java</code>:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="no">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="k">new</span> <span class="no">URL</span><span class="o">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/data.csv"</span><span class="o">);</span>
<span class="nc">Table</span> <span class="n">table</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Table</span><span class="o">(</span><span class="n">url</span><span class="o">);</span>
<span class="nc">Schema</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="na">inferSchema</span><span class="o">();</span>
<span class="n">schema</span><span class="o">.</span><span class="na">write</span><span class="o">(</span><span class="s">"/path/to/write/schema.json"</span><span class="o">);</span></code></pre></figure>
<p>The type inference algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned.</p>
<p>The inference algorithm traverses all of the table’s rows and attempts to cast every single value of the table. When dealing with large tables, you might want to limit the number of rows that the inference algorithm processes:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Only process the first 25 rows for type inference.</span>
<span class="nc">Schema</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="na">inferSchema</span><span class="o">(</span><span class="mi">25</span><span class="o">);</span></code></pre></figure>
<p>Be sure to go through <code class="language-plaintext highlighter-rouge">tableschema-java</code>’s <a href="https://github.com/frictionlessdata/tableschema-java/blob/master/README.md">README</a> as well to learn more about how to operate with <a href="https://frictionlessdata.io/docs/table-schema/">Table Schema</a>.</p>
<h2 id="contributing">Contributing</h2>
<p>In case you discovered an issue that you’d like to contribute a fix for, or if you would like to extend functionality:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># install jabba and maven2</span>
<span class="nv">$ </span><span class="nb">cd </span>tableschema-java
<span class="nv">$ </span>jabba <span class="nb">install </span>1.8
<span class="nv">$ </span>jabba use 1.8
<span class="nv">$ </span>mvn <span class="nb">install</span> <span class="nt">-DskipTests</span><span class="o">=</span><span class="nb">true</span> <span class="nt">-Dmaven</span>.javadoc.skip<span class="o">=</span><span class="nb">true</span> <span class="nt">-B</span> <span class="nt">-V</span>
<span class="nv">$ </span>mvn <span class="nb">test</span> <span class="nt">-B</span></code></pre></figure>
<p>Make sure that all tests pass, and submit a PR with your contributions once you’re ready.</p>
<hr />
<p>We also welcome your feedback and questions via our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a> or via <a href="https://github.com/frictionlessdata/datapackage-java/issues">GitHub issues</a> on the datapackage-java repository.</p>
Georges Labrèche
Collecting, Analysing and Sharing Twitter Data
2018-03-08T00:00:00+00:00
http://okfnlabs.org/blog/2018/03/08/open-data-day-tweets
<p>On March 3, communities around the world marked Open Data Day <a href="http://opendataday.org/#map">in over 400 events</a>. Here’s the <a href="https://github.com/okfn/opendataday/blob/master/Datasets/Events2018.csv">dataset for all Open Data Day 2018 events</a>.</p>
<p>In this post, we will harvest Open Data Day affiliated content from Twitter and analyze it using R before packaging and publishing the data and associated resources publicly on GitHub.</p>
<h2 id="collecting-the-data">Collecting the Data</h2>
<p>With over 300million monthly users [<a href="https://www.omnicoreagency.com/twitter-statistics/">source</a>, January 2018], Twitter is a popular social network that I particularly like for its abbreviated messages, known as Tweets. <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets">Twitter’s Standard Search API</a> allows users to mine tweets from as far back as a week for free.</p>
<p><a href="https://www.r-project.org">R</a> is a popular programming language for data analysis and has an active community of contributors that add to its capabilities by writing custom packages for interacting with different tools and platforms and achieving different tasks. In this post, we will employ two such packages:</p>
<ul>
<li><strong><a href="https://cran.r-project.org/web/packages/twitteR/README.html">twitteR</a></strong> allows us to interact with the Twitter API. We will install this from CRAN, the official packages repository for R.</li>
<li>Frictionless Data’s <strong><a href="https://github.com/frictionlessdata/datapackage-r">datapackage.r</a></strong> library will allow us to collate our open data day data and associated resources, such as the R script in one place before we publish it. We will install this from GitHub.</li>
</ul>
<p>To get started, create a new application on <a href="apps.twitter.com">apps.twitter.com</a> and take note of the API and access tokens. We will need to specify these in our R script.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># install and load the twitteR library</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"twitteR"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">twitteR</span><span class="p">)</span><span class="w">
</span><span class="c1"># specify Twitter API and Access Tokens</span><span class="w">
</span><span class="n">api_key</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"YOUR_API_KEY"</span><span class="w">
</span><span class="n">api_secret</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"YOUR_API_SECRET"</span><span class="w">
</span><span class="n">access_token</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"YOUR_ACCESS_TOKEN"</span><span class="w">
</span><span class="n">access_secret</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"YOUR_ACCESS_SECRET"</span><span class="w">
</span><span class="n">setup_twitter_oauth</span><span class="p">(</span><span class="n">api_key</span><span class="p">,</span><span class="w"> </span><span class="n">api_secret</span><span class="p">,</span><span class="w"> </span><span class="n">access_token</span><span class="p">,</span><span class="w"> </span><span class="n">access_secret</span><span class="p">)</span></code></pre></figure>
<p>We are now ready to read tweets from the two official Open Data Day hashtags: <a href="https://twitter.com/hashtag/OpenDataDay">#opendataday</a> and <a href="https://twitter.com/hashtag/ODD18">#odd18</a>. With a maximum number of 100 tweets per request, <a href="https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets">Twitter’s Search API allows</a> for 180 Requests every 15 minutes. Since we are interested in as many tweets as we can get, we will specify the upper limit as 18,000, which tells the twitteR library the maximum number of tweets to retrieve for us.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># read tweets from the two official hashtags, #opendataday and #odd18</span><span class="w">
</span><span class="n">tweets_opendataday</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">searchTwitteR</span><span class="p">(</span><span class="s2">"#opendataday"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">18000</span><span class="p">)</span><span class="w">
</span><span class="n">tweets_odd18</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">searchTwitteR</span><span class="p">(</span><span class="s2">"#odd18"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">18000</span><span class="p">)</span><span class="w">
</span><span class="c1"># view lists of mined tweets from both hashtags</span><span class="w">
</span><span class="n">tweets_opendataday</span><span class="w">
</span><span class="n">tweets_odd18</span></code></pre></figure>
<p>Note: run each <code class="language-plaintext highlighter-rouge">searchTwitteR()</code> function separately and 15 minutes apart to avoid surpassing the limit.</p>
<p>In the R script snippet above, we assigned results of our search to the variables <code class="language-plaintext highlighter-rouge">tweets_opendataday</code> and <code class="language-plaintext highlighter-rouge">tweets_odd18</code> and called the two variables to view the entire list of tweets obtained. Lucky for us, the total number of tweets on either hashtag are within Twitter’s 15 minute request limit. Here’s the feedback we receive:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># tweets mined on March 7, 2018</span><span class="w">
</span><span class="c1">#opendataday</span><span class="w">
</span><span class="m">18000</span><span class="w"> </span><span class="n">tweets</span><span class="w"> </span><span class="n">were</span><span class="w"> </span><span class="n">requested</span><span class="w"> </span><span class="n">but</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">API</span><span class="w"> </span><span class="n">can</span><span class="w"> </span><span class="n">only</span><span class="w"> </span><span class="n">return</span><span class="w"> </span><span class="m">11458</span><span class="w">
</span><span class="c1">#odd18</span><span class="w">
</span><span class="m">18000</span><span class="w"> </span><span class="n">tweets</span><span class="w"> </span><span class="n">were</span><span class="w"> </span><span class="n">requested</span><span class="w"> </span><span class="n">but</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">API</span><span class="w"> </span><span class="n">can</span><span class="w"> </span><span class="n">only</span><span class="w"> </span><span class="n">return</span><span class="w"> </span><span class="m">3497</span></code></pre></figure>
<p>Here’s a snippet of the list obtained from the #opendataday hashtag:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"OpenDataAnd: Con motivo del pasado #OpenDataDay, @ODIHQ nos recuerda qu<U+00E9> es y para qu<U+00E9> sirven los #DatosAbiertos<U+2026> https://t.co/Fib4rSukbs"</span><span class="w">
</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"johnfaig: RT @ODIHQ: Here's our list of seven weird and wonderful open datasets (nominated by you) https://t.co/H42bV5oIhw\n\n#opendataday #opendataday<U+2026>"</span><span class="w">
</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"SurianoRodrigo: RT @CETGAPue: Desde el auditorio Ing. Antonio Osorio Garc<U+00ED>a de la @fi_buap, se lleva a cabo el BootCamp #OpenDataDay, al que asisten acad<U+00E9>m<U+2026>"</span><span class="w">
</span><span class="p">[[</span><span class="m">4</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Carolrozu: RT @CETGAPue: Desde el auditorio Ing. Antonio Osorio Garc<U+00ED>a de la @fi_buap, se lleva a cabo el BootCamp #OpenDataDay, al que asisten acad<U+00E9>m<U+2026>"</span><span class="w">
</span><span class="p">[[</span><span class="m">5</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"Josefina_Buxade: RT @CETGAPue: Desde el auditorio Ing. Antonio Osorio Garc<U+00ED>a de la @fi_buap, se lleva a cabo el BootCamp #OpenDataDay, al que asisten acad<U+00E9>m<U+2026>"</span></code></pre></figure>
<p>Since the entire lists are long (~11,500 tweets on the #opendataday hashtag) and hard to comprehend, our best bet is to convert the lists to data frames. In R, data frames allow us to store data in tables and manipulate and analyse them easily. twitteR’s <code class="language-plaintext highlighter-rouge">twListToDF</code> function allows us to convert lists to data frames. After scraping data, it is always a good idea to save the original raw data as it provides a good base for any analysis work. We will write our data to a CSV file, so we can publish and it widely. The CSV format is machine-readable and easy to import into any spreadsheet application or advanced tools for analysis.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># convert the list of mined tweets from each hashtag to a dataframe</span><span class="w">
</span><span class="n">tweets_opendataday_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">twListToDF</span><span class="p">(</span><span class="n">tweets_opendataday</span><span class="p">)</span><span class="w">
</span><span class="n">tweets_odd18_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">twListToDF</span><span class="p">(</span><span class="n">tweets_odd18</span><span class="p">)</span><span class="w">
</span><span class="c1"># save scraped data in CSV files</span><span class="w">
</span><span class="n">write.csv</span><span class="p">(</span><span class="n">tweets_opendataday_df</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="s2">"data/opendataday_raw.csv"</span><span class="p">)</span><span class="w">
</span><span class="n">write.csv</span><span class="p">(</span><span class="n">tweets_odd18_df</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="s2">"data/odd18_raw.csv"</span><span class="p">)</span></code></pre></figure>
<p>Here’s what the first five rows of our data frame look like:</p>
<table class="table table-striped table-bordered" style="display: block; overflow:auto">
<thead>
<tr>
<th>text</th>
<th>favorited</th>
<th>favoriteCount</th>
<th>replyToSN</th>
<th>created</th>
<th>truncated</th>
<th>replyToSID</th>
<th>id</th>
<th>replyToUID</th>
<th>statusSource</th>
<th>screenName</th>
<th>retweetCount</th>
<th>isRetweet</th>
<th>retweeted</th>
<th>longitude</th>
<th>latitude</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Participan como panelistas de la mesa <U+201C>Datos ma<U+00F1>aneros, qu<U+00E9> son y para qu<U+00E9> sirven los #DatosAbiertos<U+201D>, Karla Ramos<U+2026> https://t.co/wFBYaUP68n</td>
<td>FALSE</td>
<td>3</td>
<td>NA</td>
<td>05/03/18 16:29</td>
<td>TRUE</td>
<td>NA</td>
<td>9.70698E+17</td>
<td>NA</td>
<td><a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a></td>
<td>CETGAPue</td>
<td>2</td>
<td>FALSE</td>
<td>FALSE</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>2</td>
<td>RT @Transparen_Xal: A unos minutos de empezar el Open Data Day Xalapa#ODD18 #Xalapa https://t.co/VH3m0QGeOJ</td>
<td>FALSE</td>
<td>0</td>
<td>NA</td>
<td>05/03/18 16:28</td>
<td>FALSE</td>
<td>NA</td>
<td>9.70698E+17</td>
<td>NA</td>
<td><a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a></td>
<td>AytoXalapa</td>
<td>1</td>
<td>TRUE</td>
<td>FALSE</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>3</td>
<td>Nos encontramos ya en @ImacXalapa con @AytoXalapa para sumar esfuerzos a favor de la cultura de participaci<U+00F3>n ciuda<U+2026> https://t.co/VdIcF16Ub4</td>
<td>FALSE</td>
<td>0</td>
<td>NA</td>
<td>05/03/18 16:22</td>
<td>TRUE</td>
<td>NA</td>
<td>9.70696E+17</td>
<td>NA</td>
<td><a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a></td>
<td>VERIVAI</td>
<td>1</td>
<td>FALSE</td>
<td>FALSE</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>4</td>
<td>A unos minutos de empezar el Open Data Day Xalapa#ODD18 #Xalapa https://t.co/VH3m0QGeOJ</td>
<td>FALSE</td>
<td>0</td>
<td>NA</td>
<td>05/03/18 16:20</td>
<td>FALSE</td>
<td>NA</td>
<td>9.70696E+17</td>
<td>NA</td>
<td><a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a></td>
<td>Transparen_Xal</td>
<td>1</td>
<td>FALSE</td>
<td>FALSE</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>5</td>
<td>El gobierno de @TonyGali promueve el uso de los #DatosAbiertos. Entra al portal https://t.co/Jz23xpJLAS y consult<U+2026> https://t.co/UoWP43R8Km</td>
<td>FALSE</td>
<td>5</td>
<td>NA</td>
<td>05/03/18 16:09</td>
<td>TRUE</td>
<td>NA</td>
<td>9.70693E+17</td>
<td>NA</td>
<td><a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a></td>
<td>CETGAPue</td>
<td>4</td>
<td>FALSE</td>
<td>FALSE</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>
<p>For ease of analysis, and because the two data frames have the same columns, let’s merge the two datasets.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># combine dataframes from the two hashtags</span><span class="w">
</span><span class="n">alltweets_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">tweets_opendataday_df</span><span class="p">,</span><span class="w"> </span><span class="n">tweets_odd18_df</span><span class="p">)</span><span class="w">
</span><span class="n">write.csv</span><span class="p">(</span><span class="n">alltweets_df</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="s2">"data/allopendatadaytweets.csv"</span><span class="p">)</span></code></pre></figure>
<h2 id="analysing-the-data">Analysing the Data</h2>
<p>Data analysis in R is quite a joy. We will use R’s <code class="language-plaintext highlighter-rouge">dplyr</code> package to analyse our data and answer a few questions:</p>
<ul>
<li>how many open data day attendees tweeted from android phones?</li>
</ul>
<p>We can answer this using dplyr’s select() function, which as the name suggests, allows us to see only data we are interested in, in this case, tweets sent from the Twitter for Android app.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># install and load dplyr</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="c1"># find out number of open data day tweets from android phones</span><span class="w">
</span><span class="n">android_tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">alltweets_df</span><span class="p">,</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="s2">"Twitter for Android"</span><span class="p">,</span><span class="w"> </span><span class="n">statusSource</span><span class="p">))</span><span class="w">
</span><span class="n">tally</span><span class="p">(</span><span class="n">android_tweets</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># the result</span><span class="w">
</span><span class="n">n</span><span class="w">
</span><span class="m">1</span><span class="w"> </span><span class="m">5180</span></code></pre></figure>
<p>5,180 of the 14,955 (34.6%) #opendataday and #odd18 tweets were sent from android phones.</p>
<ul>
<li>Naturally, Open Data Day events cut across many topics and disciplines, and some events included hands-on workshop sessions or hackathons. Let’s find out which open data day tweets point to open source projects and resources that are available on GitHub.</li>
</ul>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="c1"># open data day tweets that mention resources on GitHub</span><span class="w">
</span><span class="n">github_resources</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">alltweets_df</span><span class="p">,</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="s2">"github.com"</span><span class="p">,</span><span class="w"> </span><span class="n">statusSource</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># the result</span><span class="w">
</span><span class="n">n</span><span class="w">
</span><span class="m">1</span><span class="w"> </span><span class="m">32</span></code></pre></figure>
<p>Only 32 #opendataday and #odd18 tweets contain GitHub links.</p>
<ul>
<li>Not all open data day tweets are geotagged, but from the few that are, we can create a very basic map to show where people tweeted from. To do this, we will use the <a href="http://leafletjs.com">Leaflet</a> library for R.</li>
</ul>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># install and load leaflet</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="n">leaflet</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">leaflet</span><span class="p">)</span><span class="w">
</span><span class="c1"># create basic map</span><span class="w">
</span><span class="n">map</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">leaflet</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addTiles</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">addCircles</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alltweets_df</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">latitude</span><span class="p">,</span><span class="w"> </span><span class="n">lng</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">longitude</span><span class="p">)</span><span class="w">
</span><span class="c1"># view map</span><span class="w">
</span><span class="n">map</span></code></pre></figure>
<p><img src="/img/posts/opendataday-geotagged-tweets.png" alt="map showing where geotagged #opendataday and #odd18 tweets originated from" />
<br />
<em>figure 1: map showing where geotagged #opendataday and #odd18 tweets originated from</em></p>
<h2 id="sharing-the-data">Sharing the Data</h2>
<p>Due to Twitter’s terms of use, we can only share a stripped-down version of the raw data. Our final dataset contains tweet IDs and retweet count, and will be packaged alongside this R script, so you could download the tweets yourself.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># filter out retweets and leave original tweets</span><span class="w">
</span><span class="n">notretweets_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="n">alltweets_df</span><span class="p">,</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="s2">"FALSE"</span><span class="p">,</span><span class="w"> </span><span class="n">isRetweet</span><span class="p">))</span><span class="w">
</span><span class="c1"># strip down tweets data to comply with Twitter's terms of use.</span><span class="w">
</span><span class="n">subsetoftweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">notretweets_df</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">retweetCount</span><span class="p">)</span><span class="w">
</span><span class="n">write.csv</span><span class="p">(</span><span class="n">subsetoftweets</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="s2">"data/subsetofopendatadaytweets.csv"</span><span class="p">)</span></code></pre></figure>
<h3 id="packaging-the-data-and-associated-resources">Packaging the Data and associated resources</h3>
<p>Providing context when sharing data is important, and Frictionless Data’s <a href="http://frictionlessdata.io/data-packages/">Data Package</a> format makes it possible. Using <a href="https://github.com/frictionlessdata/datapackage-r">datapackage.r</a>, we can infer a schema for the <code class="language-plaintext highlighter-rouge">all tweets</code> CSV file and publish it alongside the other resources.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># specify filepath and infer schema</span><span class="w">
</span><span class="n">filepath</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'/data/subsetofopendatadaytweets.csv'</span><span class="w">
</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tableschema.r</span><span class="o">::</span><span class="n">infer</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span></code></pre></figure>
<p>Read more about the datapackage-r package <a href="http://okfnlabs.org/blog/2018/02/14/datapackages-in-r.html">in this post by Open Knowledge Greece</a>.</p>
<p>Alternatively, we can use the <a href="create.frictionlessdata.io">Data Package Creator</a> to package our data and associated resources.</p>
<p><img src="/img/posts/opendataday-data-package.png" alt="creating the data package on Data Package Creator" />
<br />
<em>figure 2: creating the data package on Data Package Creator</em></p>
<p>Read more about the data package creator in <a href="http://okfnlabs.org/blog/2018/02/05/data-package-creator.html">this post</a>.</p>
<h3 id="publishing-on-github">Publishing on Github</h3>
<p>Once our data package is ready, we can simply publish it to GitHub. Find the open data day tweets data package <a href="https://github.com/frictionlessdata/example-data-packages/tree/master/open-data-day-tweets-2018">here</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Data Packages are a great format for sharing data collections with contextual information i.e. we added metadata and a schema to our final dataset. Read more about <a href="http://frictionlessdata.io/data-packages/">Data Packages in Frictionless Data</a> and reach out in <a href="http://gitter.im/frictionlessdata/chat">our community chat on Gitter</a>.</p>
Serah Rono
Processing Tabular Data Packages in Go
2018-02-16T00:00:00+00:00
http://okfnlabs.org/blog/2018/02/16/datapackages-in-go
<p>Daniel Fireman was one of 2017’s <a href="https://toolfund.frictionlessdata.io">Frictionless Data Tool Fund</a> grantees tasked with extending implementation of core Frictionless Data libraries in Go programming language. You can read more about this in <a href="https://frictionlessdata.io/articles/daniel-fireman/">his grantee profile</a>.</p>
<p>In this post, Fireman will show you how to install and use the <a href="http://golang.org">Go</a> libraries for working with <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Packages</a>.</p>
<hr />
<p>Our goal in this tutorial is to load a data package from the web and read its metadata and contents.</p>
<h2 id="setup">Setup</h2>
<p>For this tutorial, we will need the <a href="https://github.com/frictionlessdata/datapackage-go">datapackage-go</a> and <a href="https://github.com/frictionlessdata/tableschema-go">tableschema-go</a> packages, which provide all the functionality to deal with a Data Package’s metadata and its contents.</p>
<p>We are going to use the <a href="https://golang.github.io/dep/">dep tool</a> to manage the dependencies of our new project:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span><span class="nb">cd</span> <span class="nv">$GOPATH</span>/src/newdataproj
<span class="nv">$ </span>dep init</code></pre></figure>
<h2 id="the-periodic-table-data-package">The Periodic Table Data Package</h2>
<p>A <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> is a simple container format used to describe and package a collection of data. It consists of two parts:</p>
<ul>
<li>Metadata that describes the structure and contents of the package</li>
<li>Resources such as data files that form the contents of the package</li>
</ul>
<p>In this tutorial, we are using a <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Package</a> containing the periodic table. The package descriptor (<a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json">datapackage.json</a>) and contents (<a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/data.csv">data.csv</a>) are stored on GitHub. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the header and the first three rows:</p>
<table class="table table-striped table-bordered" style="display: block; overflow:auto">
<thead>
<tr>
<th>atomic number</th>
<th>symbol</th>
<th>name</th>
<th>atomic mass</th>
<th>metal or nonmetal?</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>H</td>
<td>Hydrogen</td>
<td>1.00794</td>
<td>nonmetal</td>
</tr>
<tr>
<td>2</td>
<td>He</td>
<td>Helium</td>
<td>4.002602</td>
<td>noble gas</td>
</tr>
<tr>
<td>3</td>
<td>Li</td>
<td>Lithium</td>
<td>6.941</td>
<td>alkali metal</td>
</tr>
</tbody>
</table>
<h2 id="inspecting-package-metadata">Inspecting Package Metadata</h2>
<p>Let’s start off by creating the <code class="language-plaintext highlighter-rouge">main.go</code>, which loads the data package and inspects some of its metadata.</p>
<figure class="highlight"><pre><code class="language-go" data-lang="go"><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="s">"github.com/frictionlessdata/datapackage-go/datapackage"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="nb">panic</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"Package loaded successfully."</span><span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<p>Before running the code, you need to tell the dep tool to update our project dependencies. Don’t worry; you won’t need to do it again in this tutorial.</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>dep ensure
<span class="nv">$ </span>go run main.go
Package loaded successfully.</code></pre></figure>
<p>Now that you have loaded the periodic table Data Package, you have access to its <code class="language-plaintext highlighter-rouge">title</code> and <code class="language-plaintext highlighter-rouge">name</code> fields through the <a href="https://godoc.org/github.com/frictionlessdata/datapackage-go/datapackage#Package.Descriptor">Package.Descriptor() function</a>. To do so, let’s change our main function to (omitting error handling for the sake of brevity, but we know it is <em>very</em> important):</p>
<figure class="highlight"><pre><code class="language-go" data-lang="go"><span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"Name:"</span><span class="p">,</span> <span class="n">pkg</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">()[</span><span class="s">"name"</span><span class="p">])</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"Title:"</span><span class="p">,</span> <span class="n">pkg</span><span class="o">.</span><span class="n">Descriptor</span><span class="p">()[</span><span class="s">"title"</span><span class="p">])</span>
<span class="p">}</span></code></pre></figure>
<p>And rerun the program:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>go run main.go
Name: period-table
Title: Periodic Table</code></pre></figure>
<p>And as you can see, the printed fields match the <a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json">package descriptor</a>. For more information about the Data Package structure, please take a look at the <a href="https://frictionlessdata.io/specs/data-package/">specification</a>.</p>
<h2 id="quick-look-at-the-data">Quick Look At the Data</h2>
<p>Now that you have loaded your Data Package, it is time to process its contents. The package content consists of one or more resources. You can access <a href="https://godoc.org/github.com/frictionlessdata/datapackage-go/datapackage#Resource">Resources</a> via the <a href="https://godoc.org/github.com/frictionlessdata/datapackage-go/datapackage#Package.GetResource()">Package.GetResource()</a> method. Let’s print the periodic table <code class="language-plaintext highlighter-rouge">data</code> resource contents.</p>
<figure class="highlight"><pre><code class="language-go" data-lang="go"><span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="n">res</span> <span class="o">:=</span> <span class="n">pkg</span><span class="o">.</span><span class="n">GetResource</span><span class="p">(</span><span class="s">"data"</span><span class="p">)</span>
<span class="n">table</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">res</span><span class="o">.</span><span class="n">ReadAll</span><span class="p">()</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">row</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">table</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>go run main.go
<span class="o">[</span>atomic number symbol name atomic mass metal or nonmetal?]
<span class="o">[</span>1 H Hydrogen 1.00794 nonmetal]
<span class="o">[</span>2 He Helium 4.002602 noble gas]
<span class="o">[</span>3 Li Lithium 6.941 alkali metal]
<span class="o">[</span>4 Be Beryllium 9.012182 alkaline earth metal]
...</code></pre></figure>
<p>The <a href="https://godoc.org/github.com/frictionlessdata/datapackage-go/datapackage#Resource.ReadAll">Resource.ReadAll()</a> method loads the whole table in memory as raw strings and returns it as a Go <code class="language-plaintext highlighter-rouge">[][]string</code>. This can be quick useful to take a quick look or perform a visual sanity check at the data.</p>
<h2 id="processing-the-data-packages-content">Processing the Data Package’s Content</h2>
<p>Even though the string representation can be useful for a quick sanity check, you probably want to use actual language types to process the data. Don’t worry, you won’t need to fight the casting battle yourself. Data Package Go libraries provide a rich set of methods to deal with data loading in a very idiomatic way (very similar to <a href="https://golang.org/pkg/encoding/json/">encoding/json</a>).</p>
<p>As an example, let’s change our <code class="language-plaintext highlighter-rouge">main</code> function to use actual types to store the periodic table and print the elements with atomic mass smaller than 10.</p>
<figure class="highlight"><pre><code class="language-go" data-lang="go"><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="s">"github.com/frictionlessdata/datapackage-go/datapackage"</span>
<span class="s">"github.com/frictionlessdata/tableschema-go/csv"</span>
<span class="p">)</span>
<span class="k">type</span> <span class="n">element</span> <span class="k">struct</span> <span class="p">{</span>
<span class="n">Number</span> <span class="kt">int</span> <span class="s">`tableheader:"atomic number"`</span>
<span class="n">Symbol</span> <span class="kt">string</span> <span class="s">`tableheader:"symbol"`</span>
<span class="n">Name</span> <span class="kt">string</span> <span class="s">`tableheader:"name"`</span>
<span class="n">Mass</span> <span class="kt">float64</span> <span class="s">`tableheader:"atomic mass"`</span>
<span class="n">Metal</span> <span class="kt">string</span> <span class="s">`tableheader:"metal or nonmetal?"`</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="n">resource</span> <span class="o">:=</span> <span class="n">pkg</span><span class="o">.</span><span class="n">GetResource</span><span class="p">(</span><span class="s">"data"</span><span class="p">)</span>
<span class="k">var</span> <span class="n">elements</span> <span class="p">[]</span><span class="n">element</span>
<span class="n">resource</span><span class="o">.</span><span class="n">Cast</span><span class="p">(</span><span class="o">&</span><span class="n">elements</span><span class="p">,</span> <span class="n">csv</span><span class="o">.</span><span class="n">LoadHeaders</span><span class="p">())</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">e</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">elements</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">e</span><span class="o">.</span><span class="n">Mass</span> <span class="o"><</span> <span class="m">10</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"%+v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>go run main.go
<span class="o">{</span>Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal<span class="o">}</span>
<span class="o">{</span>Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas<span class="o">}</span>
<span class="o">{</span>Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal<span class="o">}</span>
<span class="o">{</span>Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal<span class="o">}</span></code></pre></figure>
<p>In the example above, all rows in the table are loaded into memory. Then every row is parsed into an <code class="language-plaintext highlighter-rouge">element</code> object and appended to the slice. The <code class="language-plaintext highlighter-rouge">resource.Cast</code> call returns an error if the whole table cannot be successfully parsed.</p>
<p>If you don’t want to load all data in memory at once, you can lazily access each row using <a href="https://godoc.org/github.com/frictionlessdata/datapackage-go/datapackage#Resource.Iter">Resource.Iter</a> and use <a href="https://godoc.org/github.com/frictionlessdata/tableschema-go/schema#Schema.CastRow">Schema.CastRow</a> to cast each row into an <code class="language-plaintext highlighter-rouge">element</code> object. That would change our main function to:</p>
<figure class="highlight"><pre><code class="language-go" data-lang="go"><span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pkg</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">datapackage</span><span class="o">.</span><span class="n">Load</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"</span><span class="p">)</span>
<span class="n">resource</span> <span class="o">:=</span> <span class="n">pkg</span><span class="o">.</span><span class="n">GetResource</span><span class="p">(</span><span class="s">"data"</span><span class="p">)</span>
<span class="n">iter</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">resource</span><span class="o">.</span><span class="n">Iter</span><span class="p">(</span><span class="n">csv</span><span class="o">.</span><span class="n">LoadHeaders</span><span class="p">())</span>
<span class="n">sch</span><span class="p">,</span> <span class="n">_</span> <span class="o">:=</span> <span class="n">resource</span><span class="o">.</span><span class="n">GetSchema</span><span class="p">()</span>
<span class="k">var</span> <span class="n">e</span> <span class="n">element</span>
<span class="k">for</span> <span class="n">iter</span><span class="o">.</span><span class="n">Next</span><span class="p">()</span> <span class="p">{</span>
<span class="n">sch</span><span class="o">.</span><span class="n">CastRow</span><span class="p">(</span><span class="n">iter</span><span class="o">.</span><span class="n">Row</span><span class="p">(),</span> <span class="o">&</span><span class="n">e</span><span class="p">)</span>
<span class="k">if</span> <span class="n">e</span><span class="o">.</span><span class="n">Mass</span> <span class="o"><</span> <span class="m">10</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"%+v</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>go run main.go
<span class="o">{</span>Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal<span class="o">}</span>
<span class="o">{</span>Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas<span class="o">}</span>
<span class="o">{</span>Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal<span class="o">}</span>
<span class="o">{</span>Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal<span class="o">}</span></code></pre></figure>
<p>And our code is ready to deal with the growth of the periodic table in a very memory-efficient way :-)</p>
<hr />
<p>We welcome your feedback and questions via our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a> or via <a href="https://github.com/frictionlessdata/datapackage-go/issues">GitHub issues</a> on the datapackage-go repository.</p>
Daniel Fireman
Frictionless Data Lib - A Design Pattern for Accessing Files and Datasets
2018-02-15T00:00:00+00:00
http://okfnlabs.org/blog/2018/02/15/design-pattern-for-a-core-data-library
<p>This document outlines a simple design pattern for a “core” data library <code class="language-plaintext highlighter-rouge">"data"</code>.</p>
<p>The pattern is focused on access and use of:</p>
<ul>
<li>individual files (streams)</li>
<li>collections of files (“datasets”)</li>
</ul>
<p>Its primary operation is <code class="language-plaintext highlighter-rouge">open</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = open('path/to/file.csv')
dataset = open('path/to/files/')
</code></pre></div></div>
<p>It defines a standardized “stream-plus-metadata” interface for file and dataset objects, along with methods for creating these from file or dataset pointers such as file paths or urls.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = open('path/to/file.csv')
file.stream()
file.rows()
file.descriptor
file.descriptor.path
...
</code></pre></div></div>
<p>This pattern derives from many years experience working on data tools and projects like <a href="https://frictionlessdata.io/">Frictionless Data</a>. Specifically:</p>
<ul>
<li><strong>Data “plus”</strong>: when you work with data you always find yourself needing the data itself plus a little bit more – things like where the data came from on disk (or is going to), or how large it is. This pattern gives you that information in a standardized way.</li>
<li><strong>Streams (and strings)</strong>: streams are the standard way to access data (though strings are useful too) and you should get the same interface whether you’ve loaded data from online, on disk or inline; and, finally, we want both raw byte streams <em>and</em> (for tabular data) object/row streams aka iterators.</li>
<li><strong>Building blocks</strong>: most data wrangling, even in simple cases, involves building data processing pipelines. Pipelines need a standard stream-plus-metadata interface to pass data between steps in the pipeline. For example, suppose you want to load a csv file and convert to JSON and write to stdout: that’s already three steps (load, convert, dump). Then suppose you want to delete the first 3 rows, delete the 2nd column. Now you have a more complex processing pipeline.</li>
</ul>
<!--
```mermaid
graph TD
loader -- file pointer/stream + metadata -> op1
op1 -- file pointer/stream + metadata -> op2
op2 -- file pointer/stream + metadata -> writer
```
-->
<p><img src="/img/frictionless-data-lib-data-pipeline-20180215.png" alt="" style="width: 220px; display: block; margin: auto;" /></p>
<p style="text-align: center; font-style: italic">Fig 1: data pipelines and the stream-plus-metadata pattern</p>
<p>The pattern leverages the Frictionless Data specs including <a href="https://frictionlessdata.io/specs/data-resource/">Data Resource</a>, <a href="https://frictionlessdata.io/specs/table-schema/">Table Schema</a> and <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a>. But it keeps metadata implicit rather than explicit and focuses on giving users the simplest most direct interface possible (put most crudely: <code class="language-plaintext highlighter-rouge">open</code> then <code class="language-plaintext highlighter-rouge">stream</code>). You can find more about the connection with the <a href="https://frictionlessdata.io/">Frictionless Data</a> tooling in the appendix.</p>
<p>Finally, we already have one working implementation of the pattern in javascript:</p>
<p><a href="https://github.com/datahq/data.js">https://github.com/datahq/data.js</a></p>
<p>Work on a python implementation is underway (most of the code is already there in the python Data Package libraries).</p>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#overview-of-the-pattern" id="markdown-toc-overview-of-the-pattern">Overview of the Pattern</a></li>
<li><a href="#the-pattern-in-detail" id="markdown-toc-the-pattern-in-detail">The Pattern in Detail</a> <ul>
<li><a href="#open-method" id="markdown-toc-open-method"><code class="language-plaintext highlighter-rouge">open</code> method</a> <ul>
<li><a href="#file-locators" id="markdown-toc-file-locators">File locators</a></li>
</ul>
</li>
<li><a href="#file" id="markdown-toc-file">File</a> <ul>
<li><a href="#metadata-descriptor" id="markdown-toc-metadata-descriptor">Metadata: <code class="language-plaintext highlighter-rouge">descriptor</code></a></li>
<li><a href="#accessing-data" id="markdown-toc-accessing-data">Accessing data</a> <ul>
<li><a href="#stream" id="markdown-toc-stream"><code class="language-plaintext highlighter-rouge">stream</code></a></li>
<li><a href="#rows" id="markdown-toc-rows"><code class="language-plaintext highlighter-rouge">rows</code></a> <ul>
<li><a href="#support-for-tableschema-and-csv-dialect" id="markdown-toc-support-for-tableschema-and-csv-dialect">Support for TableSchema and CSV Dialect</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li><a href="#dataset" id="markdown-toc-dataset">Dataset</a> <ul>
<li><a href="#open-for-datasets" id="markdown-toc-open-for-datasets"><code class="language-plaintext highlighter-rouge">open</code> for datasets</a> <ul>
<li><a href="#dataset-locators" id="markdown-toc-dataset-locators">Dataset Locators</a></li>
</ul>
</li>
<li><a href="#descriptor" id="markdown-toc-descriptor"><code class="language-plaintext highlighter-rouge">descriptor</code></a></li>
<li><a href="#identifier-optional" id="markdown-toc-identifier-optional"><code class="language-plaintext highlighter-rouge">identifier</code> (optional)</a></li>
<li><a href="#readme" id="markdown-toc-readme">README</a></li>
<li><a href="#files" id="markdown-toc-files"><code class="language-plaintext highlighter-rouge">files</code></a></li>
<li><a href="#addfile" id="markdown-toc-addfile">addFile</a></li>
</ul>
</li>
<li><a href="#operators" id="markdown-toc-operators">Operators</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
<li><a href="#appendix-why-we-need-a-pattern-like-this" id="markdown-toc-appendix-why-we-need-a-pattern-like-this">Appendix: Why we need a pattern like this</a> <ul>
<li><a href="#all-data-wrangling-tools-need-to-load-and-then-pass-around-file-like-objects-as-they-process-data" id="markdown-toc-all-data-wrangling-tools-need-to-load-and-then-pass-around-file-like-objects-as-they-process-data">All data wrangling tools need to load and then pass around “file-like objects” as they process data</a></li>
<li><a href="#a-file-is-more-than-a-byte-stream-the-stream-may-be-structured-and-there-is-usually-the-need-for-associated-metadata" id="markdown-toc-a-file-is-more-than-a-byte-stream-the-stream-may-be-structured-and-there-is-usually-the-need-for-associated-metadata">A file is more than a byte stream: the stream may be structured and there is usually the need for associated metadata</a></li>
<li><a href="#tool-authors-find-themselves-inventing-their-own-stream-plus-metadata-objects--but-they-are-all-different" id="markdown-toc-tool-authors-find-themselves-inventing-their-own-stream-plus-metadata-objects--but-they-are-all-different">Tool authors find themselves inventing their own “stream-plus-metadata” objects … but they are all different</a></li>
<li><a href="#plus-many-tools-also-need-to-access-collection-of-files-ie-datasets" id="markdown-toc-plus-many-tools-also-need-to-access-collection-of-files-ie-datasets">Plus, many tools also need to access collection of files, i.e. datasets</a></li>
<li><a href="#having-a-common-api-pattern-for-files-stream-plus-metadata-and-datasets-would-reduce-duplication-and-support-plug-and-play-with-tooling" id="markdown-toc-having-a-common-api-pattern-for-files-stream-plus-metadata-and-datasets-would-reduce-duplication-and-support-plug-and-play-with-tooling">Having a common API pattern for files (stream-plus-metadata) and datasets would reduce duplication and support plug and play with tooling</a></li>
</ul>
</li>
<li><a href="#appendix-design-principles" id="markdown-toc-appendix-design-principles">Appendix: Design Principles</a> <ul>
<li><a href="#orient-to-the-data-wrangler-workflow" id="markdown-toc-orient-to-the-data-wrangler-workflow">Orient to the data wrangler workflow</a></li>
<li><a href="#zen---maximum-viable-simplicity" id="markdown-toc-zen---maximum-viable-simplicity">Zen - maximum viable simplicity</a> <ul>
<li><a href="#core-objects-should-be-kept-as-simple-as-possible-and-no-simpler" id="markdown-toc-core-objects-should-be-kept-as-simple-as-possible-and-no-simpler">Core objects should be kept as simple as possible (and no simpler)</a></li>
</ul>
</li>
<li><a href="#use-streams" id="markdown-toc-use-streams">Use Streams</a></li>
</ul>
</li>
<li><a href="#appendix-internal-library-structure-suggestions" id="markdown-toc-appendix-internal-library-structure-suggestions">Appendix: Internal Library Structure Suggestions</a> <ul>
<li><a href="#library-components" id="markdown-toc-library-components">Library Components</a></li>
<li><a href="#streams" id="markdown-toc-streams">Streams</a></li>
<li><a href="#loadersparsers-and-writers" id="markdown-toc-loadersparsers-and-writers">Loaders/Parsers and Writers</a></li>
</ul>
</li>
<li><a href="#appendix-api-with-data-package-terminology" id="markdown-toc-appendix-api-with-data-package-terminology">Appendix: API with Data Package terminology</a></li>
<li><a href="#appendix-connection-with-frictionless-data" id="markdown-toc-appendix-connection-with-frictionless-data">Appendix: Connection with Frictionless Data</a> <ul>
<li><a href="#recommendations-for-frictionless-data-community" id="markdown-toc-recommendations-for-frictionless-data-community">Recommendations for Frictionless Data community</a></li>
<li><a href="#why-do-it" id="markdown-toc-why-do-it">Why do it?</a></li>
<li><a href="#relation-to-data-packages" id="markdown-toc-relation-to-data-packages">Relation to Data Packages</a></li>
</ul>
</li>
</ul>
<h1 id="overview-of-the-pattern">Overview of the Pattern</h1>
<p>The pattern is based on the following principles:</p>
<ul>
<li>Data wrangler focused: focus on the core data wrangler workflow: open a file and do something with it</li>
<li>Zen-like: Simplicity and power. As simple as possible: does just what it needs and no more.</li>
<li>Use Streams: stream focused library, including object streams (aka iterators).</li>
</ul>
<p>A minimal viable interface for the file case:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// this example uses javascript but the example is generic</span>
<span class="c1">// data.js is just an illustrative name for the library</span>
<span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">data.js</span><span class="dl">'</span><span class="p">)</span>
<span class="c1">// path can be local or remote</span>
<span class="c1">// file is now a data.File object</span>
<span class="kd">const</span> <span class="nx">file</span> <span class="o">=</span> <span class="nx">data</span><span class="p">.</span><span class="nx">open</span><span class="p">(</span><span class="nx">pathOrUrl</span><span class="p">)</span>
<span class="c1">// a byte stream</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">stream</span><span class="p">()</span>
<span class="c1">// if this file is tabular this will give me a row stream (iterator)</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">rows</span><span class="p">()</span>
<span class="c1">// descriptor for this file including info like size (if available)</span>
<span class="c1">// the descriptor follows the Data Resource specification</span>
<span class="c1">// (and if Tabular the Tabular Data Resource spec)</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">descriptor</span>
</code></pre></div></div>
<p>For datasets:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// path or url to a directory (or datapackage.json)</span>
<span class="c1">// dataset is a data.Dataset object</span>
<span class="c1">// note: may rename to openDataset if need to disambiguate from open(file)</span>
<span class="kd">const</span> <span class="nx">dataset</span> <span class="o">=</span> <span class="nx">open</span><span class="p">(</span><span class="nx">pathOrUrl</span><span class="p">)</span>
<span class="c1">// list of files</span>
<span class="nx">dataset</span><span class="p">.</span><span class="nx">files</span>
<span class="c1">// readme (if README.md existed)</span>
<span class="nx">dataset</span><span class="p">.</span><span class="nx">readme</span>
<span class="c1">// any metadata (either inferred or from datapackage.json)</span>
<span class="c1">// this follows the Data Package spec</span>
<span class="nx">dataset</span><span class="p">.</span><span class="nx">descriptor</span>
</code></pre></div></div>
<p>These interfaces can then form the standard basis for lots of additional functionality e.g.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">infer</span><span class="p">(</span><span class="nx">file</span><span class="p">)</span> <span class="o">=></span> <span class="nx">inferred</span> <span class="nx">tableschema</span> <span class="p">(</span><span class="nx">and</span> <span class="nx">types</span><span class="p">)</span> <span class="k">for</span> <span class="nx">the</span> <span class="nx">column</span>
<span class="nx">writer</span><span class="p">(</span><span class="nx">file</span><span class="p">)</span> <span class="o">=></span> <span class="nx">stream</span> <span class="p">(</span><span class="k">for</span> <span class="nx">saving</span> <span class="nx">to</span> <span class="nx">disk</span><span class="p">)</span>
<span class="nx">validate</span><span class="p">(</span><span class="nx">file</span><span class="p">)</span> <span class="o">=></span> <span class="nx">validate</span> <span class="nx">a</span> <span class="nx">file</span> <span class="p">(</span><span class="nx">assumes</span> <span class="nx">it</span> <span class="nx">has</span> <span class="nx">a</span> <span class="nx">tableschema</span><span class="p">)</span>
</code></pre></div></div>
<p><em>NOTE: here we have used <code class="language-plaintext highlighter-rouge">file</code> and <code class="language-plaintext highlighter-rouge">dataset</code> terminology. If you are more familiar with the package and resource of the Frictionless Data specs please mentally substitute file => resource and dataset => package.</em></p>
<h1 id="the-pattern-in-detail">The Pattern in Detail</h1>
<p><strong>Note: Support for Datasets is optional.</strong> Supporting datasets is an added layer of complexity and some implementors MAY choose to support files only. If so, they MUST indicate this clearly.</p>
<h2 id="open-method"><code class="language-plaintext highlighter-rouge">open</code> method</h2>
<p>The library MUST provide a method <code class="language-plaintext highlighter-rouge">open</code> which takes a locator to a file and returns a File object:</p>
<pre><code class="language-javascript=">open(path/to/file.csv, [options]) => File object
</code></pre>
<p><code class="language-plaintext highlighter-rouge">options</code> is a dictionary of keyword argument list of options. The library MUST support an option <code class="language-plaintext highlighter-rouge">basePath</code>. <code class="language-plaintext highlighter-rouge">basePath</code> is for cases where you want to create a File with a path that is relative to a base directory / path e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = open('data.csv', {basePath: '/my/base/path'})
</code></pre></div></div>
<p>Will open the file: <code class="language-plaintext highlighter-rouge">/my/base/path/data.csv</code></p>
<p>This functionality is mainly useful when using Files as part of Datasets where it can be convenient for a File to have a path relative to the directory of the Dataset. (See also Data Package and Data Resource in the Frictionless Data specs).</p>
<h3 id="file-locators">File locators</h3>
<p>Locators can be:</p>
<ul>
<li>A file path</li>
<li>A URL</li>
<li>Raw data in JSON format</li>
<li>A Data Resource (in native language structure)</li>
</ul>
<p>Implementors MUST support file paths, SHOULD support URLs and MAY support the last two.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">file</span> <span class="o">=</span> <span class="nx">open</span><span class="p">(</span><span class="dl">'</span><span class="s1">/path/to/file.csv</span><span class="dl">'</span><span class="p">)</span>
<span class="nx">file</span> <span class="o">=</span> <span class="nx">open</span><span class="p">(</span><span class="dl">'</span><span class="s1">https://example.com/data.xls</span><span class="dl">'</span><span class="p">)</span>
<span class="c1">// loading raw data</span>
<span class="nx">file</span> <span class="o">=</span> <span class="nx">open</span><span class="p">({</span>
<span class="na">name</span><span class="p">:</span> <span class="dl">'</span><span class="s1">mydata</span><span class="dl">'</span><span class="p">,</span>
<span class="na">data</span><span class="p">:</span> <span class="p">{</span> <span class="c1">// can be any javascript - an object, an array or a string or ...</span>
<span class="na">a</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="na">b</span><span class="p">:</span> <span class="mi">2</span>
<span class="p">}</span>
<span class="p">})</span>
<span class="c1">// Loading with a descriptor - this allows more fine-grained configuration</span>
<span class="c1">// The descriptor should follow the Frictionless Data Resource model</span>
<span class="c1">// http://specs.frictionlessdata.io/data-resource/</span>
<span class="nx">file</span> <span class="o">=</span> <span class="nx">open</span><span class="p">({</span>
<span class="c1">// file or url path</span>
<span class="na">path</span><span class="p">:</span> <span class="dl">'</span><span class="s1">https://example.com/data.csv</span><span class="dl">'</span><span class="p">,</span>
<span class="c1">// a Table Schema - https://specs.frictionlessdata.io/table-schema/</span>
<span class="na">schema</span><span class="p">:</span> <span class="p">{</span>
<span class="na">fields</span><span class="p">:</span> <span class="p">[</span>
<span class="p">...</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="c1">// CSV dialect - https://specs.frictionlessdata.io/csv-dialect/</span>
<span class="nl">dialect</span><span class="p">:</span> <span class="p">{</span>
<span class="c1">// this is tab separated CSV/DSV</span>
<span class="na">delimiter</span><span class="p">:</span> <span class="dl">'</span><span class="se">\\</span><span class="s1">t</span><span class="dl">'</span>
<span class="p">}</span>
<span class="p">})</span>
</code></pre></div></div>
<h2 id="file">File</h2>
<p>The File instance MUST have the following properties and methods</p>
<h3 id="metadata-descriptor">Metadata: <code class="language-plaintext highlighter-rouge">descriptor</code></h3>
<p>Main metadata is available via the <code class="language-plaintext highlighter-rouge">descriptor</code>:</p>
<pre><code class="language-javascript=">file.descriptor
</code></pre>
<p>The descriptor follows the Frictionless Data <a href="https://frictionlessdata.io/specs/data-resource/">Data Resource</a> spec.</p>
<p>The descriptor metadata is a combination of the metadata passed in at File creation (if you created the File with a descriptor object) and auto-inferred information from the File path. This is the info that SHOULD be auto-inferred:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>path: path this was instantiated with - may not be same as file.path (depending on basePath)
pathType: remote | local
name: file name (without extension)
format: the extension
mediatype: mimetype based on file name and extension
</code></pre></div></div>
<p>In addition to this metadata there are certain properties which MAY be computed on demand and SHOULD be available as getters on the file object:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// the full path to the file (using basepath)</span>
<span class="kd">const</span> <span class="nx">path</span> <span class="o">=</span> <span class="nx">file</span><span class="p">.</span><span class="nx">path</span>
<span class="kd">const</span> <span class="nx">size</span> <span class="o">=</span> <span class="nx">file</span><span class="p">.</span><span class="nx">size</span>
<span class="c1">// md5 hash of the file</span>
<span class="kd">const</span> <span class="nx">hash</span> <span class="o">=</span> <span class="nx">file</span><span class="p">.</span><span class="nx">hash</span>
<span class="c1">// file encoding</span>
<span class="kd">const</span> <span class="nx">encoding</span> <span class="o">=</span> <span class="nx">file</span><span class="p">.</span><span class="nx">encoding</span>
</code></pre></div></div>
<p><strong>Note</strong>: size, hash are not available for remote Files (those created from urls).</p>
<h3 id="accessing-data">Accessing data</h3>
<p>Accessing data in the file:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// byte stream</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">stream</span><span class="p">()</span>
<span class="c1">// if file is tabular</span>
<span class="c1">// crude rows - no type casting etc</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">rows</span><span class="p">(</span><span class="nx">cast</span><span class="o">=</span><span class="nx">False</span><span class="p">,</span> <span class="nx">keyed</span><span class="o">=</span><span class="nx">False</span><span class="p">,</span> <span class="p">...)</span>
<span class="c1">// entire file as a buffer/string (be careful with large files!)</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">buffer</span><span class="p">()</span>
<span class="c1">// (optional)</span>
<span class="c1">// if tabular return entire set of rows as an array</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">array</span><span class="p">()</span>
<span class="c1">// EXPERIMENTAL</span>
<span class="c1">// file object packed into a stream</span>
<span class="c1">// metadata is first line (\n separated)</span>
<span class="c1">// motivation: way to send object over single stdin/stdout pipe</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">packed</span><span class="p">()</span>
</code></pre></div></div>
<h4 id="stream"><code class="language-plaintext highlighter-rouge">stream</code></h4>
<p>A raw byte stream:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>stream()
</code></pre></div></div>
<h4 id="rows"><code class="language-plaintext highlighter-rouge">rows</code></h4>
<p>Get the rows for this file as an object stream / iterator.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file.rows(cast=False, keyed=False, ...) =>
iterator with items [val1, val2, val3, ...]
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">keyed</code>: if <code class="language-plaintext highlighter-rouge">false</code> (default) returns rows as arrays i.e. [val1, val2, val3]. If <code class="language-plaintext highlighter-rouge">true</code> returns rows as objects i.e.. <code class="language-plaintext highlighter-rouge">{col1: val1, col2: val2, ...}</code>.</li>
<li><code class="language-plaintext highlighter-rouge">cast</code>: if <code class="language-plaintext highlighter-rouge">false</code> (default) returns values uncast. If true attempts to cast values either using best-effort or TableSchema if available</li>
<li><code class="language-plaintext highlighter-rouge">addRowNumber</code>: default <code class="language-plaintext highlighter-rouge">false</code>. Add first value or column <code class="language-plaintext highlighter-rouge">_id</code> to resulting rows with row number. [OPTIONAL for implementors]</li>
</ul>
<p><strong>Note:</strong> this method assumes underlying data is tabular. Library should raise appropriate error if called on a non-tabular file. It is also up to implementors what tabular formats they support (there are many). At a minimum the library MUST support CSV. It SHOULD support JSON and it MAY (it is desirable) support Excel.</p>
<h5 id="support-for-tableschema-and-csv-dialect">Support for TableSchema and CSV Dialect</h5>
<p>The library SHOULD support <a href="https://frictionlessdata.io/specs/table-schema/">Table Schema</a> and <a href="https://frictionlessdata.io/specs/csv-dialect/">CSV Dialect</a> in the <code class="language-plaintext highlighter-rouge">rows</code> method using metadata provided when the file was <code class="language-plaintext highlighter-rouge">open</code>ed:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1">// load a CSV with a non-standard dialect e.g. tab separated or semi-colon separated</span>
<span class="nx">file</span> <span class="o">=</span> <span class="nx">open</span><span class="p">({</span>
<span class="na">path</span><span class="p">:</span> <span class="dl">'</span><span class="s1">mydata.tsv</span><span class="dl">'</span>
<span class="c1">// Full support for http://specs.frictionlessdata.io/csv-dialect/</span>
<span class="na">dialect</span><span class="p">:</span> <span class="p">{</span>
<span class="na">delimiter</span><span class="p">:</span> <span class="dl">'</span><span class="se">\\</span><span class="s1">t</span><span class="dl">'</span> <span class="c1">// for tabs or ';' for semi-colons etc</span>
<span class="p">}</span>
<span class="p">})</span>
<span class="nx">file</span><span class="p">.</span><span class="nx">rows</span><span class="p">()</span> <span class="c1">// use the dialect info in parsing the csv</span>
<span class="c1">// open a CSV with a Table Schema</span>
<span class="nx">file</span> <span class="o">=</span> <span class="nx">open</span><span class="p">({</span>
<span class="na">path</span><span class="p">:</span> <span class="dl">'</span><span class="s1">mydata.csv</span><span class="dl">'</span>
<span class="c1">// Full support for Table Schema https://specs.frictionlessdata.io/table-schema/</span>
<span class="na">schema</span><span class="p">:</span> <span class="p">{</span>
<span class="na">fields</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="na">name</span><span class="p">:</span> <span class="dl">'</span><span class="s1">Column 1</span><span class="dl">'</span><span class="p">,</span>
<span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">integer</span><span class="dl">'</span>
<span class="p">},</span>
<span class="p">...</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">})</span>
</code></pre></div></div>
<h2 id="dataset">Dataset</h2>
<p>A collection of data files with optional metadata.</p>
<p>Under the hood it heavily uses <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> formats and it natively supports Data Package formats including loading from <code class="language-plaintext highlighter-rouge">datapackage.json</code> files. However, it does not require knowledge or use of Data Packages.</p>
<p>A Dataset has two key properties:</p>
<pre><code class="language-javascript=">// metadata
dataset.descriptor
// files in the dataset
dataset.files
</code></pre>
<h3 id="open-for-datasets"><code class="language-plaintext highlighter-rouge">open</code> for datasets</h3>
<p>The library MUST provide a method <code class="language-plaintext highlighter-rouge">openDataset</code> that takes a locator to a dataset and returns a Dataset object:</p>
<pre><code class="language-javascript=">openDataset(path/to/dataset/) => Dataset object
</code></pre>
<p>The library MAY overload the <code class="language-plaintext highlighter-rouge">open</code> method to support datasets as well as files:</p>
<pre><code class="language-javascript=">open(path/to/dataset/) => Dataset object
</code></pre>
<p><em>Note: overloading can be tricky as disambiguating locators for files from locators for datasets is not always trivial.</em></p>
<h4 id="dataset-locators">Dataset Locators</h4>
<p><code class="language-plaintext highlighter-rouge">path/to/dataset</code> - can be one of:</p>
<ul>
<li>local path to Dataset</li>
<li>remote url to Dataset</li>
<li>descriptor object (i.e. datapackage.json)</li>
</ul>
<h3 id="descriptor"><code class="language-plaintext highlighter-rouge">descriptor</code></h3>
<p>A Dataset MUST have a <code class="language-plaintext highlighter-rouge">descriptor</code> which holds the Dataset metadata. The descriptor MUST follow the <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> spec.</p>
<p>The Dataset SHOULD have the convenience attribute <code class="language-plaintext highlighter-rouge">path</code> which is the path (remote or local) to this dataset.</p>
<h3 id="identifier-optional"><code class="language-plaintext highlighter-rouge">identifier</code> (optional)</h3>
<p>A Dataset MAY have a <code class="language-plaintext highlighter-rouge">identifier</code> property that encapsulates the location (or origin) of this Dataset. The locator property MUST have the following structure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
name: <name>, // computed from path
owner: <owner>, // may be null
path: <path>, // computed path
type: <type>, // e.g. local, url, github, datahub, ...
original: <path>, // path (file or url) as originally supplied
version: <version> // version as computed
}
</code></pre></div></div>
<p><em>Note: the identifier is parsed from the locator passed into the open method. See the Data Package identifier spec https://frictionlessdata.io/specs/data-package-identifier/ and implementation in data.js library https://github.com/datahq/data.js#parsedatasetidentifier</em></p>
<h3 id="readme">README</h3>
<p>The Dataset object MAY support a <code class="language-plaintext highlighter-rouge">readme</code> property which returns a string corresponding to the README for this Dataset (if it exists).</p>
<p>The readme content is taken from the README.md file located in the Dataset root directory, or, if that does not exist from the <code class="language-plaintext highlighter-rouge">readme</code> property on the descriptor. If neither of those exist the readme will be undefined or null.</p>
<h3 id="files"><code class="language-plaintext highlighter-rouge">files</code></h3>
<p>A Dataset MUST have <code class="language-plaintext highlighter-rouge">files</code> property which returns an array of the Files contained in this Dataset:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataset.files => Array(<File>)
</code></pre></div></div>
<h3 id="addfile">addFile</h3>
<p>The library SHOULD implement an <code class="language-plaintext highlighter-rouge">addFile</code> method to add a <code class="language-plaintext highlighter-rouge">File</code> to a Dataset:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataset.addFile(file)
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">file</code>: an already instantiated File object or a File descriptor</li>
</ul>
<h2 id="operators">Operators</h2>
<p>Finally, we discuss some operators. These SHOULD not be in the core library but it is useful to be aware of them:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">infer(file) => TableSchema</code>: infer the <a href="https://frictionlessdata.io/specs/table-schema/">Table Schema</a> for a CSV file or other tabular file
<ul>
<li><code class="language-plaintext highlighter-rouge">inferStructure(file)</code>: infer the structure i.e. <a href="https://frictionlessdata.io/specs/csv-dialect/">CSV Dialect</a> of a CSV or other tabular file. In addition to CSV dialect properties this may include things like <code class="language-plaintext highlighter-rouge">skipRows</code> i.e. number of rows to skip</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">validate(file/dataset, metadataOnly=False)</code>: validate the data in a file e.g. against its schema
<ul>
<li><code class="language-plaintext highlighter-rouge">metadataOnly</code>: only validate the metadata e.g. against the <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> or <a href="https://frictionlessdata.io/specs/data-resource/">Data Resource</a> schemas.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">write(file/dataset)</code>: write a File or Dataset to disk</li>
</ul>
<h1 id="conclusion">Conclusion</h1>
<p>In this document we’ve outlined a “Frictionless Data Library” pattern that standardizes the design of a “core” data library API focused on accessing files and datasets.</p>
<p>Almost all data wrangling work involves opening data streams and passing them between processes. Standardizing the API would have major benefits for tool creators and users, making it quicker and easier to develop tooling as well as making tooling more “plug and play”.</p>
<h1 id="appendix-why-we-need-a-pattern-like-this">Appendix: Why we need a pattern like this</h1>
<h2 id="all-data-wrangling-tools-need-to-load-and-then-pass-around-file-like-objects-as-they-process-data">All data wrangling tools need to load and then pass around “file-like objects” as they process data</h2>
<p>All data tools need to access files/streams:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data = open(path/to/file.csv)
</code></pre></div></div>
<p>And so every programming language and every tool have a method for opening a file path and returning a byte-stream.</p>
<p>But …</p>
<h2 id="a-file-is-more-than-a-byte-stream-the-stream-may-be-structured-and-there-is-usually-the-need-for-associated-metadata">A file is more than a byte stream: the stream may be structured and there is usually the need for associated metadata</h2>
<p>Often we need more than just a byte stream, for example:</p>
<ul>
<li>We may want the stream to be structured: if it is a CSV file we’re opening we’d like to get a stream of row objects not just a stream of bytes</li>
<li>We may want file metadata to be available (where did the file come from, how big is the file, when was it last modified)</li>
<li>We may want schema information: not just the CSV file but type information on its columns (this would allow us to reliably cast the CSV data to proper types when reading)</li>
<li>And we may even want to add metadata ourselves (perhaps automatedly), for example guessing the types of the columns in a CSV</li>
</ul>
<p><em>A file is more than a byte stream: the stream may be structured and there is usually the need for associated metadata, at a minimum the name and size of the file but also extending to things like a file schema.</em></p>
<h2 id="tool-authors-find-themselves-inventing-their-own-stream-plus-metadata-objects--but-they-are-all-different">Tool authors find themselves inventing their own “stream-plus-metadata” objects … but they are all different</h2>
<p>Tool authors find themselves inventing their own file-like “stream-plus-metadata” objects to describe the files they open.</p>
<p><em>Note: Many languages have “file-like” object that usually consists of a stream plus some metadata (e.g. python <code class="language-plaintext highlighter-rouge">file</code> object, Node Streams etc). But this is not standardized and is often inadequate so tool makers end up wrapping or replacing it.</em></p>
<p>This is not just about opening files but about passing streams around with because most tools, even very simple ones, start to contain implicit mini data pipelines:</p>
<!--
```mermaid
graph LR
file[File on Disk] --"open"-> fileobj[Stream / File-like Object]
fileobj --parse-> strstream[Structured Stream]
strstream -.-> other[More ...]
```
-->
<p><img src="/img/frictionless-data-lib-streams-20180215.png" alt="" style="width: 600px; display: block; margin: auto;" /></p>
<p>These stream-plus-metadata objects contain implicit mini-metadata standards for describing files and collections of files (“datasets”). These mini-metadata standards look like <a href="https://frictionlessdata.io/specs/data-resource/">Data Resource</a>, <a href="https://frictionlessdata.io/specs/table-schema/">Table Schema</a>, <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> etc.</p>
<p>But these stream-plus-metadata objects and their mini-metadata are all a little different across the various languages and tools.</p>
<h2 id="plus-many-tools-also-need-to-access-collection-of-files-ie-datasets">Plus, many tools also need to access collection of files, i.e. datasets</h2>
<p>Many tools want to access collections of files e.g. datasets:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataset = open(/path/to/dataset)
</code></pre></div></div>
<p>Datasets already require some structure to list their collection of files and usually require some additional metadata ranging from where the dataset was loaded from to items such as its license.</p>
<p>You can even have datasets without multiple files when the file you are using is implicitly a dataset. For example, an Excel file is really a dataset if you think of each sheets as a separate file stream or think of an sqlite database.</p>
<h2 id="having-a-common-api-pattern-for-files-stream-plus-metadata-and-datasets-would-reduce-duplication-and-support-plug-and-play-with-tooling">Having a common API pattern for files (stream-plus-metadata) and datasets would reduce duplication and support plug and play with tooling</h2>
<p>Standardizing the structure of these stream-plus-metadata file objects (and dataset objects), and building standard libraries to create them from file/dataset pointers would:</p>
<ul>
<li>Reduce repetition / allow for reuse across tools: at present, data wrangling tool write this themselves. They now have a common pattern and may even be able to use a common underlying library.</li>
<li>Support plug and play: new wrangling can operate on these standard file and dataset objects. For example, an inference library that given a file object returns an inferred schema, or a converter that converts xls => csv.</li>
</ul>
<h1 id="appendix-design-principles">Appendix: Design Principles</h1>
<p>The pattern is based on the following principles:</p>
<ul>
<li>Data wrangler focused: focus on the core data wrangler workflow: open a file and do something with it</li>
<li>Zen-like: Simplicity and power. As simple as possible: does just what it needs and no more.</li>
<li>Use Streams: stream focused library, including object streams.</li>
</ul>
<h2 id="orient-to-the-data-wrangler-workflow">Orient to the data wrangler workflow</h2>
<p><em>See motivation section above</em></p>
<ul>
<li>Open => Read / Stream</li>
<li>[optional] Inspect</li>
<li>Check</li>
<li>Operate on</li>
<li>Write</li>
</ul>
<h2 id="zen---maximum-viable-simplicity">Zen - maximum viable simplicity</h2>
<p>As simple as possible. Does just what it needs and no more. Simple and powerful.</p>
<p>Zen =></p>
<ul>
<li>“thin” (vs fat) objects: all complex operators such as infer or dump operate <em>on</em> objects rather than becoming part of them</li>
<li>a single open method to get data (file or dataset)</li>
<li>hide metadata by default (data package, data resource etc are in the background)</li>
</ul>
<h3 id="core-objects-should-be-kept-as-simple-as-possible-and-no-simpler">Core objects should be kept as simple as possible (and no simpler)</h3>
<p>=> Inversion of control where possible so that we don’t end up with “fat” core classes e.g.</p>
<p>A. save data to disk should be separate objects that operate on the main objects rather than built into them e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>const writer = CSVWriter()
writer.writer(dataLibFileObjectInstance, filePath, [options])
</code></pre></div></div>
<p>rather than e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataLibFileObjectInstance.saveToCsv(filePath)
</code></pre></div></div>
<p><em>If there is a simple way to invert dependency (i.e. not have all different dumper in main lib) but have a simple save method that would be fine.</em></p>
<p>B. Similarly for parsers (though reading is so essential that read needs to be part of of the class)</p>
<p>C. infer, validate etc should operate <em>on</em> Files rather than be part of it …</p>
<pre><code class="language-javascript=">const tableschema = infer(fileObj)
</code></pre>
<p>Rather than</p>
<pre><code class="language-javascript=">fileObj.inferSchema()
</code></pre>
<h2 id="use-streams">Use Streams</h2>
<p>Streams are the natural way to handle data and it scales to large datasets.</p>
<p>The library should be stream focused library, including object streams.</p>
<h1 id="appendix-internal-library-structure-suggestions">Appendix: Internal Library Structure Suggestions</h1>
<p><em>These are some suggestions for how implementors could structure their library internally. They are entirely optional.</em></p>
<h2 id="library-components">Library Components</h2>
<p>In top level library just have Dataset and File (+ TabularFile)</p>
<pre><code class="language-mermaid">graph TD
Dataset[Dataset/Package] --> File[File/Resource]
File --> TabularFile
TabularFile -.-> TableSchema
TableSchema -.-> Field
parsers((Parsers))
dumpers((Writers))
tools((Tools))
tools --> infer
infer --> validate
classDef medium fill:lightblue,stroke:#333,stroke-width:4px;
</code></pre>
<pre><code class="language-mermaid">graph TD
parsers((Parsers))
dumpers((Writers))
subgraph Parsers - Tabular
csv["CSV parse(resource) -> row stream"]
xls["XLS ..."]
end
subgraph Writers - Tabular
ascii
csvdump[CSV]
xlsdump[XLS]
markdown
end
parsers --> csv
parsers --> xls
dumpers --> ascii
dumpers --> markdown
dumpers --> xlsdump
dumpers --> csvdump
</code></pre>
<h2 id="streams">Streams</h2>
<pre><code class="language-mermaid">graph LR
in1[File,URL,Stream] -- stream--> stream[Byte Stream + Meta]
stream --parse--> objstream[Obj Stream+Meta]
objstream --unparse--> stream2[Byte Stream + Meta]
stream --write--> out
stream2 --writer--> out[file/stream]
</code></pre>
<p>open => yields descriptor and file stream
parse => yields file rows (internally uses parsers)
writer => writers</p>
<pre><code class="language-javascript=">// aka write
writer(File) => readable stream
parser(File) => object stream
</code></pre>
<h2 id="loadersparsers-and-writers">Loaders/Parsers and Writers</h2>
<p>Loaders/Parsers and Writers should be be an extensible list.</p>
<p>Inversion of control is important: the core library does <strong>not</strong> depend directly on parsers (that way we can hot swap and/or extend the list at runtime).</p>
<p>Parsers:</p>
<pre><code class="language-javascript=">// file is a data.File object
parse(file) => row stream
</code></pre>
<p>Writers are similar:</p>
<pre><code class="language-javascript=">// e.g. csv.js
// dump to CSV file
write(file, path) => null
</code></pre>
<p>Note we may want a writer for datasets as well e.g. a writer to datapackage.json or to sql or …</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">write</span><span class="p">(</span><span class="nx">dataset</span><span class="p">,</span> <span class="nx">destination</span> <span class="p">...)</span>
</code></pre></div></div>
<h1 id="appendix-api-with-data-package-terminology">Appendix: API with Data Package terminology</h1>
<p><em>In progress</em></p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// data.js is just an illustrative name for the library</span>
<span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">data.js</span><span class="dl">'</span><span class="p">)</span>
<span class="c1">// path can be local or remote</span>
<span class="kd">const</span> <span class="nx">resource</span> <span class="o">=</span> <span class="nx">data</span><span class="p">.</span><span class="nx">open</span><span class="p">(</span><span class="nx">pathOrUrl</span><span class="p">)</span>
<span class="c1">// a byte stream</span>
<span class="nx">resource</span><span class="p">.</span><span class="nx">stream</span><span class="p">()</span>
<span class="c1">// if this file is tabular this will give me a row stream (iterator)</span>
<span class="nx">resource</span><span class="p">.</span><span class="nx">rows</span><span class="p">()</span>
</code></pre></div></div>
<p>For packages</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// path or url to a directory (or datapackage.json)</span>
<span class="kd">const</span> <span class="kr">package</span> <span class="o">=</span> <span class="nx">data</span><span class="p">.</span><span class="nx">open</span><span class="p">(</span><span class="nx">pathOrUrl</span><span class="p">)</span>
<span class="c1">// list of files</span>
<span class="kr">package</span><span class="p">.</span><span class="nx">resources</span>
<span class="c1">// readme (if README.md exists or there is a description in the metadata)</span>
<span class="kr">package</span><span class="p">.</span><span class="nx">readme</span>
<span class="c1">// any metadata (either inferred or from datapackage.json)</span>
<span class="kr">package</span><span class="p">.</span><span class="nx">descriptor</span>
</code></pre></div></div>
<h1 id="appendix-connection-with-frictionless-data">Appendix: Connection with Frictionless Data</h1>
<p>I’ve distilled this pattern out of the work of myself and others who have worked on <a href="https://frictionlessdata.io/">Frictionless Data</a> specs and tooling.</p>
<p>It is motivated by the following observations about the Data Package suite of libraries and their Table Schema, Data Resource and Data Package interfaces:</p>
<ol>
<li>These libraries contain functions and metadata that standardize operations that are common to almost all data wrangling tools because almost all data wrangling tools need to handle files/streams and datasets and the core metadata is designed around describing files and datasets – or inferring and validating that.</li>
<li>BUT: by presenting the underlying metadata such as Data Resource, Data Package front and centre and hiding the common operations (e.g. open this file) they make a rather <strong>unnatural</strong> interface for data wranglers.</li>
<li>Most data wranglers start from an immediate need: display this csv on the command line, convert this excel file to csv etc. At the simplest, most data wrangling tools need some function like <code class="language-plaintext highlighter-rouge">open(file) => file-like object</code> where the file-like object can be be used for other tasks</li>
</ol>
<blockquote>
<p><strong>Metaphorically: the current data package libraries put the skeleton (the metadata) “on the outside” and the “flesh” (the actual methods wranglers want to use) on the “inside” (they are implicit or hidden with the overall library)</strong></p>
</blockquote>
<p>What follows from this insight is that we should invert this:</p>
<ul>
<li>“Put the flesh on the outside”: Create a simple interface that addresses the common needs of data wranglers and data wrangler tooling e.g. <code class="language-plaintext highlighter-rouge">open(file)</code></li>
<li>Put the bones on the inside: leverage the Frictionless Data metadata structures but put them on the inside, out of sight but still available if needed.</li>
</ul>
<p><em>Note: it may be appropriate to continue to have a dedicated Data Package or Table Schema library but keep it <strong>really</strong> simple</em></p>
<p>Here’s how I put this in the original issue https://github.com/frictionlessdata/tableschema-js/issues/78:</p>
<blockquote>
<p><strong>People don’t care about Data Packages / Resources, they care about opening a data file and doing something with it</strong></p>
<blockquote>
<p><em>Data Packages / Resources come up because they are a nicely agreed metadata structure for all the stuff that comes up in the <strong>background</strong> when you do that.</em></p>
</blockquote>
<p>Put crudely: Most people are doing stuff with a file (or dataset), and they want to grab it and read it preferably in a structured way e.g. as a row iterator – sometimes inferring or specifying stuff a long the way e.g. encoding, formatting, field types.</p>
<p>=> <strong>Our job is to help users to open that file (or dataset) and stream it as quickly as possible.</strong></p>
</blockquote>
<h3 id="recommendations-for-frictionless-data-community">Recommendations for Frictionless Data community</h3>
<p>Suggestions:</p>
<ul>
<li>Users want to do stuff with data fast. This implies that a library like tabulator is more immediately appropriate to end users than data-package or table-schema</li>
<li>The current set of FD libraries is bewildering and confusing, especially for new users. There are several complementary libraries plus some of the API is pretty confusing (see appendix for more on this)</li>
</ul>
<p>Recommendations:</p>
<ul>
<li>Have a primary “gateway” library oriented around reading and writing data and datasets.</li>
<li>This can be based around a simplified Package and Resource interface and library
<ul>
<li>Move auxiliary functionality to libraries e.g. infer,</li>
<li>Make parsers / loaders (and writers) to a plugin model so this can be extended easily</li>
</ul>
</li>
<li>Consider renaming Package and Resource to Dataset and File in the simple library as these are more accessible and common terms</li>
</ul>
<h3 id="why-do-it">Why do it?</h3>
<ul>
<li>Massively grow the potential audience: Create an interface non-DP fanatics can use and want to use (and DP ones too)</li>
<li>Ease of use: easier for us and others to use</li>
<li>Elegance: do it right - this is the elegant, functional, beautiful way to do this library</li>
</ul>
<h3 id="relation-to-data-packages">Relation to Data Packages</h3>
<ul>
<li>We use Data Package and Table Schema as the metadata model for data files and datasets</li>
<li>Data Package libraries already implement APIs a bit like this and support many features we want (e.g. infer)</li>
</ul>
Rufus Pollock
Creating and Using Data Packages in R
2018-02-14T00:00:00+00:00
http://okfnlabs.org/blog/2018/02/14/datapackages-in-r
<p><a href="http://okfn.gr/">Open Knowledge Greece</a> was one of 2017’s <a href="https://toolfund.frictionlessdata.io">Frictionless Data Tool Fund</a> grantees tasked with extending implementation of core Frictionless Data libraries in R programming language. You can read more about this in <a href="https://frictionlessdata.io/articles/open-knowledge-greece/">their grantee profile</a>.</p>
<p>In this post, <a href="https://twitter.com/Kleanthis_k10">Kleanthis Koupidis</a>, a Data Scientist and Statistician at Open Knowledge Greece, explains how to <a href="#creating-data-packages-in-r">create</a> and <a href="#using-data-packages-in-r">use</a> Data Packages in R.</p>
<hr />
<h1 id="creating-data-packages-in-r">Creating Data Packages in R</h1>
<p>This section of the tutorial will show you how to install the R library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.</p>
<h2 id="setup">Setup</h2>
<p>For this tutorial, we will need the Data Package R library (<a href="https://github.com/frictionlessdata/datapackage-r">datapackage.r</a>).</p>
<p><a href="https://cran.r-project.org/package=devtools">devtools library</a> is required to install the <code class="language-plaintext highlighter-rouge">datapackage.r</code> library from github.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> # Install devtools package if not already
install.packages("devtools")
</code></pre></div></div>
<p>And then install the development version of <a href="https://github.com/frictionlessdata/datapackage-r">datapackage.r</a> from github.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> devtools::install_github("frictionlessdata/datapackage-r")
</code></pre></div></div>
<h2 id="load">Load</h2>
<p>You can start using the library by loading <code class="language-plaintext highlighter-rouge">datapackage.r</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">datapackage.r</span><span class="p">)</span></code></pre></figure>
<p>You can add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required <code class="language-plaintext highlighter-rouge">name</code> key as well as a human-readable <code class="language-plaintext highlighter-rouge">title</code> key. For the keys supported, please consult the full <a href="https://frictionlessdata.io/specs/data-package/">Data Package spec</a>. Note, we will be creating the required <code class="language-plaintext highlighter-rouge">resources</code> key further down below.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dataPackage</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Package.load</span><span class="p">()</span><span class="w">
</span><span class="n">dataPackage</span><span class="o">$</span><span class="n">descriptor</span><span class="p">[</span><span class="s1">'name'</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'period-table'</span><span class="w">
</span><span class="n">dataPackage</span><span class="o">$</span><span class="n">descriptor</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Periodic Table'</span><span class="w">
</span><span class="c1"># commit the changes to Package class</span><span class="w">
</span><span class="n">dataPackage</span><span class="o">$</span><span class="n">commit</span><span class="p">()</span><span class="w">
</span><span class="c1">## [1] TRUE</span></code></pre></figure>
<h2 id="infer-a-csv-schema">Infer a CSV Schema</h2>
<p>We will use periodic-table data from <a href="https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example%20data/data.csv">remote path</a></p>
<table class="table table-striped table-bordered" style="display: table; overflow:auto">
<thead>
<tr>
<th>atomic.number</th>
<th>symbol</th>
<th>name</th>
<th>atomic.mass</th>
<th>metal.or.nonmetal.</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>H</td>
<td>Hydrogen</td>
<td>1.00794</td>
<td>nonmetal</td>
</tr>
<tr>
<td>2</td>
<td>He</td>
<td>Helium</td>
<td>4.002602</td>
<td>noble gas</td>
</tr>
<tr>
<td>3</td>
<td>Li</td>
<td>Lithium</td>
<td>6.941</td>
<td>alkali metal</td>
</tr>
<tr>
<td>4</td>
<td>Be</td>
<td>Beryllium</td>
<td>9.012182</td>
<td>alkaline earth metal</td>
</tr>
<tr>
<td>5</td>
<td>B</td>
<td>Boron</td>
<td>10.811</td>
<td>metalloid</td>
</tr>
<tr>
<td>6</td>
<td>C</td>
<td>Carbon</td>
<td>12.0107</td>
<td>nonmetal</td>
</tr>
<tr>
<td>7</td>
<td>N</td>
<td>Nitrogen</td>
<td>14.0067</td>
<td>nonmetal</td>
</tr>
<tr>
<td>8</td>
<td>O</td>
<td>Oxygen</td>
<td>15.9994</td>
<td>nonmetal</td>
</tr>
<tr>
<td>9</td>
<td>F</td>
<td>Fluorine</td>
<td>18.9984032</td>
<td>halogen</td>
</tr>
<tr>
<td>10</td>
<td>Ne</td>
<td>Neon</td>
<td>20.1797</td>
<td>noble gas</td>
</tr>
</tbody>
</table>
<p>We can guess our CSV’s <a href="https://frictionlessdata.io/guides/table-schema/">schema</a> by using <code class="language-plaintext highlighter-rouge">infer</code> from the Table Schema library. We pass directly the remote link to the infer function, the result of which is an inferred schema. For example, if the processor detects only integers in a given column, it will assign <code class="language-plaintext highlighter-rouge">integer</code> as a column type.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">filepath</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'</span><span class="w">
</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tableschema.r</span><span class="o">::</span><span class="n">infer</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span></code></pre></figure>
<p>Once we have a schema, we are now ready to add a <code class="language-plaintext highlighter-rouge">resource</code> key to the Data Package which points to the resource path and its newly created schema. Below we define resources with three ways, using json text format with usual assignment operator in R list objects and directly using <code class="language-plaintext highlighter-rouge">addResource</code> function of <code class="language-plaintext highlighter-rouge">Package</code> class:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="c1"># define resources using json text</span><span class="w">
</span><span class="n">resources</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">helpers.from.json.to.list</span><span class="p">(</span><span class="w">
</span><span class="s1">'[{
"name": "data",
"path": "filepath",
"schema": "schema"
}]'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">schema</span><span class="w">
</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filepath</span><span class="w">
</span><span class="c1"># or define resources using list object</span><span class="w">
</span><span class="n">resources</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filepath</span><span class="p">,</span><span class="w">
</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">schema</span><span class="w">
</span><span class="p">))</span></code></pre></figure>
<p>And now, add resources to the Data Package:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dataPackage</span><span class="o">$</span><span class="n">descriptor</span><span class="p">[[</span><span class="s1">'resources'</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">resources</span><span class="w">
</span><span class="n">dataPackage</span><span class="o">$</span><span class="n">commit</span><span class="p">()</span><span class="w">
</span><span class="c1">## [1] TRUE</span></code></pre></figure>
<p>Or you can directly add resources using <code class="language-plaintext highlighter-rouge">addResources</code> function of <code class="language-plaintext highlighter-rouge">Package</code> class:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">resources</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
</span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filepath</span><span class="p">,</span><span class="w">
</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">schema</span><span class="w">
</span><span class="p">))</span><span class="w">
</span><span class="n">dataPackage</span><span class="o">$</span><span class="n">addResource</span><span class="p">(</span><span class="n">resources</span><span class="p">)</span></code></pre></figure>
<p>Now we are ready to write our <code class="language-plaintext highlighter-rouge">datapackage.json</code> file to the current working directory.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dataPackage</span><span class="o">$</span><span class="n">save</span><span class="p">(</span><span class="s1">'example_data'</span><span class="p">)</span></code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">datapackage.json</code> (<a href="https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/package.json">download</a>) is inlined below. Note that atomic number has been correctly inferred as an <code class="language-plaintext highlighter-rouge">integer</code> and atomic mass as a <code class="language-plaintext highlighter-rouge">number</code> (float) while every other column is a <code class="language-plaintext highlighter-rouge">string</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">jsonlite</span><span class="o">::</span><span class="n">prettify</span><span class="p">(</span><span class="n">helpers.from.list.to.json</span><span class="p">(</span><span class="n">dataPackage</span><span class="o">$</span><span class="n">descriptor</span><span class="p">))</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "profile": "data-package",</span><span class="w">
</span><span class="c1">## "name": "period-table",</span><span class="w">
</span><span class="c1">## "title": "Periodic Table",</span><span class="w">
</span><span class="c1">## "resources": [</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "name": "data",</span><span class="w">
</span><span class="c1">## "path": "https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv",</span><span class="w">
</span><span class="c1">## "schema": {</span><span class="w">
</span><span class="c1">## "fields": [</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "name": "atomic number",</span><span class="w">
</span><span class="c1">## "type": "integer",</span><span class="w">
</span><span class="c1">## "format": "default"</span><span class="w">
</span><span class="c1">## },</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "name": "symbol",</span><span class="w">
</span><span class="c1">## "type": "string",</span><span class="w">
</span><span class="c1">## "format": "default"</span><span class="w">
</span><span class="c1">## },</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "name": "name",</span><span class="w">
</span><span class="c1">## "type": "string",</span><span class="w">
</span><span class="c1">## "format": "default"</span><span class="w">
</span><span class="c1">## },</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "name": "atomic mass",</span><span class="w">
</span><span class="c1">## "type": "number",</span><span class="w">
</span><span class="c1">## "format": "default"</span><span class="w">
</span><span class="c1">## },</span><span class="w">
</span><span class="c1">## {</span><span class="w">
</span><span class="c1">## "name": "metal or nonmetal?",</span><span class="w">
</span><span class="c1">## "type": "string",</span><span class="w">
</span><span class="c1">## "format": "default"</span><span class="w">
</span><span class="c1">## }</span><span class="w">
</span><span class="c1">## ],</span><span class="w">
</span><span class="c1">## "missingValues": [</span><span class="w">
</span><span class="c1">## ""</span><span class="w">
</span><span class="c1">## ]</span><span class="w">
</span><span class="c1">## },</span><span class="w">
</span><span class="c1">## "profile": "data-resource",</span><span class="w">
</span><span class="c1">## "encoding": "utf-8"</span><span class="w">
</span><span class="c1">## }</span><span class="w">
</span><span class="c1">## ]</span><span class="w">
</span><span class="c1">## }</span><span class="w">
</span><span class="c1">##</span></code></pre></figure>
<h2 id="publishing">Publishing</h2>
<p>Now that you have created your Data Package, you might want to <a href="https://frictionlessdata.io/guides/publish-online/">publish your data online</a> so that you can share it with others.</p>
<hr />
<h1 id="using-data-packages-in-r">Using Data Packages in R</h1>
<p>This section of the tutorial will show you how to install the R libraries for working with Tabular Data Packages and demonstrate a very simple example of loading a Tabular Data Package from the web and pushing it directly into a local SQL database and send query to retrieve results.</p>
<h2 id="setup-1">Setup</h2>
<p>For this tutorial, we will need the Data Package R library (<a href="https://github.com/frictionlessdata/datapackage-r">datapackage.r</a>). <a href="https://cran.r-project.org/package=devtools">Devtools library</a> is also required to install the datapackage.r library from github.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Install devtools package if not already
install.packages("devtools")
</code></pre></div></div>
<p>And then install the development version of <a href="https://github.com/frictionlessdata/datapackage-r">datapackage.r</a> from github.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>devtools::install_github("frictionlessdata/datapackage.r")
</code></pre></div></div>
<h2 id="load-1">Load</h2>
<p>You can start using the library by loading <code class="language-plaintext highlighter-rouge">datapackage.r</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">datapackage.r</span><span class="p">)</span></code></pre></figure>
<h2 id="reading-basic-metadata">Reading Basic Metadata</h2>
<p>In this case, we are using an example Tabular Data Package containing the periodic table stored on <a href="https://github.com/frictionlessdata/example-data-packages/tree/master/periodic-table">GitHub</a> (<a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/periodic-table/datapackage.json">datapackage.json</a>, <a href="https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/periodic-table/data.csv">data.csv</a>). This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the first five rows:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'</span><span class="w">
</span><span class="n">pt_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read.csv2</span><span class="p">(</span><span class="n">url</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">','</span><span class="p">)</span><span class="w">
</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">pt_data</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="n">align</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'c'</span><span class="p">)</span></code></pre></figure>
<table class="table table-striped table-bordered" style="display: table; overflow:auto">
<thead>
<tr>
<th>atomic.number</th>
<th>symbol</th>
<th>name</th>
<th>atomic.mass</th>
<th>metal.or.nonmetal.</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>H</td>
<td>Hydrogen</td>
<td>1.00794</td>
<td>nonmetal</td>
</tr>
<tr>
<td>2</td>
<td>He</td>
<td>Helium</td>
<td>4.002602</td>
<td>noble gas</td>
</tr>
<tr>
<td>3</td>
<td>Li</td>
<td>Lithium</td>
<td>6.941</td>
<td>alkali metal</td>
</tr>
<tr>
<td>4</td>
<td>Be</td>
<td>Beryllium</td>
<td>9.012182</td>
<td>alkaline earth metal</td>
</tr>
<tr>
<td>5</td>
<td>B</td>
<td>Boron</td>
<td>10.811</td>
<td>metalloid</td>
</tr>
</tbody>
</table>
<p>Data Packages can be loaded either from a local path or directly from the web.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/package.json'</span><span class="w">
</span><span class="n">datapackage</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Package.load</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">
</span><span class="n">datapackage</span><span class="o">$</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">descriptor</span><span class="o">$</span><span class="n">profile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'tabular-data-resource'</span><span class="w"> </span><span class="c1"># tabular resource descriptor profile</span><span class="w">
</span><span class="n">datapackage</span><span class="o">$</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">commit</span><span class="p">()</span><span class="w"> </span><span class="c1"># commit changes</span><span class="w">
</span><span class="c1">## [1] TRUE</span></code></pre></figure>
<p>At the most basic level, Data Packages provide a standardized format for general metadata (for example, the dataset title, source, author, and/or description) about your dataset. Now that you have loaded this Data Package, you have access to this <code class="language-plaintext highlighter-rouge">metadata</code> using the metadata dict attribute. Note that these fields are optional and may not be specified for all Data Packages. For more information on which fields are supported, see <a href="https://frictionlessdata.io/specs/data-package/">the full Data Package standard</a>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">datapackage</span><span class="o">$</span><span class="n">descriptor</span><span class="o">$</span><span class="n">title</span><span class="w">
</span><span class="c1">## [1] "Periodic Table"</span></code></pre></figure>
<h2 id="reading-data">Reading Data</h2>
<p>Now that you have loaded your Data Package, you can read its data. A Data Package can contain multiple files which are accessible via the <code class="language-plaintext highlighter-rouge">resources</code> attribute. The <code class="language-plaintext highlighter-rouge">resources</code> attribute is an array of objects containing information (e.g. path, schema, description) about each file in the package.</p>
<p>You can access the data in a given resource in the <code class="language-plaintext highlighter-rouge">resources</code> array by reading the <code class="language-plaintext highlighter-rouge">data</code> attribute.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">datapackage</span><span class="o">$</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">table</span><span class="w">
</span><span class="n">periodic_table_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">table</span><span class="o">$</span><span class="n">read</span><span class="p">()</span></code></pre></figure>
<p>You can further manipulate list objects in R by using</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="p">[</span><span class="n">purrr</span><span class="p">](</span><span class="n">https</span><span class="o">://</span><span class="n">cran.r</span><span class="o">-</span><span class="n">project.org</span><span class="o">/</span><span class="n">package</span><span class="o">=</span><span class="n">purrr</span><span class="p">),</span><span class="w"> </span><span class="p">[</span><span class="n">rlist</span><span class="p">](</span><span class="n">https</span><span class="o">://</span><span class="n">cran.r</span><span class="o">-</span><span class="n">project.org</span><span class="o">/</span><span class="n">package</span><span class="o">=</span><span class="n">rlist</span><span class="p">)</span><span class="w"> </span><span class="n">packages.</span></code></pre></figure>
<h2 id="loading-into-an-sql-database">Loading into an SQL database</h2>
<p><a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Packages</a> contains schema information about its data using <a href="https://frictionlessdata.io/guides/table-schema/">Table Schema</a>. This means you can easily import your Data Package into the SQL backend of your choice. In this case, we are creating an <a href="http://sqlite.org/">SQLite</a> database.</p>
<p>To create a new SQLite database and load the data into SQL we will need <a href="https://cran.r-project.org/package=DBI">DBI</a> package and <a href="https://cran.r-project.org/package=RSQLite">RSQLite</a> package, which contains <a href="https://www.sqlite.org/">SQLite</a> (no external software is needed).</p>
<p>You can install and load them by using:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">install.packages</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"DBI"</span><span class="p">,</span><span class="s2">"RSQLite"</span><span class="p">))</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">DBI</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">RSQLite</span><span class="p">)</span></code></pre></figure>
<p>To create a new SQLite database, you simply supply the filename to <code class="language-plaintext highlighter-rouge">dbConnect()</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dp.database</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">RSQLite</span><span class="o">::</span><span class="n">SQLite</span><span class="p">(),</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="c1"># temporary database</span></code></pre></figure>
<p>We will use <a href="https://cran.r-project.org/package=RSQLite">data.table</a> package to convert the list object with the data to a data frame object to copy them to database table.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="c1"># install data.table package if not already</span><span class="w">
</span><span class="c1"># install.packages("data.table")</span><span class="w">
</span><span class="n">periodic_table_sql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.table</span><span class="o">::</span><span class="n">rbindlist</span><span class="p">(</span><span class="n">periodic_table_data</span><span class="p">)</span><span class="w">
</span><span class="n">periodic_table_sql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">periodic_table_sql</span><span class="p">,</span><span class="n">unlist</span><span class="p">(</span><span class="n">datapackage</span><span class="o">$</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">headers</span><span class="p">))</span></code></pre></figure>
<p>You can easily copy an R data frame into a SQLite database with <code class="language-plaintext highlighter-rouge">dbWriteTable()</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dbWriteTable</span><span class="p">(</span><span class="n">dp.database</span><span class="p">,</span><span class="w"> </span><span class="s2">"periodic_table_sql"</span><span class="p">,</span><span class="w"> </span><span class="n">periodic_table_sql</span><span class="p">)</span><span class="w">
</span><span class="c1"># show remote tables accessible through this connection</span><span class="w">
</span><span class="n">dbListTables</span><span class="p">(</span><span class="n">dp.database</span><span class="p">)</span><span class="w">
</span><span class="c1">## [1] "periodic_table_sql"</span></code></pre></figure>
<p>The data are already to the database.</p>
<p>We can further issue queries to hte database and return first 5 elements:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">dp.database</span><span class="p">,</span><span class="w"> </span><span class="s1">'SELECT * FROM periodic_table_sql LIMIT 5'</span><span class="p">)</span><span class="w">
</span><span class="c1">## atomic number symbol name atomic mass metal or nonmetal?</span><span class="w">
</span><span class="c1">## 1 1 H Hydrogen 1.007940 nonmetal</span><span class="w">
</span><span class="c1">## 2 2 He Helium 4.002602 noble gas</span><span class="w">
</span><span class="c1">## 3 3 Li Lithium 6.941000 alkali metal</span><span class="w">
</span><span class="c1">## 4 4 Be Beryllium 9.012182 alkaline earth metal</span><span class="w">
</span><span class="c1">## 5 5 B Boron 10.811000 metalloid</span></code></pre></figure>
<p>Or return all elements with an atomic number of less than 10:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">dp.database</span><span class="p">,</span><span class="w"> </span><span class="s1">'SELECT * FROM periodic_table_sql WHERE "atomic number" < 10'</span><span class="p">)</span><span class="w">
</span><span class="c1">## atomic number symbol name atomic mass metal or nonmetal?</span><span class="w">
</span><span class="c1">## 1 1 H Hydrogen 1.007940 nonmetal</span><span class="w">
</span><span class="c1">## 2 2 He Helium 4.002602 noble gas</span><span class="w">
</span><span class="c1">## 3 3 Li Lithium 6.941000 alkali metal</span><span class="w">
</span><span class="c1">## 4 4 Be Beryllium 9.012182 alkaline earth metal</span><span class="w">
</span><span class="c1">## 5 5 B Boron 10.811000 metalloid</span><span class="w">
</span><span class="c1">## 6 6 C Carbon 12.010700 nonmetal</span><span class="w">
</span><span class="c1">## 7 7 N Nitrogen 14.006700 nonmetal</span><span class="w">
</span><span class="c1">## 8 8 O Oxygen 15.999400 nonmetal</span><span class="w">
</span><span class="c1">## 9 9 F Fluorine 18.998403 halogen</span></code></pre></figure>
<p>More about using databases, SQLite in R you can find in vignettes of <a href="https://cran.r-project.org/package=DBI">DBI</a> and <a href="https://cran.r-project.org/package=RSQLite">RSQLite</a> packages.</p>
<p>We welcome your feedback and questions via our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a> or via <a href="https://github.com/frictionlessdata/datapackage-r/issues">Github issues</a> on the datapackage-r repository.</p>
Open Knowledge Greece
Working with Data Package Creator
2018-02-05T00:00:00+00:00
http://okfnlabs.org/blog/2018/02/05/data-package-creator
<p><em>The Data Package Creator, <a href="https://create.frictionlessdata.io">create.frictionlessdata.io</a>, is a revamp of the Data Packagist app that lets you create and edit and validate your data packages with ease. Read on and find out how.</em></p>
<hr />
<p><a href="https://frictionlessdata.io">Frictionless Data</a> aims to make it effortless to transport high quality data among different tools and platforms for further analysis. At the heart of this work is the <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a>, a simple format that makes it possible to package a collection of data and attach contextual information to it before sharing it. Where tabular data is involved, the ensuing <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Package</a> contains the dataset, its schema and descriptive metadata associated with the dataset collated in a JSON file.</p>
<p>The basic building block of a Data Package is its <code class="language-plaintext highlighter-rouge">datapackage.json</code> file. The Frictionless Data team and community have developed libraries and continue to actively support users who wish to create and work with Data Packages in <a href="https://github.com/frictionlessdata/datapackage-js">Javascript</a>, <a href="https://github.com/frictionlessdata/datapackage-py">Python</a>, <a href="https://github.com/frictionlessdata/datapackage-rb">Ruby</a>, <a href="https://github.com/frictionlessdata/datapackage-r">R</a>, <a href="https://github.com/frictionlessdata/datapackage-php">PHP</a>, <a href="https://github.com/frictionlessdata/datapackage-java">Java</a>, <a href="https://github.com/frictionlessdata/datapackage-go">Go</a>, <a href="https://github.com/frictionlessdata/datapackage-clj">Clojure</a> and <a href="https://github.com/frictionlessdata/datapackage-jl">Julia</a>. Up until now, the <a href="http://datapackagist.openknowledge.io">Data Packagist app</a>, which was developed as an Open Knowledge Labs initiative, has also been a helpful resource to help people create Data Packages quickly and with relative ease.</p>
<p>At Open Knowledge International and as part of the Frictionless Data project, we are constantly thinking about streamlining processes and making it easier for users to adopt the software we develop for use in their data work. New improvements to the Data Package specification as part of the <a href="https://blog.okfn.org/2017/09/05/frictionless-data-v1-0/">September 2017 update</a> have also led our team to carry out subsequent iterations on the original Data Packagist app. The outcome of this work is the <a href="https://create.frictionlessdata.io">Data Package Creator</a>, which boasts a revamped user interface and additional functionality to streamline the data package creation process.</p>
<p><a href="https://create.frictionlessdata.io">Data Package Creator</a> is an online service that lets users generate tabular data packages from their datasets (<a href="https://frictionlessdata.io/specs/tabular-data-package/">more on the Tabular Data Package specification</a>). Let’s see how it works.</p>
<p>As mentioned earlier, a data package contains a collection of data. Each unique data file is referred to as a <a href="https://frictionlessdata.io/specs/data-resource/">data resource</a>.</p>
<p>You can add as many resources as your data collection contains either by linking to them, uploading them from your local machine or creating them from scratch and specifying their fields. You can also edit each resource (rename, add and remove fields, et al) within the <a href="https://create.frictionlessdata.io">Data Package Creator</a>.</p>
<p>For our example, I am looking to package <a href="https://frictionlessdata.io/specs/data-resource/">data resources</a> that contain information on three cities I am interested in: Paris, Rome and London.</p>
<ul>
<li>The first resource is <code class="language-plaintext highlighter-rouge">location.csv</code> which contains city names and their coordinates. I will load this file from my local machine. Here’s what the data in the <code class="language-plaintext highlighter-rouge">location.csv</code> file looks like.</li>
</ul>
<pre><code class="language-csv">city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
</code></pre>
<ul>
<li>The second resource is <code class="language-plaintext highlighter-rouge">data.csv</code> which contains population information on the three cities and I will load this tabular data resource <a href="https://github.com/frictionlessdata/datapackage-py/blob/master/data/data.csvformat">from a Github repository</a>.</li>
</ul>
<pre><code class="language-csv">city,population
london,8787892
paris,2244000
rome,2877215
</code></pre>
<ul>
<li>The third resource is one that doesn’t exist yet and which I will create and add fields to in the <a href="https://create.frictionlessdata.io">Data Package Creator</a>. I’ll call it <code class="language-plaintext highlighter-rouge">rome.csv</code>. Once I download the data package, I will add this resource to the data package before sharing it elsewhere.</li>
</ul>
<pre><code class="language-csv">city,location
rome,"41.89,12.51"
</code></pre>
<p>The <code class="language-plaintext highlighter-rouge">datapackage.json</code> file is updated every time a resource is added, edited or removed. This JSON file can be viewed on the right hand side of the Data Package Creator by clicking on the <code class="language-plaintext highlighter-rouge">{···}</code> symbol to expand the section.</p>
<p><img src="/img/posts/datapackagecreator.png" alt="screengrab of data package creator" />
<em>screen grab of the new Data Package Creator</em></p>
<p>Metadata attached to any Data Package is also stored in the <code class="language-plaintext highlighter-rouge">datapackage.json</code> file. However, editing JSON files directly can be a laborious and error-prone task. The MetaData section on the left side makes it easy to write and edit descriptive metadata that will be included in your Data Package alongside your data.</p>
<p>The Profile Section allows you to specify what kind of Data Package you are going for. There are 3 options:</p>
<ul>
<li><a href="https://frictionlessdata.io/specs/data-package/">Data Package</a>: This can contain a collection of any type of data resource and a JSON file.</li>
<li><a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Package</a>: This collection must contain tabular data. It is possible to load data in any machine readable format - csv, tsv, xls, etc and a JSON file</li>
<li>[Fiscal Data Package][fdp]: This is a subset of the tabular data package, specifically designed for use with budget and fiscal data.</li>
</ul>
<p>The keyword section also allows you to add up to 3 tags to your Data Package, so they are more discoverable.</p>
<p>Before downloading your data package, click on the <strong>Validate</strong> button at the bottom of the side navigation to check whether the generated schema is valid. The Validate button prompts Data Package creator to check whether the selected profile is befitting for the resources that constitute your data package. Should you see a warning, such as the one below, it is likely that the wrong Profile is specified in the MetaData section.</p>
<p><img src="/img/posts/datapackagecreator-invalid.png" alt="screengrab of an alert for an invalid data package message on data package creator" />
<em>Error message that ensues on choosing the wrong data package profile. My data package is comprised of tabular data resources and the Fiscal Data Package profile is ill-suited for it. Tabular Data Package profile is most ideal.</em></p>
<p>Aim for the eureka message below, and in case you feel stuck, reach out and we’ll work with you to resolve the issue.</p>
<p><img src="/img/posts/datapackagecreator-valid.png" alt="screengrab of a valid data package message on data package creator" /></p>
<p>Finally, click on the download button which gives you a local copy of the generated datapackage.json file, complete with your data schema and metadata attached to it. Score 1 for data provenance!
Finally, create a folder and place your downloaded <code class="language-plaintext highlighter-rouge">datapackage.json</code> file in it. Create a new folder within it, call it <code class="language-plaintext highlighter-rouge">data</code> and add all the data resources in your data package to it. You are now ready to share your data package.
Here’s what my final data package folder looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Three-Cities-Data-Package
|-- datapackage.json
|-- data
|-- population.csv
|-- rome.csv
</code></pre></div></div>
<p>Please note, there are cases where all you would need to share is the <code class="language-plaintext highlighter-rouge">datapackage.json</code> file i.e. if all your resources are online and publicly accessible. For this reason, as the population data resource is publicly available and already linked in the resulting JSON file, I need not include the csv file in my final data package.</p>
<p><a href="[dpc-git]">This</a> is the code repository for <a href="https://create.frictionlessdata.io">Data Package Creator</a>. We welcome your feedback and questions via our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a> or via <a href="https://github.com/frictionlessdata/datapackage-ui/issues">Github issues</a>.</p>
<p>Happy days!</p>
Serah Rono
Interactive Data wrangling using Data Package Pipelines new UI
2018-01-10T00:00:00+00:00
http://okfnlabs.org/blog/2018/01/10/datapackage-pipelines-ui
<p><em><a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a> is a framework for defining data processing steps to generate self-describing Data Packages, built on the concepts and tooling of the <a href="http://frictionlessdata.io/">Frictionless Data</a> project. You can read more about datapackage-pipelines in this <a href="http://okfnlabs.org/blog/2017/02/27/datapackage-pipelines.html">introductory post</a>.</em></p>
<hr />
<p>Data wrangling can be quite a tedious task -</p>
<ul>
<li>We download a few files from a data portal or some other source.</li>
<li>We use Excel or other applications to view the data.</li>
<li>We select columns from all these files and copy-paste data to construct the data-set that we need.</li>
<li>We filter the data so that it contains only the rows that are required.</li>
<li>We use formulas to compute data for new columns, to un-pivot the data or to verify that the data ‘makes sense’.</li>
</ul>
<p>Finally, we have the wrangled data, ready to be analysed, used in our application or published in an article.</p>
<p>The big problem with this process is that it’s not repeatable or verifiable. In many cases, the ability to show the various transformations and processes that the data underwent is crucial to establish the data’s authenticity and correctness.</p>
<p>The common solution for this is to ditch the spreadsheet programs and bring out the power tools - the programming languages. By writing a data processing program, we are able to repeatedly run the same processing sequence on the source data and consistently receive the same results. This processing code can also be presented and reviewed as a proof of the validity of the resulting data.</p>
<p>However, as anyone that tinkered with programming knows - writing code is hard. Things that are simple to do using spreadsheet programs often require a complex mental effort to accomplish using custom code. Making sure that the code you’ve written is correct and actually produces the intended result is another major obstacle (not to mention the time it takes to learn how to program in the first place). Furthermore, even with the most readable code, it’s still hard for a 3rd party reviewer to verify the validity of the process - unless they’re familiar with the exact toolset and method you’ve used (which is often not the case).</p>
<p>At Open Knowledge International, we’ve tried to tackle this problem by finding a middle way. The <code class="language-plaintext highlighter-rouge">datapackage-pipelines</code> framework allows users to build processing pipelines - essentially, sequences of processing steps. Each of these steps is a reusable and flexible building-block which performs a single action. For example, you might use a ‘Load data from source’ block, a ‘Select columns’ block or ‘Sort data’ block. By combining these blocks together in a chain, one could construct powerful and simple to understand data processors.</p>
<p>While providing a good solution to issues like repeatability and difficult-to-understand processes, pipelines were still difficult to develop and test. Most users building the pipelines were already proficient programmers and getting the right result turned to be a tricky business.</p>
<p>So, we’ve decided to tackle this problem by creating an interactive user interface for building pipelines.</p>
<p>Our approach is based on a few principles:</p>
<ul>
<li>Modularity: each step in the pipeline should be as small and simple as possible. In case of a failure, users can tell exactly which step caused the problem. Since each step is very simple, debugging also becomes a non-issue.</li>
<li>Interactivity: each decision or change the user makes is immediately reflected in the UI. If there was an error, the user can change their mind or try something else. The effects of the change are instantly visible and no long build/run/test cycles are needed.</li>
<li>Server side processing: by leveraging smart caching heuristics, the server can optimize on the required processing and further improve speed and snappiness of the user interface.</li>
</ul>
<p>In our proof-of-concept implementation, users are prompted to select a source file - either a URL for a datafile hosted anywhere, or choose a dataset from <a href="http://datahub.io">datahub.io</a>. Once selected, users can choose to remove columns or rows (to position the data table and remove filler columns), add a schema (to validate data) or filter some of the rows with specific constraints.</p>
<p>As this is a demonstration only, the list of building blocks is still lacking - however we’re planning on adding more so that this product becomes more powerful and useful.</p>
<p>Check out our proof-of-concept here at <a href="http://dppui.openknowledge.io">http://dppui.openknowledge.io</a>!</p>
<p>Feel free to ask any questions / start a discussion about this in our <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a>. You can also find the code repositories for this work <a href="https://github.com/frictionlessdata/datapackage-pipelines-ui-client">here</a> and <a href="https://github.com/frictionlessdata/datapackage-pipelines-ui-server">here</a>.</p>
Adam Kariv
Bootstrapping data standards with Frictionless Data
2017-12-21T00:00:00+00:00
http://okfnlabs.org/blog/2017/12/21/bootstrapping-data-standards-with-frictionless-data
<p>When it comes to tabular data, the <a href="https://frictionlessdata.io">Frictionless Data</a> specifications provide users with strong conventions for declaring both the shape of data (via schemas) and information about the data (as metadata on package and resource descriptors).</p>
<p>Within the Frictionless Data world, we purposefully refer to specification work as <em>specifications</em>, and not <em>standards</em>. The specifications therein provide clear conventions for working with data, and declare fundamental interfaces on which a modular software system that works with these specifications can be built. It is very meta. However, the specifications and software foundation <em>do</em> make the Frictionless Data ecosystem a powerful and compelling technical foundation on which to build data standards.</p>
<p>Some reasons for why:</p>
<ul>
<li>data serialised in a format that can be read by software developers use to build tools such as APIs, and also by many consumer programs that are used by consumers of data with little to no technical know how.</li>
<li>built in progressive enhancement, where metadata, as well as structural and schematic information about the data, can be incorporated over time without modifying the original data source.</li>
<li>A large and growing collection of tools, in many programming languages, for working with the Frictionless Data specifications.</li>
<li>The specifications and the software are platform agnostic. A major example of this is being web-friendly without being dependent on the web (as with many linked data approaches). Linkable data, not Linked Data.</li>
</ul>
<p>We’ll demonstrate this with some examples below, which are a proof of concept for the idea of using Frictionless Data as a technical foundation for data standards. This is an ongoing work that we intend to iterate on in response to feedback to this initial take.</p>
<p>Of course, we do not in any way think that the technical implementation of a data standard is what “data standards” is about. Data standards are about communities of practice, stakeholder engagement, and increasingly, a vehicle of change at the level of policy and governance. Technical implementation, in this wider context, is but a small, yet crucial, component. Indeed, this is a critical part of the promise we are pointing to here - that by building on a common foundation, communities building data standards can focus a little less on the technical implementation details and a little more on the change they want to see by creating them.</p>
<h2 id="grant-funding">Grant funding</h2>
<p><a href="http://www.threesixtygiving.org/">360Giving</a> is an organization that helps funders to be transparent about the grants they award. It provides a <a href="http://standard.threesixtygiving.org/en/latest/">standard</a> for publishing grants data in a common format, and a <a href="http://www.threesixtygiving.org/data/data-registry/">Registry</a> to host the data. Publishers can upload a spreadsheet that contains various fields describing the different activities they funded. We will demonstrate how a custom Data Package profile could describe one of these spreadsheets, ensuring that the required metadata fields are present and that the contents of the file conform to the <a href="http://standard.threesixtygiving.org/en/latest/reference/#grants-sheet">schema</a>.</p>
<p>We will use this <a href="http://www.blagravetrust.org/wp-content/uploads/2017/01/360G-blagravetrust-2016.xlsx">sample dataset</a>, taken directly from the Registry without any changes:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Identifier</th>
<th>Title</th>
<th>Description</th>
<th>Currency</th>
<th>Amount Awarded</th>
<th>Amount Disbursed</th>
<th>Award Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>360G-blagravetrust-00658000009YZRq</td>
<td>Achieving Further</td>
<td>Work with 22 FE colleges to improve attainment; attendance and participation</td>
<td>GBP</td>
<td>300000</td>
<td>300000</td>
<td>2014-07-08</td>
</tr>
<tr>
<td>360G-blagravetrust-00658000007A1UQ</td>
<td>Training on feedback for Portsmouth VCS</td>
<td>Improving feedback skills for Portsmouth VCS - Feedback Fund 2016</td>
<td>GBP</td>
<td>3933</td>
<td>3933</td>
<td>2016-08-09</td>
</tr>
<tr>
<td>360G-blagravetrust-00658000008vdAl</td>
<td>Creative learning programme</td>
<td>Portsmouth young people leaving care</td>
<td>GBP</td>
<td>75000</td>
<td>25000</td>
<td>2016-11-08</td>
</tr>
<tr>
<td>360G-blagravetrust-00658000007lweS</td>
<td>Feedback Fund</td>
<td>Feedback Fund 2016</td>
<td>GBP</td>
<td>2094</td>
<td>2094</td>
<td>2016-08-09</td>
</tr>
</tbody>
</table>
<p>Our first step was to create a Table Schema describing the expected contents of the fields, which was then <a href="https://github.com/frictionlessdata/profiles/blob/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/datapackage.json#L39">embedded</a> in the Data Package descriptor. This was easy as like we mentioned before, there already is a well defined <a href="http://standard.threesixtygiving.org/en/latest/reference/#grants-sheet">schema</a> for how the fields should be. For the purposes of this example we just focused on a subset of all available fields. Here are some example fields:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Name / Title</th>
<th>Type</th>
<th>Constraints</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td> </td>
<td>Identifier</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Title</td>
<td>string</td>
<td><strong>maxLength</strong>: 140</td>
</tr>
<tr>
<td> </td>
<td>Description</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Currency</td>
<td>string</td>
<td><strong>enum</strong>: [‘AED’, ‘AFN’, ‘ALL’, ‘AMD’, …]</td>
</tr>
<tr>
<td> </td>
<td>Amount Awarded</td>
<td>number</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Amount Disbursed</td>
<td>number</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Award Date</td>
<td>date</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>URL</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
<tr>
<td> </td>
<td>Funding Org:Name</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Funding Org:Department</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Grant Programme:Code</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Grant Programme:Title</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Grant Programme:URL</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>From an open call?</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Related Activity</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Last modified</td>
<td>datetime</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td>Data Source</td>
<td>string</td>
<td> </td>
</tr>
</tbody>
</table>
<p>Our custom <a href="https://github.com/frictionlessdata/profiles/blob/master/assets/grants/datapackage.json">Grants Data Package</a> extends the <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> specification by adding the following fields:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Name</th>
<th>Description</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>funder</td>
<td>A JSON object describing the funding organization. It can include the following properties: <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">name</code>, <code class="language-plaintext highlighter-rouge">email</code>, <code class="language-plaintext highlighter-rouge">url</code></td>
<td>object</td>
</tr>
<tr>
<td>year</td>
<td>The year that the grants data in this file covers</td>
<td>integer</td>
</tr>
<tr>
<td>modified</td>
<td>The timestap of when this dataset was last modifed</td>
<td>datetime</td>
</tr>
</tbody>
</table>
<p>This follows closely the <a href="https://threesixtygiving.github.io/getdata/">JSON specification</a> that 360Giving has, with the rest of the fields covered by the standard Data Package specification.</p>
<p>Once we have our data packaged in this way, we can leverage all the ecosystem of tools built around Data Packages to work with it. For instance, using the <a href="https://github.com/frictionlessdata/datapackage-py"><code class="language-plaintext highlighter-rouge">datapackage</code></a> library we can iterate over the contents of the file:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datapackage</span>
<span class="n">datapackage_url</span> <span class="o">=</span> <span class="s">'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">datapackage</span><span class="p">.</span><span class="n">Package</span><span class="p">(</span><span class="n">datapackage_url</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">dp</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nb">iter</span><span class="p">(</span><span class="n">keyed</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="c1"># {'Funding Org:Identifier': 'GB-CHC-1164021', 'Beneficiary Location:Geographic Code Type': 'UA', 'From an open call?': None, 'Beneficiary Location:Name': 'Reading', 'Grant Programme:Code': None, 'Beneficiary Location:Geographic Code': 'E06000038', 'Amount Disbursed': Decimal('300000'), 'Recipient Org:City': 'Newbury', 'Award Date': datetime.datetime(2014, 7, 8, 0, 0), 'Beneficiary Location:Longitude': Decimal('-0.95543100000000003024780426130746491253376007080078125'), 'Recipient Org:Web Address': 'http://www.afaeducation.org', 'Recipient Org:Charity Number': '1142154', 'Grant Programme:Title': None, 'Related Activity': None, 'Grant Programme:URL': None, 'Recipient Org:Country': 'UK', 'Funding Org:Name': 'The Blagrave Trust', 'Title': 'Achieving Further', 'Planned Dates:End Date': datetime.datetime(2017, 6, 30, 0, 0), 'Recipient Org:Postal Code': 'RG14 1JQ', 'Identifier': '360G-blagravetrust-00658000009YZRq', 'Data Source': None, 'Planned Dates:Start Date': None, 'Currency': 'GBP', 'Description': 'Work with 22 FE colleges to improve attainment; attendance and participation', 'Recipient Org:Identifier': 'GB-CHC-1142154', 'Recipient Org:Description': 'Charity working with nurseries schools and colleges to raise attainment and achivement of children particularly those with barriers to learning', 'Funding Org:Department': None, 'Beneficiary Location:Country Code': None, 'Last modified': None, 'URL': None, 'Amount Awarded': Decimal('300000'), 'Beneficiary Location:Latitude': Decimal('51.4541449999999969122654874809086322784423828125'), 'Recipient Org:County': 'Berkshire', 'Recipient Org:Name': 'Achievement for All', 'Recipient Org:Street Address': 'Oxford House, Oxford Street', 'Planned Dates:Duration (months)': None, 'Recipient Org:Company Number': None}
</span>
</code></pre></div></div>
<p>Also, as we define the Table Schema, we can use <a href="https://github.com/frictionlessdata/goodtables-py">goodtables</a> to perform data validation and get a report of issues found:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">goodtables</span> <span class="kn">import</span> <span class="n">validate</span>
<span class="n">datapackage_url</span> <span class="o">=</span> <span class="s">'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/grants/datapackage.json'</span>
<span class="n">validate</span><span class="p">(</span><span class="n">datapackage_url</span><span class="p">)</span>
<span class="s">'''
{'error-count': 0,
'preset': 'datapackage',
'table-count': 1,
'tables': [{'datapackage': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/datapackage.json',
'encoding': None,
'error-count': 0,
'errors': [],
'format': 'inline',
'headers': ['Identifier',
'Title',
'Description',
'Currency',
'Amount Awarded',
'Amount Disbursed',
'Award Date',
'URL',
'Planned Dates:Start Date',
...
'Grant Programme:URL',
'From an open call?',
'Related Activity',
'Last modified',
'Data Source'],
'row-count': 70,
'schema': 'table-schema',
'scheme': None,
'source': 'https://raw.githubusercontent.com/frictionlessdata/profiles/c3423d1266439ffebfdac2b681d3dd0bffd81964/assets/grants/360G-blagravetrust-2016.xlsx',
'time': 0.53,
'valid': True}],
'time': 1.386,
'valid': True,
'warnings': []}
'''</span>
</code></pre></div></div>
<h2 id="iati-registry">IATI Registry</h2>
<p>The <a href="http://iatistandard.org/">IATI Standard</a> is a technical framework to publish aid, development, and humanitarian data in a standard way. Data published in the IATI standard is indexed on the <a href="https://iatiregistry.org/">IATI Registry</a>. Here we will demonstrate the creation of a custom Data Package profile to package data meant to be published in the registry, ensuring that it has the required metadata.</p>
<p>Here are the fields available when publishing a new IATI file on the registry:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Name</th>
<th>Data Package field</th>
<th>Description</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="language-plaintext highlighter-rouge">registry-file-id</code></td>
<td><code class="language-plaintext highlighter-rouge">name </code></td>
<td>A unique identifier for the activity record</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">registry-publisher-id</code></td>
<td>-</td>
<td>Publisher identificator on the IATI Registry</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">title</code></td>
<td><code class="language-plaintext highlighter-rouge">title </code></td>
<td>The title of the dataset</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">description</code></td>
<td><code class="language-plaintext highlighter-rouge">description</code></td>
<td>Some useful notes about the data</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">source-url</code></td>
<td><code class="language-plaintext highlighter-rouge">resources[0]['path']</code></td>
<td>URL to a publicly accessible IATI file</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">contact-email</code></td>
<td>-</td>
<td>Contact email for publisher</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">file-type</code></td>
<td>-</td>
<td>Must be either ‘Activity’ or ‘Organization’</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">recipient-country</code></td>
<td>-</td>
<td>Recipient country</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">last-updated-datetime</code></td>
<td>-</td>
<td>Timestamp of the last modification</td>
<td>date-time</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">activity-count</code></td>
<td>-</td>
<td>Number of activities described in the data</td>
<td>integer</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">default-language</code></td>
<td>-</td>
<td>Language of the data</td>
<td>string</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">secondary-publisher</code></td>
<td>-</td>
<td>The publisher this dataset is published on behalf of</td>
<td>string</td>
</tr>
</tbody>
</table>
<p>To create the new profile, we will add those fields that do not map directly to the <a href="https://frictionlessdata.io/specs/data-package/">Data Package specification</a> to a standard Data Package descriptor and create a custom JSON Schema to validate it. Here is the <a href="https://github.com/frictionlessdata/profiles/blob/master/assets/iatiregistry/datapackage.json">resulting Data Package descriptor</a>.</p>
<h3 id="trees">Trees</h3>
<p>The <a href="https://opencouncildata.org/" title="Open Council Data">Open Council Data</a> defined the standard <a href="http://standards.opencouncildata.org/#/trees" title="Open Council Data: Trees 1.3 Specification">Trees 1.3</a> for describing the trees in a geographical region (e.g. a council). This standard includes information about the location, type, and other characteristics of individual trees, which is useful for planning future growth, maintenance of canopy cover, managing risk of falling branches, etc.</p>
<p>We are using the from <a href="https://data.gov.au/dataset/colac-otway-shire-trees/resource/bcf1d62b-9e72-4eca-b183-418f83dedcea" title="Colac Otway Shire Trees">Colac Otway Shire Trees</a> as an example.</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>lat</th>
<th>lon</th>
<th>genus</th>
<th>species</th>
<th>dbh</th>
<th>dbh_min</th>
<th>dbh_max</th>
<th>year_min</th>
<th>year_max</th>
<th>crown</th>
<th>crown_min</th>
<th>crown_max</th>
<th>height</th>
<th>height_min</th>
<th>height_max</th>
<th>common</th>
<th>location</th>
<th>ref</th>
<th>maintenance</th>
<th>maturity</th>
<th>planted</th>
<th>updated</th>
<th>health</th>
<th>variety</th>
<th>description</th>
<th>family</th>
<th>ule_min</th>
<th>ule_max</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<td>-38.344595</td>
<td>143.592171</td>
<td>Melaleuca</td>
<td>Stypheliodes</td>
<td>1</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>5</td>
<td> </td>
<td> </td>
<td>Prickly Paperback</td>
<td>street</td>
<td>10001</td>
<td> </td>
<td>mature</td>
<td>1975-01-01</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>106 Queen ST COLAC VIC 3250</td>
</tr>
<tr>
<td>-38.346198</td>
<td>143.591812</td>
<td>Melaleuca</td>
<td>Stypheliodes</td>
<td>1</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>4</td>
<td> </td>
<td> </td>
<td>Prickly Paperback</td>
<td>street</td>
<td>10004</td>
<td> </td>
<td>mature</td>
<td>1975-01-01</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>122 Queen ST COLAC VIC 3250</td>
</tr>
<tr>
<td>-38.342097</td>
<td>143.588944</td>
<td>Fraxinus</td>
<td>Excelsior</td>
<td>1.2</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>12</td>
<td> </td>
<td> </td>
<td>Golden Ash</td>
<td>street</td>
<td>10007</td>
<td> </td>
<td>mature</td>
<td>1980-01-01</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>40 Rae ST COLAC VIC 3250</td>
</tr>
<tr>
<td>-38.341927</td>
<td>143.588715</td>
<td>Agonis</td>
<td>Flexuosa</td>
<td>0.4</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>5</td>
<td> </td>
<td> </td>
<td>Weeping Willow Myrtle</td>
<td>street</td>
<td>10018</td>
<td> </td>
<td>semi-mature</td>
<td>1980-01-01</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>47 Rae ST COLAC VIC 3250//next to coles coaches</td>
</tr>
<tr>
<td>-38.342044</td>
<td>143.591182</td>
<td>Eucalyptus</td>
<td>Nichollii</td>
<td>0.3</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>6</td>
<td> </td>
<td> </td>
<td>Willow Peppermint</td>
<td>street</td>
<td>10021</td>
<td> </td>
<td>mature</td>
<td>1980-01-01</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>56 Rae ST COLAC VIC 3250//Between Queen St & CCDA</td>
</tr>
</tbody>
</table>
<p>This data was modified from the source to conform to the <a href="http://standards.opencouncildata.org/#/trees" title="Open Council Data: Trees 1.3 Specification">Trees 1.3</a> specification. All the data is available <a href="https://github.com/frictionlessdata/profiles/blob/master/assets/trees/data.csv" title="Trees CSV">here</a>.</p>
<p>The <a href="https://github.com/frictionlessdata/profiles/blob/master/assets/trees/trees-data-package" title="Trees Data Package JSON Schema">Trees Data Package</a> extends the <a href="https://frictionlessdata.io/specs/data-package/">Data Package</a> specification by adding the following fields:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Name</th>
<th>Description</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>countryCode</td>
<td>A single or an array of 2-letter ISO country code defining the country(ies) present in the data</td>
<td>string</td>
</tr>
<tr>
<td>geospatialCoverage</td>
<td>Geospatial area contained in the dataset</td>
<td>geojson</td>
</tr>
</tbody>
</table>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Name</th>
<th>Title</th>
<th>Type</th>
<th>Constraints</th>
</tr>
</thead>
<tbody>
<tr>
<td>lat</td>
<td>Latitude in decimal degrees (EPSG:4326)</td>
<td>number</td>
<td><strong>required</strong>: True</td>
</tr>
<tr>
<td>lon</td>
<td>Longitude in decimal degrees (EPSG:4326)</td>
<td>number</td>
<td><strong>required</strong>: True</td>
</tr>
<tr>
<td>genus</td>
<td>Botanical genus, in title case (e.g. Eucalyptus)</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td>species</td>
<td>Botanical species, in title case (e.g. Regnans)</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td>dbh</td>
<td>Diameter at breast height (130cm above ground), in centimeters. If this information is available only as a range, this contains the middle of the range.</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>dbh_min</td>
<td>Minimum diameter at breast height (130cm above ground)</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>dbh_max</td>
<td>Maximum diameter at breast height (130cm above ground)</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>year_min</td>
<td>Lower bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2018).</td>
<td>year</td>
<td> </td>
</tr>
<tr>
<td>year_max</td>
<td>Upper bound on year that tree is expected to live to (e.g. A tree surveyed in 2008 with useful life expectancy range of 10-15 years would be 2023).</td>
<td>year</td>
<td> </td>
</tr>
<tr>
<td>crown</td>
<td>Width in metres of the tree’s foliage (also known as crown spread). If this information is available only as a range, this contains the middle of the range.</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>crown_min</td>
<td>Minimum width in meters of the tree’s foliage</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>crown_max</td>
<td>Maximum width in meters of the tree’s foliage</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>height</td>
<td>Height in meters. If this information is available only as a range, this contains the middle of the range.</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>height_min</td>
<td>Minimum height in meters</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>height_max</td>
<td>Maximum height in meters</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>common</td>
<td>Common name for species (non-standardised)</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td>location</td>
<td>Where the tree is located</td>
<td>string</td>
<td><strong>enum</strong>: [‘park’, ‘street’, ‘council’]</td>
</tr>
<tr>
<td>ref</td>
<td>Council-specific identifier, enabling joining to other datasets</td>
<td>number</td>
<td> </td>
</tr>
<tr>
<td>maintenance</td>
<td>How often the tree is inspected (in months)</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>maturity</td>
<td> </td>
<td>string</td>
<td><strong>enum</strong>: [‘young’, ‘semi-mature’, ‘mature’, ‘over-mature’]</td>
</tr>
<tr>
<td>planted</td>
<td>Date of planting</td>
<td>date</td>
<td> </td>
</tr>
<tr>
<td>updated</td>
<td>Date of addition to database or most recent revision</td>
<td>date</td>
<td> </td>
</tr>
<tr>
<td>health</td>
<td>Health of tree growth</td>
<td>string</td>
<td><strong>enum</strong>: [‘stump’, ‘dead’, ‘poor’, ‘fair’, ‘good’]</td>
</tr>
<tr>
<td>variety</td>
<td>Any part of the scientific name below species level, including subspecies or variety</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td>description</td>
<td>Other information about the tree that is not in its scientific name or species</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td>family</td>
<td>Botanical family</td>
<td>string</td>
<td> </td>
</tr>
<tr>
<td>ule_min</td>
<td>Lower bound on useful life expectancy when surveyed</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>ule_max</td>
<td>Upper bound on useful life expectancy when surveyed</td>
<td>number</td>
<td><strong>minimum</strong>: 0</td>
</tr>
<tr>
<td>address</td>
<td>Street address</td>
<td>string</td>
<td> </td>
</tr>
</tbody>
</table>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datapackage</span>
<span class="n">datapackage_url</span> <span class="o">=</span> <span class="s">'https://raw.githubusercontent.com/frictionlessdata/profiles/master/assets/trees/datapackage.json'</span>
<span class="n">dp</span> <span class="o">=</span> <span class="n">datapackage</span><span class="p">.</span><span class="n">Package</span><span class="p">(</span><span class="n">datapackage_url</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">dp</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nb">iter</span><span class="p">(</span><span class="n">keyed</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="c1"># {'lat': Decimal('-38.347497'), 'lon': Decimal('143.595686'), 'genus': 'Melaleuca', 'species': 'Nesophila', 'dbh': Decimal('0.25'), 'dbh_min': None, 'dbh_max': None, 'year_min': None, 'year_max': None, 'crown': None, 'crown_min': None, 'crown_max': None, 'height': Decimal('2'), 'height_min': None, 'height_max': None, 'common': 'Snowy Honey Myrtle', 'location': 'street', 'ref': Decimal('10379'), 'maintenance': None, 'maturity': 'semi-mature', 'planted': datetime.date(1980, 1, 1), 'updated': None, 'health': None, 'variety': None, 'description': None, 'family': None, 'ule_min': None, 'ule_max': None, 'address': '18 Thomas ST COLAC VIC 3250'}
</span></code></pre></div></div>
<h3 id="conculsion">Conculsion</h3>
<p>This has been a high-level exploration of using <a href="https://frictionlessdata.io/specs/tabular-data-package/">Tabular Data Package</a> and <a href="https://frictionlessdata.io/specs/table-schema/">Table Schema</a> as a “specification framework”, allowing one to bootstrap a proof of concept data standard. Taking this approach, one gains access to a collection of modular software libraries that provide powerful APIs for working with this data according to the rules and condition of the standard that is declared. Data validation, processing, transport, and consumption do not require custom tool chains once the data standard is declared as a <a href="https://frictionlessdata.io/specs/profiles/">Tabular Data Package Profile</a>.</p>
<p>The approach described here is a first step in the direction of domain-specific tabular data profiles. A future iteration would likely integrate work we are currently undertaking in the <a href="https://frictionlessdata.io/specs/fiscal-data-package/">Fiscal Data Package</a> which enables the simple declaration of <em>domain concepts</em> via <code class="language-plaintext highlighter-rouge">columnType</code> annotations on Table Schemas. This enables data standard authors to work at a level of abstraction of domain concepts, rather than the “primitive types” we work with here via Table Schema. We plan to revisit this work once the <code class="language-plaintext highlighter-rouge">columnType</code> work from Fiscal Data Package is stable for general use.</p>
<p>For now, all the schemas above work as described, and open up all the software in the Frictionless Data ecosystem to those following this approach.</p>
<p>You can check the source code for all the examples listed in the following GitHub repository:</p>
<p><a href="https://github.com/frictionlessdata/profiles">https://github.com/frictionlessdata/profiles</a></p>
Paul Walsh
Validating scraped data using goodtables
2017-11-29T00:00:00+00:00
http://okfnlabs.org/blog/2017/11/29/validating-scraped-data-using-goodtables
<p>We have to deal with many challenges when scraping a page. What’s the page’s layout? How do I extract the bits of data I want? How do I know when their layout changes and break my code? How can I be sure that my code isn’t introducing errors to the data? There are many tools to test that the code works, but not so many to test the actual data. This is especially important when you don’t control the source of the data, which is almost always the case when you’re scraping (otherwise, you wouldn’t be scraping). In this post, I’ll show you how I used <a href="https://github.com/frictionlessdata/goodtables-py/" title="goodtables">goodtables</a> to validate scraped data.</p>
<p><a href="https://github.com/frictionlessdata/goodtables-py/" title="goodtables">Goodtables</a> is an open source data validator for tabular data (think spreadsheets and CSVs). It can check both the structure of the file (do all rows have the same number of columns?), and its contents (is this a valid date?). Goodtables gives you a safety net that guarantees that your data files are valid.</p>
<p>We’ll work step by step. First, I’ll show you what the data looks like, then we’ll then check what goodtables can find out of the box, without any information about the data contents. Finally, we’ll define the types and constraints of each column, so goodtables can validate that the rows contain what we expect.</p>
<p>By the end of this post, you’ll have a better idea on how goodtables can help you be more confident about your data’s quality.</p>
<h2 id="the-data"><a name="data"></a>The data</h2>
<p>We’ll use the remuneration of the civil servants working for São Paulo’s City Council as an example. This data was scraped from <a href="http://www.camara.sp.gov.br/transparencia/salarios-abertos/remuneracao-dos-servidores-e-comissionados/" title="Remuneration of Sao Paulo City Council's Civil Servants">their website</a>. The first few rows look like:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>name</th>
<th>role</th>
<th>function</th>
<th>remuneration</th>
<th>department</th>
<th>year</th>
<th>month</th>
</tr>
</thead>
<tbody>
<tr>
<td>MILTON LEITE DA SILVA</td>
<td>VEREADOR</td>
<td>VEREADOR</td>
<td>11534.82</td>
<td>PRESIDÊNCIA</td>
<td>2017</td>
<td>9</td>
</tr>
<tr>
<td>PAULO CESAR TAGLIAVINI</td>
<td>CHEFE DE GABINETE DA PRESIDÊNCIA</td>
<td>CHEFE DE GABINETE DA PRESIDÊNCIA</td>
<td>14124.71</td>
<td>GABINETE DA PRESIDÊNCIA</td>
<td>2017</td>
<td>9</td>
</tr>
<tr>
<td>CECILIA DE ARRUDA</td>
<td>CHEFE DE CERIMONIAL</td>
<td>CHEFE DE CERIMONIAL</td>
<td>22455.9</td>
<td>GABINETE DA PRESIDÊNCIA</td>
<td>2017</td>
<td>9</td>
</tr>
<tr>
<td>ANTONIO JAIR DA ROSA</td>
<td>ASSISTENTE LEGISLATIVO III</td>
<td> </td>
<td>7383.64</td>
<td>GABINETE DA PRESIDÊNCIA</td>
<td>2017</td>
<td>9</td>
</tr>
<tr>
<td>BRASILINO SILVA BRANDAO</td>
<td>ASSISTENTE LEGISLATIVO III</td>
<td> </td>
<td>8135.51</td>
<td>GABINETE DA PRESIDÊNCIA</td>
<td>2017</td>
<td>9</td>
</tr>
</tbody>
</table>
<p>Some of the columns are strings (name, role, function, and department), one is numeric (remuneration), and two are date parts (year and month). We’ll think about the types and constraints on each of these columns in a minute, but first let’s see what goodtables tells us out of the box.</p>
<h2 id="initial-validations">Initial validations</h2>
<p><a href="https://github.com/frictionlessdata/goodtables-py/" title="goodtables">Goodtables</a> is written in Python, and can be used both as a command-line tool or imported in your Python code. We’ll use it in the command-line. Considering that our data lives in <code class="language-plaintext highlighter-rouge">data/remunerations.csv</code>, we validate it by running <code class="language-plaintext highlighter-rouge">goodtables data/remunerations.csv</code>. This is the output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DATASET
=======
{'error-count': 0,
'preset': 'table',
'table-count': 1,
'time': 0.025,
'valid': True}
---------
Warning: Table "data/remunerations.csv" inspection has reached 1000 row(s) limit
TABLE [1]
=========
{'encoding': 'utf-8',
'error-count': 0,
'format': 'csv',
'headers': ['name', 'role', 'function', 'remuneration', 'department', 'year', 'month'],
'row-count': 1000,
'schema': None,
'scheme': 'file',
'source': 'data/remunerations.csv',
'time': 0.024,
'valid': True}
</code></pre></div></div>
<p>It hasn’t found any errors, good! However, there’s a warning: it just analyzed the first 1,000 rows. Maybe there’s an error in the other rows? As our data is very small, with a bit over 2,000 rows, analyzing everything should be quick. Let’s try again with a high row limit, using <code class="language-plaintext highlighter-rouge">goodtables --row-limit 999999 data/remunerations.csv</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DATASET
=======
{'error-count': 1,
'preset': 'table',
'table-count': 1,
'time': 0.046,
'valid': False}
TABLE [1]
=========
{'encoding': 'utf-8',
'error-count': 1,
'format': 'csv',
'headers': ['name', 'role', 'function', 'remuneration', 'department', 'year', 'month'],
'row-count': 2043,
'schema': None,
'scheme': 'file',
'source': 'data/remunerations.csv',
'time': 0.045,
'valid': False}
---------
[1859,-] [duplicate-row] Row 1859 is duplicated to row(s) 1858
</code></pre></div></div>
<p>A-ha! Now it found an error: duplicate rows. Depending on the data, this might or might not be an issue. Goodtables is helpful enough to tell us the row numbers, let’s take a look at them:</p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>name</th>
<th>role</th>
<th>function</th>
<th>remuneration</th>
<th>department</th>
<th>year</th>
<th>month</th>
</tr>
</thead>
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>CTI-4 - EQUIPE DE TELECOMUNICAÇÕES E INFRAESTRUTURA</td>
<td>2017</td>
<td>9</td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>CTI-4 - EQUIPE DE TELECOMUNICAÇÕES E INFRAESTRUTURA</td>
<td>2017</td>
<td>9</td>
</tr>
</tbody>
</table>
<p>This does look like a valid error (no names?). After investigating for a while, I found the culprit: the source website was modified. There are now a few cases where the civil servant’s name was removed by judicial order, and it broke my code. The joys of scraping, right?</p>
<p>After fixing it and running goodtables again, this is what I got:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DATASET
=======
{'error-count': 0,
'preset': 'table',
'table-count': 1,
'time': 0.083,
'valid': True}
TABLE [1]
=========
{'encoding': 'utf-8',
'error-count': 0,
'format': 'csv',
'headers': ['name', 'role', 'function', 'remuneration', 'department', 'year', 'month'],
'row-count': 4083,
'schema': None,
'scheme': 'file',
'source': 'data/remunerations.csv',
'time': 0.081,
'valid': True}
</code></pre></div></div>
<p>Great, no more errors!</p>
<p>Without giving any information about my data, goodtables found out there was a duplicate row. This led me to find out that the website I’m scraping was modified and broke my code. Even if we stopped now, this has already been useful. We won’t. There still are some useful tricks up goodtable’s sleeve.</p>
<h2 id="improving-the-validations">Improving the validations</h2>
<p>Although goodtables provides valuable information for an arbitrary CSV, its real power comes when we tell it the data schema. It’ll validate that the data is what we expect it to be (numbers are numbers, dates are valid, etc.). The easiest way to define this schema is by creating a <a href="http://frictionlessdata.io/data-packages/" title="Data Package">Data Package</a>.</p>
<p>The first thing we need is to create a JSON file named <code class="language-plaintext highlighter-rouge">datapackage.json</code>:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remunerations-cmsp"</span><span class="p">,</span><span class="w">
</span><span class="nl">"title"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Remuneration of the civil servants from the Sao Paulo's City Council"</span><span class="p">,</span><span class="w">
</span><span class="nl">"resources"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remunerations"</span><span class="p">,</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data/remunerations.csv"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This is the simplest data package we can create for this data. It just defines a <code class="language-plaintext highlighter-rouge">name</code> and <code class="language-plaintext highlighter-rouge">title</code> for the dataset, and a single resource, our CSV file. Goodtables support data packages out of the box, so we can run <code class="language-plaintext highlighter-rouge">goodtables datapackage.json</code> and it’ll give us the same result as running <code class="language-plaintext highlighter-rouge">goodtables data/remunerations.csv</code> directly. With this in place, we can start writing the schema.</p>
<p>Think of the schema as a data dictionary. It defines what each column means, what data it contains, which format, their description, and so on. Looking at the data, these are the data types of each of our columns:</p>
<ul>
<li>String
<ul>
<li>Name</li>
<li>Role</li>
<li>Function</li>
<li>Department</li>
</ul>
</li>
<li>Currency
<ul>
<li>Remuneration</li>
</ul>
</li>
<li>Dates
<ul>
<li>Year</li>
<li>Month</li>
</ul>
</li>
</ul>
<p>Schemas in data packages follow the <a href="http://frictionlessdata.io/guides/table-schema/" title="Table Schema">Table Schema</a> specification which defines how to write the schema, a few basic types, and how to add constraints (e.g. uniqueness, required, valid ranges). It sounds more complicated than it actually is. For instance, this is how we would write the column’s types I defined above using the Table Schema:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remunerations-cmsp"</span><span class="p">,</span><span class="w">
</span><span class="nl">"title"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Remuneration of the civil servants from the Sao Paulo's City Council"</span><span class="p">,</span><span class="w">
</span><span class="nl">"resources"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remunerations"</span><span class="p">,</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data/remunerations.csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"name"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"role"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"function"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remuneration"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"number"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"department"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"year"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"year"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"month"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"integer"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>The only thing we changed was adding the <code class="language-plaintext highlighter-rouge">schema</code> attribute to our resource, everything else is the same. When we run goodtables again, it still is successful, but now it’s not only running the basic validations, but also checking the cells’ types.</p>
<p>Can we improve it further? Of course!</p>
<p>Take a look at the <code class="language-plaintext highlighter-rouge">month</code> column. As Table Schema doesn’t have a “month” data type, we had to use the closest to it: integer. A month is an integer, however it’s not <em>any</em> integer. It can’t be zero, or -1, or 42, it must be from 1 to 12. The Table Schema allows us to define these constraints in our schema, but before I show you how, what about the other columns? Are there other similar constraints, not only about valid ranges, but also if they are required or must be unique.</p>
<p>I went through all columns, looking at the data and understand which constraints they have, and this is what I defined:</p>
<ul>
<li>Department
<ul>
<li>Required</li>
</ul>
</li>
<li>Remuneration
<ul>
<li>Required</li>
</ul>
</li>
<li>Year
<ul>
<li>Required</li>
<li>Greater than 2017 (there’s no historical data)</li>
</ul>
</li>
<li>Month
<ul>
<li>Required</li>
<li>Between 1 and 12</li>
</ul>
</li>
</ul>
<p>There are no constraints for <code class="language-plaintext highlighter-rouge">name</code>, <code class="language-plaintext highlighter-rouge">role</code> and <code class="language-plaintext highlighter-rouge">function</code> other than the type. On the <code class="language-plaintext highlighter-rouge">datapackage.json</code>, these fields will look like:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"department"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="s2">"true"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="err">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remuneration"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"number"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="err">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"year"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"number"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"minimum"</span><span class="p">:</span><span class="w"> </span><span class="mi">2017</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="err">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"month"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"number"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"minimum"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span><span class="nl">"maximum"</span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Goodtables now raises a few errors on <code class="language-plaintext highlighter-rouge">remuneration</code>. There are some rows where it’s empty. Looking back at the original website, I confirm that I was wrong, there really are some rows without <code class="language-plaintext highlighter-rouge">remuneration</code> (apparently the councillors’ remunerations are somewhere else). After removing this constraint, everything runs successfully.</p>
<p>The final <code class="language-plaintext highlighter-rouge">datapackage.json</code> looks like:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remunerations-cmsp"</span><span class="p">,</span><span class="w">
</span><span class="nl">"title"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Remuneration of the civil servants from the Sao Paulo's City Council"</span><span class="p">,</span><span class="w">
</span><span class="nl">"resources"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remunerations"</span><span class="p">,</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data/remunerations.csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"name"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"role"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"function"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"remuneration"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"number"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"department"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="s2">"true"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"year"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"year"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"minimum"</span><span class="p">:</span><span class="w"> </span><span class="mi">2017</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"month"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"integer"</span><span class="p">,</span><span class="w">
</span><span class="nl">"constraints"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"minimum"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span><span class="nl">"maximum"</span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>I could’ve added constraints in the <code class="language-plaintext highlighter-rouge">role</code>, <code class="language-plaintext highlighter-rouge">function</code>, and <code class="language-plaintext highlighter-rouge">department</code> fields, as they can only have a set of values (i.e. there’s no department “Foobar”). I decided it wasn’t worth the trouble now, as I don’t have a list of possible values at hand. If I want to add these or other constraints in the future, the structure is already in place, so it’s straightforward.</p>
<h2 id="conclusion">Conclusion</h2>
<p>My intent with this post was to show you the value of adding even a little bit of data validation to your toolbox, and how easy it is to do so with goodtables. We started by running it without giving any information about our data. It found duplicate rows that led me to discover that the website I’m scraping has changed, so my scraper was out of date. After I updated the code and ran it again, goodtables was successful.</p>
<p>We then told goodtables more about our data by writing a schema using the <a href="http://frictionlessdata.io/data-packages/" title="Data Package">Data Package</a> and <a href="http://frictionlessdata.io/guides/table-schema/" title="Table Schema">Table Schema</a> specifications. This led me to know the data better, as my initial assumptions that all rows must have a remuneration was wrong.</p>
<p>With all this in place, goodtables is now able to check not only the structure of the data, but that its contents are valid. The next step is how to make sure it stays this way. In a future post, I’ll show you how to run goodtables automatically as part of your test suite when your data is on GitHub.</p>
<p>I hope you found this interesting. If you’re curious about how this all fit together, check it out on <a href="https://github.com/vitorbaptista/remuneracao_cmsp">https://github.com/vitorbaptista/remuneracao_cmsp</a>.</p>
<p>If you have any questions, feedback, or would just like to chat, join our <a href="http://gitter.im/frictionlessdata/chat" title="Frictionless Data Gitter">Frictionless Data Gitter chat</a>. We’d love to hear from you, so we can make these tools as useful as they can be.</p>
Vitor Baptista
Core Data on DataHub.io
2017-11-03T00:00:00+00:00
http://okfnlabs.org/blog/2017/11/03/core-data
<p>This blog post was originally published on <a href="http://datahub.io/">datahub.io</a> by <a href="http://datahub.io/rufuspollock">Rufus Pollock</a>, <a href="http://datahub.io/Mikanebu">Meiran Zhiyenbayev</a> & <a href="http://datahub.io/anuveyatsu">Anuar Ustayev</a>.</p>
<hr />
<p>The “Core Data” project provides essential data for the data wranglers and data science community. Its online home is on the DataHub:</p>
<p><a href="https://datahub.io/core">https://datahub.io/core</a></p>
<p><a href="https://datahub.io/docs/core-data">https://datahub.io/docs/core-data</a></p>
<p>This post introduces you to the Core Data, presents a couple of examples and shows you how you can access and use core data easily from your own tools and systems including R, Python, Pandas and more.</p>
<ul id="markdown-toc">
<li><a href="#why-core-data" id="markdown-toc-why-core-data">Why Core Data</a></li>
<li><a href="#examples" id="markdown-toc-examples">Examples</a> <ul>
<li><a href="#list-of-countries" id="markdown-toc-list-of-countries">List of Countries</a></li>
<li><a href="#country-codes" id="markdown-toc-country-codes">Country Codes</a></li>
<li><a href="#population" id="markdown-toc-population">Population</a></li>
</ul>
</li>
<li><a href="#use-core-data-from-your-favorite-language-or-tool" id="markdown-toc-use-core-data-from-your-favorite-language-or-tool">Use Core Data from your favorite language or tool</a> <ul>
<li><a href="#csv-and-json" id="markdown-toc-csv-and-json">CSV and JSON</a></li>
<li><a href="#curl" id="markdown-toc-curl">cURL</a></li>
<li><a href="#r" id="markdown-toc-r">R</a></li>
<li><a href="#python" id="markdown-toc-python">Python</a></li>
<li><a href="#pandas" id="markdown-toc-pandas">Pandas</a></li>
<li><a href="#ruby-javascript-and-many-more" id="markdown-toc-ruby-javascript-and-many-more">Ruby, JavaScript and many more</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<h2 id="why-core-data">Why Core Data</h2>
<p>If you build data driven applications or data driven insight you regularly find yourself wanting common “core” data, things like lists of countries, populations, geographic boundaries and more.</p>
<p>However, finding good quality data has always been challenging. Professionals can spend lots of time finding and preparing data before they get to do any real work analysing or presenting it.</p>
<p>To address this, a few years ago we started the “core data” project as part of the Frictionless Data initiative. Its purpose was to curate important, commonly used datasets including reference data like country codes, indicators like population and GDP, and geodata like country boundaries. It provides them in a high-quality, easy-to-use, and standard form.</p>
<p>Recently the Core Data project has got even better with a new home on the newly upgraded DataHub and has expanded thanks to new partners like Datopian and John Snow Labs (more on this in a future post!).
<br /><br /></p>
<h2 id="examples">Examples</h2>
<p>There are dozens of core datasets already available and many more being worked on, including a list of countries and their 2 digit codes, and a more extensive version.</p>
<h3 id="list-of-countries">List of Countries</h3>
<p>Ever needed to build a drop-down list of countries in a web application? Or ever needed to add country name labels for a graph and only had country codes?</p>
<p>Then these datasets are for you!</p>
<p>First up is the very simple “country-list” dataset:</p>
<p><a href="https://datahub.io/core/country-list">https://datahub.io/core/country-list</a></p>
<p>You can see a preview table for the dataset on the showcase page:
<br /><br /></p>
<p><img src="/img/posts/country-list-preview-table.png" alt="" /></p>
<p><br />
You can download it in either CSV or JSON formats:
<br /><br /></p>
<p><img src="/img/posts/country-list-downloads.png" alt="" />
<br /></p>
<ul>
<li>CSV: <a href="https://datahub.io/core/country-list/r/data.csv">https://datahub.io/core/country-list/r/data.csv</a></li>
<li>JSON: <a href="https://datahub.io/core/country-list/r/data.json">https://datahub.io/core/country-list/r/data.json</a></li>
</ul>
<h3 id="country-codes">Country Codes</h3>
<p>Maybe the simple list of countries is not enough for you. Perhaps you need phone codes for each country, or want to know their currencies?</p>
<p>We’ve got you covered with the more extensive country codes dataset:</p>
<p><a href="https://datahub.io/core/country-codes">https://datahub.io/core/country-codes</a></p>
<p>All the countries from Country List including number of associated codes - ISO 3166 codes, ITU dialing codes, ISO 4217 currency codes, and many others. This dataset includes <strong>26</strong> different codes and associated information.</p>
<p>You can also preview the data and download in different formats just like it is described for Country List dataset above:</p>
<ul>
<li>CSV: <a href="https://datahub.io/core/country-codes/r/country-codes.csv">https://datahub.io/core/country-codes/r/country-codes.csv</a></li>
<li>JSON: <a href="https://datahub.io/core/country-codes/r/country-codes.json">https://datahub.io/core/country-codes/r/country-codes.json</a></li>
</ul>
<h3 id="population">Population</h3>
<p>This is another useful dataset for people: you regularly need population in order to do normalisations and calculate per capita figures as part of a statistical analysis.</p>
<p>This dataset includes population figures for countries, regions (e.g. Asia) and the world. Data comes originally from World Bank and has been converted into standard tabular data package with CSV data and a table schema:</p>
<p><a href="https://datahub.io/core/population">https://datahub.io/core/population</a></p>
<p>Preview the data on the showcase page:
<br /><br /></p>
<p><img src="/img/posts/population-preview-table.png" alt="" />
<br />
Get the data in CSV or JSON formats just like for any other Core Datasets:</p>
<ul>
<li>CSV: <a href="https://datahub.io/core/population/r/population.csv">https://datahub.io/core/population/r/population.csv</a></li>
<li>JSON: <a href="https://datahub.io/core/population/r/population.json">https://datahub.io/core/population/r/population.json</a></li>
</ul>
<h2 id="use-core-data-from-your-favorite-language-or-tool">Use Core Data from your favorite language or tool</h2>
<p>We have made Core Data easy-to-use from various programming languages and tools. We will walk through using our Country List example. But you can apply these instructions to any other Core Data in the DataHub.</p>
<h3 id="csv-and-json">CSV and JSON</h3>
<p>If you just need to get data, you have a direct link usable from any tool or app e.g. for the country list:</p>
<ul>
<li>CSV - <a href="https://datahub.io/core/country-list/r/data.csv">https://datahub.io/core/country-list/r/data.csv</a></li>
<li>JSON - <a href="https://datahub.io/core/country-list/r/data.json">https://datahub.io/core/country-list/r/data.json</a></li>
</ul>
<div class="alert alert-info">
For more read our "Getting Data" tutorial:
<br />
<p><a href="https://datahub.io/docs/getting-started/getting-data">https://datahub.io/docs/getting-started/getting-data</a></p>
</div>
<h3 id="curl">cURL</h3>
<p>Following commands help you to get the data using “cURL” tool. Use <code class="language-plaintext highlighter-rouge">-L</code> flag so “cURL” follows redirects:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> <span class="c"># Get the data:</span>
curl <span class="nt">-L</span> https://datahub.io/core/country-list/r/data.csv
<span class="c"># datapackage.json provides metadata and a list of all data files</span>
curl <span class="nt">-L</span> https://datahub.io/core/country-list/datapackage.json
<span class="c"># See just the available data files (resources):</span>
curl <span class="nt">-L</span> https://datahub.io/core/country-list/datapackage.json | jq <span class="s2">".resources"</span></code></pre></figure>
<h3 id="r">R</h3>
<p>If you are using R here’s how to get the data you want quickly loaded:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"jsonlite"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"jsonlite"</span><span class="p">)</span><span class="w">
</span><span class="n">json_file</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://datahub.io/core/country-list/datapackage.json"</span><span class="w">
</span><span class="n">json_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fromJSON</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">readLines</span><span class="p">(</span><span class="n">json_file</span><span class="p">),</span><span class="w"> </span><span class="n">collapse</span><span class="o">=</span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="c1"># access csv file by the index starting from 1</span><span class="w">
</span><span class="n">path_to_file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">json_data</span><span class="o">$</span><span class="n">resources</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">path</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">url</span><span class="p">(</span><span class="n">path_to_file</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span></code></pre></figure>
<h3 id="python">Python</h3>
<p>Here we take a look at how to get Country List in Python programming language:</p>
<p>For Python, first install the <code class="language-plaintext highlighter-rouge">datapackage</code> library (all the datasets on DataHub are Data Packages):</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> pip <span class="nb">install </span>datapackage</code></pre></figure>
<p>Again, we’ll use the <code class="language-plaintext highlighter-rouge">country-list</code> dataset:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="kn">from</span> <span class="nn">datapackage</span> <span class="kn">import</span> <span class="n">Package</span>
<span class="n">package</span> <span class="o">=</span> <span class="n">Package</span><span class="p">(</span><span class="s">'https://datahub.io/core/country-list/datapackage.json'</span><span class="p">)</span>
<span class="c1"># get list of resources:
</span> <span class="n">resources</span> <span class="o">=</span> <span class="n">package</span><span class="p">.</span><span class="n">descriptor</span><span class="p">[</span><span class="s">'resources'</span><span class="p">]</span>
<span class="n">resourceList</span> <span class="o">=</span> <span class="p">[</span><span class="n">resources</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="s">'name'</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">resources</span><span class="p">))]</span>
<span class="k">print</span><span class="p">(</span><span class="n">resourceList</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">package</span><span class="p">.</span><span class="n">resources</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">read</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">data</span><span class="p">)</span></code></pre></figure>
<h3 id="pandas">Pandas</h3>
<p>In order to work with Data Packages in Pandas you need to install the Frictionless Data data package library and the pandas extension:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> pip <span class="nb">install </span>datapackage
pip <span class="nb">install </span>jsontableschema-pandas</code></pre></figure>
<p>To get the data run following code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="kn">import</span> <span class="nn">datapackage</span>
<span class="n">data_url</span> <span class="o">=</span> <span class="s">"https://datahub.io/core/country-list/datapackage.json"</span>
<span class="c1"># to load Data Package into storage
</span> <span class="n">storage</span> <span class="o">=</span> <span class="n">datapackage</span><span class="p">.</span><span class="n">push_datapackage</span><span class="p">(</span><span class="n">data_url</span><span class="p">,</span> <span class="s">'pandas'</span><span class="p">)</span>
<span class="c1"># data frames available (corresponding to data files in original dataset)
</span> <span class="n">storage</span><span class="p">.</span><span class="n">buckets</span>
<span class="c1"># you can access datasets inside storage, e.g. the first one:
</span> <span class="n">storage</span><span class="p">[</span><span class="n">storage</span><span class="p">.</span><span class="n">buckets</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span></code></pre></figure>
<h3 id="ruby-javascript-and-many-more">Ruby, JavaScript and many more</h3>
<p>We also have support for JavaScript, SQL, Ruby and PHP. See our “Getting Data” tutorial for more:</p>
<p><a href="https://datahub.io/docs/getting-started/getting-data">https://datahub.io/docs/getting-started/getting-data</a></p>
<h2 id="conclusion">Conclusion</h2>
<p>This post has shown how you can import datasets in a high quality, standard form quickly and easily.</p>
<p>There are many more datasets to explore than the three we showed you here. You can find a full list here:</p>
<p><a href="https://datahub.io/core">https://datahub.io/core</a></p>
<p>Finally, we would love collaborators to help us curate even more core datasets. If you’re interested you can find out more about the Core Data Curator program here:</p>
<p><a href="https://datahub.io/docs/core-data/curators">https://datahub.io/docs/core-data/curators</a></p>
<hr />
<p><em>If you have questions, comments or feedback join <a href="https://gitter.im/datahubio/chat">DataHub’s chat channel</a> or open an issue on <a href="https://github.com/datahubio/qa">DataHub’s tracker</a>.</em></p>
DataHub Team
Data Package v1 Specifications. What has Changed and how to Upgrade
2017-10-11T00:00:00+00:00
http://okfnlabs.org/blog/2017/10/11/upgrade-to-data-package-specs-v1
<p>This post walks you through the major changes in the Data Package v1 specs compared to pre-v1. It covers changes in the full suite of Data Package specifications including Data Resources and Table Schema. It is particularly valuable if:</p>
<ul>
<li>you were using Data Packages pre v1 and want to know how to upgrade your datasets</li>
<li>if you are implementing Data Package related tooling and want to know how to upgrade your tools or want to support or auto-upgrade pre-v1 Data Packages for backwards compatibility</li>
</ul>
<p>It also includes a script we have created (in JavaScript) that we’ve been using ourselves to automate upgrades of the <a href="https://github.com/datahq/datapackage-normalize-js">Core Data</a>.</p>
<h2 id="the-changes">The Changes</h2>
<p>Two major changes in v1 were presentational:</p>
<ul>
<li>Creating Data Resource as a separate spec from Data Package. This did not change anything substantive in terms of how data packages worked but is important presentationally. In parallel, we also split out a Tabular Data Resource from the Tabular Data Package.</li>
<li>Renaming JSON Table Schema to just Table Schema</li>
</ul>
<p>In addition, there were a fair number of substantive changes. We summarize these in the sections below. For more detailed info see the <a href="https://specs.frictionlessdata.io/">current specifications</a> and <a href="https://pre-v1.frictionlessdata.io/">the old site containing the pre spec v1 specifications</a>.</p>
<h3 id="table-schema">Table Schema</h3>
<p>Link to spec: <a href="https://specs.frictionlessdata.io/table-schema/">https://specs.frictionlessdata.io/table-schema/</a></p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Property</th>
<th>Pre v1</th>
<th>v1 Spec</th>
<th>Notes</th>
<th>Issue</th>
</tr>
</thead>
<tbody>
<tr>
<td>id/name</td>
<td>id</td>
<td>name</td>
<td>Renamed id to name to be consistent across specs</td>
<td> </td>
</tr>
<tr>
<td>type/number</td>
<td>format: currency</td>
<td>format: currency - removed format: bareNumber format: decimalChar and groupChar</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/509">#509</a><br /><a href="https://github.com/frictionlessdata/specs/issues/246">#246</a></td>
</tr>
<tr>
<td>type/integer</td>
<td>No additional properties</td>
<td>Additional properties: bareNumber</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/509">#509</a></td>
</tr>
<tr>
<td>type/boolean</td>
<td>true: [yes, y, true, t, 1],false: [no, n, false, f, 0]</td>
<td>true: [ true, True, TRUE, 1],false: [false, False, FALSE, 0]</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/415">#415</a></td>
</tr>
<tr>
<td>type/year + yearmonth</td>
<td> </td>
<td>year and yearmonth NB: these were temporarily gyear and gyearmonth</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/346">#346</a></td>
</tr>
<tr>
<td>type/duration</td>
<td> </td>
<td>duration</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/210">#210</a></td>
</tr>
<tr>
<td>type/rdfType</td>
<td> </td>
<td>rdfType</td>
<td>Support rich “semantic web” types for fields</td>
<td><a href="https://github.com/frictionlessdata/specs/issues/217">#217</a></td>
</tr>
<tr>
<td>type/null</td>
<td> </td>
<td>removed (see missingValue)</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/262">#262</a></td>
</tr>
<tr>
<td>missingValues</td>
<td> </td>
<td>missingValues</td>
<td>Missing values support did not exist pre v1.</td>
<td><a href="https://github.com/frictionlessdata/specs/issues/97">#97</a></td>
</tr>
</tbody>
</table>
<h3 id="data-resource">Data Resource</h3>
<p>Link to spec: <a href="https://specs.frictionlessdata.io/data-resource/">https://specs.frictionlessdata.io/data-resource/</a></p>
<p><em>Note: Data Resource did not exist as a separate spec pre-v1 so strictly we are comparing the Data Resource section of the old Data Package spec with the new Data Resource spec.</em></p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Property</th>
<th>Pre v1</th>
<th>v1 Spec</th>
<th>Notes</th>
<th>Issue</th>
</tr>
</thead>
<tbody>
<tr>
<td>path</td>
<td>path and url</td>
<td>path only</td>
<td>url merged into path and path can now be a url or local path</td>
<td><a href="https://github.com/frictionlessdata/specs/issues/250">#250</a></td>
</tr>
<tr>
<td>path</td>
<td>string</td>
<td>string or array</td>
<td>path can be an array to support a single resource split across multiple files</td>
<td><a href="https://github.com/frictionlessdata/specs/issues/228">#228</a></td>
</tr>
<tr>
<td>name</td>
<td>recommended</td>
<td>required</td>
<td>Made name required to enable access to resources by name consistently across tools</td>
<td> </td>
</tr>
<tr>
<td>profile</td>
<td> </td>
<td>recommended</td>
<td>See profiles discussion</td>
<td> </td>
</tr>
<tr>
<td>sources, licenses …</td>
<td> </td>
<td> </td>
<td>Inherited metadata from Data Package like sources or licenses upgraded inline with changes in Data Package</td>
<td> </td>
</tr>
</tbody>
</table>
<h3 id="tabular-data-resource">Tabular Data Resource</h3>
<p>Link to spec: <a href="https://specs.frictionlessdata.io/data-resource/">https://specs.frictionlessdata.io/data-resource/</a></p>
<p>Just as Data Resource split out from Data Package so Tabular Data Resource split out from the old Tabular Data Package spec.</p>
<p>There were no significant changes here beyond those in Data Resource.</p>
<h3 id="data-package">Data Package</h3>
<p>Link to spec: <a href="https://specs.frictionlessdata.io/data-package/">https://specs.frictionlessdata.io/data-package/</a></p>
<table class="table table-striped table-bordered" style="display: block; overflow: auto;">
<thead>
<tr>
<th>Property</th>
<th>Pre v1</th>
<th>v1 Spec</th>
<th>Notes</th>
<th>Issue</th>
</tr>
</thead>
<tbody>
<tr>
<td>name</td>
<td>required</td>
<td>recommended</td>
<td>Unique names are not essential to any part of the present tooling so we have have moved to recommended.</td>
<td> </td>
</tr>
<tr>
<td>id</td>
<td> </td>
<td>id property-globally unique</td>
<td>Globally unique id property</td>
<td><a href="https://github.com/frictionlessdata/specs/issues/228">#228</a></td>
</tr>
<tr>
<td>licenses</td>
<td>license - object or string. The object structure must contain a type property and a url property linking to the actual text</td>
<td>licenses - is an array. Each item in the array is a License. Each must be an object. The object must contain a name property and/or a path property. It may contain a title property.</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>author</td>
<td>author</td>
<td>author is removed in favour of contributors</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>contributor</td>
<td>name, email, web properties with name required</td>
<td>title property required with roles, role property values must be one of - author, publisher, maintainer, wrangler, and contributor. Defaults to contributor.</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>sources</td>
<td>name, web and email and none required</td>
<td>title, path and email and title is required</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>resources</td>
<td> </td>
<td>resources array is required</td>
<td> </td>
<td><a href="https://github.com/frictionlessdata/specs/issues/434">#434</a></td>
</tr>
<tr>
<td>dataDependencies</td>
<td>dataDependencies</td>
<td> </td>
<td>Moved to a pattern until we have greater clarity on need.</td>
<td><a href="https://github.com/frictionlessdata/specs/issues/341">#341</a></td>
</tr>
</tbody>
</table>
<h3 id="tabular-data-package">Tabular Data Package</h3>
<p>Link to spec: <a href="https://specs.frictionlessdata.io/data-package/">https://specs.frictionlessdata.io/tabular-data-package/</a></p>
<p>Tabular Data Package is unchanged.</p>
<h3 id="profiles">Profiles</h3>
<p>Profiles arrived in v1:</p>
<p><a href="http://specs.frictionlessdata.io/profiles/">http://specs.frictionlessdata.io/profiles/</a></p>
<p>Profiles are the first step on supporting a rich ecosystem of “micro-schemas” for data. They provide a very simple way to quickly state that your data follows a specific structure and/or schema. From the docs:</p>
<blockquote>
<p>Different kinds of data need different formats for their data and metadata. To support these different data and metadata formats we need to extend and specialise the generic Data Package. These specialized types of Data Package (or Data Resource) are termed profiles.</p>
<p>For example, there is a Tabular Data Package profile that specializes Data Packages specifically for tabular data. And there is a “Fiscal” Data Package profile designed for government financial data that includes requirements that certain columns are present in the data e.g. Amount or Date and that they contain data of certain types.</p>
</blockquote>
<p>We think profiles are an easy, lightweight way to starting adding more structure to your data.</p>
<p>Profiles can be specified on both resources and packages.</p>
<h2 id="automate-upgrading-your-descriptor-according-to-the-spec-v1">Automate upgrading your descriptor according to the spec v1</h2>
<p>We have created a <a href="https://github.com/datahq/datapackage-normalize-js">data package normalization script</a> that you can use to automate the process of upgrading a <code class="language-plaintext highlighter-rouge">datapackage.json</code> or Table Schema from pre-v1 to v1.</p>
<p>The script enables you to automate updating your <code class="language-plaintext highlighter-rouge">datapackage.json</code> for the following properties: <code class="language-plaintext highlighter-rouge">path</code>, <code class="language-plaintext highlighter-rouge">contributors</code>, <code class="language-plaintext highlighter-rouge">resources</code>, <code class="language-plaintext highlighter-rouge">sources</code> and <code class="language-plaintext highlighter-rouge">licenses</code>.</p>
<p>This is a simple script that you can download directly from here:</p>
<p>https://raw.githubusercontent.com/datahq/datapackage-normalize-js/master/normalize.js</p>
<p>e.g. using wget:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://raw.githubusercontent.com/datahq/datapackage-normalize-js/master/normalize.js
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># path (optional) is the path to datapackage.json</span>
<span class="c"># if not provided looks in current directory</span>
normalize.js <span class="o">[</span>path]
<span class="c"># prints out updated datapackage.json</span>
</code></pre></div></div>
<p>You can also use as a library:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># install it from npm</span>
npm <span class="nb">install </span>datapackage-normalize
</code></pre></div></div>
<p>so you can use it in your javascript:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">normalize</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">datapackage-normalize</span><span class="dl">'</span><span class="p">)</span>
<span class="kd">const</span> <span class="nx">path</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">path/to/datapackage.json</span><span class="dl">'</span>
<span class="nx">normalize</span><span class="p">(</span><span class="nx">path</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>The above summarizes the main changes for v1 of Data Package suite of specs and instructions on how to upgrade.</p>
<p>If you want to see specification for more details, please visit <a href="https://specs.frictionlessdata.io/">Data Package specifications</a>. You can also visit the <a href="http://frictionlessdata.io/">Frictionless Data initiative for more information about Data Packages</a>.</p>
<hr />
<p><em>This blog post was originally published on <a href="http://datahub.io/">datahub.io</a> by Meiran Zhiyenbayev. Meiran works for <a href="https://datopian.com/">Datopian</a> who have been developing datahub.io as part of the Frictionless Data initative</em>.</p>
Meiran Zhiyenbayev
Frictionless Data Specs v1 Updates
2017-10-05T00:00:00+00:00
http://okfnlabs.org/blog/2017/10/05/frictionless-data-specs-v1-updates
<p>The Frictionless Data team released v1 specifications in the first week of September 2017 and Paul Walsh, Chief Product Officer at Open Knowledge International, <a href="https://blog.okfn.org/2017/09/05/frictionless-data-v1-0/">wrote a detailed blogpost about it</a>. With this milestone, in addition to modifications on pre-existing specifications like Table Schema<sup id="fnref:tableschema" role="doc-noteref"><a href="#fn:tableschema" class="footnote" rel="footnote">1</a></sup> and CSV Dialect<sup id="fnref:csvdialect" role="doc-noteref"><a href="#fn:csvdialect" class="footnote" rel="footnote">2</a></sup> in line with our design philosophy<sup id="fnref:philosophy" role="doc-noteref"><a href="#fn:philosophy" class="footnote" rel="footnote">3</a></sup>, the team created two new specifications- Data Resource<sup id="fnref:dr" role="doc-noteref"><a href="#fn:dr" class="footnote" rel="footnote">4</a></sup> and Tabular Data Resource<sup id="fnref:tdr" role="doc-noteref"><a href="#fn:tdr" class="footnote" rel="footnote">5</a></sup>- which employ explicit pattern rules to help describe data resources unambiguously.</p>
<p>Following the September release, the team has now updated our range of frictionless data implementations to work with v1 specs - from <code class="language-plaintext highlighter-rouge">tableschema</code> and <code class="language-plaintext highlighter-rouge">datapackage</code> libraries to <code class="language-plaintext highlighter-rouge">tableschema</code> plugins and the <code class="language-plaintext highlighter-rouge">goodtables.io</code> service.</p>
<p>Some of the highlights from this update include:</p>
<ul>
<li>SQL/BigQuery/Pandas plugins now work with all 15 Table Schema types<sup id="fnref:types" role="doc-noteref"><a href="#fn:types" class="footnote" rel="footnote">6</a></sup> with no data loss,</li>
<li>use Frictionless Data tools <sup id="fnref:tools" role="doc-noteref"><a href="#fn:tools" class="footnote" rel="footnote">7</a></sup> to infer, package and use data from different online sources</li>
<li>Create datapackages from a select few tables in your database</li>
</ul>
<h2 id="table-schema-plugins-update">Table Schema Plugins update</h2>
<p>The <a href="https://github.com/frictionlessdata/tableschema-pandas-py">Pandas</a>, <a href="https://github.com/frictionlessdata/tableschema-sql-py">SQL</a> and <a href="https://github.com/frictionlessdata/tableschema-bigquery-py">BigQuery</a> plugins have now been updated to work with v1 specifications.</p>
<p>Here’s how you can infer arbitrary CSV files from an online source, create a data package with the data and analyze it in a widely used data analysis tool like Pandas or an SQL database:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#pip install datapackage tableschema tableschema_sql tableschema_pandas
</span><span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span>
<span class="kn">from</span> <span class="nn">tableschema</span> <span class="kn">import</span> <span class="n">Storage</span>
<span class="kn">from</span> <span class="nn">datapackage</span> <span class="kn">import</span> <span class="n">Package</span>
<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">create_engine</span>
<span class="c1"># Infer data package from some CSVs in the internet
</span><span class="n">package</span> <span class="o">=</span> <span class="n">Package</span><span class="p">()</span>
<span class="n">package</span><span class="p">.</span><span class="n">add_resource</span><span class="p">({</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'teams'</span><span class="p">,</span> <span class="s">'path'</span><span class="p">:</span> <span class="s">'https://raw.githubusercontent.com/danielfrg/espn-nba-scrapy/master/data/teams.csv'</span><span class="p">})</span>
<span class="n">package</span><span class="p">.</span><span class="n">add_resource</span><span class="p">({</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'games'</span><span class="p">,</span> <span class="s">'path'</span><span class="p">:</span> <span class="s">'https://raw.githubusercontent.com/danielfrg/espn-nba-scrapy/master/data/games.csv'</span><span class="p">})</span>
<span class="n">package</span><span class="p">.</span><span class="n">infer</span><span class="p">()</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">package</span><span class="p">.</span><span class="n">descriptor</span><span class="p">)</span>
<span class="c1"># Check data package integrity
</span><span class="n">package</span><span class="p">.</span><span class="n">descriptor</span><span class="p">[</span><span class="s">'resources'</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="s">'schema'</span><span class="p">][</span><span class="s">'foreignKeys'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'fields'</span><span class="p">:</span> <span class="s">'home_team'</span><span class="p">,</span> <span class="s">'reference'</span><span class="p">:</span> <span class="p">{</span><span class="s">'resource'</span><span class="p">:</span> <span class="s">'teams'</span><span class="p">,</span> <span class="s">'fields'</span><span class="p">:</span> <span class="s">'name'</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'fields'</span><span class="p">:</span> <span class="s">'visit_team'</span><span class="p">,</span> <span class="s">'reference'</span><span class="p">:</span> <span class="p">{</span><span class="s">'resource'</span><span class="p">:</span> <span class="s">'teams'</span><span class="p">,</span> <span class="s">'fields'</span><span class="p">:</span> <span class="s">'name'</span><span class="p">}},</span>
<span class="p">]</span>
<span class="n">package</span><span class="p">.</span><span class="n">commit</span><span class="p">()</span>
<span class="n">package</span><span class="p">.</span><span class="n">get_resource</span><span class="p">(</span><span class="s">'games'</span><span class="p">).</span><span class="n">check_relations</span><span class="p">()</span>
<span class="n">pprint</span><span class="p">(</span><span class="s">'Integrity is checked'</span><span class="p">)</span>
<span class="c1"># Analyze data package in SQL
</span><span class="n">engine</span> <span class="o">=</span> <span class="n">create_engine</span><span class="p">(</span><span class="s">'sqlite:///'</span><span class="p">)</span>
<span class="n">package</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">storage</span><span class="o">=</span><span class="s">'sql'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">engine</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"""
SELECT home_team, round(avg(home_team_score), 1) as score
FROM games GROUP BY home_team ORDER BY score DESC
"""</span><span class="p">)))</span>
<span class="c1"># Analyze data package in Pandas
</span><span class="n">storage</span> <span class="o">=</span> <span class="n">Storage</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">'pandas'</span><span class="p">)</span>
<span class="n">package</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">storage</span><span class="o">=</span><span class="n">storage</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">storage</span><span class="p">[</span><span class="s">'games'</span><span class="p">].</span><span class="n">loc</span><span class="p">[</span><span class="n">storage</span><span class="p">[</span><span class="s">'games'</span><span class="p">][</span><span class="s">'home_team_score'</span><span class="p">].</span><span class="n">idxmax</span><span class="p">()])</span>
</code></pre></div></div>
<h2 id="data-package-storage-api-update">Data Package Storage API update</h2>
<p>We are working to make the Data Package specification<sup id="fnref:datapackage" role="doc-noteref"><a href="#fn:datapackage" class="footnote" rel="footnote">8</a></sup> the go-to metadata format to move datasets from one persistent storage to another. The Storage API (example below) now allows you to move data between Pandas, SQL and Big Query.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pip install datapackage tableschema tableschema_sql tableschema_pandas tableschema_bigquery
</span><span class="kn">import</span> <span class="nn">io</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span>
<span class="kn">from</span> <span class="nn">tableschema</span> <span class="kn">import</span> <span class="n">Storage</span>
<span class="kn">from</span> <span class="nn">datapackage</span> <span class="kn">import</span> <span class="n">Package</span>
<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">create_engine</span>
<span class="kn">from</span> <span class="nn">apiclient.discovery</span> <span class="kn">import</span> <span class="n">build</span>
<span class="kn">from</span> <span class="nn">oauth2client.client</span> <span class="kn">import</span> <span class="n">GoogleCredentials</span>
<span class="n">engine</span> <span class="o">=</span> <span class="n">create_engine</span><span class="p">(</span><span class="s">'sqlite:///'</span><span class="p">)</span> <span class="c1"># use your persistent database
</span>
<span class="c1"># From BigQuery to SQL
</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'GOOGLE_APPLICATION_CREDENTIALS'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'.credentials.json'</span>
<span class="n">credentials</span> <span class="o">=</span> <span class="n">GoogleCredentials</span><span class="p">.</span><span class="n">get_application_default</span><span class="p">()</span>
<span class="n">service</span> <span class="o">=</span> <span class="n">build</span><span class="p">(</span><span class="s">'bigquery'</span><span class="p">,</span> <span class="s">'v2'</span><span class="p">,</span> <span class="n">credentials</span><span class="o">=</span><span class="n">credentials</span><span class="p">)</span>
<span class="n">package</span> <span class="o">=</span> <span class="n">Package</span><span class="p">(</span><span class="n">storage</span><span class="o">=</span><span class="s">'bigquery'</span><span class="p">,</span> <span class="n">service</span><span class="o">=</span><span class="n">service</span><span class="p">,</span> <span class="n">project</span><span class="o">=</span><span class="s">'bigquery-public-data'</span><span class="p">,</span> <span class="n">dataset</span><span class="o">=</span><span class="s">'usa_names'</span><span class="p">)</span>
<span class="n">package</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">storage</span><span class="o">=</span><span class="s">'sql'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
<span class="c1"># From SQL to Pandas
</span><span class="n">storage</span> <span class="o">=</span> <span class="n">Storage</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">'pandas'</span><span class="p">)</span>
<span class="n">package</span> <span class="o">=</span> <span class="n">Package</span><span class="p">(</span><span class="n">storage</span><span class="o">=</span><span class="s">'sql'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
<span class="n">package</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">storage</span><span class="o">=</span><span class="n">storage</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">storage</span><span class="p">[</span><span class="s">'usa_1910_current'</span><span class="p">].</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>
<p>For more examples and ideas on how to use Storage API in your data wrangling and publishing workflow, take a look at <code class="language-plaintext highlighter-rouge">datapackage-py</code> documentation<sup id="fnref:datapackagepy" role="doc-noteref"><a href="#fn:datapackagepy" class="footnote" rel="footnote">9</a></sup>.
We welcome community contributions to allow for more integrations. Interested in contributing? <a href="https://github.com/frictionlessdata/tableschema-py/blob/master/README.md#storage">Start here</a>.</p>
<h2 id="use-table-schemas-data-types-with-no-data-loss">Use Table Schema’s Data Types with no data loss</h2>
<p>With the new update, it is now possible to store and retain all your data even where your storage backend provides a limited subset of all supported data types. For example, SQLite doesn’t support a JSON data type:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">datapackage</span> <span class="kn">import</span> <span class="n">Package</span>
<span class="kn">from</span> <span class="nn">tableschema</span> <span class="kn">import</span> <span class="n">Table</span><span class="p">,</span> <span class="n">Storage</span>
<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">create_engine</span>
<span class="n">engine</span><span class="o">=</span><span class="n">create_engine</span><span class="p">(</span><span class="s">'sqlite:///'</span><span class="p">)</span>
<span class="c1"># Resource
</span><span class="n">data</span> <span class="o">=</span> <span class="p">[[{</span><span class="s">'key'</span><span class="p">:</span> <span class="s">'value'</span><span class="p">}]]</span>
<span class="n">schema</span> <span class="o">=</span> <span class="p">{</span><span class="s">'fields'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'object'</span><span class="p">,</span> <span class="s">'type'</span><span class="p">:</span> <span class="s">'object'</span><span class="p">}]}</span>
<span class="c1"># Save
</span><span class="n">storage</span> <span class="o">=</span> <span class="n">Storage</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">'sql'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">Table</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span>
<span class="n">table</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">'objects'</span><span class="p">,</span> <span class="n">storage</span><span class="o">=</span><span class="n">storage</span><span class="p">)</span>
<span class="c1"># Load
</span><span class="n">table</span> <span class="o">=</span> <span class="n">Table</span><span class="p">(</span><span class="s">'objects'</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">,</span> <span class="n">storage</span><span class="o">=</span><span class="n">storage</span><span class="p">)</span>
<span class="n">table</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
<span class="c1"># [[{'key': 'value'}]] - objects inside as we'd like
</span></code></pre></div></div>
<h2 id="create-datapackages-from-a-few-tables-in-your-db">Create datapackages from a few tables in your DB</h2>
<p>You can now create new data packages with a select few SQL, BigQuery or Panda tables from your database when loading it as a data package.
Example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">package</span> <span class="o">=</span> <span class="n">Package</span><span class="p">({</span><span class="s">'resources'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'path'</span><span class="p">:</span> <span class="s">'table1'</span><span class="p">},</span> <span class="p">{</span><span class="s">'path'</span><span class="p">:</span> <span class="s">'table3'</span><span class="p">}]},</span> <span class="n">storage</span><span class="o">=</span><span class="s">'sql'</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
<span class="n">package</span><span class="p">.</span><span class="n">resource_names</span> <span class="c1"># ['table1', 'table3']
</span><span class="n">package</span><span class="p">.</span><span class="n">infer</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">package</span><span class="p">.</span><span class="n">descriptor</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="goodtables-web-service-works-with-v1-specs">Goodtables web service works with v1 specs</h2>
<p>Our Goodtables web service is now updated to work with v1 specifications<sup id="fnref:specs" role="doc-noteref"><a href="#fn:specs" class="footnote" rel="footnote">10</a></sup>. This tool allows you to setup your continous data validation workflow to ensure that published data is always valid. <a href="https://try.goodtables.io">try.goodtables.io</a> allows a one-time validation for arbitrary tabular files against structure and schema checks and is perfect for demo or trial purposes.</p>
<h2 id="next-steps">Next steps</h2>
<ul>
<li>We are looking to write more in-depth documentation and guides for the Frictionless Data specs and tools as we update our codebase<sup id="fnref:github" role="doc-noteref"><a href="#fn:github" class="footnote" rel="footnote">11</a></sup>.</li>
<li>We are looking forward to extend the number of our storage API implementations. In addition to the SQL/BigQuery/Pandas implementations, we are working on SPSS<sup id="fnref:spss" role="doc-noteref"><a href="#fn:spss" class="footnote" rel="footnote">12</a></sup> and Elasticsearch<sup id="fnref:elasticsearch" role="doc-noteref"><a href="#fn:elasticsearch" class="footnote" rel="footnote">13</a></sup> plugins. Contributors play a very important role in this work. Feel free to write your own <code class="language-plaintext highlighter-rouge">tableschema</code> plugin - it’s fun and a relatively simple task!</li>
</ul>
<p>We welcome community contributions to our codebase, and are keen to interact with you on <a href="http://gitter.im/frictionlessdata/chat">Frictionless Data Gitter chat</a>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:tableschema" role="doc-endnote">
<p>Table Schema: <a href="http://specs.frictionlessdata.io/table-schema/">http://specs.frictionlessdata.io/table-schema/</a> <a href="#fnref:tableschema" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:csvdialect" role="doc-endnote">
<p>CSV Dialect: <a href="http://specs.frictionlessdata.io/csv-dialect/">http://specs.frictionlessdata.io/csv-dialect/</a> <a href="#fnref:csvdialect" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:philosophy" role="doc-endnote">
<p>Frictionless Data Design Philosophy: <a href="http://specs.frictionlessdata.io/#design-philosophy">http://specs.frictionlessdata.io/#design-philosophy</a> <a href="#fnref:philosophy" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:dr" role="doc-endnote">
<p>Data Resource: <a href="http://specs.frictionlessdata.io/data-resource/">http://specs.frictionlessdata.io/data-resource/</a> <a href="#fnref:dr" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tdr" role="doc-endnote">
<p>Tabular Data Resource: <a href="http://specs.frictionlessdata.io/tabular-data-resource/">http://specs.frictionlessdata.io/tabular-data-resource/</a> <a href="#fnref:tdr" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:types" role="doc-endnote">
<p>Table Schema Types: <a href="http://specs.frictionlessdata.io/table-schema/#types-and-formats">http://specs.frictionlessdata.io/table-schema/#types-and-formats</a> <a href="#fnref:types" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tools" role="doc-endnote">
<p>Frictionless Data Tools <a href="http://frictionlessdata.io/software/">http://frictionlessdata.io/software/</a> <a href="#fnref:tools" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:datapackage" role="doc-endnote">
<p>Data Package <a href="http://specs.frictionlessdata.io/data-package/">http://specs.frictionlessdata.io/data-package/</a> <a href="#fnref:datapackage" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:datapackagepy" role="doc-endnote">
<p>Data Package Python Library: <a href="https://github.com/frictionlessdata/datapackage-py">https://github.com/frictionlessdata/datapackage-py</a> <a href="#fnref:datapackagepy" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:specs" role="doc-endnote">
<p>Frictionless Data Specifications: <a href="http://specs.frictionlessdata.io/">http://specs.frictionlessdata.io/</a> <a href="#fnref:specs" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:github" role="doc-endnote">
<p>Frictionless Data on Github: <a href="http://github.com/frictionlessdata">http://github.com/frictionlessdata</a> <a href="#fnref:github" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:spss" role="doc-endnote">
<p>Frictionless Data SPSS Plugin: <a href="https://github.com/frictionlessdata/tableschema-spss-py">https://github.com/frictionlessdata/tableschema-spss-py</a> <a href="#fnref:spss" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:elasticsearch" role="doc-endnote">
<p>Frictionless Data ElasticSearch Plugin: <a href="https://github.com/frictionlessdata/tableschema-elasticsearch-py">https://github.com/frictionlessdata/tableschema-elasticsearch-py</a> <a href="#fnref:elasticsearch" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Serah Rono
Measure for Measure
2017-07-13T00:00:00+00:00
http://okfnlabs.org/blog/2017/07/13/measure-for-measure
<p><em>In his Open Knowledge International Tech Talk, Developer Brook Elgie
describes how we are using Data Package Pipelines and Redash to gain
insight into our organization in a declarative, reproducible, and easy
to modify way.</em></p>
<p>This post briefly introduces a newly launched internal project at
<a href="https://okfn.org/">Open Knowledge International</a> called <a href="https://github.com/okfn/measure">Measure</a>, its
history, motivation, and the tech that drives it. To learn more,
watch the embedded video demonstration by developer
<a href="https://twitter.com/brew">Brook Elgie</a> and check out the
<a href="https://github.com/okfn/measure">code</a>.</p>
<h2 id="what-is-measure">What is Measure?</h2>
<p><a href="https://github.com/okfn/measure">Measure</a> is a system that allows us to collect and analyze
metrics from various internal sources and external platforms through a
combination of easy-to-write YAML docs and a user-friendly interface.
These include the number of views on our main website, downloads of
our libraries from <a href="https://pypi.python.org/pypi">PyPI</a>, retweets on Twitter, and form-based
records of project outputs (e.g. recent talks we’ve given). Like many
organizations, we rely heavily on hosted platforms to execute on our
mission, each of which has its own interface to useful data. This can
make it harder to correlate events (e.g. how many downloads did this
software package have after this blog post?) and yield insight across
platforms. It’s critical to harmonize access to this data not only
for us to learn how to be more effective, but also to demonstrate to
external funders the impact of our work advancing the cause of
openness. It’s also important for this data to be accessible to
everyone at the organization, regardless of their technical skill.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/NVuJq_WseJQ?list=PLOGV29UsPM6hTC5Nvd2ySyI_5q-C1_i1S" frameborder="0" allowfullscreen=""></iframe>
<p><em>Brook Elgie describes Measure in an Open Knowledge International Tech Talk</em></p>
<h3 id="how-does-it-work">How Does it Work?</h3>
<p>Measure relies on several technologies we are developing here at Open
Knowledge International around our <a href="http://frictionlessdata.io/">Frictionless Data</a> project.
Each of our projects has a <a href="https://github.com/okfn/measure#project-configuration">source specification file</a>
defined in YAML and split into themes. For example, <code class="language-plaintext highlighter-rouge">social-media</code> is
a theme for data sources such as Twitter and Facebook, while
<code class="language-plaintext highlighter-rouge">code-packaging</code> is a theme for PyPI and other software repositories
we upload to. Each theme has a <a href="https://github.com/frictionlessdata/datapackage-pipelines#pipelines">pipeline</a> which is
composed of <a href="https://github.com/frictionlessdata/datapackage-pipelines#custom-processors">processors</a> which do the actual work of
fetching data and transforming the <a href="http://specs.frictionlessdata.io/data-package/">Data Package</a> (a collection of
data and descriptive metadata) and its resources. Data is moved
through the thematic pipeline using <a href="https://github.com/frictionlessdata/datapackage-pipelines">Data Package Pipelines</a> and
a handful of other tools in the Frictionless Data project. The final
processor writes the processed resources to the Measure database,
which is used as the data source for our visualisation tool,
<a href="https://redash.io/">Redash</a>. Each pipeline is configured to run once a day. You
can read more about Data Package Pipelines and how it enables this
process in its <a href="/blog/2017/02/27/datapackage-pipelines.html">introductory blog post</a>.</p>
<p>By consolidating our metrics into a single database and surfacing
through Redash, it’s easy to create and share visualisations across
one or more data sources, create dashboards of project and
organization health, and make truly data-driven decisions with minimal
friction.</p>
<h2 id="tech-talks">Tech Talks</h2>
<p>If you enjoyed this, you can see similar content on our
<a href="https://www.youtube.com/playlist?list=PLOGV29UsPM6hTC5Nvd2ySyI_5q-C1_i1S">Open Knowledge International Tech Talks YouTube Playlist</a>.</p>
Dan Fowler
DAC and CRS code lists – Now available as Frictionless Data!
2017-07-10T00:00:00+00:00
http://okfnlabs.org/blog/2017/07/10/dac-and-crs-code-lists-frictionless-data
<p><em>This blog was originally posted on the <a href="http://www.publishwhatyoufund.org/maintained-machine-readable-dac-crs-code-lists-de-rien/">Publish What You Fund</a> website.</em></p>
<p>Maintained, machine readable versions of <a href="http://data.okfn.org/data/core/dac-and-crs-code-lists">the DAC and CRS code lists are now available as CSV and JSON!</a> Here’s how <a href="http://www.publishwhatyoufund.org/">Publish What You Fund</a> and <a href="https://okfn.org/">Open Knowledge</a> made it happen…</p>
<p><img src="https://d26dzxoao6i3hh.cloudfront.net/items/1u361E1a3W1U3H2U3v0A/android.png" alt="DAC CRS Bot" /></p>
<p>The <a href="http://www.oecd.org">OECD</a>’s Development Assistance Committee (<a href="http://www.oecd.org/dac/">DAC</a>) maintains a set of code lists used by donors to report on their aid flows. These are used as part of donors’ DAC reporting, but also in their <a href="https://iatiregistry.org/">IATI publications</a>. Not only that, but since some of the codes e.g. for aid classification, are so widely used, they are also useful to recipient country governments to <a href="http://aidonbudget.org/">map aid activities to their own budgets</a>. So they’re super important!</p>
<h2 id="keeping-in-sync">Keeping in sync</h2>
<p>Now, these code lists are <a href="http://www.oecd.org/dac/stats/dacandcrscodelists.htm">available on the OECD website</a> as a non-machine-readable XLS file. There’s also an XML version, but it was last updated 18-months ago, and as such it differs significantly from the standard, canonical XLS version on the OECD website.</p>
<p>Because of this lack of a machine-readable version, <a href="https://github.com/IATI/IATI-Codelists-NonEmbedded/tree/master/xml">IATI maintains its own replicated versions of these code lists</a>. These replicated versions are used by <a href="http://d-portal.org/">d-portal</a>, the <a href="http://dashboard.iatistandard.org/">IATI Dashboard</a> and others. However, due to the overheads involved in maintaining them, these too have fallen out of sync with the source file.</p>
<p>There has been a-rumbling (and some grumbling!) within the IATI community about <a href="https://discuss.iatistandard.org/t/planning-for-machine-readable-version-controlled-oecd-dac-codelists/866/8">getting the DAC to produce a machine-readable version</a> of these code lists. This idea has long been in the offing, and we at Publish What You Fund would very much welcome such a development.</p>
<p>In the meantime, though, we have taken matters into our own hands. Together with <a href="https://okfn.org/">Open Knowledge</a>, we’ve published <a href="http://data.okfn.org/data/core/dac-and-crs-code-lists">a frictionless data package of the DAC code lists</a> – with data available in machine-readable CSV and JSON formats. This is published as an <a href="http://data.okfn.org/roadmap/core-datasets">Open Knowledge Core Dataset</a> – a group of <strong>important</strong> and <strong>commonly-used</strong> datasets in <strong>high quality, easy-to-use and open</strong> form.</p>
<h2 id="but-how-does-it-work-the-science-bit">But how does it work? The science bit!</h2>
<p>The data is <a href="https://github.com/datasets/dac-crs-codes/tree/master/data">stored on github</a>, and maintained by a scraper that runs nightly on <a href="https://morph.io/">morph.io</a> (created by the wonderful <a href="https://www.openaustraliafoundation.org.au/">Open Australia Foundation</a>). When a change to the data is detected, a pull request is sent by <a href="https://github.com/dac-crs-bot">DAC CRS Bot</a>, and reviewed by a (human) maintainer. Via github, <a href="https://github.com/datasets/dac-crs-codes/commits/master/data">we maintain a version history of changes to the data</a>, so it’s possible to tell what changed and when.</p>
<p>The next logical step would be for IATI to <a href="https://github.com/IATI/IATI-Codelists-NonEmbedded/pull/51">use this data to maintain their replicated lists</a> as a routine maintenance task. We’ve already tested this as a proof of concept one-off task, to <a href="https://github.com/IATI/IATI-Codelists-NonEmbedded/pull/153">bring all the relevant replicated IATI code lists up-to-date</a>, including adding all French translations. De rien!</p>
Andy Lulham
Introducing the new goodtables library and goodtables.io
2017-05-22T00:00:00+00:00
http://okfnlabs.org/blog/2017/05/22/introducing-the-new-goodtables-library-and-goodtablesio
<p>Information is everywhere. This means that there is so much we need to know at any given time, but such limited capacity and time to internalize it all. True art, therefore, lies in the ability to draw summaries adequate enough to save time and impart knowledge. From the 1880s, tabulation has been our go-to method for compacting information, not only to preserve it, but also to make analyses and draw meaningful conclusions out of it.</p>
<p>Tables, comprised of rows and columns of related data, are not always as easy to analyze, and especially not when there are thousands of rows of data. Mixed data types, missing data, or ill-suited data in tables are but a few reasons why tabular data is often a nightmare to work with in its raw state, often referred to as “dirty” data.</p>
<p>Enter <strong>goodtables</strong>.</p>
<p><a href="https://github.com/frictionlessdata/goodtables.io"><img src="/img/posts/goodtables-python-library.png" alt="goodtables python library" /></a></p>
<h2 id="the-goodtables-library">The goodtables library</h2>
<p><a href="https://github.com/frictionlessdata/goodtables-py/">goodtables</a> is a Python library that allows users to inspect tabular data, checking it for both structural and schematic errors, and giving pointers on plausible error fixes, before users draw analyses on the data using other tools. At its most basic level, goodtables highlights general errors in tabular files that would otherwise prevent loading or parsing.</p>
<p>Since <a href="/blog/2015/02/20/introducing-goodtables.html">the release of goodtables v0.7 in early 2015</a>, the codebase has evolved, allowing for additional use cases while working with tabular data. Without cutting back on functionality, goodtables v1 has been simplified, and the focus is now on extensible data validation.</p>
<h2 id="using-goodtables">Using goodtables</h2>
<p>goodtables is still in alpha, so we need to pass the pre-release flag (<code class="language-plaintext highlighter-rouge">--pre</code>) to <code class="language-plaintext highlighter-rouge">pip</code> to install. With that, installation of goodtables v1 is as easy as <code class="language-plaintext highlighter-rouge">pip install goodtables --pre</code>.</p>
<p>The goodtables v1 CLI supports two presets by default: <strong><em>table</em></strong> and <strong><em>datapackage</em></strong>. The <code class="language-plaintext highlighter-rouge">table</code> preset allows you to inspect a single tabular file.</p>
<p><em>Example:</em></p>
<p><code class="language-plaintext highlighter-rouge">goodtables --json table valid.csv</code> returns a JSON report for the file specifying the error count, source, and validity of the data file among other things.</p>
<p>The <code class="language-plaintext highlighter-rouge">datapackage</code> preset allows you to run checks on datasets aggregated in one container. <a href="http://specs.frictionlessdata.io/data-package/">Data Packages</a> are a format for coalescing data in one ‘container’ before shipping it for use by different people and with different tools.</p>
<p><em>Example:</em></p>
<p><code class="language-plaintext highlighter-rouge">goodtables datapackage datapackage.json</code> allows a user to check a data package’s schema, table by table, and gives a detailed report on errors, row count, headers, and validity or lack thereof of a data package.</p>
<p>You can try out these commands on your own data or you can use datasets <a href="https://github.com/frictionlessdata/goodtables-py/tree/master/data">from this folder</a>.</p>
<h2 id="customization">Customization</h2>
<p>In addition to general structure and schema checks on tabular files available in v0.7, the goodtables library now allows users to define custom (data source) presets and run custom checks on tabular files. So what is the difference?</p>
<p>While basic schema checks inspect data against <a href="https://github.com/frictionlessdata/data-quality-spec">the data quality spec</a>, <code class="language-plaintext highlighter-rouge">custom_check</code> gives developers leeway to specify acceptable values against data fields so that any values outside of the defined rules are flagged as errors.</p>
<p><code class="language-plaintext highlighter-rouge">custom_preset</code> allows users to define custom interfaces to your data storage platform of choice. They allow you to tell goodtables where your dataset is held, whether it is hosted on CKAN, Dropbox, or Google Drive.</p>
<p>Any presets outside of the built-in ones above are made possible and registered through a provisional API.</p>
<p><em>Examples:</em></p>
<ul>
<li>
<p><strong><em>CKAN custom preset</em></strong>:
<a href="http://ckan.org">CKAN</a> is the world’s leading open data platform developed by Open Knowledge Foundation to help streamline the publishing, sharing, finding and using of data.
<a href="https://github.com/frictionlessdata/goodtables-py/blob/master/examples/ckan.py">Here’s a custom preset</a> that, for example, could help the user run an inspection on datasets from <a href="http://data.surrey.ca">Surrey’s Data Portal</a> which utilizes CKAN.</p>
</li>
<li>
<p><strong><em>Dropbox custom preset</em></strong>:
Dropbox is one of the most popular file storage and collaboration cloud service in use. It ships with an API that makes it possible for third party apps to read files stored on Dropbox as long as a user’s access token is specified. Here’s our <a href="https://github.com/frictionlessdata/goodtables-py/blob/master/examples/dropbox.py">goodtables custom preset for Dropbox</a>. Remember to generate an access token by first <a href="https://www.dropbox.com/developers/apps">creating a Dropbox app with full permissions</a>.</p>
</li>
<li>
<p><strong><em>Google Sheets custom preset</em></strong>:
The Google Sheets parser to enable custom preset definition is currently in development. At present, for any data file stored in Google Drive and published on the web, the command <code class="language-plaintext highlighter-rouge">goodtables table google_drive_file_url</code> inspects your dataset and checks for validity, or lack thereof.</p>
</li>
</ul>
<h2 id="validating-multiple-tables">Validating multiple tables</h2>
<p>goodtables also allows users to carry out parallel validation for multi-table datasets. The <strong><em>datapackage</em></strong> preset make this possible.</p>
<p><em>Example:</em></p>
<p><a href="http://frictionlessdata.io">Frictionless Data</a> is a core Open Knowledge Foundation project and all goodtables work falls under its umbrella. One of the pilots working with Frictionless Data is <a href="https://github.com/frictionlessdata/pilot-dm4t">DM4T</a>, with an aim to understand the extent to which Data Package concepts can be applied in the energy sector. DM4T pilot’s issue tracker <a href="https://github.com/frictionlessdata/pilot-dm4t">lives here</a> and its <a href="https://s3-eu-west-1.amazonaws.com/frictionlessdata.io/pilots/pilot-dm4t/datapackage.json">Data Package</a> comprises of <a href="http://data.okfn.org/tools/view?url=https%3A%2F%2Fs3-eu-west-1.amazonaws.com%2Ffrictionlessdata.io%2Fpilots%2Fpilot-dm4t%2Fdatapackage.json">20 CSV files</a> and is approximately 6.7 GB in size.</p>
<p>To inspect DM4T’s energy consumption data collected from 20 households in the UK, run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>goodtables --table-limit 20 datapackage https://s3-eu-west-1.amazonaws.com/frictionlessdata.io/pilots/pilot-dm4t/datapackage.json
</code></pre></div></div>
<p>In the command above, the <code class="language-plaintext highlighter-rouge">--table-limit</code> option allows you to check all 20 tables, since by default goodtables only runs checks on the first ten tables by default. You can find plenty of sample Data Packages for use with goodtables <a href="https://github.com/datasets/">in this repository</a>.</p>
<p>So why use GitHub for storage of data files? At Open Knowledge Foundation, we <a href="http://blog.okfn.org/2013/07/02/git-and-github-for-data/">highly recommend</a> and <a href="http://blog.okfn.org/2016/11/29/git-for-data-analysis-why-version-control-is-essential-collaboration-public-trust/">work with others to</a> use GitHub repositories for dataset storage.</p>
<p><strong>PRO TIP:</strong>
In working with datasets hosted on GitHub, say <a href="https://github.com/datasets/country-codes">the countries codes Data Package</a>, users should use the raw file URL with goodtables, since support for GitHub URL resolution is still in development.</p>
<h2 id="standards-and-other-enhancements">Standards and other enhancements</h2>
<p>goodtables v1 also works with our proposed <a href="https://github.com/frictionlessdata/data-quality-spec">data quality specification standard</a>, which defines <a href="https://github.com/frictionlessdata/goodtables-py/blob/master/goodtables/spec.json">an extensive list</a> of standard tabular data errors. Other enhancements from goodtables v0.7 include:</p>
<ol>
<li>Breaking out <a href="http://github.com/frictionlessdata/tabulator">tabulator</a> into its own library. As part of the Frictionless Data framework, <strong><em>tabulator</em></strong> is a Python library that has been developed to provide a consistent interface for stream reading and writing tabular data that is in whatever format, be it CSV, XML, etc. The library is installable via pip: <code class="language-plaintext highlighter-rouge">pip install tabulator</code>.</li>
<li>Close to 100% support for <a href="http://specs.frictionlessdata.io/table-schema/">Table Schema</a> due to lots of work on the underlying <a href="https://github.com/frictionlessdata/jsontableschema-py">Python library</a>. The Table Schema Python library allows users to validate dataset schema and, given headers and data, infer a schema as a python dictionary based on its initial values.</li>
<li>Better CSV parsing, better HTML detection, and less false positives.</li>
</ol>
<h2 id="goodtablesio">goodtables.io</h2>
<p><a href="https://github.com/frictionlessdata/goodtables.io"><img src="/img/posts/goodtablesio.jpg" alt="goodtablesio" /></a></p>
<p>Moving forward, at Open Knowledge Foundation we want to streamline the process of data validation and ensure seamless integration is possible in different publishing workflows. To do so, <a href="https://discuss.okfn.org/t/launching-goodtables-io-tell-us-what-you-think/5165">we are launching a continuous data validation hosted service</a> that builds on top of this suite of Frictionless Data libraries. <a href="http://goodtables.io">goodtables.io</a> will provide support for different backends. At this time, users can use it to check any datasets hosted on GitHub and Amazon S3 buckets, automatically running validation against data files every time they are updated, and providing a user friendly report of any issues found.</p>
<p>Try it here: <a href="http://goodtables.io">goodtables.io</a></p>
<p>This kind of continuous feedback allows data publishers to release better, higher quality data and helps ensure that this quality is maintained over time, even if different people publish the data.</p>
<p>Using <a href="http://goodtables.io/github/frictionlessdata/example-goodtables.io">this dataset on Github</a>, here’s sample output from data validation run on goodtables.io:</p>
<p><a href="http://goodtables.io/github/amercader/car-fuel-and-emissions"><img src="/img/posts/goodtablesio-validation.png" alt="illustrating data validation on goodtables.io" /></a></p>
<p>Updates on the files in the dataset will trigger a validation check on goodtables.io. As with other projects at Open Knowledge International, <a href="https://github.com/frictionlessdata/goodtables.io">goodtables.io code is open source</a> and contributions are welcome. We hope to build functionality to support additional data storage platforms in the coming months, please let us know which ones to consider in our <a href="https://gitter.im/frictionlessdata/chat">Gitter chat</a> or on the <a href="https://discuss.okfn.org/c/frictionless-data">Frictionless Data forum</a>.</p>
Serah Rono
Data Package Pipelines
2017-02-27T00:00:00+00:00
http://okfnlabs.org/blog/2017/02/27/datapackage-pipelines
<p><em><a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a> is the newest part of the
<a href="http://frictionlessdata.io/">Frictionless Data</a> toolchain. Originally developed through work
on OpenSpending, it is a framework for defining data processing steps
to generate self-describing Data Packages.</em></p>
<p><a href="http://next.openspending.org/">OpenSpending</a> is an open database for uploading fiscal data for
countries or municipalities to better understand how governments spend
public money. In this project, we’re often presented with requests to
upload large amounts of potentially budget messy data—often a CSV or
Excel files—to the platform. We looked for existing ETL (extract,
transform, load) solutions for extracting data from these different
sources, transforming them into a format that OpenSpending supports
(the <a href="http://specs.frictionlessdata.io/fiscal-data-package">Open Fiscal Data Package</a>) and loading them into the
platform. A few existing and powerful solutions exist, but none suited
our needs. Most were optimised for a use case in which you have a few
different data sources, on which a large dependency graph can be built
out of complex processing nodes. The OpenSpending use case is
radically different. Not only do we have <em>many</em> data sources, but our
processing flows are <em>independent</em> (i.e. not an intricate dependency
graph) and mostly quite <em>similar</em> (i.e. built from the same building
blocks).</p>
<p><img src="/img/posts/dpp-openspending.png" alt="OpenSpending image" /></p>
<p>We also found that typical ETL solutions were intended to be used by
data scientists and developers with processing pipelines defined in
code. While this is very convenient for coders, it is less so for the
kind of non-techies (e.g. government officials) we want to use the
platform. Writing processing nodes in code gives developers a lot of
flexibility but also provides very few assurances about the
computational resources the code will use. This creates problems when
having to make decisions regarding deployment or concurrency.</p>
<h2 id="pipelines-for-data-packages">Pipelines for Data Packages</h2>
<p>Based on these observations, we implemented a new ETL library,
<a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a>, with a different set of assumptions and
use cases.</p>
<p><img src="/img/posts/dpp-pipelines.jpg" alt="Pipelines for something other than data" /></p>
<p><em><a href="https://www.flickr.com/photos/loopkid/359144645">Pipelines</a> for
something other than data -
<a href="https://www.flickr.com/photos/loopkid/">Stefan Schmidt</a> -
<a href="https://creativecommons.org/licenses/by-nc/2.0/">CC BY-NC 2.0</a></em></p>
<p>datapackage-pipelines assumptions and use cases:</p>
<ol>
<li>
<p><strong>Processing flows (or ‘pipelines’) are defined in a configuration
file and not code.</strong></p>
<p>This allows non-techies to write pipeline definitions, and enables
other possibilities, such as strict validation of definition files.</p>
<p>Writing custom processing code is possible, but the framework
encourages small, simple processing nodes and not processing
behemoths. This creates better design and easier-to-understand
pipelines.</p>
</li>
<li>
<p><strong>Input and output works through streaming data.</strong></p>
<p>While this means processing nodes have limited flexibility, it
also means they must adhere to strict use of computing
resources. This constraint allows us to deploy processing flows
more easily, without having to worry about a processing node
taking too much memory or disk space.</p>
</li>
<li>
<p><strong>We are based on the Data Package, like OpenSpending.</strong></p>
<p>All pipelines process and produce valid
<a href="http://specs.frictionlessdata.io/data-package">Data Packages</a>. This
means that metadata (both descriptive and structural) and data
validation are built into the framework. The resulting files can
then be seamlessly used with any <a href="http://frictionlessdata.io/software/">compliant tool or library</a>, which
makes the produced data extremely portable and
machine-processable.</p>
</li>
</ol>
<h2 id="quick-start">Quick Start</h2>
<p>To start using <a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a>, you must first create a
<code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> file in your current directory. Here’s an
example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>worldbank-co2-emissions:
pipeline:
-
run: add_metadata
parameters:
name: 'co2-emissions'
title: 'CO2 emissions (metric tons per capita)'
homepage: 'http://worldbank.org/'
-
run: add_resource
parameters:
name: 'global-data'
url: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel"
format: xls
headers: 4
-
run: stream_remote_resources
cache: True
-
run: set_types
parameters:
resources: global-data
types:
"[12][0-9]{3}":
type: number
-
run: dump.to_path
parameters:
out-path: co2-emissions
</code></pre></div></div>
<p>Running a pipeline from the command line is done using the <code class="language-plaintext highlighter-rouge">dpp</code>
tool. Install the latest version of <a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a> from
PyPI (Requirements: Python 3.5 or higher):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pip install datapackage-pipelines
</code></pre></div></div>
<p>At this point, running <code class="language-plaintext highlighter-rouge">dpp</code> will show the list of available pipelines
by scanning the current directory and its subdirectories, searching
for <code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> files. (You can ignore the “:Skipping redis
connection, host:None, port:6379” warning for now.)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dpp
Available Pipelines:
- ./worldbank-co2-emissions (*)
</code></pre></div></div>
<p>Each pipeline has an identifier, composed of the path to the
<code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> file and the name of the pipeline, as defined
within that description file. In this case,the identifier is
<code class="language-plaintext highlighter-rouge">./worldbank-co2-emissions</code>.</p>
<p>In order to run a pipeline, you use <code class="language-plaintext highlighter-rouge">dpp run <pipeline-id></code>. You can
also use <code class="language-plaintext highlighter-rouge">dpp run all</code> for running all pipelines and dpp run dirty to
run the just the dirty pipelines (more on that in the
<a href="https://github.com/frictionlessdata/datapackage-pipelines/blob/master/README.md">README</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dpp run ./worldbank-co2-emissions
INFO :Main:RUNNING ./worldbank-co2-emissions
INFO :Main:- lib/add_metadata.py
INFO :Main:- lib/add_resource.py
INFO :Main:- lib/stream_remote_resources.py
INFO :Main:- lib/dump/to_zip.py
INFO :Main:DONE lib/add_metadata.py
INFO :Main:DONE lib/add_resource.py
INFO :Main:stream_remote_resources: OPENING http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
INFO :Main:stream_remote_resources: TOTAL 264 rows
INFO :Main:stream_remote_resources: Processed 264 rows
INFO :Main:DONE lib/stream_remote_resources.py
INFO :Main:dump.to_zip: INFO :Main:Processed 264 rows
INFO :Main:DONE lib/dump/to_zip.py
INFO :Main:RESULTS:
INFO :Main:SUCCESS: ./worldbank-co2-emissions
{'dataset-name': 'co2-emissions', 'total_row_count': 264}
</code></pre></div></div>
<p>At the end of this, you should have a new directory co2-emisonss-wb
with a <code class="language-plaintext highlighter-rouge">/data</code> directory and a <code class="language-plaintext highlighter-rouge">datapackage.json</code> file. This is a
<a href="http://specs.frictionlessdata.io/data-package">Data Package</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tree
.
├── co2-emissions
│ ├── data
│ │ └── EN.ATM.CO2E.csv
│ └── datapackage.json
└── pipeline-spec.yaml
2 directories, 3 files
</code></pre></div></div>
<p>So what exactly happened? Let’s explore what a pipeline actually is,
and what it does.</p>
<h2 id="the-pipeline">The Pipeline</h2>
<p>The basic concept in this framework is the pipeline. A pipeline has a
list of processing steps, and it generates a single Data Package as
its output. Each step is executed in a processor and consists of the
following stages:</p>
<ul>
<li><strong>Modify the Data Package descriptor file</strong> (<code class="language-plaintext highlighter-rouge">datapackage.json</code>) - For
example: add metadata, add or remove resources, change resources’
data schema etc. For valid elements, see the
<a href="http://specs.frictionlessdata.io/data-package">spec</a>.</li>
<li><strong>Process resources</strong> - Each row of each resource is processed
sequentially. The processor can drop rows, add new ones, or modify
their contents.</li>
<li><strong>Return stats</strong> - If necessary, the processor can report a
dictionary of data which will be returned to the user when the
pipeline execution terminates. This can be used, for example, for
calculating quality measures for the processed data.</li>
</ul>
<p>Not every processor needs to do all of these. In fact, you would often
find each processing step doing only one of these.</p>
<h3 id="pipeline-specyaml-file">pipeline-spec.yaml file</h3>
<p>Pipelines are defined in a declarative way, and not in code. One or
more pipelines can be defined in a <code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> file. This
file specifies the list of processors (referenced by name) and the
execution parameters for each of the processors.</p>
<p>In the above example we see one pipeline called
<code class="language-plaintext highlighter-rouge">worldbank-co2-emissions</code>. Its pipeline consists of 4 steps:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">metadata</code>: This processor (see the <a href="https://github.com/frictionlessdata/datapackage-pipelines#the-standard-processor-library">repo</a> for more), which
modifies the Data Package’s descriptor (in our case: the initial,
empty descriptor) - adding name, title, and other properties to the
<code class="language-plaintext highlighter-rouge">datapackage.json</code>.</li>
<li><code class="language-plaintext highlighter-rouge">add_resource</code>: This processor adds a single resource to the Data
Package. This resource has a <code class="language-plaintext highlighter-rouge">name</code> and a <code class="language-plaintext highlighter-rouge">url</code>, pointing to the
remote location of the data.</li>
<li><code class="language-plaintext highlighter-rouge">stream_remote_resources</code>: This processor converts remote resources
(like the one we defined in the 1st step) to local resources,
streaming the data to processors further down the pipeline (see
“Mechanics” below).</li>
<li><code class="language-plaintext highlighter-rouge">set_types</code>: This processor assigns data types to fields in the
data. In this example, field headers looking like years will be
assigned the number type.</li>
<li><code class="language-plaintext highlighter-rouge">dump.to_path</code>: Create a validated Data Package in the provided path
<code class="language-plaintext highlighter-rouge">co2-emissions-wb</code></li>
</ul>
<h3 id="mechanics">Mechanics</h3>
<p>An important aspect of how the pipelines are run is the fact that data
is passed in streams from one processor to another. Each processor is
run in its own dedicated process, where the Data Package is read from
its STDIN and output to its STDOUT. No processor holds the entire data
set at any point.</p>
<h2 id="dirty-tasks-and-keeping-state">Dirty tasks and keeping state</h2>
<p>As you modify and re-run your pipeline, you can also avoid
unnecessarily repeating steps. By setting the <code class="language-plaintext highlighter-rouge">cached:</code> property on a
specific pipeline step to <code class="language-plaintext highlighter-rouge">True</code>, this step’s output will be stored on
disk (in the <code class="language-plaintext highlighter-rouge">.cache</code> directory, in the same location as the
<code class="language-plaintext highlighter-rouge">pipeline-spec.yaml</code> file). Re-running the pipeline will make use of
that cache, thus avoiding the execution of the cached step and its
precursors.</p>
<p>The cache hash is also used for seeing if a pipeline is “dirty”. When
a pipeline completes executing successfully, <code class="language-plaintext highlighter-rouge">dpp</code> stores the cache
hash along with the pipeline id. If the stored hash is different than
the currently calculated hash, it means that either the code or the
execution parameters were modified, and that the pipeline needs to be
re-run.</p>
<h2 id="validating">Validating</h2>
<ul>
<li>The Data Package metadata is always validated before being passed to
a processor, so there’s no possibility for a processor to modify a
Data Package in a way that renders it invalid.</li>
<li>The data itself is not validated against its respective Table
Schema, unless explicitly requested by setting the <code class="language-plaintext highlighter-rouge">validate</code> flag
to <code class="language-plaintext highlighter-rouge">True</code> in the step’s properties. This is done for two main
reasons:
<ul>
<li>Performance: validating the data in every step is very CPU-intensive</li>
<li>In some cases, you modify the schema in one step and the data in
another, so you would only like to validate the data once all
the changes were made</li>
</ul>
</li>
<li>In any case, all the <code class="language-plaintext highlighter-rouge">dump.to_*</code> (<code class="language-plaintext highlighter-rouge">dump.to_path</code>, <code class="language-plaintext highlighter-rouge">dump.to_sql</code>,
<code class="language-plaintext highlighter-rouge">dump.to_zip</code>) standard processors validate their input data
regardless of the <code class="language-plaintext highlighter-rouge">validate</code> flag - so in case you’re using them,
your data validity is covered 👍🏽.</li>
</ul>
<h2 id="try-it-out">Try it out</h2>
<p>This all adds up to highly modular, configurable, and
resource-considerate framework for processing and packaging tabular
data. Once you have created a Data Package, you can publish it
anywhere on the web, comfortable in the knowledge that its embedded
metadata will make it much easier to document and use. Developers can
process Data Packages using our <a href="http://frictionlessdata.io/guides/using-data-packages-in-python/">Python</a> and
<a href="https://github.com/frictionlessdata/datapackage-js">JavaScript</a> libraries. Data analysts can use the
<a href="http://okfnlabs.org/blog/2016/07/14/using-data-packages-with-r.html">R library for Data Packages</a> or our
<a href="http://okfnlabs.org/blog/2016/08/01/using-data-packages-with-pandas.html">Python Pandas</a> library to load the data.</p>
<p>For more information about <a href="https://github.com/frictionlessdata/datapackage-pipelines">datapackage-pipelines</a>, included
running pipelines on a schedule, using the dashboard, configuring the
standard processors, and information on how to write your own
processor, visit the <a href="https://github.com/frictionlessdata/datapackage-pipelines">GitHub repo</a>.</p>
Adam Kariv
Case Studies for Frictionless Data
2016-11-30T00:00:00+00:00
http://okfnlabs.org/blog/2016/11/30/case-studies-for-frictionless-data
<p>For our <a href="http://frictionlessdata.io/">Frictionless Data</a> project, we
were curious to learn about some of the common issues users face when
working with data. To that end, we started a
<a href="http://frictionlessdata.io/case-studies/">Case Study series</a> to
highlight projects and organizations working with the Frictionless
Data specifications and tooling in interesting and innovative ways.</p>
<p>Through several interviews over the past several months, we now have
three case studies published on a range of topics: data science in the
browser, the provision of clean power system data for energy
researchers, and re-usable components for cloud-based data-intensive
workflows.</p>
<h2 id="dataship">Dataship</h2>
<p><a href="http://frictionlessdata.io/case-studies/dataship/">http://frictionlessdata.io/case-studies/dataship/</a></p>
<p>We interviewed Waylon Finn of <a href="https://dataship.io/">Dataship</a> to
learn more about how he uses the Data Package specifications.
Dataship is a way to share data and analysis, from simple charts to
complex machine learning, with anyone in the world easily and for
free. The
<a href="http://specs.frictionlessdata.io/data-package/">Data Package</a> acts
as the base format for Dataship notebooks.</p>
<p><a href="http://frictionlessdata.io/case-studies/dataship/"><img src="/img/posts/dataship.png" alt="Dataship" /></a></p>
<hr />
<h3 id="open-power-system-data">Open Power System Data</h3>
<p><a href="http://frictionlessdata.io/case-studies/open-power-system-data/">http://frictionlessdata.io/case-studies/open-power-system-data/</a></p>
<p>We spoke to Lion Hirth and Ingmar Schlecht about
<a href="http://open-power-system-data.org/">Open Power System Data</a>, a
free-of-charge and open platform providing clean, high quality data
needed for power system analysis and modeling. The project is aimed
at resolving some of the most common, persistent issues energy
researchers face when working with data.</p>
<p><a href="http://frictionlessdata.io/case-studies/open-power-system-data/"><img src="/img/posts/opsd.png" alt="Open Power System Data" /></a></p>
<hr />
<h2 id="tesera">Tesera</h2>
<p><a href="http://frictionlessdata.io/case-studies/tesera/">http://frictionlessdata.io/case-studies/tesera/</a></p>
<p>Spencer Cox of <a href="http://tesera.com">Tesera Systems, Inc.</a> shared how
his team is using the
<a href="http://specs.frictionlessdata.io/">Frictionless Data specifications</a>
across a range of purpose-built tools to power data-driven
applications in the cloud.</p>
<p><a href="http://frictionlessdata.io/case-studies/tesera/"><img src="/img/posts/tesera.png" alt="Tesera" /></a></p>
<hr />
<h2 id="reach-out-to-us">Reach out to us</h2>
<p>If you are using any of the Frictionless Data
specifications—<a href="http://specs.frictionlessdata.io/table-schema/">JSON Table Schema</a>,
<a href="http://specs.frictionlessdata.io/data-package/">Data Packages</a>—for
your project, big or small, reach out to us. We can work together on
developing a case study to share your project with the world!</p>
Dan Fowler
Frictionless Data Specs Working Group
2016-10-17T00:00:00+00:00
http://okfnlabs.org/blog/2016/10/17/specs-working-group
<p>Last month, we had the first call of the <strong>Frictionless Data
Specifications Working Group</strong>, starting a new chapter in the project.
The call covered the status of the specifications to date, current
adoption, upcoming technical pilots and partnerships, and how work
will be organized going forward. In this post, I will lay out the
purpose for this initiative, who is participating, and how you can get
involved.</p>
<p><a href="http://frictionlessdata.io/"><img src="/img/posts/frictionlessdata-logo.png" alt="Frictionless Data Logo" /></a></p>
<h2 id="overview">Overview</h2>
<p><a href="http://frictionlessdata.io/">Frictionless Data</a> is a project
encompassing a set of tooling and specifications to ease the transport
and reuse of data. The specifications have grown out of a long
engagement with issues around data interoperability, publication
workflows, and analysis. For most of the history of this project, the
specifications were curated by Rufus Pollock as one of several “Data
Protocols” with input and assistance from individuals from Open
Knowledge International and other organizations. As a result, the
specifications have steadily gained traction across various projects
and software developed by, among others, the
<a href="http://theodi.org/">Open Data Institute (ODI)</a>,
<a href="http://tesera.com/">Tesera Systems, Inc.</a>,
<a href="https://dataship.io/">Dataship</a>, and
<a href="http://open-power-system-data.org/">Open Power System Data</a>.</p>
<p>This adoption validates the approach we’ve taken: creating a minimum
viable set of specifications to significantly improve transport of
data. In reaching out to <em>new</em> users, we would like to make sure that
we have resolved some of the outstanding edge cases to ensure that
Data Packages can serve a solid foundation for many more types of
data-intensive applications. This work is all the more important as
“core” libraries in
<a href="https://github.com/frictionlessdata/datapackage-py">Python</a>,
<a href="https://github.com/frictionlessdata/datapackage-js">Javascript</a>, and
<a href="https://github.com/theodi/datapackage.rb">Ruby</a> are currently being
refined, and newer libraries, like
<a href="https://github.com/frictionlessdata/datapackage-r">R</a>, are being
developed. With that in mind, we have organized a working group with
a specific goal: to deliver a first, complete version of the
specifications by end of this year.</p>
<h2 id="working-group">Working Group</h2>
<p>Members of the working group currently include:</p>
<ul>
<li><a href="https://twitter.com/_pwalsh">Paul Walsh</a> (Open Knowledge International)</li>
<li><a href="https://twitter.com/rufuspollock">Rufus Pollock</a> (Open Knowledge International)</li>
<li><a href="https://twitter.com/danfowler">Dan Fowler</a> (Open Knowledge International)</li>
<li><a href="https://twitter.com/domoritz">Dominik Moritz</a> (<a href="http://www.cs.washington.edu/">University of Washington</a>)</li>
<li><a href="https://twitter.com/starl3n">Steven De Costa</a> (<a href="http://linkdigital.com.au/">Link Digital</a>)</li>
<li><a href="https://twitter.com/mckinneyjames">James McKinney</a> (<a href="http://www.opennorth.ca/">Open North</a>)</li>
<li><a href="https://twitter.com/okdistribute">Karissa McKelvey</a> (<a href="http://dat-data.com/">Dat Data</a></li>
<li><a href="https://twitter.com/TheSpencerCox">Spencer Cox</a> (<a href="http://tesera.com/">Tesera Systems, Inc.</a>)</li>
</ul>
<p>Work will happen continue to happen asynchronously, in the open,
without excessive rules around voting. Rather, we will listen to
feedback and act in favor of consensus (without requiring it). Rufus
Pollock, having led this work for many years with a strong focus on
keeping it simple, will remain the curator; decisions of what stays or
goes from the specs will rest with him. Having more eyes the specs
with a variety of different perspectives will allow us to solidify and
remove ambiguous statements, eliminate unnecessary repetition or
logical errors, and, hopefully, achieve a minimal 1.0 by end of 2016.
Beyond the core Data Package specifications, open topics might include
defining further custom “profiles”
(e.g. <a href="http://specs.frictionlessdata.io/fiscal-data-package/">Fiscal Data Package</a>),
as well as potential extensions, including specifications for
<a href="https://discuss.okfn.org/t/data-packages-views-graphs-maps-tables-etc/2667">visualizations</a>,
statistics, and quality metrics for data.</p>
<h2 id="feedback-needed">Feedback Needed</h2>
<p>Are you currently using or considering using the Frictionless Data
specifications for your data or application? If so, please let us
know!</p>
<p>Work is managed via an
<a href="https://github.com/frictionlessdata/specs/issues">issue tracker</a> on
GitHub, which is the best way to raise specific questions. If you
would like to specifically flag an issue for the Working Group,
mention <strong>@frictionlessdata/specs-working-group</strong> in the comment. For
general commentary on any aspect of Frictionless Data, you can leave a
comment on the <a href="https://discuss.okfn.org/c/frictionless-data">forum</a>.</p>
<ul>
<li>Current Specifications: <a href="http://specs.frictionlessdata.io/">http://specs.frictionlessdata.io/</a>
<ul>
<li>JSON Schema (for validation): <a href="https://github.com/frictionlessdata/schemas">https://github.com/frictionlessdata/schemas</a></li>
</ul>
</li>
<li>Specs Issue Tracker: <a href="https://github.com/frictionlessdata/specs/issues">https://github.com/frictionlessdata/specs/issues</a>
<ul>
<li>Current Milestone: <a href="https://github.com/frictionlessdata/specs/milestone/1">https://github.com/frictionlessdata/specs/milestone/1</a></li>
</ul>
</li>
<li>Forum: <a href="https://discuss.okfn.org/c/frictionless-data">https://discuss.okfn.org/c/frictionless-data</a></li>
</ul>
<hr />
<p><em>Thanks to Paul Walsh who provided the motivating text that served as
the basis for this post and Jo Barratt who did much of organizing
necessary to make it happen.</em></p>
Dan Fowler
Building 2030-watch.de: measuring progress towards the sustainable development goals (SDGs)
2016-10-13T00:00:00+00:00
http://okfnlabs.org/blog/2016/10/13/2030-watch
<p>For the last 15 months the Open Knowledge Foundation Germany has been working on a prototype to monitor progress towards the sustainable development goals (SDGs) from an independent, civil society-led perspective. There’s a detailed blog post on why such independent monitoring is necessary at <a href="https://www.2030-watch.de/en/blog/2016/10/06/blog/">our blog</a>. To give a quick example, the UN Commission agreed to measure the tax revenue generated by low-income countries but doesn’t propose an indicator to measure the financial secrecy of European countries, for example encouraging tax evasion. At 2030-watch we have the Tax Justice Network as our data partner providing an indicator on this topic. The Tax Justice Network also collaborate with Open Knowledge on <a href="http://datafortaxjustice.net/">tax justice</a>.</p>
<p>Due to “cherry picking” of indicators, there is a high risk that the ambition of the 2030 Agenda is watered down at the monitoring stage. This is why we have created 2030-Watch: a tool that focuses on high-income countries and highlights through a visualisation tool which countries are doing well at achieving which goals using over 60 indicators using data from official sources like Eurostat and the OECD and also from civil society organisations. The results might surprise you, <a href="https://www.2030-watch.de/en/">so head on over</a> and take a look!</p>
<p>It’s been my pleasure to lead development work on the project the last few months: creating a workflow for uploading indicator data, reworking the site to showcase indicator “sponsors” and allowing multilingual texts. Last week we were very proud to launch the English version: <a href="https://www.2030-watch.de/en/">2030-watch.de/en/</a>. The site is generated using the static site generator <a href="https://jekyllrb.com/">Jekyll</a>. Recent Jekyll versions have a wonderful ability <a href="https://jekyllrb.com/docs/datafiles/">to ingest JSON data and make it available to templates</a>. We’ve used this facility to make a JSON database of all indicators available to various small visualisation web applications written in AngularJS. We have recently moved from direct editing of JSON files (one per indicator) via GitHub to reading in data from standardized Google Sheets and automatically outputting JSON files as updates to the GitHub repository. This change was made to make indicator sponsorship easier for external parties. It has the side benefit of allowing conversion to CSV, ODS and XLSX formats using the Google Drive API. <a href="https://github.com/okfde/2030-watch.de">Source code for the website</a> is of course open and you can take a look at batch processing Google Drive sheets at <a href="https://github.com/okfde/2030-watch-dataprocessing">our data processing repository</a>.</p>
<p><img src="/img/posts/2030watchsample.png" alt="The country comparison tool shows multiple countries’ performance across all indicators in a simple color-coded fashion as well as allowing comparisons to be made between the countries for each indicator" />
<em>The country comparison tool shows multiple countries’ performance across all indicators in a simple color-coded fashion as well as allowing comparisons to be made between the countries for each indicator</em></p>
<p>2030 Watch is still a prototype which has been developed with scarce resources and a lot of voluntary work. We still have a lot to do and are looking for help in many areas, technical and non-technical. For example on the technical side we would love to see ideas of how the site can become even more user friendly and more informative and how the visualisations could work with up to 90 indicators, or how we could solve the problem of adapting the data tool for mobile use. For further details on how to get involved <a href="mailto:info@2030-watch.de">contact us at info@2030-watch.de</a>. We are also raising financial contributions at <a href="https://www.betterplace.org/en/projects/25565-2030-watch-de-germany-on-the-path-to-sustainability">betterplace.org</a>. Despite the remaining challenges we feel that 2030 Watch already demonstrates that civil society monitoring is possible.</p>
<p><strong>Disclaimer and acknowledgements:</strong> This post has reused some of Claudia’s <a href="https://www.2030-watch.de/en/blog/2016/10/06/blog/">longer post at 2030-watch.de</a>. I am responsible for the development and data preparation effort carried out since September 2016. Special thanks go to fellow labs member <a href="http://okfnlabs.org/members/markbrough/">Mark Brough</a> who has reworked the visuals wonderfully in the last months, to <a href="http://katjadittrich.com/">Katja Dittrich</a> who created the data visualisations and to <a href="https://www.xing.com/profile/Christian_Pape18">Christian Pape</a> who developed the first version of the site in 2015.</p>
Matt Fullerton
Embulk at csv,conf,v2
2016-08-04T00:00:00+00:00
http://okfnlabs.org/blog/2016/08/04/embulk
<p>Having co-organized csv,conf,v2 this past May, a few of us from Open
Knowledge International had the awesome opportunity to travel to
Berlin and sit in on a range of fascinating talks on the current
state-of-the-art on wrangling messy data. Previously, I posted about
Comma Chameleon by Stuart Harrison<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. Another such talk was given
by Sadayuki Furuhashi of Treasure Data<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> who presented on tool he is
developing called <strong>Embulk</strong>. Embulk is an open-source tool for
moving messy data.</p>
<p><img src="/img/posts/embulk-sadayuki.jpg" alt="Sadayuki Furuhashi" /></p>
<h2 id="friction-in-data-transport">Friction in Data Transport</h2>
<p>In his talk, Sadayuki talked about the <em>friction</em> commonly experienced
in moving large amounts of data from one system to another. He gives
a relatively simple example of trying to push a 10GB CSV file to
PostgreSQL and encountering a series of issues—broken or missing
records, unsupported time formats—that can typically be dealt with
only through <em>trial and error</em>. Multiply these kinds of issues across
the various and growing number of backends and file formats, and it
quickly becomes clear that there’s not enough time in the day for data
wranglers to write and optimize their scripts to move data flexibly
and efficiently. Enter Embulk.</p>
<h2 id="embulk">Embulk</h2>
<p><a href="http://www.embulk.org/">Embulk</a> is open-source tool for transporting massive, messy
datasets—in parallel—from one system to another. In this context,
“system” can refer to any number of endpoints including Amazon S3, an
SQL database, or even a CSV file on your local computer. Embulk
attempts to solve the issues above by creating a plugin-based
framework that supports various data transport tasks, including file
type and format guessing, processing, filtering, and encryption.</p>
<p><img src="/img/posts/embulk-logo.png" alt="Sadayuki Furuhashi" /></p>
<p>Specialized connectors for supporting different storage
engines—RDBMSs, cloud services, etc.—as well as various file types
—CSV, XML, JSON, HDF5, etc.—can be created by the community as
<a href="http://www.embulk.org/plugins/">plugins</a>. The core of Embulk is actually quite small
and gains most of its power from this plugin architecture.</p>
<h2 id="frictionless-data">Frictionless Data</h2>
<p>While watching his presentation, it was clear that there is a lot of
opportunity for collaboration between the work Embulk is doing and the
ecosystem we’re trying to build through our <a href="http://frictionlessdata.io/">Frictionless Data</a>
project. In our project, we’re looking to support easy and efficient
transport of data primarily through the development and promotion of
the <a href="http://specs.frictionlessdata.io">Frictionless Data specifications</a> and the development of
various <a href="http://frictionlessdata.io/software/">libraries, tools, and integrations</a>. For instance,
our Python library for reading and working with
<a href="http://frictionlessdata.io/guides/table-schema/">JSON Table Schema</a> also supports a plugin-architecture for
reading and storing data in a variety of backends. Currently, we have
support for <a href="https://github.com/frictionlessdata/jsontableschema-pandas-py">Pandas</a>, <a href="https://github.com/frictionlessdata/jsontableschema-bigquery-py">BigQuery</a>, and
<a href="https://github.com/frictionlessdata/jsontableschema-sql-py">SQL</a> (visit our <a href="http://frictionlessdata.io/user-stories/">User Stories</a> page to vote for and
comment on what you’d like to see next).</p>
<p><img src="/img/posts/embulk-presentation.jpg" alt="Embulk Presentation" /></p>
<h3 id="embulk-guess-and-data-packages">Embulk Guess and Data Packages</h3>
<p>As an example of the potential overlap, we can demonstrate the
<code class="language-plaintext highlighter-rouge">schema</code> and <code class="language-plaintext highlighter-rouge">dialect</code> guessing that Embulk employs to load data. To
support loading CSV data into a variety of backends, Embulk needs a
good idea of the types of records (<code class="language-plaintext highlighter-rouge">schema</code>) and also the rules by
which these values are separated in the file (<code class="language-plaintext highlighter-rouge">dialect</code>). Embulk’s
<code class="language-plaintext highlighter-rouge">guess</code> function (<code class="language-plaintext highlighter-rouge">embulk guess</code>) makes guesses about the file
structure and outputs something similar to the <code class="language-plaintext highlighter-rouge">datapackage.json</code> (see
our <a href="http://specs.frictionlessdata.io">specifications</a> for more details). To demonstrate, we can
use Embulk’s convenient <code class="language-plaintext highlighter-rouge">example</code> function which creates an example
CSV. It looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">guess</code> function reads the data and generates a configuration file
used by Embulk that looks like the following. Of particular interest
is the <code class="language-plaintext highlighter-rouge">parser</code> section.</p>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">in</span><span class="pi">:</span>
<span class="na">type</span><span class="pi">:</span> <span class="s">file</span>
<span class="na">path_prefix</span><span class="pi">:</span> <span class="s">/Users/dan/Desktop/demo/csv/sample_</span>
<span class="na">decoders</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">type</span><span class="pi">:</span> <span class="nv">gzip</span><span class="pi">}</span>
<span class="na">parser</span><span class="pi">:</span>
<span class="na">charset</span><span class="pi">:</span> <span class="s">UTF-8</span>
<span class="na">newline</span><span class="pi">:</span> <span class="s">CRLF</span>
<span class="na">type</span><span class="pi">:</span> <span class="s">csv</span>
<span class="na">delimiter</span><span class="pi">:</span> <span class="s1">'</span><span class="s">,'</span>
<span class="na">quote</span><span class="pi">:</span> <span class="s1">'</span><span class="s">"'</span>
<span class="na">escape</span><span class="pi">:</span> <span class="s1">'</span><span class="s">"'</span>
<span class="na">null_string</span><span class="pi">:</span> <span class="s1">'</span><span class="s">NULL'</span>
<span class="na">trim_if_not_quoted</span><span class="pi">:</span> <span class="no">false</span>
<span class="na">skip_header_lines</span><span class="pi">:</span> <span class="m">1</span>
<span class="na">allow_extra_columns</span><span class="pi">:</span> <span class="no">false</span>
<span class="na">allow_optional_columns</span><span class="pi">:</span> <span class="no">false</span>
<span class="na">columns</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">id</span><span class="pi">,</span> <span class="nv">type</span><span class="pi">:</span> <span class="nv">long</span><span class="pi">}</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">account</span><span class="pi">,</span> <span class="nv">type</span><span class="pi">:</span> <span class="nv">long</span><span class="pi">}</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">time</span><span class="pi">,</span> <span class="nv">type</span><span class="pi">:</span> <span class="nv">timestamp</span><span class="pi">,</span> <span class="nv">format</span><span class="pi">:</span> <span class="s1">'</span><span class="s">%Y-%m-%d</span><span class="nv"> </span><span class="s">%H:%M:%S'</span><span class="pi">}</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">purchase</span><span class="pi">,</span> <span class="nv">type</span><span class="pi">:</span> <span class="nv">timestamp</span><span class="pi">,</span> <span class="nv">format</span><span class="pi">:</span> <span class="s1">'</span><span class="s">%Y%m%d'</span><span class="pi">}</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">comment</span><span class="pi">,</span> <span class="nv">type</span><span class="pi">:</span> <span class="nv">string</span><span class="pi">}</span>
<span class="na">out</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">type</span><span class="pi">:</span> <span class="nv">stdout</span><span class="pi">}</span></code></pre></figure>
<p>As you can see, <code class="language-plaintext highlighter-rouge">embulk guess</code> is powerful enough to go a bit further
than other similar type guessing functions by also guessing, for
example, that not only is a column a date, but also the expected date
format structure.</p>
<p>Many of the fields in the <code class="language-plaintext highlighter-rouge">parser</code> section can be represented in a
Data Package following the <a href="http://frictionlessdata.io/guides/table-schema/">JSON Table Schema</a> and
<a href="http://specs.frictionlessdata.io/csv-dialect/">CSV Dialect Description Format</a> specifications, which are
part of the <a href="http://frictionlessdata.io/guides/data-package/">Data Package</a> specifications. Here’s what the
equivalent <code class="language-plaintext highlighter-rouge">datapackage.json</code> file would look like:</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">sample_01</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">resources</span><span class="dl">"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">sample-01</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">path</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">sample_01.csv</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">format</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">csv</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">encoding</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">UTF-8</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">dialect</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">lineTerminator</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="se">\r\n</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">delimiter</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">,</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">quoteChar</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="se">\"</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">escapeChar</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="se">\"</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">header</span><span class="dl">"</span><span class="p">:</span> <span class="kc">true</span>
<span class="p">}</span>
<span class="dl">"</span><span class="s2">schema</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">fields</span><span class="dl">"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">id</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">integer</span><span class="dl">"</span> <span class="p">},</span>
<span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">account</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">integer</span><span class="dl">"</span> <span class="p">},</span>
<span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">time</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">datetime</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">format</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">fmt:%Y-%m-%d %H:%M:%S</span><span class="dl">"</span> <span class="p">},</span>
<span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">purchase</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">date</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">format</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">fmt:%Y%m%d</span><span class="dl">"</span> <span class="p">},</span>
<span class="p">{</span> <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">comment</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">string</span><span class="dl">"</span> <span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span></code></pre></figure>
<h3 id="the-potential-future">The Potential Future</h3>
<p>Embulk looks like a really powerful tool that can definitely be part
of the Frictionless Data ecosystem we envision. I would love to an
input plugin to read a <code class="language-plaintext highlighter-rouge">datapackage.json</code> to give Embulk’s CSV parser
what it needs. It would also be great to see an output plugin that
can produce a valid <code class="language-plaintext highlighter-rouge">datapackage.json</code> file with your data. Once
you’ve generated a schema using Embulk’s powerful guessing
functionality, publishing the schema with your data in a standard
format like the <a href="http://frictionlessdata.io/guides/data-package/">Data Package</a> is an excellent step towards making
your data more <em>findable</em> and <em>reusable</em>. The Frictionless Data
project is, at its heart, about highlighting the benefits of adopting
just such a standardized <a href="http://frictionlessdata.io/about/#data-containerization">containerization</a> approach
to data. Of course, even without this, Embulk is a really powerful
tool for solving some of the problems of data transport today. Give
it a try!</p>
<ul>
<li>Download Embulk: <a href="https://github.com/embulk/embulk">https://github.com/embulk/embulk</a></li>
<li>Follow Treasure Data on Twitter: <a href="https://twitter.com/TreasureData">https://twitter.com/TreasureData</a></li>
<li>Follow Sadayuki Furuhashi on Twitter: <a href="https://twitter.com/frsyuki/">https://twitter.com/frsyuki/</a></li>
<li>Follow Open Knowledge Labs on Twitter: <a href="https://twitter.com/okfnlabs">https://twitter.com/okfnlabs</a></li>
<li>See the full range of speakers from csv,conf,v2: <a href="http://csvconf.com">http://csvconf.com</a></li>
</ul>
<p>See Sadayuki’s full talk:</p>
<iframe width="576px" height="360px" src="https://www.youtube.com/embed/RuA_SL5-sXY" frameborder="0" allowfullscreen=""></iframe>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Comma Chameleon: <a href="/blog/2016/07/18/comma-chameleon.html">/blog/2016/07/18/comma-chameleon.html</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Treasure Data: <a href="https://www.treasuredata.com/">https://www.treasuredata.com/</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Dan Fowler
Using Data Packages with Pandas
2016-08-01T00:00:00+00:00
http://okfnlabs.org/blog/2016/08/01/using-data-packages-with-pandas
<p>Frictionless Data is about making it effortless to transport high
quality data among different tools and platforms for further analysis.
We obviously ♥ data science, and pandas is one of the most
popular Python libraries for advanced <em>data analysis and modeling</em>.
This post highlights our most recent community
contribution<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>—pandas integration for Data Packages—what it
means, and how you can contribute.</p>
<h2 id="pandas">Pandas</h2>
<p><a href="http://pandas.pydata.org/"><img src="/img/posts/pandas_logo.png" alt="Pandas" /></a></p>
<p>From the
<a href="http://pandas.pydata.org/pandas-docs/stable/">pandas documentation</a>:</p>
<blockquote>
<p>pandas is a Python package providing fast, flexible, and expressive
data structures designed to make working with “relational” or
“labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real
world data analysis in Python.</p>
</blockquote>
<p>One of the primary data structures in pandas is the <strong>DataFrame</strong>. The
DataFrame, similar to <a href="https://www.r-project.org/">R</a>’s <strong>data
frame</strong>, stores the kind of 2-dimensional, tabular data common across
various data analysis use cases. While pandas has extremely powerful
tools for importing, exporting, and manipulating data, the process of
loading data from, say, a single CSV file, often requires some trial
and error to do optimally. For instance, one might need to manually
specify CSV dialect parameters, index columns, datetime fields, etc.
Pandas has automatic type and encoding guessing, but guessing often
fails, requiring manual intervention to accurately describe and load
your data. (See
<a href="/blog/2016/07/14/using-data-packages-with-r.html">my recent post on R</a>
for an example of this.)</p>
<p>A Tabular Data Package consists of one or more CSV resources, each
containing a <em>schema</em> (indicating type, constraints, and other
metadata useful for validation and analysis) and, optionally, a
<em>dialect</em> (specifying characters for separating or quoting values).
See our
<a href="http://frictionlessdata.io/guides/table-schema/">JSON Table Schema guide</a>
and the <a href="http://dataprotocols.org/csv-dialect/">CSVDDF</a> specification
for more information. Given that a single Tabular Data Package can
consist of multiple tables, pandas integration means loading multiple
DataFrames—with appropriately set types, encodings, indexes and
dialects—at once. And once you have Tabular Data Packages in a
pandas DataFrame, you now get all the power provided by Pandas to
reshape, explore and visualise data as well as access to Pandas’
<a href="http://pandas.pydata.org/pandas-docs/stable/io.html">variety of export formats</a>.</p>
<h2 id="jsontableschema-pandas">jsontableschema-pandas</h2>
<p>The newly developed
<a href="https://github.com/frictionlessdata/jsontableschema-pandas-py">Pandas plugin</a>
allows users to generate and load Pandas DataFrames based on JSON
Table Schema descriptors. In order to use it, you first need to
install the <code class="language-plaintext highlighter-rouge">datapackage</code> and <code class="language-plaintext highlighter-rouge">jsontableschema-pandas</code> libraries.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">pip <span class="nb">install </span>datapackage
pip <span class="nb">install </span>jsontableschema-pandas</code></pre></figure>
<p>You can load a Data Package into your environment by using the
<code class="language-plaintext highlighter-rouge">datapackage.push_datapackage</code> function. We pass a path to the
descriptor file (<code class="language-plaintext highlighter-rouge">datapackage.json</code>), and we are choosing <code class="language-plaintext highlighter-rouge">pandas</code> for
our backend:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">datapackage</span>
<span class="kn">import</span> <span class="nn">pandas</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">'https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/cpi/datapackage.json'</span>
<span class="n">storage</span> <span class="o">=</span> <span class="n">datapackage</span><span class="p">.</span><span class="n">push_datapackage</span><span class="p">(</span><span class="n">descriptor</span><span class="o">=</span><span class="n">url</span><span class="p">,</span><span class="n">backend</span><span class="o">=</span><span class="s">'pandas'</span><span class="p">)</span></code></pre></figure>
<p>Once loaded into memory, the <code class="language-plaintext highlighter-rouge">tables</code> method returns a list of
DataFrames stored in the Data Package.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">storage</span><span class="p">.</span><span class="n">tables</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">['data__cpi']</code></p>
<p>In this case, we have a single table, <code class="language-plaintext highlighter-rouge">data__cpi</code> which we can take a
peek at using the Pandas <code class="language-plaintext highlighter-rouge">head()</code> method.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">storage</span><span class="p">[</span><span class="s">'data__cpi'</span><span class="p">].</span><span class="n">head</span><span class="p">()</span></code></pre></figure>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country Name</th>
<th>Country Code</th>
<th>Year</th>
<th>CPI</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Afghanistan</td>
<td>AFG</td>
<td>2004-01-01</td>
<td>63.131893</td>
</tr>
<tr>
<th>1</th>
<td>Afghanistan</td>
<td>AFG</td>
<td>2005-01-01</td>
<td>71.140974</td>
</tr>
<tr>
<th>2</th>
<td>Afghanistan</td>
<td>AFG</td>
<td>2006-01-01</td>
<td>76.302178</td>
</tr>
<tr>
<th>3</th>
<td>Afghanistan</td>
<td>AFG</td>
<td>2007-01-01</td>
<td>82.774807</td>
</tr>
<tr>
<th>4</th>
<td>Afghanistan</td>
<td>AFG</td>
<td>2008-01-01</td>
<td>108.066600</td>
</tr>
</tbody>
</table>
</div>
<p></p>
<p>At this point, you can treat <code class="language-plaintext highlighter-rouge">storage['data__cpi']</code> as you would any
other DataFrame in Pandas. For more detail on how to interact with
the library and where to go from here, please visit below:</p>
<ul>
<li>Package on PyPI: <a href="https://pypi.python.org/pypi/jsontableschema-pandas">https://pypi.python.org/pypi/jsontableschema-pandas</a></li>
<li>Source on GitHub: <a href="https://github.com/frictionlessdata/jsontableschema-pandas-py">https://github.com/frictionlessdata/jsontableschema-pandas-py</a></li>
</ul>
<h2 id="contributing">Contributing</h2>
<p>The Python library
<a href="https://github.com/frictionlessdata/jsontableschema-py">jsontableschema-py</a>
provides the core set of utilities for working with Tabular Data
Package tables, and it implements a plugin-based system for adding
different
<a href="https://github.com/frictionlessdata/jsontableschema-py#storage">storage</a>
backends. In a
<a href="http://okfnlabs.org/blog/2016/03/11/frictionless-data-transport-in-python.html">recent post</a>,
I highlighted the first two of these storage integrations:
<a href="https://github.com/frictionlessdata/jsontableschema-sql-py">SQL</a> and
<a href="https://github.com/frictionlessdata/jsontableschema-bigquery-py">BigQuery</a>.
These libraries, and the Pandas library, were written as drivers
implementing the <code class="language-plaintext highlighter-rouge">jsontableschema.storage.Storage</code>
<a href="https://github.com/frictionlessdata/jsontableschema-py#storage">interface</a>.
If you have another storage backend you’d like to use with Data
Packages in Python, consider writing a
<a href="https://github.com/frictionlessdata/jsontableschema-py#plugins">plugin</a>.</p>
<p><img src="http://okfnlabs.org/img/posts/tabular-storage-diagram.png" alt="Plugins" /></p>
<p>We’re also looking to support other integrations beyond Python. You
can find user stories we’re looking to support on the
<a href="http://frictionlessdata.io/user-stories/">User Stories</a> section of
the Frictionless Data site. Do you have a library, tool, or platform
that you’d like to see support importing and exporting Data Packages?
Let us know by voting and commenting on what you’d like to see! If
you have any questions about how to contribute, jump into the
<a href="https://gitter.im/frictionlessdata/chat">Frictionless Data chat</a> or
<a href="https://discuss.okfn.org/c/frictionless-data">post in the forum</a>.</p>
<p>To see the code used in this post, visit its
<a href="https://github.com/okfn/okfn.github.com/blob/master/resources/using-data-packages-with-pandas.ipynb">Jupyter Notebook</a>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Thanks @sirex for the contribution! <a href="http://sirex.lt">http://sirex.lt</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Dan Fowler
Publish Data Packages to DataHub (CKAN)
2016-07-25T00:00:00+00:00
http://okfnlabs.org/blog/2016/07/25/publish-data-packages-to-datahub-ckan
<p>Back in March, I wrote about a CKAN extension for publishing and
exporting Data Packages<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. This extension, <code class="language-plaintext highlighter-rouge">datapackager</code>, has been
updated and is now <strong>live</strong> on our very own CKAN instance,
<strong>DataHub</strong>. DataHub users can now import and export Data Packages
via the CKAN UI and API. This post will show you how.</p>
<p><a href="https://datahub.io"><img src="/img/posts/datahub.png" alt="DataHub" /></a></p>
<h2 id="datahub-and-data-packages">DataHub and Data Packages</h2>
<p><a href="https://datahub.io">DataHub</a> is a free, powerful data management platform hosted by
Open Knowledge International. It is powered by <a href="http://ckan.org/">CKAN</a>, the
leading open-source data management system used by governments and
civic organizations
<a href="http://ckan.org/instances/#">around the world</a>—including
<a href="http://www.data.gov/">Data.gov</a> and
<a href="https://data.gov.uk/">data.gov.uk</a>. In this post, I describe how to
load “Data Packages” onto DataHub to take advantage of CKAN’s powerful
visualization and analytics features.</p>
<p>A <a href="http://frictionlessdata.io/guides/data-package">Data Package</a> is a coherent collection of data, metadata, and
other assets. Open Knowledge International is currently working on
<a href="http://frictionlessdata.io/">Frictionless Data</a>, a project aimed at creating an ecosystem for
<em>frictionless</em> data transport by defining the Data Package standard
and designing the tools and integrations that support them. Given its
ubiquity as a data publishing platform, CKAN support is an important
part of this strategy.</p>
<h2 id="importing-a-data-package-into-datahubio">Importing a Data Package into DataHub.io</h2>
<ol>
<li><strong>Register on DataHub</strong>: If you’re not already a DataHub user, you
will need to <a href="https://datahub.io/user/register">register for an account</a>. Once registered,
you will also need to <a href="https://discuss.okfn.org/t/creating-a-dataset-on-the-datahub/1627">request an “organization”</a> via our
forum. New datasets can only be loaded on DataHub if they are
associated with an organization.</li>
<li><strong>Create Your Data Package</strong>: If you don’t have your data in a Data
Package already, you can visit this
<a href="http://datapackagist.okfnlabs.org/">online Data Package creator</a> or
<a href="http://frictionlessdata.io/guides/creating-tabular-data-packages-in-python/">create a Data Package programmatically in Python</a>. If you
are just interested in trying out this demo, you should be able to
visit the <a href="https://github.com/datasets/">datasets organization</a> on GitHub and download any
of the repos as a zip file.</li>
<li><strong>Zip your Data Package</strong>: If you created your Data
Package in the previous step,e create a new zip file from the Data
Package folder with the <code class="language-plaintext highlighter-rouge">datapackage.json</code> at the root. If you are on
a Unix-type machine, you can usually run <code class="language-plaintext highlighter-rouge">zip -r
my-datapackage-to-import.zip <data package directory></code>. <strong>Note</strong>:
make sure your packaged data, unzipped, is <strong>less than the 100MB</strong>, as
this is current size limit on DataHub.</li>
<li><strong>Import your Data Package</strong>: While signed in, click on
“Import Data Package” on the page of the
organization you created in Step 1, and upload the zipped Data Package
you created in the previous step.</li>
</ol>
<p>Once your Data Package has been successfully imported, you should be
to use the dataset as you would any dataset on the DataHub. This
includes adding or editing any of your dataset’s metadata, or,
accessing the dataset using the <a href="http://docs.ckan.org/en/latest/api/">CKAN API</a>.</p>
<h3 id="screencast">Screencast</h3>
<p>This screencast walks through the import steps outlined above.</p>
<p><img src="https://github.com/ckan/ckanext-datapackager/raw/master/doc/images/ckanext-datapackager-import-demo.gif" alt="Screencast" /></p>
<h3 id="exporting-a-data-package-from-datahubio">Exporting a Data Package from DataHub.io</h3>
<p>Exporting a Data Package from DataHub is even easier. Just navigate
to the dataset you’d like to export, click on “Download Data Package”,
and a <code class="language-plaintext highlighter-rouge">datapackage.json</code> file will be downloaded to your computer.
The JSON file will contain the Data Package representation of the
metadata stored on DataHub as well as links to the resources stored on
DataHub.</p>
<h3 id="ckan-data-packager-and-other-extensions">CKAN Data Packager and Other Extensions</h3>
<p>For information on importing and exporting data via the
<a href="http://docs.ckan.org/en/latest/api/">CKAN API</a> or, if you are interested in adding <code class="language-plaintext highlighter-rouge">datapackager</code> to
your own CKAN instance, you can read more on the extension
<a href="https://github.com/ckan/ckanext-datapackager">repository</a>.</p>
<p>Of course, CKAN is not the only data repository software we are
looking to support. A major aim of Frictionless Data is to create
integrations with the many different types of tools and platforms
people already use for working with data. Visit our
<a href="http://frictionlessdata.io/user-stories/">User Stories</a> page to learn about the kinds of use cases and data
workflows we’re looking to support. Let us know how you store your
data and what you would like to see next!</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p><a href="/blog/2016/03/11/frictionless-data-transport-in-python.html">Frictionless Data Transport in Python: 11 March 2016</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Dan Fowler
Comma Chameleon at csv,conf,v2
2016-07-18T00:00:00+00:00
http://okfnlabs.org/blog/2016/07/18/comma-chameleon
<p>Having co-organized csv,conf,v2 this past May, a few of us from Open
Knowledge International had the awesome opportunity to travel to
Berlin and sit in on a range of fascinating talks on the current
state-of-the-art on wrangling messy data. One such talk was given by
Stuart Harrison of the <strong>Open Data Institute</strong> (ODI) who presented on
tool he is developing called Comma Chameleon. Comma Chameleon is a
desktop CSV editor with <em>validation magic</em> 🌟.</p>
<p><img src="/img/posts/comma-chameleon-small.jpg" alt="Stuart Harrison" /></p>
<h2 id="comma-chameleon">Comma Chameleon</h2>
<p>CSV is a great, simple format that is easy to publish and use by
technical and non-technical users alike. But CSV can also be <em>abused</em>
(see <a href="http://okfnlabs.org/bad-data/">Bad Data</a> for examples), leading Stuart and his team
at ODI Labs—the R&D team at ODI—to develop <a href="https://github.com/theodi/comma-chameleon">Comma Chameleon</a>.
Comma Chameleon is a desktop tool that uses <a href="http://csvlint.io">CSVLint</a> under
the hood to validate CSVs for their structural integrity as well as
their adherence to a schema specified in <a href="http://frictionlessdata.io/guides/table-schema/">JSON Table Schema</a>
(<a href="http://w3c.github.io/csvw/">CSV on the Web</a> support in progress).</p>
<p><img src="/img/posts/comma-chameleon-period-table.png" alt="Comma Chameleon" /></p>
<p>The point of Comma Chameleon is to give non-technical users the
ability to create and edit CSV files in a more appropriate tool than
Excel, software designed for manipulating spreadsheets first and
foremost. The app allows users to fix errors in their data in place
<em>before</em> publishing using the handy validation functions described
above. Comma Chameleon also allows users to add useful metadata—for
instance, a title, description, and a license—and export it all as a
zipped <a href="http://frictionlessdata.io/guides/data-package/">Data Package</a>.</p>
<h2 id="frictionless-data">Frictionless Data</h2>
<p>Comma Chameleon—built with <a href="http://electron.atom.io/">Electron</a>—is an excellent
example of the kind of tool that can provide the foundation for real
advances in data quality thanks to adherence to a few simple, open
standards. At Open Knowledge International, we are currently working
hard on <a href="http://frictionlessdata.io/">Frictionless Data</a>, an initiative to define and promote
just such tools and standards. We are <em>delighted</em> to be partnering
with the ODI in the coming months on this and other initiatives around
Frictionless Data.</p>
<ul>
<li>Download Comma Chameleon: <a href="https://github.com/theodi/comma-chameleon">https://github.com/theodi/comma-chameleon</a></li>
<li>Follow ODILabs on Twitter: <a href="https://twitter.com/odilabs">https://twitter.com/odilabs</a></li>
<li>Follow Stuart Harrison on Twitter: <a href="https://twitter.com/pezholio">https://twitter.com/pezholio</a></li>
<li>Follow Open Knowledge Labs on Twitter: <a href="https://twitter.com/okfnlabs">https://twitter.com/okfnlabs</a></li>
<li>See the full range of speakers from csv,conf,v2: <a href="http://csvconf.com">http://csvconf.com</a></li>
</ul>
<p>See Stuart’s full talk:</p>
<iframe width="576px" height="360px" src="https://www.youtube.com/embed/wIIw0cTeUG0" frameborder="0" allowfullscreen=""></iframe>
Dan Fowler
Using Data Packages with R
2016-07-14T00:00:00+00:00
http://okfnlabs.org/blog/2016/07/14/using-data-packages-with-r
<p>R is a popular open-source programming language and platform for data
analysis. <em>Frictionless Data</em> is an Open Knowledge International
project aimed at making it easy to publish and load <em>high-quality
data</em> into tools like R through the creation of a standard wrapper
format called the Data Package.</p>
<p>In this post, I will demonstrate an in-progress version of
<strong>datapkg</strong>, an R package that makes it easy to load Data Packages
into your R environment by automating otherwise manual import steps
using information provided in the Data Package descriptor file
<code class="language-plaintext highlighter-rouge">datapackage.json</code>. datapkg was developed through a collaboration
between <a href="https://okfn.org/">Open Knowledge International</a> and <a href="https://ropensci.org/">rOpenSci</a>,
an organization that specializes in creating open-source tools using R
for advancing open science.</p>
<h2 id="loading-tabular-data-in-r">Loading Tabular Data in R</h2>
<p><img src="/img/posts/rlogo.png" alt="R Logo" /></p>
<p>R’s core strengths as a data analysis framework lie in its support for
a wide array of statistical tests, its straightforward, powerful
options for static visualization, and the ease with which its
functionality can be extended. For these reasons, R enjoys a vibrant
online community who contribute daily to thousands of packages on
<a href="https://cran.r-project.org/">CRAN</a>. For this post, we will avoid going deep into what makes
R so powerful, and instead focus on the typical first step in any data
analysis project: loading source data. <em>This post assumes you have a
fairly basic understanding of R and a working R environment on your
machine.</em></p>
<p>When loading tabular data from a file into an R environment, it is
common to use the functions <code class="language-plaintext highlighter-rouge">read.csv</code> or <code class="language-plaintext highlighter-rouge">read.delim</code>. These are
wrappers for the more generic <code class="language-plaintext highlighter-rouge">read.table</code> function that provide sane
defaults for reading from commonly formatted
<a href="http://frictionlessdata.io/guides/csv/">CSV</a> and tab-delimited files,
respectively. These commands read data into what’s called a “data
frame”, R’s basic data structure for storing data tables. In this
structure, each column (“vector”) in the original tabular data file
may be assigned a different type (e.g. string, integer, date).</p>
<p>As a simple example, let’s load a CSV file containing the
<a href="https://en.wikipedia.org/wiki/VIX">CBOE Volatility Index</a> using <code class="language-plaintext highlighter-rouge">read.csv()</code>. This dataset can be
found on our <a href="https://github.com/frictionlessdata/example-data-packages">example Data Packages repo</a> in
the subdirectory “finance-vix”. Once downloaded, we can set R’s
working directory to where the data is stored and take a peek at the
files within its <code class="language-plaintext highlighter-rouge">data</code> subdirectory:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">setwd</span><span class="p">(</span><span class="s1">'/Users/dan/Downloads/example-data-packages-master/finance-vix'</span><span class="p">)</span><span class="w">
</span><span class="n">list.files</span><span class="p">(</span><span class="s2">"data"</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'vix-daily.csv'
</code></pre></div></div>
<p>We can read this single CSV, <code class="language-plaintext highlighter-rouge">vix-daily</code>, using R’s <code class="language-plaintext highlighter-rouge">read.csv()</code>
function and assign its output to a data frame called
<code class="language-plaintext highlighter-rouge">volatility_raw</code>. Afterwards, we can get a sample of the data by
viewing the first few rows of the file using the <code class="language-plaintext highlighter-rouge">head()</code> function.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">volatility_raw</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s2">"data/vix-daily.csv"</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">volatility_raw</span><span class="p">)</span></code></pre></figure>
<table>
<thead><tr><th></th><th scope="col">Date</th><th scope="col">VIXOpen</th><th scope="col">VIXHigh</th><th scope="col">VIXLow</th><th scope="col">VIXClose</th></tr></thead>
<tbody>
<tr><th scope="row">1</th><td>1/2/2004</td><td>17.96 </td><td>18.68 </td><td>17.54 </td><td>18.22 </td></tr>
<tr><th scope="row">2</th><td>1/5/2004</td><td>18.45 </td><td>18.49 </td><td>17.44 </td><td>17.49 </td></tr>
<tr><th scope="row">3</th><td>1/6/2004</td><td>17.66 </td><td>17.67 </td><td>16.19 </td><td>16.73 </td></tr>
<tr><th scope="row">4</th><td>1/7/2004</td><td>16.72 </td><td>16.75 </td><td>15.5 </td><td>15.5 </td></tr>
<tr><th scope="row">5</th><td>1/8/2004</td><td>15.42 </td><td>15.68 </td><td>15.32 </td><td>15.61 </td></tr>
<tr><th scope="row">6</th><td>1/9/2004</td><td>16.15 </td><td>16.88 </td><td>15.57 </td><td>16.75 </td></tr>
</tbody>
</table>
<p>In the process of loading this data into a data frame, R made an
educated guess as to the types of data found in each column. We can
display those types by looking at the the “structure” of an R object
using the <code class="language-plaintext highlighter-rouge">str</code> command.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">str</span><span class="p">(</span><span class="n">volatility_raw</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'data.frame': 3122 obs. of 5 variables:
$ Date : Factor w/ 3122 levels "01/02/2014","01/02/2015",..: 543 644 652 659 666 672 493 501 508 515 ...
$ VIXOpen : num 18 18.4 17.7 16.7 15.4 ...
$ VIXHigh : num 18.7 18.5 17.7 16.8 15.7 ...
$ VIXLow : num 17.5 17.4 16.2 15.5 15.3 ...
$ VIXClose: num 18.2 17.5 16.7 15.5 15.6 ...
</code></pre></div></div>
<p>We can see that while R has correctly guessed the types of “VIXOpen”,
“VIXHigh”, “VIXLow”, and “VIXClose” to be <code class="language-plaintext highlighter-rouge">num</code>, it has incorrectly
guessed the type of the “Date” to be <code class="language-plaintext highlighter-rouge">Factor</code> when R has a much more
appropriate type for the kind of data in this column called,
predictably, <code class="language-plaintext highlighter-rouge">Date</code>. This is a problem easily demonstrable by
attempting to plot the data.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">volatility_raw</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">volatility_raw</span><span class="o">$</span><span class="n">VIXOpen</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s1">'l'</span><span class="p">)</span></code></pre></figure>
<p><img src="/img/posts/r-vix-bad-type.png" alt="Bad Type" /></p>
<p>What should be the steadily increasing Date on the X axis is, instead,
out of order because the Date column is has not been assigned its
correct type. In this very simple case, there is a straightforward
fix which is to manually re-assign the Date column (in our data frame
represented as <code class="language-plaintext highlighter-rouge">volatility_raw$Date</code>) to a type <code class="language-plaintext highlighter-rouge">Date</code> passing the
special format <code class="language-plaintext highlighter-rouge">%m/%d/%Y</code> which we found out by previewing the data.
After this, we can revisit its structure using the <code class="language-plaintext highlighter-rouge">str()</code> command.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">volatility_raw</span><span class="o">$</span><span class="n">Date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">volatility_raw</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="s2">"%m/%d/%Y"</span><span class="p">)</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">volatility_raw</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'data.frame': 3122 obs. of 5 variables:
$ Date : Date, format: "2004-01-02" "2004-01-05" ...
$ VIXOpen : num 18 18.4 17.7 16.7 15.4 ...
$ VIXHigh : num 18.7 18.5 17.7 16.8 15.7 ...
$ VIXLow : num 17.5 17.4 16.2 15.5 15.3 ...
$ VIXClose: num 18.2 17.5 16.7 15.5 15.6 ...
</code></pre></div></div>
<p>We have successfully given the Date column a <code class="language-plaintext highlighter-rouge">Date</code> type, and we
should be able to run the same <code class="language-plaintext highlighter-rouge">plot()</code> function above and get a
better result. While this is a good solution for this single dataset
with a single incorrectly guessed column, it doesn’t scale well to
multiple incorrectly guessed columns across multiple datasets. In
addition, it only represents one type of manual task to be performed
on a new set of data. We have designed the Data Package format to
obviate this, and other kinds of tedious “data wrangling” tasks. In
the next section, we will perform the same task above using the
<code class="language-plaintext highlighter-rouge">datapkg</code> library.</p>
<h2 id="loading-tabular-data-packages-in-r">Loading Tabular Data Packages in R</h2>
<p>A Data Package is a <a href="http://frictionlessdata.io/data-packages/">specification</a> for creating a
“<a href="http://frictionlessdata.io/about/#data-containerization">container</a>” for transporting data by saving useful
metadata in a specially formatted file. This file is called
<code class="language-plaintext highlighter-rouge">datapackage.json</code>, and it is stored in the root of a directory
containing a given dataset. When loading a Data Package,
<a href="https://github.com/frictionlessdata/datapackage-r">datapkg</a>—the new R Data Package library developed by
<a href="https://ropensci.org/">rOpenSci</a>—reads this extra metadata in order to
conveniently load high quality, well formatted data into your R
environment.</p>
<h3 id="installing-datapkg">Installing datapkg</h3>
<p><em>Note: the Data Package library for R is still in testing and subject
to change. For this reason, it is not yet on CRAN and must be
installed from its <a href="https://github.com/frictionlessdata/datapackage-r">GitHub repository</a> using the
<a href="https://github.com/hadley/devtools">devtools</a> package.</em></p>
<p>To install, start your R environment and run the following commands:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">install.packages</span><span class="p">(</span><span class="s2">"devtools"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">devtools</span><span class="p">)</span><span class="w">
</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"hadley/readr"</span><span class="p">)</span><span class="w">
</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"ropenscilabs/jsonvalidate"</span><span class="p">)</span><span class="w">
</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"frictionlessdata/datapackage-r"</span><span class="p">)</span></code></pre></figure>
<h3 id="reading-data">Reading Data</h3>
<p>Revisiting our data directory, we can examine the files in the root
using the <code class="language-plaintext highlighter-rouge">list.files()</code> function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'data' 'datapackage.json'
</code></pre></div></div>
<p>The presence of the <code class="language-plaintext highlighter-rouge">datapackage.json</code> file indicates our current R
working directory is points to a Data Package, so we can load the
<code class="language-plaintext highlighter-rouge">datapkg</code> library and use the <code class="language-plaintext highlighter-rouge">datapkg_read()</code> function to read our
Data Package (note: we can also pass a path or URL to this function).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">datapkg</span><span class="p">)</span><span class="w">
</span><span class="n">volatility</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">datapkg_read</span><span class="p">()</span></code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">datapkg_read()</code> function reads not only the data in the dataset,
but also the metadata stored with it. This metadata includes high
level information like the author, source, and license of the dataset.
We can inspect this information by reading various variables stored on
this object. For instance, to get a fuller, human-readable title, we
can access <code class="language-plaintext highlighter-rouge">volatility$title</code> or, if the Data Package has a “homepage”
variable set, we can access it using <code class="language-plaintext highlighter-rouge">volatility$homepage</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'VIX - CBOE Volatility Index'
'http://www.cboe.com/micro/VIX/'
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">datapkg_read()</code> also uses <em>schema</em> information stored in the
<code class="language-plaintext highlighter-rouge">datapackage.json</code> to facilitate the loading of data. As shown above,
one misstep we encountered when loading a new dataset into R was
neglecting to correct an incorrectly guessed column type. What the
Data Package format provides is a simple, standard way to store that
information with a dataset to automate this and other steps. The
following snippet shows how the <code class="language-plaintext highlighter-rouge">datapackage.json</code> descibes this
information:</p>
<figure class="highlight"><pre><code class="language-js" data-lang="js"> <span class="dl">"</span><span class="s2">schema</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">fields</span><span class="dl">"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Date</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">date</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">format</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">fmt:%m/%d/%Y</span><span class="dl">"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">VIXOpen</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">number</span><span class="dl">"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">VIXHigh</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">number</span><span class="dl">"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">VIXLow</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">number</span><span class="dl">"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">VIXClose</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">number</span><span class="dl">"</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span></code></pre></figure>
<p>As above, we can verify that <code class="language-plaintext highlighter-rouge">datapkg_read()</code> used this information to
construct its data frame by calling the <code class="language-plaintext highlighter-rouge">str()</code> function. The <code class="language-plaintext highlighter-rouge">data</code>
variable on the <code class="language-plaintext highlighter-rouge">volatility</code> object created by <code class="language-plaintext highlighter-rouge">datapkg_read()</code> points
to a list of files (“resources”) on the dataset; <code class="language-plaintext highlighter-rouge">vix-daily</code> is the
name of the resource—expressed as a data frame—we want.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">str</span><span class="p">(</span><span class="n">volatility</span><span class="o">$</span><span class="n">data</span><span class="o">$</span><span class="n">`vix-daily`</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3122 obs. of 5 variables:
$ Date : Date, format: "2004-01-02" "2004-01-05" ...
$ VIXOpen : num 18 18.4 17.7 16.7 15.4 ...
$ VIXHigh : num 18.7 18.5 17.7 16.8 15.7 ...
$ VIXLow : num 17.5 17.4 16.2 15.5 15.3 ...
$ VIXClose: num 18.2 17.5 16.7 15.5 15.6 ...
</code></pre></div></div>
<p>The output shows that the Date column has been set with the correct
type. Because the type on the Date column has been set correctly, we
can immediately plot the data.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">vix.daily</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">volatility</span><span class="o">$</span><span class="n">data</span><span class="o">$</span><span class="n">`vix-daily`</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">vix.daily</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">vix.daily</span><span class="o">$</span><span class="n">VIXOpen</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s1">'l'</span><span class="p">)</span></code></pre></figure>
<p><img src="/img/posts/r-vix-good-type.png" alt="Good Type" /></p>
<h2 id="going-forward">Going Forward</h2>
<p>This has been a very small example of the basic functionality of the R
library. This software is still in testing, so if you are an R user
and would like to use Data Packages to help manage your data in R,
please let us know. You can leave a comment here on the
<a href="https://discuss.okfn.org/t/using-data-packages-with-r/3271">forum</a>.</p>
<p>To see the code used in this post, visit its <a href="https://github.com/okfn/okfn.github.com/blob/master/resources/using-data-packages-with-r.ipynb">Jupyter Notebook</a>.</p>
Dan Fowler
'Continuous Processing' with Data Packages
2016-07-13T00:00:00+00:00
http://okfnlabs.org/blog/2016/07/13/continuous-processing-with-data-packages
<p>When storing your data in Data Packages, it is considered good
practice to store scripts for updating, processing, or analyzing your
data in a directory called <code class="language-plaintext highlighter-rouge">scripts/</code> placed at the root of your Data
Package. I’ve written a tutorial to show how to achieve <strong>continuous
processing</strong>: <em>that is, the delivery of updated data every time
something changes, either in the source data or the processing code</em>.</p>
<p>Depending on the timeliness of your dataset, you’ll want to
periodically run update scripts stored in your <code class="language-plaintext highlighter-rouge">scripts/</code> directory,
but what if you don’t want to run the update script of your Data
Package by yourself? Instead, why not let
<a href="https://travis-ci.org">Travis CI</a> do it for you?</p>
<p>If your Data Package already…</p>
<ol>
<li>has scripts that download the source data, cleans it or reformats it into a nice interoperable format</li>
<li><a href="http://okfnlabs.org/blog/2016/03/25/make-vs-tuttle.html">relies on <code class="language-plaintext highlighter-rouge">make</code></a> to run the scripts</li>
<li><a href="http://okfnlabs.org/blog/2016/05/17/automated-data-validation.html">has tests</a> to validate the data</li>
</ol>
<p>…then you’re ready to go to the next level of automation! Here’s a
<a href="https://github.com/lexman/ex-continuous-processing">tutorial</a> to
enable regular updates of the data with Travis CI.</p>
<p>It’s very well suited for small data (less then 300 MB) and when the
processing step is short (i.e. less than 10 minutes). This makes this
workflow perfect for Data Packages!</p>
<p>Read the
<a href="https://github.com/lexman/ex-continuous-processing">tutorial</a> to find
out more!</p>
Alexandre Bonnasseau
Automated Data Validation with Data Packages
2016-05-17T00:00:00+00:00
http://okfnlabs.org/blog/2016/05/17/automated-data-validation
<p>Much of the open data on the web is published in CSV or Excel format.
Unfortunately, it is often messy and can require significant
manipulation to actually be usable. In this post, I walk through a
workflow for automating data validation on every update to a shared
repository inspired by existing practices in software development and
enabled by <em>Frictionless Data</em> standards and tooling.</p>
<p>Software projects have long benefited from Continuous Integration
services like Travis CI and others for ensuring and maintaining
<strong>code</strong> quality. Continuous integration is a process where all tests
are automatically run and a report is generated on every update
(“commit”) to a project’s shared repository. This allows developers to
find and resolve errors quickly and reliably. In addition, by
displaying the “build status” others outside of the project can
clearly see the status of the project test compliance.</p>
<p><img src="/img/posts/build-passing.png" alt="Build Passing" /></p>
<p>As with software, datasets are often collaboratively created, edited,
and updated over time, sometimes introducing subtle (or not so subtle)
structural and schematic errors (see
<a href="http://okfnlabs.org/bad-data/">Bad Data</a> for examples). Much of the
“friction” in using the data comes from the time and effort needed to
identify and address these errors before analyzing in a given tool.
Automatically flagging <strong>data</strong> quality issues at upload time in a
repository can go a long way in making data more useful and have
significant follow-on effects in the data ecosystem, both open and
closed.</p>
<h2 id="continuous-data-integration">Continuous Data Integration</h2>
<p>As the <a href="http://frictionlessdata.io/">Frictionless Data</a> tooling and
standards ecosystem continues to grow, we now have the elements
necessary to provide data managers with the same type of service for
tabular data (e.g. Excel and CSV). In less than one hour, a few of us
at Open Knowledge booted a small demo to show what <em>continuous data
integration</em> could look like. On each commit to
<a href="https://github.com/frictionlessdata/ex-continuous-data-integration">our example repository</a>,
a set of validation tests are run on the data, raising an exception if
the data is invalid. If a user adds “bad” data, the “build” fails and
issues a report indicating what went wrong.</p>
<p><a href="https://github.com/frictionlessdata/ex-continuous-data-integration"><img src="/img/posts/data_ci_travis.png" alt="Data CI" /></a></p>
<p>As an example, the following CSV has a few issues with its values. In
the schema we define in the <code class="language-plaintext highlighter-rouge">datapackage.json</code> file below (i.e. the
object that the <code class="language-plaintext highlighter-rouge">schema</code> points to), we defined the “Number” column
type to <code class="language-plaintext highlighter-rouge">number</code> and the Date column type to <code class="language-plaintext highlighter-rouge">date</code>. However, the CSV
contains invalid values for those types: “x23.5” and “2015-02”
respectively.</p>
<h3 id="csv">CSV</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Date,Country,Number
2015-01-01,3,20.3
2015-02-01,United States,23.5
2015-02,United States,x23.5
</code></pre></div></div>
<h3 id="datapackagejson">datapackage.json</h3>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fd-continuous-data-integration"</span><span class="p">,</span><span class="w">
</span><span class="nl">"title"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="nl">"resources"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data"</span><span class="p">,</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"data/data.csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"format"</span><span class="p">:</span><span class="w"> </span><span class="s2">"csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"mediatype"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text/csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Date"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"date"</span><span class="p">,</span><span class="w">
</span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Country"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="p">,</span><span class="w">
</span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Number"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"number"</span><span class="p">,</span><span class="w">
</span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>When we try to add this invalid data to the repository, the following
report is generated:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+----------------+------------------------------------------------------------+
| result_name | result_message |
+================+============================================================+
| Incorrect Type | The value "2015-02" in column "Date" is not a valid Date. |
+----------------+------------------------------------------------------------+
| Incorrect Type | The value "x23.5" in column "Number" is not a valid Number.|
+----------------+------------------------------------------------------------+
</code></pre></div></div>
<h2 id="how-it-works">How It Works</h2>
<p>The Data Package descriptor file,
<a href="http://dataprotocols.org/data-packages/">datapackage.json</a>, provides
both high-level metadata as well as a
<a href="http://frictionlessdata.io/guides/table-schema/">schema</a> for
tabular data. We use the Python library
<a href="http://github.com/frictionlessdata/datapackage-py">datapackage-py</a> to
create a high-level model of the Data Package that allows us to
inspect and work with the data inside. The real work is accomplished using
<a href="http://goodtables.okfnlabs.org/">GoodTables</a>.</p>
<p>We
<a href="http://okfnlabs.org/blog/2015/03/06/goodtables-web-service.html">previously blogged about using Good Tables</a>
to validate our tabular data. On every update, two small test
functions read the <code class="language-plaintext highlighter-rouge">datapackage.json</code> to read and validate the tabular
data contained therein according to its structure and adherence to a
<a href="http://frictionlessdata.io/guides/table-schema/">schema</a>.
Here’s the first:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">test_schema</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># We heart CSV :)
</span>
<span class="n">data_format</span> <span class="o">=</span> <span class="s">'csv'</span>
<span class="c1"># Load our Data Package path and schema
</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">dp</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">'resources'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s">'path'</span><span class="p">]</span>
<span class="n">schema</span> <span class="o">=</span> <span class="n">dp</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">'resources'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s">'schema'</span><span class="p">]</span>
<span class="c1"># We use the "schema" processor to test the data against its
</span> <span class="c1"># expected schema. There is also a "structure" processor.
</span>
<span class="n">processor</span> <span class="o">=</span> <span class="n">processors</span><span class="p">.</span><span class="n">SchemaProcessor</span><span class="p">(</span><span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">,</span>
<span class="nb">format</span><span class="o">=</span><span class="n">data_format</span><span class="p">,</span>
<span class="n">row_limit</span><span class="o">=</span><span class="n">row_limit</span><span class="p">,</span>
<span class="n">report_limit</span><span class="o">=</span><span class="n">report_limit</span><span class="p">)</span>
<span class="n">valid</span><span class="p">,</span> <span class="n">report</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">processor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># Various formatting options for our report follow.
</span>
<span class="n">output_format</span> <span class="o">=</span> <span class="s">'txt'</span>
<span class="n">exclude</span> <span class="o">=</span> <span class="p">[</span><span class="s">'result_context'</span><span class="p">,</span> <span class="s">'processor'</span><span class="p">,</span> <span class="s">'row_name'</span><span class="p">,</span>
<span class="s">'result_category'</span><span class="p">,</span> <span class="s">'column_name'</span><span class="p">,</span> <span class="s">'result_id'</span><span class="p">,</span>
<span class="s">'result_level'</span><span class="p">]</span>
<span class="c1"># And here's our report!
</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">report</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span><span class="n">output_format</span><span class="p">,</span> <span class="n">exclude</span><span class="o">=</span><span class="n">exclude</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">assertTrue</span><span class="p">(</span><span class="n">valid</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span></code></pre></figure>
<p>For more information, read the guide on
<a href="http://frictionlessdata.io/">frictionlessdata.io</a> about
<a href="http://frictionlessdata.io/guides/validating-data/">validating data</a>.
Behind the scenes, this is just a normal Travis CI configuration (see
the
<a href="https://github.com/frictionlessdata/ex-continuous-data-integration/blob/master/.travis.yml">.travis.yml</a>).</p>
<h2 id="try-it-yourself">Try It Yourself</h2>
<p>Our example relies on
<a href="http://blog.okfn.org/2013/07/02/git-and-github-for-data/">GitHub as a data storage mechanism</a>
and <a href="http://travis-ci.org/">Travis CI</a> as a host for the actual
validation. However, this approach is broadly applicable to any
storage and processing backend with some extra tweaking (e.g. using
AWS <a href="https://aws.amazon.com/lambda/">Lambda</a> and
<a href="https://aws.amazon.com/s3/">S3</a>).</p>
<p>Check out the
<a href="https://github.com/frictionlessdata/ex-continuous-data-integration">ex-continuous-data-integration</a>
repository on our
<a href="https://github.com/frictionlessdata">frictionlessdata</a> organization
on GitHub to see how you can try this out with your own data! Let us
know how it works on our <a href="/contact/">chat channel</a>.</p>
Dan Fowler
Tools for Extracting Data and Text from PDFs - A Review
2016-04-19T00:00:00+00:00
http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs
<p>Extracting data from PDFs remains, unfortunately, a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.</p>
<p>The tools we can consider fall into three categories:</p>
<ul>
<li>Extracting text from PDF</li>
<li>Extracting tables from PDF</li>
<li>Extracting data (text or otherwise) from PDFs where the content is not text but is images (for example, scans)</li>
</ul>
<p>The last case is really a situation for OCR (optical character recognition) so we’re going to ignore it here. We may do a follow up post on this.</p>
<p><img src="/img/posts/pdf-tools-climate-treaty-paris-pdf.png" alt="Climate Treaty PDF" style="width: 70%; margin: auto; display: block;" /></p>
<div style="text-align: center;">
<p><em>The Paris Climate Agreement text was <a href="http://unfccc.int/resource/docs/2015/cop21/eng/l09r01.pdf">published as PDF</a>. Some of the tools described here – plus the usual blood, sweat and tears – were used turn them back into usable HTML for our <a href="http://cop21.okfnlabs.org/">Paris COP21 Climate Treaty Texts site</a></em></p>
</div>
<p><img src="/img/posts/pdf-tools-senate-report-pdf.png" alt="Example PDF" style="width: 70%; margin: auto; display: block;" /></p>
<div style="text-align: center;">
<p><em>A classic example of an important government report published as PDF only</em></p>
</div>
<h2 id="generic-pdf-to-text">Generic (PDF to text)</h2>
<ul>
<li><a href="http://www.unixuser.org/~euske/python/pdfminer/">PDFMiner</a> - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
<ul>
<li>Pure python</li>
<li>In our trials PDFMiner has performed excellently and we rate as one of the best tools out there.</li>
</ul>
</li>
<li><a href="http://pdftohtml.sourceforge.net/">pdftohtml</a> - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf. One of the better for tables but have found PDFMiner somewhat better for a while. Command-line Linux</li>
<li><a href="http://pdftoxml.sourceforge.net/">pdftoxml</a> - command line utility to convert PDF to XML built on poppler.</li>
<li><a href="http://documentcloud.github.io/docsplit/">docsplit</a> - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…)</li>
<li><a href="https://github.com/zejn/pypdf2xml">pypdf2xml</a> - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler’s pdftoxml, which didn’t properly decode CID Type2 fonts in PDFs.</li>
<li><a href="http://coolwanglu.github.io/pdf2htmlEX/">pdf2htmlEX</a> - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction as it generates css-heavy HTML that replicates the exact look of a PDF document.</li>
<li><a href="http://mozilla.github.io/pdf.js/">pdf.js</a> - you probably want a fork like <a href="https://github.com/modesty/pdf2json">pdf2json</a> or <a href="https://github.com/jviereck/node-pdfreader">node-pdfreader</a> that integrates this better with node. Not tried this on tables though …
<ul>
<li>Max Ogden has this list of Node libraries and tools for working with PDFs: <a href="https://gist.github.com/maxogden/5842859">https://gist.github.com/maxogden/5842859</a></li>
<li>Here’s a gist showing how to use pdf2json: <a href="https://gist.github.com/rgrp/5944247">https://gist.github.com/rgrp/5944247</a></li>
</ul>
</li>
<li><a href="https://tika.apache.org/">Apache Tika</a> - Java library for extracting metadata and content from all types of document types including PDF.</li>
<li><a href="https://pdfbox.apache.org/">Apache PDFBox</a> - Java library specifically for creating, manipulating and getting content from PDFs.</li>
</ul>
<h3 id="tables-from-pdf">Tables from PDF</h3>
<ul>
<li><a href="http://tabula.technology/">Tabula</a> - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.</li>
<li><a href="https://github.com/okfn/pdftables">https://github.com/okfn/pdftables</a> - open-source. Created by Scraperwiki but now closed-source and powering <a href="https://pdftables.com/">PDFTables</a> so here is a fork.</li>
<li><a href="http://pdftohtml.sourceforge.net/">pdftohtml</a> - one of the better for tables but have not used for a while</li>
<li><a href="https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py">https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py</a> AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515</li>
<li>Using scraperwiki + pdftoxml - see this recent tutorial <a href="http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/">Get Started With Scraping – Extracting Simple Tables from PDF Documents</a></li>
</ul>
<h3 id="existing-open-services">Existing open services</h3>
<ul>
<li><a href="http://givemetext.okfnlabs.org/">http://givemetext.okfnlabs.org/</a> - Give me Text is a free, easy to use open source web service that extracts text from PDFs and other documents using Apache Tika (and built by <a href="http://okfnlabs.org/members/mattfullerton/">Labs member Matt Fullerton</a>)</li>
<li><a href="http://pdfx.cs.man.ac.uk/">http://pdfx.cs.man.ac.uk/</a> - has a nice command line interface
<ul>
<li>Is this open? Says at <a href="http://pdfx.cs.man.ac.uk/usage">bottom of usage</a> that it is powered by http://www.utopiadocs.com/</li>
<li>Note that as of 2016 this seems more focused on conversion to structured XML for scientific articles but may still be useful</li>
</ul>
</li>
<li><del>Scraperwiki - https://views.scraperwiki.com/run/pdf-to-html-preview-1/ and <a href="http://blog.scraperwiki.com/2010/12/17/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/">this tutorial</a></del> - no longer working as of 2016</li>
</ul>
<h3 id="existing-proprietary-free-or-paid-for-services">Existing proprietary free or paid-for services</h3>
<p>There are many online – just do a search – so we do not propose a comprehensive list. Two we have tried and seem promising are:</p>
<ul>
<li><a href="http://www.newocr.com/">http://www.newocr.com/</a> - free, with an API, very bare bones site but quite good results based on our limiting testing</li>
<li><a href="https://pdftables.com/">https://pdftables.com/</a> - pay-per-page service focused on tabular data extraction from the folks at ScraperWiki</li>
</ul>
<p>We also note that Google app engine <a href="http://developers.google.com/appengine/docs/python/conversion/overview">used to do this</a> but unfortunately it seems discontinued.</p>
<h2 id="other-good-intros">Other good intros</h2>
<ul>
<li><a href="http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html">Thomas Levine on Parsing PDFs</a></li>
<li><a href="http://schoolofdata.org/handbook/courses/extracting-data-from-pdf/">Extracting Data from PDFs - School of Data</a></li>
</ul>
Rufus Pollock
Tools for Data Packages: Make vs. Tuttle
2016-03-25T00:00:00+00:00
http://okfnlabs.org/blog/2016/03/25/make-vs-tuttle
<p>When crafting data from some other data, like packaging public data, using the good tools
can really ease development process and reliability of the data.</p>
<p>The venerable <code class="language-plaintext highlighter-rouge">make</code> which have already been used for decades to build software, is a very good option as advocated by Mike Bostock’s in his <a href="https://bost.ocks.org/mike/make/">blog</a>.</p>
<h2 id="a-state-of-the-art-makefile">A state-of-the-art Makefile</h2>
<p>Let’s take an example with crafting <a href="http://github.com/datasets/geo-countries">geo-countries</a> datapackage. We need to download data from NaturalEarth, extract the zip, convert it to json with ogr (the ‘‘swiss-army-knife’’ of maps), and rename a column. Following Mike Bostok’s instructions, here’s an appropriate <code class="language-plaintext highlighter-rouge">Makefile</code> (that should lie the <code class="language-plaintext highlighter-rouge">scripts</code> folder of the project):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>all: ../data/countries.geojson
ne_10m_admin_0_countries.zip:
wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
ne_10m_admin_0_countries.README.html ne_10m_admin_0_countries.VERSION.txt ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx: ne_10m_admin_0_countries.zip
unzip ne_10m_admin_0_countries.zip
ne_10m_admin_0_countries.geojson: ne_10m_admin_0_countries.dbf ne_10m_admin_0_countries.prj ne_10m_admin_0_countries.shp ne_10m_admin_0_countries.shx
ogr2ogr -select admin,iso_a3 -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp
../data:
mkdir ../data
../data/countries.geojson: ne_10m_admin_0_countries.geojson ../data
# Change the name of the fields after conversion
cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
</code></pre></div></div>
<p>If you’re not familiar with Makefiles, the last section reads : “When both files <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.geojson</code> and <code class="language-plaintext highlighter-rouge">../data</code> are available, you can run command <code class="language-plaintext highlighter-rouge">cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson</code>
and it will produce file <code class="language-plaintext highlighter-rouge">../data/countries.geojson</code>”. <code class="language-plaintext highlighter-rouge">Make</code> deduces the commands to be run, starting with the ones where everything is available, until it produces <em>target</em> <code class="language-plaintext highlighter-rouge">all</code>.</p>
<p>We achieve two very important goals with this <code class="language-plaintext highlighter-rouge">Makefile</code> :</p>
<ul>
<li>it covers the whole process even the download part. It’s so easy to forget wether we have downloaded <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.zip</code> or <code class="language-plaintext highlighter-rouge">ne_110m_admin_0_countries.zip</code> when it is done by hand. But now every thing is written down so we can keep track of it in our source repository (like git), even if we change our mind.</li>
<li>Running <code class="language-plaintext highlighter-rouge">make</code> checks the date consistency of the files. That means that if Scottland has gone independent in 2015 it would have created a new country, that Natural Earth would have added. Now you can download the updated version of <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.zip</code>. When running <code class="language-plaintext highlighter-rouge">make</code> again, it would notice that the unziped files like <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.dbf</code> and so on are older than their source, so the <code class="language-plaintext highlighter-rouge">unzip</code> command has to be run again ! And so on because <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.geojson</code> would not be up to date, until every depending file is updated.</li>
</ul>
<p>Even if this is a great improvement over <em>running-all-the-commands-manually-and-don’t-remember-them</em> as much <em>custom-script-that-must-start-from-scratch-every-time</em>, it is not enough to have a fluid and reliable development experience.</p>
<h2 id="improve-collaboration-with-tuttle">Improve collaboration with <code class="language-plaintext highlighter-rouge">tuttle</code></h2>
<p>Before we see in detail two major improvements, let’s see the same workflow written in a <code class="language-plaintext highlighter-rouge">tuttlefile</code> (still in folder <code class="language-plaintext highlighter-rouge">scripts</code>) :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
wget http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
file://ne_10m_admin_0_countries.README.html, file://ne_10m_admin_0_countries.VERSION.txt, file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx <- file://ne_10m_admin_0_countries.zip
unzip ne_10m_admin_0_countries.zip
file://ne_10m_admin_0_countries.geojson <- file://ne_10m_admin_0_countries.dbf, file://ne_10m_admin_0_countries.prj, file://ne_10m_admin_0_countries.shp, file://ne_10m_admin_0_countries.shx
ogr2ogr -select admin,iso_a3 -f geojson ne_10m_admin_0_countries.geojson ne_10m_admin_0_countries.shp
file://../data <-
cd ..
mkdir data
file://../data/countries.geojson <- file://ne_10m_admin_0_countries.geojson, file://../data
# Change the name of the fields after conversion
cat ne_10m_admin_0_countries.geojson | sed 's/"admin": /"name": /g' | sed 's/"iso_a3": /"ISO3166-1-Alpha-3": /g' > ../data/countries.geojson
</code></pre></div></div>
<p>Looks familiar ?</p>
<p>It is very close to Makefile, except for urls everywhere. Because <code class="language-plaintext highlighter-rouge">tuttle</code> aims at giving a url to every bit of data, in order link them together.</p>
<p>You can see the first section of the tuttlefile clearly states the dependency of file <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.zip</code> to url <code class="language-plaintext highlighter-rouge">http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip</code>.
This means that when the online list of countries change, no unusual action is required. You just have to execute <code class="language-plaintext highlighter-rouge">tuttle run</code> as <em>if you where building the data for the first time</em>. It will notice the source url has changed and will reprocess dependencies accordingly.</p>
<p>The other difference with <code class="language-plaintext highlighter-rouge">make</code> is not in the syntax, it’s in how it deals with changes in the <code class="language-plaintext highlighter-rouge">tuttlefile</code>. If you ever worked with the <code class="language-plaintext highlighter-rouge">ogr2ogr</code> command line tool, you know it’s impossible to make it right the first time. But if you change the command in a <code class="language-plaintext highlighter-rouge">Makefile</code>, unfortunately running <code class="language-plaintext highlighter-rouge">make</code> again won’t update the data because the date of the file <code class="language-plaintext highlighter-rouge">ne_10m_admin_0_countries.geojson</code> seem coherent.</p>
<p>To improve this, <code class="language-plaintext highlighter-rouge">tuttle</code> reacts to changes in every command. When you run it, it will first roll back as the previous command as if had never run by deleting whatever data has been produced. Then it will run the updated ogr2ogr command. That’s very handy when prototyping because you want focus on your code without side effects caused by remaining data.</p>
<p>This feature also proves really useful when working in a team. With <code class="language-plaintext highlighter-rouge">make</code>, if you change the makefile, you need to send an mail to all your team with instructions of how to clean the workspace (ie : “Please remove file ../data/countries.geojson because I have changed the ogr2ogr command”), and hope nobody misses it because it would lead to undebuggable behaviour. On the other hand <code class="language-plaintext highlighter-rouge">tuttle</code> guarantees the data corresponds exactly the <code class="language-plaintext highlighter-rouge">tuttlefile</code>, so you can safely share or merge changes with your fellow contributors.</p>
<h2 id="conclusion">Conclusion</h2>
<p>If you put both improvements over <code class="language-plaintext highlighter-rouge">make</code> together (remote dependencies and reliably reprocess what have changed), we can set up a system that automatically updates datapackages when either the original data changes or when someone modifies the source code. Pretty cool, huh ?</p>
<p>I hope I’ve convinced you of the advantages of tuttle for collectively crafting data. If you’re interested, the best way to learn more about inline languages, url to databases or online resources, is to read the <a href="https://github.com/lexman/tuttle/master/doc/tuttorial">main tutorial</a>.</p>
<p>And one more thing about the sugar syntax you can expect… You could simplify the first section of the <code class="language-plaintext highlighter-rouge">tuttlefile</code> in only one line :</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file://ne_10m_admin_0_countries.zip <- http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip ! download
</code></pre></div></div>
Alexandre Bonnasseau
Frictionless Data Transport in Python
2016-03-11T00:00:00+00:00
http://okfnlabs.org/blog/2016/03/11/frictionless-data-transport-in-python
<p>Tool and platform integrations for “Data Packages” are key elements of
our <a href="http://datapackages.org/">Frictionless Data Initiative</a> at
<a href="https://okfn.org/">Open Knowledge International</a>. We recently posted
on the <a href="http://blog.okfn.org">main blog</a> about some
<a href="http://blog.okfn.org/2016/02/01/google-funds-frictionless-data-initiative-at-open-knowledge/">integration work</a>
funded by our friends at Google. We’ve built useful Python libraries
for working with Tabular Data Packages in some of the most popular
tools in use today by data wranglers and developers. These
integrations allow for easily getting data into and out of your tool
of choice for further manipulation while reducing the tedious
<em>wrangling</em> sometimes needed. In this post, I will give some more
details of the work done on adding support for these open standards
within <a href="http://ckan.org/">CKAN</a>, Google’s
<a href="http://bigquery.cloud.google.com/">BigQuery</a>, and common SQL database
software. But first, here is an introduction to the format for those
who are unfamiliar.</p>
<h2 id="tabular-data-package">Tabular Data Package</h2>
<p><img src="/img/posts/tabular-data-package.png" alt="" /></p>
<p>Tabular Data Package is a simple structure for publishing and sharing
tabular data in
<a href="http://datapackages.org/doc/tabular-data-package#csv">CSV</a> format.
You can find more information about the standards
<a href="http://datapackages.org/standards">here</a>, but here are the key
features:</p>
<ul>
<li>
<p>Your dataset is stored as a collection of flat files.</p>
</li>
<li>
<p>Useful information about this dataset is stored in a specially
formatted JSON file, <code class="language-plaintext highlighter-rouge">datapackage.json</code> stored with your
data. For tabular data, this information is a combination of
<em>general metadata</em> and <em>schema</em> information.</p>
<ul>
<li>
<p>General metadata (e.g. name, title, sources) are stored as
top-level attributes of the file</p>
</li>
<li>
<p>The exact schema (e.g. type, constraint information per
column, and relations between resources) for the tabular
data is stored in a resources attribute. For each resource,
a schema is specified using the
<a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a>
standard.</p>
</li>
</ul>
</li>
</ul>
<p>As an example, for the following <code class="language-plaintext highlighter-rouge">data.csv</code> file…</p>
<table>
<thead>
<tr>
<th>date</th>
<th>price</th>
</tr>
</thead>
<tbody>
<tr>
<td>2014-01-01</td>
<td>1243.068</td>
</tr>
<tr>
<td>2014-02-01</td>
<td>1298.713</td>
</tr>
<tr>
<td>2014-03-01</td>
<td>1336.560</td>
</tr>
<tr>
<td>2014-04-01</td>
<td>1299.175</td>
</tr>
</tbody>
</table>
<p>…we can define the associated <code class="language-plaintext highlighter-rouge">datapackage.json</code> file describing it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"name": "gold-prices",
"title": "Gold Prices (Monthly in USD)",
"resources": [
{
"path": "data.csv",
"format": "csv",
"schema": {
"fields": [
{
"name":"date",
"type":"date"
},
{
"name":"price",
"type":"number",
"constraints": {
"minimum": 0.0
}
}
]
}
}
]
}
</code></pre></div></div>
<p>By providing a simple, easy-to-use standard for packaging data and
building a suite of integrations to easily and losslessly import and
export packaged data using existing software, we foresee a radical
improvement in the quality and speed of data-driven analysis.</p>
<p>So, without further ado, let’s look at some of the actual tooling built :).</p>
<h2><img src="/img/posts/ckan-logo-s.png" alt="CKAN Logo" /></h2>
<p><a href="http://ckan.org/">CKAN</a>, originally developed by Open Knowledge, is
the leading open-source data management system used by governments and
civic organizations around the world. It allows organizations and
ordinary users to streamline the publishing, sharing, and use of open
data. In the US and the UK, <a href="http://www.data.gov/">data.gov</a> and
<a href="https://data.gov.uk/">data.gov.uk</a> run on CKAN, and there are many
more <a href="http://ckan.org/instances/#">instances around the world</a>.</p>
<p>Given its ubiquity, CKAN was a natural target for supporting Data
Packages, so we built a
<a href="http://docs.ckan.org/en/latest/extensions/index.html">CKAN extension</a>
for importing and exporting Data Packages both via the UI and the API:
<a href="https://github.com/ckan/ckanext-datapackager">ckanext-datapackager</a>.
This work
<a href="https://github.com/ckan/ckanext-datapackager#where-is-the-old-open-knowledges-data-packager">replaces</a>
a previous implementation for an earlier version of CKAN.</p>
<h3 id="ckan-data-packager-extension">CKAN Data Packager Extension</h3>
<ul>
<li>Source and usage information: <a href="https://github.com/ckan/ckanext-datapackager">https://github.com/ckan/ckanext-datapackager</a></li>
<li>Screencast (UI): <a href="https://youtu.be/qEaAJB_GYmQ">https://youtu.be/qEaAJB_GYmQ</a></li>
<li>Screencast (API): <a href="https://asciinema.org/a/8jrpft2etpubte8jupfko8ci5">https://asciinema.org/a/8jrpft2etpubte8jupfko8ci5</a></li>
</ul>
<h2 id="bigquery-and-sql-integration">BigQuery and SQL Integration</h2>
<p><a href="https://developers.google.com/apps-script/advanced/bigquery">BigQuery</a>
is Google’s web service for querying massive datasets. By providing a
Python library, we can allow data wranglers to easily import and
export “big” Data Packages for analysis in the cloud. Likewise, by
supporting general SQL import and export for Data Packages, a wide
variety of software that depend on typical SQL databases can support
Data Packages natively. The library powering both implementations is
<a href="https://github.com/frictionlessdata/jsontableschema-py">jsontableschema-py</a>,
which provides a high level interface for importing and exporting
tabular data to and from
<a href="https://github.com/frictionlessdata/jsontableschema-py#storage">Tabular Storage</a>
objects based on
<a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a>
descriptors.</p>
<h3 id="bigquery">BigQuery</h3>
<ul>
<li>Source: <a href="https://github.com/frictionlessdata/jsontableschema-bigquery-py">jsontableschema-bigquery-py</a>.</li>
<li>Screencast: <a href="https://www.youtube.com/watch?v=i_YHSwl-7VU&feature=youtu.be">https://www.youtube.com/watch?v=i_YHSwl-7VU</a></li>
<li>Walkthrough: <a href="https://gist.github.com/vitorbaptista/998aed29097945aaccff">https://gist.github.com/vitorbaptista/998aed29097945aaccff</a></li>
</ul>
<h3 id="sql">SQL</h3>
<ul>
<li>Source and usage information: <a href="https://github.com/frictionlessdata/jsontableschema-sql-py">jsontableschema-sql-py</a>.</li>
<li>Screencast: <a href="https://asciinema.org/a/cyzd0lz0kqvcqmg4zneifohov">https://asciinema.org/a/cyzd0lz0kqvcqmg4zneifohov</a></li>
<li>Walkthrough: <a href="https://gist.github.com/vitorbaptista/19d476d99595584e9ad5">https://gist.github.com/vitorbaptista/19d476d99595584e9ad5</a></li>
</ul>
<h3 id="beyond">Beyond</h3>
<p>This modular approach allows us to easily build support across many
more tools and databases. We already have plans to support
<a href="https://www.mongodb.org/">MongoDB</a> and
<a href="https://github.com/maxogden/dat">DAT</a>. Of course, we need feedback
from <strong>you</strong> to pick the next libraries to focus on. What tool do you
think could benefit from Data Package integration? Tell us in the
<a href="https://discuss.okfn.org/c/frictionless-data">forum</a>.</p>
<p><img src="/img/posts/tabular-storage-diagram.png" alt="" /></p>
<p>For more information on Data Packages and our Frictionless Data
approach, please visit
<a href="http://frictionlessdata.io/">http://frictionlessdata.io/</a>.</p>
Dan Fowler
Submit your Newsletter ideas today!
2016-02-18T00:00:00+00:00
http://okfnlabs.org/blog/2016/02/18/submissions
<p>The first quarter of 2016 is almost through, which means that the OKFN Labs Newsletter is on its way! But we have a problem. We know that you have spent the last 3 months writing awesome code, founding disruptive new projects and basically changing the world. But we haven’t received your newsletter submissions so we can help spread the word about what you’ve been up to.</p>
<p>Submitting your ideas or work for inclusion in the newsletter is easy, and only taks about a minute. Start by clicking the link for Open Knowledge Labs issue tracker <a href="https://github.com/okfn/okfn.github.com/issues/new?title=[newsletter]">here</a>.</p>
<p>Here’s what your submission should include:</p>
<ul>
<li>A title that is succinct, and describes the item well</li>
<li>A link (or links) to the project/initiative you are submitting</li>
<li>A single paragraph that presents an introduction to the project/initiative</li>
</ul>
<p>And that’s it! Its fast, its easy and it helps get the word out about your latest project - so submit your newsletter ideas today.</p>
Josh Wieder
Labs newsletter: Q4 2015
2015-12-05T00:00:00+00:00
http://okfnlabs.org/blog/2015/12/05/newsletter
<p>Hey there hackers & hackettes! Welcome to the 4th quarter 2015 Open Knowledge Labs Newsletter: A Very Special Holiday Edition of the Open Knowledge Labs Newsletter. We hope that all of our readers, volunteers, team members & contributors have a great holiday season. Labs is doing our part to keep things festive:</p>
<p><img src="https://raw.githubusercontent.com/okfn/okfn.github.com/master/img/newsletter/xmas-computer.jpg" alt="Holiday computer" /></p>
<p>Despite the hustle and bustle of the season, we are happy to report that Labs has made some serious progress with our existing projects and that we also have a few very cool tools to assist with your year-end data analysis.</p>
<h3 id="tuttle---language-platform--version-control-agnostic-tool-for-collaborating-on-complex-coding-projects">Tuttle - language, platform & version-control agnostic tool for collaborating on complex coding projects</h3>
<p>Our very own @lexman (Alexandre Bonnasseau of mappy.com) was kind enough to provide Labs with a tool called <a href="https://github.com/lexman/tuttle">tuttle</a> that should come in handy when submitting code for large projects.</p>
<p>@lexman does an excellent job describing the purpose of <a href="https://github.com/lexman/tuttle">tuttle</a> in a <a href="https://discuss.okfn.org/t/a-tool-for-collaborating-on-datapackages/1397">recent post to the Labs discussion site</a>:</p>
<blockquote>
<p>“When we write scripts to create data, we don’t make it right on the first time. How many times did you have to comment the beginning of a script, so that executions jumps directly to a bug fix? With tuttle, you won’t have to. First, it computes only what is necessary : for example if a file has already been downloaded, it won’t do it again. But also, when you change a line of code, tuttle knows exactly what data must be removed and what part of the code must be run instead.”</p>
</blockquote>
<p>Tuttle can be used to generate reports that generate workflows based on submission history and also highlight errors, as illustrated below (or in more detail <a href="http://stuff.lexman.org/s-and-p-500/scripts/.tuttle/report.html">here</a>):</p>
<p><img src="https://raw.githubusercontent.com/okfn/okfn.github.com/master/img/newsletter/tuttle-report-2.PNG" alt="Tuttle report 2" /></p>
<p>@lexman provides a detailed (and incredibly helpful) <a href="https://github.com/lexman/tuttle/blob/master/doc/tutorial_musketeers/tutorial.md">tutorial that helps acquaint new users with tuttle</a>. We highly recommend giving the tutorial a try and using tuttle for complex development projects.</p>
<h3 id="mira-turns-csv-files-into-an-http-api">Mira turns CSV files into an HTTP API</h3>
<p><a href="https://github.com/davbre/mira">Mira</a> is a new tool that comes to Labs from @davbre and is built using Ruby on Rails & relies on Postgres. <a href="https://github.com/davbre/mira">Mira</a> allows users to generate an API using <a href="http://dataprotocols.org/data-packages/">data packages</a>, a way to describe csv files using JSON - greatly simplifying what can often be a lengthy, tedious process. Here is how @davbre describes his utility:</p>
<blockquote>
<p>“This is a small application developed using Ruby-on-Rails. You upload a datapackage.json file to it along with the corresponding CSV files and it gives you a read-only HTTP API. It’s pretty simple - it uses the metadata in the datapackage.json file to import each CSV file into its own database table. Once imported, various API endpoints become available for metadata and data. You can perform simple queries on the data, controlling the ordering, paging and variable selection. It also talks to the DataTables jQuery plug-in.”</p>
</blockquote>
<h3 id="python-data-analysis-library-agate-has-reached-version-11">Python data analysis library Agate has reached version 1.1</h3>
<p>A new Python library has begun to come of age. <a href="https://github.com/onyxfish/agate">Agate</a> was built by @onyxfish (NPR data journalist <a href="https://source.opennews.org/en-US/people/christopher-groskopf/">Christopher Groskopf</a>) as an alternative to numpy and pandas. Whereas numpy and pandas were designed for scientists, Agate is designed with the needs of journalists in mind. Agate places a premium on ease of use and flexibility, even at the expense of performance optimazations present in other libraries. As @onyxfish puts it in <a href="https://source.opennews.org/en-US/articles/introducing-agate/">a post</a> announcing the new version of Agate:</p>
<blockquote>
<p>“In greater depth, agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.”</p>
</blockquote>
<p>Agate’s leap from version 0.11.0 to version 1.0.0 on October 22nd of this year marked the <a href="https://agate.readthedocs.io/en/1.1.0/changelog.html">first major release</a> for the up-and-coming library (version 1.1 was released November 4th). While Agate was fully functional at v0.11.0, the changes since then have been substantial. Among some of the more impressive additions:</p>
<ul>
<li>Agate can now be used as a drop-in replacement for Python’s csv module</li>
<li>Migrated csvkit‘s unicode CSV reading/writing support into agate</li>
<li>100% test coverage reached</li>
<li>Added support for Python 3.5</li>
<li>Massive performance increases for joins.</li>
<li>Dozens of other resolved issues …</li>
</ul>
<p>Agate has an impressive array of documentation for developers. Take a look at <a href="https://agate.readthedocs.io/en/1.1.0/">the manual</a>, the <a href="https://agate.readthedocs.io/en/1.1.0/tutorial.html">standard tutorials</a>, a tutorial for <a href="http://nbviewer.ipython.org/urls/gist.githubusercontent.com/onyxfish/36f459dab02545cbdce3/raw/534698388e5c404996a7b570a7228283344adbb1/example.py.ipynb">using Agate with Jupyter notebook</a>, the <a href="https://agate.readthedocs.io/en/1.1.0/cookbook.html">Agate Cookbook</a> and the <a href="https://agate.readthedocs.io/en/1.1.0/api.html">Agate API documentation</a>.</p>
<p>Agate does indeed look promising, and there is an immense need for tools like it. Journalism is changing rapidly. With a flood of new information from Open Data advocates (like Open Knowledge) making their way to the newsroom, organizations that can effectively interpret that data will maintain a significant advantage over their competitors. Meanwhile, the public can only benefit from more accurate analysis of larger sets of information that impact their lives. Clearly, the need for analytics that were once only required at university now extends beyond the Ivory Tower.</p>
<h3 id="webshot-improvements">Webshot improvements</h3>
<p><a href="http://webshot.okfnlabs.org/">Webshot</a> is a free, automated utility that allows for the generation of live screenshots. Screenshots serve an important role in demonstrating accountability (when content is removed, defaced or censored from the internet), but just as frequently are critical for troubleshooting & diagnostic services. There are many scenarios in which manually creating screenshots would not be feasible - because of routing issues, or because a screenshot needs to be generated at an exact time (Webshot can be called via an API).</p>
<p>The <a href="https://github.com/okfn/webshot/">Github page for Webshot</a> includes not just the Webshot source code, but also a node-based web server with a default Heroku configuration. This enables users to spin up a fully functional Webshot instance using Heroku in just a few minutes, starting from scratch. Install the node package manager, create your heroku instance and push the configuration and you are all set!</p>
<p>When calling screenshots that have been generated by Webshot, the URLs reference the source website of the screenshot and allow for resizing of the image. Here are some examples from the Webshot documentation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> http://localhost:5000/api/generate?url=google.com&width=500
http://localhost:5000/api/generate?url=google.com&height=300
http://localhost:5000/api/generate?url=google.com&width=200&height=400
</code></pre></div></div>
<p>Check out Webshot - and don’t forget to <a href="https://github.com/okfn/webshot/issues/">contribute to the project on Github</a>!</p>
<h3 id="new-tasks-need-your-help-within-the-core-datasets-issue-registry">New tasks need your help within the Core Datasets Issue Registry</h3>
<p><a href="http://data.okfn.org/roadmap/core-datasets">Open Knowledge’s Core Datasets</a> are a selection of commonly-used datasets on a variety of topics that can be put to use for a variety of different research topics. All of the tools are free and available on Github - being able to find so many diverse, reliable and useful datasets in one place can save those of us who rely on open data a lot of time and hassle. Because so many people rely on these tools, submitting code to these tools allows your submissions to make a real and significant difference to projects all over the world. Here are some examples of the types of packages that are included:</p>
<ul>
<li><a href="https://github.com/datasets/geoip2">geoip2</a>, a free IP geolocation database based on data from the <a href="http://dev.maxmind.com/geoip/geoip2/geolite2/">Geolite2 MaxMind databases</a></li>
<li><a href="https://github.com/datasets/imf-weo">imf-weo</a>, a copy of the <a href="http://www.imf.org/external/ns/cs.aspx?id=28">International Monetary Fund World Economic Outlook database</a></li>
<li><a href="https://github.com/datasets/clinical-trials-us">clinical-trials-us</a>, javascript-based tool listing official US clinical trial outcomes from the FDA, relies on data from <a href="http://clinicaltrials.gov/">clinicaltrials.gov</a></li>
<li><a href="https://github.com/datasets/crime-uk">crime-uk</a>, UK-specific crime data from multiple sources, including http://police.uk/data</li>
<li><a href="https://github.com/datasets/browser-stats">browser-stats</a>, a Python based tool that collects browser usage statistics trends, primarily gathered from <a href="http://www.w3schools.com/browsers/browsers_stats.asp">W3Schools log files</a></li>
<li>and much more …</li>
</ul>
<p>Thanks in large part to @pdehaye, the new Core Datasets Managing Curator, there has been a flurry of new activity and project additions on the Core Datasets Issue Registry. Take a look at <a href="https://github.com/datasets/registry/issues">the Issue Registry</a> for issues that you think you could help to resolve and start to tackle it! For example, <a href="https://github.com/datasets/registry/issues/122">a thread has been created</a> for the <a href="http://www.economist.com/content/big-mac-index">Big Mac Index Dataset</a>. Don’t know what a Big Mac Index is? Not a problem! You don’t need to be an economist to write a script that will poll <a href="http://infographics.economist.com/2015/databank/BMfile2000-Jul2015.xls">the correct datasets</a>(note: XLS file). If you want to help out, but aren’t sure how to start or you’re having trouble, browse the <a href="https://github.com/datasets/registry/issues?q=is%3Aopen+is%3Aissue+label%3A%22Difficulty%3A+easy%22">easier issues</a> and leave a comment on the relevant thread! Also, be sure to let us know in a thread that you are working on a specific project to avoid duplicating effort. Now that you know all this, go get coding!</p>
<h3 id="labs-establishes-organizational-structure-open-positions-still-available">Labs establishes organizational structure, open positions still available</h3>
<p>Labs continues to expand and attract interest from talented developers and all manner of smarty-pantses. With more people and more projects there is more responsibility and more to get done. To that end, Labs has begun to develop an organizational structure so that all of our team members can focus on what we are best at, to prevent duplication of efforts and to make communication easier and more effective. So far, the assigned positions are:</p>
<ul>
<li>@danfowler
<ul>
<li>Team Lead</li>
</ul>
</li>
<li>@loleg
<ul>
<li>Team Lead</li>
</ul>
</li>
<li>@mattfullerton
<ul>
<li>Team Lead</li>
</ul>
</li>
<li>@pdehaye
<ul>
<li>Core Datasets Managing Curator</li>
</ul>
</li>
<li>@davbre
<ul>
<li>Advisory Group Member</li>
</ul>
</li>
<li>@davidmiller
<ul>
<li>Advisory Group Member</li>
</ul>
</li>
<li>@jgkim
<ul>
<li>Advisory Group Member</li>
</ul>
</li>
<li>@jwieder
<ul>
<li>Advisory Group Member</li>
</ul>
</li>
</ul>
<p>For more detailed information about each position, be sure to check out <a href="https://github.com/okfn/okfn.github.com/issues/367">this thread</a>. There are still positions available and a significant need for assistance from those with all sorts of different skills - leave a comment on the thread to let us know that you want to help step up to keep Labs growing!</p>
<h3 id="its-time-to-get-involved">It’s time to get involved</h3>
<p>The New Year is a time for reflection of the year gone by and an opportunity to resolve to engage in good deeds for the year ahead. This year, OFKN Labs urges you to forget about silly New Years resolutions like more exercise or less carbs. Do something important with 2016 and <strong>write more code</strong>! The first thing to do is to make sure that you are a part of the Labs team by <a href="http://okfnlabs.org/join/">signing up</a>. Once you have joined the Labs community, check out <a href="http://okfnlabs.org/ideas/">our Ideas page</a> or our <a href="http://okfnlabs.org/projects/">current Projects</a> and find something that you would be interested in collaborating on. Do you have a plan for something we haven’t though of yet? Tell us about it <a href="https://twitter.com/intent/user?screen_name=OKFNLabs">on Twitter</a> or better yet jump on <a href="http://okfnlabs.org/contact/">the mailing list</a>.</p>
<p>For all of you already contributing to Labs: keep up the great work! Open Data is important, and your efforts continue to provide transparency for critical information. With your help, Labs will continue its success into 2016. See you then!</p>
Josh Wieder
Labs newsletter: Q2/Q3 2015
2015-09-28T00:00:00+00:00
http://okfnlabs.org/blog/2015/09/28/newsletter
<p>Welcome to the second Labs Newsletter of 2015! There has been
excellent progress on various open data tools and initiatives across
the Open Knowledge network since the last newsletter. Let’s take a
look:</p>
<h2 id="labs-still-3-discourse">Labs Still <3 Discourse</h2>
<p>Open Knowledge is in the process of centralizing community discussions
on our Discourse <a href="http://discuss.okfn.org">forums</a>. In order to do
this, we’ve been enabling many new features to support
mailing-list-style communication such as starting and replying to new
topics via email. There’s already a lot of discussion there, so,
check out the
<a href="https://discuss.okfn.org/c/open-knowledge-labs">Open Knowledge Labs</a>
category, sign up, and tell us about your favorite tools!</p>
<h2 id="jts-sql">JTS-SQL</h2>
<p>Friedrich Lindenberg, AKA <a href="http://okfnlabs.org/members/pudo/">pudo</a>,
has booted up <a href="https://github.com/okfn/jts-sql">JTS-SQL</a>, a Python
library that removes some of the friction in dealing with data by
automatically generating database table models based on
<a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a> field
descriptors.</p>
<h2 id="give-me-text">Give Me Text!</h2>
<p>The Labs web service booted by
<a href="/members/mattfullerton/">Matt Fullerton</a> for converting documents
(e.g. PDF) to text using OCR has been given a new name “Give Me Text!”
and <a href="http://givemetext.okfnlabs.org/">a nice URL</a>. Read the original
announcement
<a href="http://okfnlabs.org/blog/2015/02/21/documents-to-text.html">here</a>.</p>
<h2 id="core-data-curators">Core Data Curators</h2>
<p>We’re always looking for curators to help support our Core Datasets
project which is aimed at collecting and maintaining (curating)
important and commonly-used (“core”) datasets (e.g. GDP, ISO-codes) in
high-quality, standardized and easy-to-use form. Interested in
joining? Have questions? Visit us
<a href="https://discuss.okfn.org/c/open-knowledge-labs/core-datasets">here</a>.</p>
<h2 id="openspending-next">OpenSpending Next</h2>
<p>Work on the next iteration of
<a href="http://community.openspending.org/next">OpenSpending</a> is picking up
steam. We recently hosted an
<a href="https://discuss.okfn.org/t/tech-hangout-sept-2015/1046/12">OpenSpending Tech Hangout</a>
to demo some of our recent work. We can’t do it without you, so jump
into the
<a href="https://discuss.okfn.org/c/openspending">OpenSpending category</a> and
tell us how you’d like your government’s finances aggregated, stored,
and visualized :).</p>
<h2 id="building-a-100-open-source-based-open-data-platform">Building a 100% Open-Source-based Open Data platform</h2>
<p>Alex Corbi
<a href="http://www.open-steps.org/my-experience-building-a-100-open-source-based-open-data-platform/">shares his experience</a>
participating in the development of the
<a href="http://www.opendevelopmentmekong.net">Open Development Mekong</a>
project, a platform built using 100% open-source technologies whose
aim is to increase transparency in the South Asian countries of the
Mekong basin region. The article focuses on its main component; a
knowledge base platform built using
<a href="http://ckan.org/">CKAN</a>. Additionally, Alex gives an overview on how
the development was coordinated on <a href="https://github.com/">GitHub</a> and
the set of additional tools developed around the project, including a
<a href="http://extensions.ckan.org/extension/wpckan/">plugin to connect Wordpress and CKAN</a>,
which can be used and re-purposed for future projects the attendees
can find themselves in the future.</p>
<h2 id="open-data-companion">Open Data Companion</h2>
<p>Osahon Okungbowa let us know about a mobile app he has created,
<a href="http://odc.utopiasoftwareonline.com/">Open Data Companion (ODC)</a>,
which provides a unified access point to over 120 CKAN open data
portals and thousands of datasets from around the world, right from
your mobile device. Crafted with mobile-optimised features and design,
this is an easy and convenient way to find, access and share open
data.</p>
<h2 id="mexicos-new-open-data-portal">Mexico’s new Open Data Portal</h2>
<p>Juan Ortiz Freuler has pointed us to
<a href="https://es.scribd.com/doc/274622757/Supply-and-Demand-of-Data-Through-Mexico-s-Open-Data-Portal">this report</a>,
written in non-technical language, which provides the reader with
contextual data to understand the challenges the government of a
developing country is facing in the implementation of an Open Data
Portal. The report includes analysis of availability and quality of
key datasets, critical analysis of the existing normative framework,
an analysis of web traffic towards the portal, as well as insights
from interviews with over two dozen professional Mexican data users.</p>
<p><img src="https://cloud.githubusercontent.com/assets/14280123/9857467/9edcb930-5b12-11e5-8d2d-6cb89b3dd710.png" alt="graph" /></p>
<h2 id="a-model-for-frictionless-science">A Model for Frictionless Science?</h2>
<p>Steven De Costa has been doing work with open government data for the
last few years, mostly around the platform capability of CKAN, and has
been considering what frictionless science might look like:</p>
<blockquote>
<p>I’ve been thinking about how to publish the full set of research
artifacts needed to replicate and review work undertaken by labs, or
to swap out data and reconstitute the research in a new context. That
thinking, done only with little access to end users, has revealed the
following short list of what might be published as a ‘dataset’ listing
of ‘resources’…</p>
</blockquote>
<p>Check out the forums to contribute to the
<a href="https://discuss.okfn.org/t/is-there-a-model-for-frictionless-science/1203/1">discussion</a>!</p>
<h2 id="get-involved">Get Involved</h2>
<p>Anyone can join the Labs community and get involved! Read more about
how you can <a href="/join">join the community</a> and participate by coding,
wrangling data, or doing outreach and engagement. Also check out the
<a href="/ideas">ideas page</a> to see what’s cooking in the Labs, and the
<a href="/newsletter">newsletter page</a> if you have items to submit to the next
newsletter.</p>
Daniel Fowler
Open Data Companion (ODC) – Bringing Open Data to the Mobile Platform
2015-09-04T00:00:00+00:00
http://okfnlabs.org/blog/2015/09/04/bringing-open-data-to-mobile
<p>
As software developers, we are always looking for data to solve a problem or address a shortcoming.
It’s just how we’re wired. So, you heard of <a href="http://opendatahandbook.org/" target="_blank">open data [1]</a>,
and now you’re excited to go exploring and get the open data needed for the project.
That’s when you face the first major obstacle – open data (with the right use license) is so difficult to locate on the
Internet, you spend hours and hours searching before you find data with the right ‘open’ license you can use.
Luckily, in recent times, <a href="http://data.okfn.org/vision" target="_blank">OKFN’s frictionless data project [2]</a>
and <a href="https://namara.io/#/" target="_blank">ThinkData Works [3]</a> are taking actions to help solve this problem.
</p>
<p>
As a data journalist, researcher or open data enthusiast, have you ever wished you could just whip-out your mobile
device and quickly get instant access to a specific open data? Or maybe you want to always be aware when new
datasets are available from a portal you’re interested in without having to manually poll or visit the portal
for regular checks?
</p>
<p>
Everyone now has a mobile device.
Popular <a href="http://www.bloomberg.com/bw/articles/2014-11-19/we-now-spend-more-time-staring-at-phones-than-tvs" target="_blank">consensus [4]</a> and
<a href="http://www.geekwire.com/2014/flurry-report-mobile-phones-162-minutes/" target="_blank">reports [5]</a> show that mobile device
usage and time spent on mobile devices are rapidly increasing. This means that mobile devices are now one of the fastest
and easiest means of accessing data and information. Yet, as of now, open data lacks a strong mobile presence.
</p>
<img src="https://writeosahon.files.wordpress.com/2015/08/smartphone-survey.png" title="Motorola Smartphone Relationship Survey"
alt="Motorola Smartphone Relationship Survey" width="400" height="300"> [Image Source – Motorola Smartphone Relationship Survey]<br>
<p>
To tackle the highlighted challenges, Open Data Companion was developed to bridge the gap between open data portals and the
mobile platform.
</p>
<h2>What is Open Data Companion (ODC)?</h2>
<p>
ODC is a FREE Android productivity app which provides a unified access point and common repository to over 120
<a href="http://ckan.org/">CKAN [6]</a> open data portals and thousands of datasets from around the world;
right from your mobile device. Crafted with mobile-optimised features and design,
this is an easy and convenient way to find, access and share open data.
</p>
<p>
Open Data Companion provides a framework for all Private Sector, State, Regional, National and Worldwide CKAN open
data portals to deliver open data to all mobile users.
</p>
<img src="https://writeosahon.files.wordpress.com/2015/08/screen_shot11.png" title="ODC Screen Shot"
alt="ODC Screen Shot" width="200" height="400">
<b><a href="https://play.google.com/store/apps/details?id=writeosahon.utopiasoftware.odcompanion" target="_blank">
Download the ODC Android App here</a></b><br>
<h3>Key Features</h3>
<ul>
<li>
ODC is built on a distributed system of CKAN portals. This means the app (or its server) does not cache or
store harvested datasets. Rather the app uses the powerful
<a href="http://docs.ckan.org/" target="_blank">CKAN API [7]</a> to retrieve live/current datasets from any
active CKAN portal on the Internet. Users are free to setup access to as many data portals as they want.<br>
<img src="https://writeosahon.files.wordpress.com/2015/08/screen4.png" title="Portal Access Setup on ODC"
alt="Portal Access Setup on ODC" width="200" height="400">
</li>
<li>
Browse datasets from any accessed portal using categorisation / classification (i.e. organisations and groups)
as provided by the portal. This category based navigation along with mobile gestures allow users to more quickly
locate datasets.<br>
<img src="https://writeosahon.files.wordpress.com/2015/08/screen_shot3.png" title="Browsing Datasets by Categories on ODC"
alt="Browsing Datasets by Categories on ODC" width="200" height="400">
</li>
<li>
Receive push notifications on your mobile device whenever new datasets become available from any portal of your choosing.
This helps to keep track of new uploads on any portal without having to poll it manually for changes.<br>
<img src="https://writeosahon.files.wordpress.com/2015/08/screen_shot2.png" width="200" height="400">
</li>
<li>
Preview datasets and create in-app data visualisations.
The app uses the installed data viewer extensions on CKAN portals to enable mobile users produce data visualisations.
This is a great feature for data journalist and data visualizers.<br>
<img src="https://writeosahon.files.wordpress.com/2015/08/preview_1.png" title="ODC Dataset Viewing & Data Viz"
alt="ODC Dataset Viewing & Data Viz" width="540" height="270"><br>
<img src="https://writeosahon.files.wordpress.com/2015/08/screen_shot5.png" title="ODC Dataset Viewing & Data Viz"
alt="ODC Dataset Viewing & Data Viz" width="200" height="400">
<img src="https://writeosahon.files.wordpress.com/2015/08/preview_3-e1440833607613.png" title="ODC Dataset Viewing & Data Viz"
alt="ODC Dataset Viewing & Data Viz" width="200" height="400">
</li>
<li>
Other key features - download datasets to your mobile device; bookmark datasets for later viewing;
share links to datasets on social media.
</li>
</ul>
<p>
ODC complements and supplements the services of CKAN portals; it allows data producers and portal administrators to reach
mobile users at no added costs or without additional configurations.
<a href="http://odc.utopiasoftwareonline.com/" target="_blank">Find out more about Open Data Companion (ODC) [8]</a>.
</p>
<p>
<a href="https://play.google.com/store/apps/details?id=writeosahon.utopiasoftware.odcompanion" target="_blank">
Use ODC and give feedback</a>. More features are actively being developed.
</p>
<!-- the entire section below contains the references for links used in the blog -->
<h3>
References
</h3>
<div>[1] Open Data Handbook http://opendatahandbook.org/</div>
<div>[2] OKFN’s Frictionless Data Project http://data.okfn.org/vision</div>
<div>[3] ThinkData Work’s Project Namara https://namara.io/#/</div>
<div>[4] We Now Spend More Time Staring at Phones Than TVs
http://www.bloomberg.com/bw/articles/2014-11-19/we-now-spend-more-time-staring-at-phones-than-tvs</div>
<div>[5] Study: Americans spend 162 minutes on their mobile device per day, mostly with apps
http://www.geekwire.com/2014/flurry-report-mobile-phones-162-minutes/</div>
<div>[6] CKAN http://ckan.org/</div>
<div>[7] CKAN API http://docs.ckan.org/</div>
<div>[8] Open Data Companion (ODC) http://odc.utopiasoftwareonline.com/</div>
Osahon Okungbowa
Document to text conversion web service gets a nice name, a nice URL and a web interface
2015-08-28T00:00:00+00:00
http://okfnlabs.org/blog/2015/08/28/give-me-text
<h2 id="give-me-text">Give Me Text!</h2>
<p>In a <a href="http://okfnlabs.org/blog/2015/02/21/documents-to-text.html">previous post</a>, I detailed a web service where you can throw documents of many kinds at it, and get text in return. We’ve now given this service a name, “Give Me Text!”, and a nice URL at <a href="http://givemetext.okfnlabs.org/">http://givemetext.okfnlabs.org/</a> for both the API, which is at a subpath <a href="http://givemetext.okfnlabs.org/tika">/tika</a> and a web interface for uploading documents to the service. The web service is based on <a href="https://github.com/tpalsulich/TikaExamples/tree/gh-pages">some nice work by Tyler Palsulich</a> who got in touch via <a href="https://github.com/okfn/ideas/issues/88#issuecomment-100107044">GitHub</a>. Thanks Tyler!</p>
Matt Fullerton
Improving the openness of health and social care data
2015-08-14T00:00:00+00:00
http://okfnlabs.org/blog/2015/08/14/improving-the-openness-of-health-and-social-care-data
<p>The <a href="http://www.hscic.gov.uk/">Health and Social Care Information Centre</a> (HSCIC) is responsible for publishing a large proportion of the official statistics related to health and care in England. Each year we release about 250 statistical publications, ranging from high-level summary data on hospital admissions, through to detail on prescriptions, and results from surveys on lifestyles and smoking, drinking and drug use habits. We publish a vast array of aggregated non-identifiable data, all under the Open Government Licence, and are working with an <a href="http://www.hscic.gov.uk/transparency">open data</a> mind-set to ensure that these data can be used to maximum effect.</p>
<p>Most of our statistical data is presented in formatted spreadsheets, providing context and detail in accordance with the Statistics Code of Conduct, but we are also making the data available for re-use, in machine-readable, comma-separated-variable format. We hope that this encourages our non-identifiable data to be consumed by a greater array of users, for more purposes. An example of this is the annual <a href="http://www.hscic.gov.uk/pubs/hes1314">Hospital Episode Statistics</a> (HES) publication for admitted patient care – the publication contains a raft of Excel tables of various statistics, but we have also now created a set of CSV files, which use a consistent structure.</p>
<p>To improve the discoverability of our non-identifiable data, as well as being published on our own site, our datasets are also available through data.gov.uk (DGU). We hope that organising datasets in this way makes it easier for users to find exactly the data they need. An area that we’ve recently worked on is our Clinical Indicators. The HSCIC is responsible for assuring the quality of health and care indicators, and publishes over 1700 indicators on our <a href="https://indicators.ic.nhs.uk/webview/">Indicator Portal</a>.</p>
<p>Until recently, the only way to search or access the data was from within the portal. Now, the aggregated datasets that support over 100 indicators on health outcomes can be accessed using DGU, thanks to its harvesting tool. Our portal makes the metadata available for each indicator using the DDI xml standard – so we have converted this into a <a href="https://github.com/hscic-open-data/indicator-portal">data.json</a> equivalent, which will be maintained in line with ongoing release of indicators.</p>
<p>This means that these indicators, and more in future, can be found without the user needing to be within our own portal. Users can also benefit from DGU’s additional CKAN functionality (for example, the ‘Preview’ function) and of course, being a CKAN implementation, the aggregated datasets have their own API, allowing other portals to re-harvest the data. All of which will hopefully increase the different ways in which our data is used.</p>
<p>To begin with, the indicators that can be found on data.gov.uk are:</p>
<ul>
<li>
<p><a href="http://data.gov.uk/data/search?sort=title_string+asc&q=title%3A+%22nhsof%22&publisher=health-and-social-care-information-centre#search-sort-by">NHS Outcomes Framework</a> (‘NHSOF’; 50 indicators) – which sets out the outcomes and corresponding indicators used by the Secretary of State to hold NHS England to account for improvements in health outcomes</p>
</li>
<li>
<p><a href="http://data.gov.uk/data/search?sort=title_string+asc&q=title%3A+%22ccgois%22&publisher=health-and-social-care-information-centre#search-sort-by">Clinical Commissioning Group Outcomes Indicator Set</a> (‘CCGOIS’; 53 indicators) – which is an integral part of NHS England’s systematic approach to quality improvement.</p>
</li>
</ul>
<p>If you’re already using our open data to generate benefits, we’d love to learn more – it will help us to prioritise our efforts. Tweet us at <a href="http://twitter.com/HSCICOpenData">@HSCICOpenData</a>.</p>
<p>Chris Hutchins is Open Data Lead, Health and Social Care Information Centre (HSCIC)</p>
Chris Hutchins
Featured Core Datasets: Comprehensive Country Codes and Country List
2015-07-11T00:00:00+00:00
http://okfnlabs.org/blog/2015/07/11/country-list-and-codes
<p>Are you in need of a clean, well maintained list of all countries and
their associated international codes in CSV and JSON? If so, you
might consider the
<a href="http://data.okfn.org/data/core/country-codes">country-codes</a> and
<a href="http://data.okfn.org/data/core/country-list">country-list</a> data
packages available at <a href="http://data.okfn.org">data.okfn.org</a>.</p>
<ul>
<li>
<p><strong><a href="http://data.okfn.org/data/core/country-codes">Country Codes</a></strong>,
using source data from <a href="http://www.iso.org/iso/home.htm">ISO</a>, the
<a href="https://www.cia.gov/library/publications/the-world-factbook/">CIA World Factbook</a>,
and others, provides comprehensive information for countries in the
world, including their respective ISO 3166 codes, ITU dialing codes,
ISO 4217 currency codes, and many others.</p>
</li>
<li>
<p><strong><a href="http://data.okfn.org/data/core/country-list">Country List</a></strong>
provides a subset of the information found in
<a href="http://data.okfn.org/data/core/country-codes">country-codes</a>, for
use when you only need a country’s two-character ISO 3166 code and
its name in English.</p>
</li>
</ul>
<p><img src="/img/posts/country-codes.png" alt="Country Codes" /></p>
<p>Often, different databases that provide country-level information may
use different unique identifiers for each country, adding significant
friction to the process of using the data they provide. By linking
all standardized identifiers in one place,
<a href="http://data.okfn.org/data/core/country-codes">Country Codes</a> can
provide a useful table on which to join distinct datasets and much
more.</p>
<h2 id="core-datasets">Core Datasets</h2>
<p><img src="http://assets.okfn.org/p/data/img/icon-128.png" alt="Data Package" /></p>
<p>Due to the value they provide, these datasets have been designated
<a href="http://data.okfn.org/roadmap/core-datasets">“core” datasets</a> as part
of the part of the <a href="/projects/frictionless-data/">Frictionless Data</a>
initiative. The Core Datasets project is about collecting and
maintaining (curating) important and commonly-used datasets in a
high-quality, standardized, and easy-to-use form: in particular, as
up-to-date, well-structured
<a href="http://dataprotocols.org/data-packages/">Data Packages</a>. These
datasets are available now for inspection on the site, and are
downloadable in either CSV or JSON format for use in your next
application, website, or spreadsheet.</p>
<p>We need help suggesting, preparing and maintaining a set of “core”
datasets as Data Packages. To get involved, see our previous
<a href="http://okfnlabs.org/blog/2015/01/03/data-curators-wanted-for-core-datasets.html">call for data curators</a>.
Also check out the
<a href="https://discuss.okfn.org/t/about-the-core-datasets-category/144">Core Datasets category</a>
on our new <a href="https://discuss.okfn.org">discussion forum</a>.</p>
Dan Fowler
Labs newsletter: Q1 2015
2015-05-11T00:00:00+00:00
http://okfnlabs.org/blog/2015/05/11/newsletter
<p>Welcome to the first Labs Newsletter of 2015! There has been some great activity around open data and tech in the Open Knowledge network over the first quarter of 2015. Let’s dive straight in!</p>
<h2 id="labs-3-discourse">Labs <3 Discourse</h2>
<p>In case you don’t know, <a href="http://www.discourse.org/">Discourse</a> is an open source forum/mailing list hybrid for communities. Open Knowledge runs a Discourse server, and of course, there is a home there for the Open Knowledge Labs community. We hope to move community discussion there going forward, so, check out the <a href="https://discuss.okfn.org/c/open-knowledge-labs">Open Knowledge Labs</a> category, signup, and set your digest preferences.</p>
<h2 id="labs-hangouts">Labs hangouts</h2>
<p>The first Open Knowledge Labs hangout for 2015 was held on April 16th to a full house, and the next one is currently scheduled for May 14. Checkout the previous agenda, and planning for the next one, <a href="https://pad.okfn.org/p/labs-hangouts">here at okfnpad</a>.</p>
<h2 id="core-datasets">Core datasets</h2>
<p>Core datasets is a project for collecting and maintaining important and commonly-used (“core”) datasets in high-quality, standardized and easy-to-use form. There has been quite some activity here, with a <a href="http://okfnlabs.org/blog/2015/01/03/data-curators-wanted-for-core-datasets.html">call for data curators</a> (jump in if you are interested!). Currently, 35+ volunteers are contributing, with leadership from super contributor @sxren.</p>
<p>Most action takes place <a href="https://github.com/datasets/registry/issues">here</a> with datasets then appearing on the <a href="http://data.okfn.org/data/">frictionless data site</a>.</p>
<p>Some notable recent contributions include:</p>
<ul>
<li><a href="http://data.okfn.org/data/core/media-types">Media Types</a> (@bluechi and @sxren)</li>
<li><a href="http://data.okfn.org/data/core/membership-to-copyright-treaties">Membership to Copyright Treaties</a> (@bluechi)</li>
<li><a href="http://data.okfn.org/data/core/corruption-perceptions-index">Transparency International - Corruptions Perceptions Index</a></li>
<li><a href="http://data.okfn.org/data/core/top-level-domain-names">Top Level Domain Names</a></li>
<li><a href="http://data.okfn.org/data/core/geo-nuts-administrative-boundaries">NUTS administrative boundaries</a></li>
</ul>
<h2 id="data-package-libraries">Data Package libraries</h2>
<p><a href="http://data.okfn.org/doc/data-package">Data Packages</a> are a simple set of specifications for packaging data. Some great libraries have recently been released (and updated) for working with the Data Package format and related specs such as <a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a>.</p>
<h3 id="dpmr-data-package-management-in-r">dpmr: Data Package management in R</h3>
<p>dpmr is for working with Data Packages in R. <a href="http://christophergandrud.github.io/dpmr/">Check it out here</a>.</p>
<h3 id="datapak-data-package-management-in-ruby">DataPak: Data Package management in Ruby</h3>
<p>DataPak is for working with Data Packages in Ruby, and provides some really nice extras like managing your packages locally, SQL integration and more. Read the announcement on the Labs blog [here][datapack-announce], and check out the code <a href="https://github.com/textkit/datapak">here</a>.</p>
<h3 id="data-package-data-package-management-in-python">Data Package: Data Package management in Python</h3>
<p>Data Package, and Budget Data Package, are Python packages for working with Data Packages. These libraries have been around for a while, but recently were updated to add Python 3 support. Check out Data Package <a href="https://github.com/tryggvib/datapackage">here</a>, and Budget Data Package <a href="https://github.com/tryggvib/budgetdatapackage">here</a>.</p>
<h3 id="jtskit-working-with-json-table-schema-in-python">JTSKit: Working with JSON Table Schema in Python</h3>
<p><a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a> is a specification for declaring schemas for data, and is used within Data Packages. JTSKit is a Python library for working with JSON Table Schema, providing interfaces for validating schemas, inferring schema from data, and a schema model class for easy use in Python code. Check it out <a href="https://github.com/okfn/jtskit-py">here</a>.</p>
<h2 id="ocr-pdf-to-text">OCR PDF to Text</h2>
<p>A new web service is available via Labs for converting documents (eg: PDF) to text using OCR. Read the announcement <a href="http://okfnlabs.org/blog/2015/02/21/documents-to-text.html">here</a>, and check out the code <a href="https://github.com/mattfullerton/tika-tesseract-docker">here</a>.</p>
<h2 id="goodtables">GoodTables</h2>
<p>GoodTables is a web services (and Python Library/CLI) for validating tabular data. Read more about it in the announcement <a href="http://okfnlabs.org/blog/2015/03/06/goodtables-web-service.html">here</a>, check out the web service <a href="http://goodtables.okfnlabs.org/">here</a>, and the library <a href="https://github.com/okfn/goodtables">here</a>.</p>
<h2 id="databaker">Databaker</h2>
<p>ScraperWiki have released a new library for getting data out of spreadsheets. Read the announcement <a href="https://blog.scraperwiki.com/2015/03/databaker-making-spreadsheets-usable/">here</a>, and check out the code <a href="https://github.com/scraperwiki/databaker">here</a>.</p>
<h2 id="council-data-visualisations-and-standards">Council data visualisations and standards</h2>
<p>Steve Bennett of Open Knowledge Australia has been doing some awesome work standardising and visualising council data in Victoria, Australia. He’s hoping to gain wider adoption of the standards that are emerging, in Australia and beyond. The standardisation work is happening <a href="https://github.com/OKFNau/open-council-data">here</a>, on the OKFNAU repository on GitHub. See some of the data visualised on the <a href="http://openbinmap.org/">Open Bin Map</a> and <a href="http://opentrees.org/">Open Trees</a>.</p>
<h2 id="new-data-portal-for-washington-dc">New data portal for Washington DC</h2>
<p>Washington, DC’s <a href="http://opendata.dc.gov/">data catalog has a new home</a>. It operates on the ArcGIS Open Data platform and houses data relevant to city services in a variety of formats and with built-in APIs. The service is run out of the DC Office of the Chief Technology Officer, who have been quite responsive to issues and requests. You can give them a shout on Twitter as @opendatadc. Old datasets are still accessible <a href="http://legacy.data.dc.gov/">here</a> as they transition to the new site.</p>
<h2 id="remote-data-access-wrapper-for-the-nomis-api">Remote data access wrapper for the Nomis API</h2>
<p>Here’s an <a href="http://blog.ouseful.info/2015/03/09/sketching-out-a-python-pandas-remote-data-access-wrapper-for-the-nomis-api/">interesting blog post</a> detailing work in Python/Pandas over the Nomis API, coming out of work Tony Hirst is doing teaching data wrangling for the UK Cabinet Office.</p>
<h2 id="get-involved">Get involved</h2>
<p>Anyone can join the Labs community and get involved! Read more about how you can <a href="http://okfnlabs.org/join/">join the community</a> and participate by coding, wrangling data, or doing outreach and engagement. Also check out the <a href="http://okfnlabs.org/ideas/">ideas page</a> to see what’s cooking in the Labs, and the <a href="http://okfnlabs.org/newsletter.html">newsletter page</a> if you have items to submit to the next newsletter.</p>
Paul Walsh
Introducing datapak - Work with Tabular Data Packages using Ruby and ActiveRecord
2015-04-26T00:00:00+00:00
http://okfnlabs.org/blog/2015/04/26/datapak
<p><a href="http://data.okfn.org/doc/tabular-data-package">Tabular data packages</a>
are a pragmatic way of both publishing your own data and consuming the
data that others share with the world. The newly published
<a href="https://rubygems.org/gems/datapak">datapak</a> is a Ruby library that
lets you work with tabular data packages using ActiveRecord
and, thus, your SQL database of choice (by default the library
uses an in-memory SQLite database).</p>
<h2 id="using-datapak">Using datapak</h2>
<p>Let’s try using the datapak gem in a simple example that pulls a
<a href="http://data.okfn.org/data/core/s-and-p-500-companies">list of S&P 500 companies</a>
from the Frictionless Data <a href="http://data.okfn.org/data">dataset registry</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>require 'datapak'
Datapak.import(
's-and-p-500-companies'
)
</code></pre></div></div>
<p>Using <code class="language-plaintext highlighter-rouge">Datapak.import</code> will:</p>
<p>1) download all data packages to the <code class="language-plaintext highlighter-rouge">./pak</code> folder</p>
<p>2) (auto-)add all tables to an in-memory SQLite database using SQL <code class="language-plaintext highlighter-rouge">create_table</code>
commands via <code class="language-plaintext highlighter-rouge">ActiveRecord</code> migrations e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>create_table :constituents_financials do |t|
t.string :symbol # Symbol (string)
t.string :name # Name (string)
t.string :sector # Sector (string)
t.float :price # Price (number)
t.float :dividend_yield # Dividend Yield (number)
t.float :price_earnings # Price/Earnings (number)
t.float :earnings_share # Earnings/Share (number)
t.float :book_value # Book Value (number)
t.float :_52_week_low # 52 week low (number)
t.float :_52_week_high # 52 week high (number)
t.float :market_cap # Market Cap (number)
t.float :ebitda # EBITDA (number)
t.float :price_sales # Price/Sales (number)
t.float :price_book # Price/Book (number)
t.string :sec_filings # SEC Filings (string)
end
</code></pre></div></div>
<p>3) (auto-)import all records using SQL inserts e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>INSERT INTO constituents_financials
(symbol,
name,
sector,
price,
dividend_yield,
price_earnings,
earnings_share,
book_value,
_52_week_low,
_52_week_high,
market_cap,
ebitda,
price_sales,
price_book,
sec_filings)
VALUES
('MMM',
'3M Co',
'Industrials',
162.27,
2.11,
22.28,
7.284,
25.238,
123.61,
162.92,
104.0,
8.467,
3.28,
6.43,
'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')
</code></pre></div></div>
<p>4) (auto-)add <code class="language-plaintext highlighter-rouge">ActiveRecord</code> models for all tables.</p>
<p>Now you can use all the “magic” of <code class="language-plaintext highlighter-rouge">ActiveRecord</code> to work
with the datasets. Example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class Constituent < ActiveRecord::Base
end
pp Constituent.count
# SELECT COUNT(*) FROM "constituents"
# => 496
pp Constituent.first
# SELECT "constituents".* FROM "constituents" ORDER BY "constituents"."id" ASC LIMIT 1
# => #<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">
pp Constituent.find_by!( symbol: 'MMM' )
# SELECT "constituents".*
FROM "constituents"
WHERE "constituents"."symbol" = "MMM"
LIMIT 1
# => #<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">
pp Constituent.find_by!( name: '3M Co' )
# SELECT "constituents".*
FROM "constituents"
WHERE "constituents"."name" = "3M Co"
LIMIT 1
# => #<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">
pp Constituent.where( sector: 'Industrials' ).count
# SELECT COUNT(*) FROM "constituents"
WHERE "constituents"."sector" = "Industrials"
# => 63
pp Constituent.where( sector: 'Industrials' ).all
# SELECT "constituents".*
FROM "constituents"
WHERE "constituents"."sector" = "Industrials"
# => [#<Constituent:0x9f8cb78
id: 1,
symbol: "MMM",
name: "3M Co",
sector: "Industrials">,
#<Constituent:0xa2a4180
id: 8,
symbol: "ADT",
name: "ADT Corp (The)",
sector: "Industrials">,...]
</code></pre></div></div>
<h3 id="how-to-manually-download-a-data-package">How to manually download a data package</h3>
<p>Use the <code class="language-plaintext highlighter-rouge">Datapak::Downloader</code> class to download a data package
to your disk (by default data packages get stored in <code class="language-plaintext highlighter-rouge">./pak</code>).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dl = Datapak::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode`)
</code></pre></div></div>
<p>Will result in:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- pak
|-- language-codes
| |-- data
| | |-- language-codes-3b2.csv
| | |-- language-codes.csv
| | `-- language-codes-full.csv
| `-- datapackage.json
|-- s-and-p-500-companies
| |-- data
| | |-- constituents.csv
| | `-- constituents-financials.csv
| `-- datapackage.json
`-- un-locode
|-- data
| |-- code-list.csv
| |-- country-codes.csv
| |-- function-classifiers.csv
| |-- status-indicators.csv
| `-- subdivision-codes.csv
`-- datapackage.json
</code></pre></div></div>
<h3 id="how-to-manually-add-and-import-a-data-package">How to manually add and import a data package</h3>
<p>Use the <code class="language-plaintext highlighter-rouge">Datapak::Pak</code> class to read a data package and import it into
an SQL database.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pak = Datapak::Pak.new( './pak/un-locode/datapackage.json' )
pak.tables.each do |table|
table.up! # (auto-) add table using SQL create_table via ActiveRecord migration
table.import! # import all records using SQL inserts
end
</code></pre></div></div>
<p>That’s it.</p>
<h2 id="bonus-how-to-connect-to-a-different-sql-database">Bonus: How to connect to a different SQL database</h2>
<p>You can connect to any database supported by ActiveRecord. If you do NOT
establish a connection in your script - the standard (default fallback)
uses an in-memory SQLite3 database.</p>
<h3 id="sqlite">SQLite</h3>
<p>For example, to create an SQLite3 database on disk, lets say <code class="language-plaintext highlighter-rouge">datapak.db</code>,
use in your script (before the <code class="language-plaintext highlighter-rouge">Datapak.import</code> statement):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ActiveRecord::Base.establish_connection( adapter: 'sqlite3',
database: './datapak.db' )
</code></pre></div></div>
<h3 id="postgresql">PostgreSQL</h3>
<p>For example, to connect to a PostgreSQL database, use in your script
(before the <code class="language-plaintext highlighter-rouge">Datapak.import</code> statement):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>require 'pg' ## pull in PostgreSQL (pg) machinery
ActiveRecord::Base.establish_connection( adapter: 'postgresql'
username: 'ruby',
password: 'topsecret',
database: 'database' )
</code></pre></div></div>
<h2 id="find-out-more">Find Out More</h2>
<p>datapak</p>
<ul>
<li>home :: <a href="https://github.com/textkit/datapak">github.com/textkit/datapak</a></li>
<li>gem :: <a href="https://rubygems.org/gems/datapak">rubygems.org/gems/datapak</a></li>
<li>rdoc :: <a href="http://rubydoc.info/gems/datapak">rubydoc.info/gems/datapak</a></li>
</ul>
<p>Tabular Data Package</p>
<ul>
<li>spec :: <a href="http://dataprotocols.org/tabular-data-package">dataprotocols.org/tabular-data-package</a></li>
<li>datasets :: <a href="http://data.okfn.org/data">data.okfn.org/data</a></li>
</ul>
Gerald Bauer
The Good Tables web service
2015-03-06T00:00:00+00:00
http://okfnlabs.org/blog/2015/03/06/goodtables-web-service
<h1 id="introducing-the-good-tables-web-service">Introducing the Good Tables web service</h1>
<p>Good Tables is a free online service that helps you find out if your tabular data is actually good to use - it can check for structural problems (blank rows and columns) as well as ensure that data fits a specific schema.</p>
<p>Tabular data in CSV and Excel formats is one the most common forms of data available on the web - especially if looking at <a href="http://okfn.org/opendata/">open data</a>. Unfortunately, much of that data is messy with blank and incorrect rows, and unexpected values for some fields. (For example, date columns that do not feature well-formed dates. <a href="http://okfnlabs.org/bad-data/">See here for more examples of “bad data”</a>.)</p>
<p>That’s where Good Tables comes in: it checks your data for you, giving you quick and simple feedback on where your tabular data may not yet be quite perfect.</p>
<p>Good Tables uses the <a href="http://okfnlabs.org/blog/2015/02/20/introducing-tabular-validator.html">previously announced</a> <a href="https://github.com/okfn/goodtables">Good Tables Python library</a>, and is developed by <a href="https://okfn.org">Open Knowledge</a> with funding from the <a href="https://www.gov.uk/government/groups/open-data-user-group">Open Data User Group</a>.</p>
<p>Good Tables is currently an alpha release; we invite the community to start using and contributing to it to help us move towards v1.0.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/f1bTx6Zaotk" frameborder="0" allowfullscreen=""></iframe>
<h2 id="api">API</h2>
<p>The documentation for the API can be found <a href="http://goodtables.okfnlabs.org/api">here</a>.</p>
<p>Using the API is easy: POST or GET your data, and get back a JSON object containing the report.</p>
<p>For example:</p>
<pre><code>
# make a request
curl http://goodtables.okfnlabs.org/api/run --data "data=https://raw.githubusercontent.com/okfn/goodtables/master/examples/row_limit_structure.csv&schema=https://raw.githubusercontent.com/okfn/goodtables/master/examples/test_schema.json"
# the response will be like
{
"report": {
"summary": {
"bad_row_count": 1,
"total_row_count": 10,
...
},
"results": [
{
"result_id": "structure_001", # the ID of this result type
"result_level": "error", # the severity of this result type (info/warning/error)
"result_message": "Row 1 is defective: there are more cells than headers", # a message that describes the result
"result_name": "Defective Row", # a human-readable title for this result
"result_context": ['38', 'John', '', ''], # the row values from which this result triggered
"row_index": 1, # the idnex of the row
"row_name": "", # If the row has an id field, this is displayed, otherwise empty
"column_index": 4, # the index of the column
"column_name": "" # the name of the column (the header), if applicable
},
...
]
}
}
</code></pre>
<p>For more details on the report response, see the <a href="https://goodtables.readthedocs.io/en/latest/reports.html">report section</a> of the <a href="https://goodtables.readthedocs.io/en/latest/index.html">Good Tables documentation</a>.</p>
<h2 id="ui">UI</h2>
<p>The web service also features a form for manual validation of data via a UI.</p>
<p>Let’s see it in action:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/f1bTx6Zaotk" frameborder="0" allowfullscreen=""></iframe>
<iframe width="560" height="315" src="https://www.youtube.com/embed/hblUuIjobrc" frameborder="0" allowfullscreen=""></iframe>
<h2 id="contributions">Contributions</h2>
<p>We invite all contributions. Feel free to <a href="https://github.com/okfn/goodtables-web/issues">open an issue</a> if you encounter any problems, or just start hacking and send a pull request.</p>
Paul Walsh
A public web service for document to text conversion including OCR
2015-02-21T00:00:00+00:00
http://okfnlabs.org/blog/2015/02/21/documents-to-text
<h2 id="getting-text-out-of-documents">Getting text out of documents</h2>
<p>Last year I was working on <a href="http://beta.offenedaten.de">beta.offenedaten.de</a>, a catalog of data catalogs in Germany using the <a href="http://www.ckan.org/">CKAN</a> platform as the basis. Although the topic of <a href="https://lists.okfn.org/pipermail/ckan-dev/2014-September/008051.html">how to enable full-text search of documents in CKAN data catalogs</a> is somewhat open, I wanted to be able to collect the full text of open data resources for searching. We can’t assume that PDFs are always nice PDFs full of text: they can just as easily be scans of paper documents without any optical character recognition (OCR) having taken place. So when we extract text from documents, it would be nice to have an option to do OCR too. This is a need common to other projects we have at <a href="http://www.okfn.de">OKF Germany</a>, and, after discussion on the <a href="https://lists.okfn.org/pipermail/okfn-labs/2014-October/001491.html">Labs list</a>, apparently something people would like to have.</p>
<h2 id="lend-me-your-files-i-send-you-back-text">Lend me your files, I send you back text</h2>
<p>In short, there is now a web service available for converting <a href="http://tika.apache.org/1.8/formats.html">a multitude of document types</a> to simple text. It lives at:</p>
<p>http://beta.offenedaten.de:9998/tika</p>
<p>To test it, just throw some images with text in them at it. For example, on a terminal on Mac or Linux:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -T tiff_example.tif http://beta.offenedaten.de:9998/tika
</code></pre></div></div>
<h3 id="how-it-was-built">How it was built</h3>
<p>My involvement in the code for this project was zero. I just took the <a href="http://wiki.apache.org/tika/TikaJAXRS">web server part</a> of the developer version of the <a href="http://tika.apache.org/">Apache Tika Project</a> and put it on a server. OCR support using <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a> has <a href="http://wiki.apache.org/tika/TikaOCR">recently been added</a> to Tika.</p>
<h3 id="roll-your-own">Roll your own</h3>
<p>For intensive use of the service or to include it in your own infrastructure, you can use this <a href="https://registry.hub.docker.com/u/mattfullerton/tika-tesseract-docker/">Docker image</a>, built on this <a href="https://github.com/mattfullerton/tika-tesseract-docker">GitHub Repository</a>. In case you don’t know <a href="https://www.docker.com/whatisdocker/">what Docker is</a>, don’t ask me, as I won’t do a great job of explaining it to you. I’m sure there’s a few Docker experts out there who could improve the Dockerfile setup: pull requests <a href="https://github.com/mattfullerton/tika-tesseract-docker">on GitHub</a> are welcome!</p>
<h3 id="improvements">Improvements</h3>
<p>The big missing feature from this, and from Tika generally, is the ability to perform OCR on a PDF when little or no text comes back. There is <a href="https://github.com/okfn/ideas/issues/88#issuecomment-71388714">a trick to get the OCR on a PDF</a>, but your application will need to decide when to employ it, for example based on the non-OCR results.</p>
<h3 id="get-involved">Get involved</h3>
<p>A quick look at the <a href="https://github.com/okfn/ideas/issues/88">discussion on GitHub</a> shows how many ideas there are floating around to improve open document processing tooling on the web. This is just one tiny piece of that puzzle. More concretely, it would be great to get some Open Knowledge involvement in the Tika project going to support them, particularly with the “no text found in PDF” conundrum above. Just <a href="http://tika.apache.org/contribute.html">get in touch with them</a> directly or with me via the <a href="https://github.com/okfn/ideas/issues/88">GitHub issue</a> or <a href="mailto:matt.fullerton@gmail.com">old-fashioned email</a>.</p>
<h2 id="avoiding-the-ocr-problem-in-the-first-place">Avoiding the OCR problem in the first place</h2>
<p>I thought it might be worth mentioning to anyone involved in putting open data and open documents on the web that there <a href="http://computers.tutsplus.com/tutorials/how-to-ocr-text-in-pdf-and-image-files-in-adobe-acrobat--cms-20406">is a procedure for adding the text to a scan-based PDF</a>, using Adobe Acrobat. If anyone knows of an open source solution for this (i.e. embedding and attaching the OCR text in the images in the PDF), I would love to <a href="mailto:matt.fullerton@gmail.com">hear from you</a>.</p>
Matt Fullerton
Introducing Good Tables
2015-02-20T00:00:00+00:00
http://okfnlabs.org/blog/2015/02/20/introducing-goodtables
<h2 id="what-is-it">What is it?</h2>
<p><a href="https://github.com/okfn/goodtables">Good Tables</a> is a Python package for validating tabular data through a processing pipeline.</p>
<p>It is built by <a href="https://okfn.org">Open Knowledge</a>, with funding from the <a href="https://www.gov.uk/government/groups/open-data-user-group">Open Data User Group</a>. Good Tables is currently an <em>alpha release</em>.</p>
<p>Applications range from simple validation checks on CSV files, to integration with a larger ETL pipeline.</p>
<p>The codebase currently ships with two validators that can be used in a pipeline:</p>
<ul>
<li>The <a href="https://github.com/okfn/goodtables/blob/master/goodtables/processors/structure.py">StructureProcessor</a> checks for common structural errors</li>
<li>The <a href="https://github.com/okfn/tabular-validator/blob/master/goodtables/processors/schema.py">SchemaProcessor</a> checks for conformance to a JSON Table Schema</li>
</ul>
<p>There is a hook to add custom processors, and there are plans to include more processorss in the core library.</p>
<p>Good Tables ships with <a href="https://goodtables.readthedocs.io/en/latest/">some documentation</a>, but it is not yet complete. You are welcome to <a href="https://github.com/okfn/goodtables">check out the code</a>, <a href="https://github.com/okfn/goodtables/blob/master/test.sh">run the tests</a> (or <a href="https://travis-ci.org/okfn/goodtables">check them on Travis</a>), <a href="https://github.com/okfn/goodtables/issues">open an issue</a>, or make a pull request to help us iterate to a version one release (<a href="https://github.com/okfn/goodtables/milestones/Backlog">here is the backlog</a>).</p>
<p>We are also released some packages that are used in Good Tables: <a href="https://github.com/okfn/goodtables-web">Good Tables Web</a>, <a href="https://github.com/okfn/jtskit-py">JTSKit</a>, and <a href="https://github.com/okfn/tellme">TellMe</a>. You can read more about each of these below.</p>
<h2 id="why">Why?</h2>
<p>The development of Good Tables has been driven by a real-world pain point: monitoring and validating government spending data in the United Kingdom (the dashboard for this project is under development <a href="https://github.com/okfn/spend-publishing-dashboard">here</a>). A brief overview of this use case can demonstrate the value proposition of Good Tables.</p>
<h3 id="the-problem">The Problem</h3>
<p>In the UK, various government departments publish spend data. This data is required to be accessible: that is, machine-readable and publicly available. Additionally, the data must conform to schema.</p>
<p>Monitoring the publication of such data, and validating its well-formedness, is a difficult task. The data is produced in a variety of circumstances (e.g.: available resources), and the producers of this data have no tools at hand to confirm that their work is correct.</p>
<p>Considering that spend data is produced at regular periodic intervals, and departments are expected to publish in a timely manner, the problem of producing well-formed data is compounded.</p>
<h3 id="the-solution">The Solution</h3>
<p>Good Tables provides part of the solution with tooling to ensure data is machine readable and well formed. All spend data across the various government departments is collected and run through a Goodl Tables pipeline at regular intervals.</p>
<p>The validation pipeline for this data is something like as follows:</p>
<ul>
<li>Is the file readable as CSV?</li>
<li>Are there headers from the first line of the file?</li>
<li>Are there any empty headers, empty rows, or ragged rows?</li>
<li>Do all the values in the file conform with the expected schema (columns of numbers, dates, etc.)?</li>
</ul>
<p>Any errors detected while the pipeline is running are written in to a report. When the pipeline finishes running, a user-facing report is generated, providing actionable data on what exactly is wrong with the file (so the data producers can take steps to fix such errors).</p>
<h2 id="how-can-i-use-it-now">How can I use it now?</h2>
<p>If you are running Python 2.7, 3.3 or 3.4, you can start using Tabular Validator today.</p>
<p>As mentioned above, this is an <em>alpha release</em>. Still, we have decent test coverage, and we are hoping to uncover bugs and weirdness through wider usage.</p>
<p>Here’s how you can use Good Tables right now:</p>
<h3 id="in-existing-code-bases">In existing code bases</h3>
<p>See some examples in the <a href="https://github.com/okfn/goodtables/tree/master/tests">test suite</a> to get a working idea of the API and how you could integrate a Good Tables pipeline, or stand-alone processor, into your existing workflow with tabular data.</p>
<h3 id="as-a-cli">As a CLI</h3>
<p>If you are doing data wrangling in the terminal, Good Tables comes with a CLI called “goodtables”. See <a href="https://github.com/okfn/goodtables/blob/master/goodtables/cli/main.py">here</a> for the CLI interface,. This is still very much a work in progress, and currently exposes a subset of the Good Tables pipeline interface.</p>
<h3 id="via-the-web">Via the web</h3>
<p>The <a href="https://github.com/okfn/goodtables-web">Good Tables Web</a> package provides both a Web API and a simple form UI over Good Tables. Read about the current API <a href="https://github.com/okfn/goodtables-web/blob/master/README.md">here</a>.</p>
<h2 id="extra-goodies">Extra goodies</h2>
<p>Good Tables has been developed as part of a larger project, and we are pulling out functionality out into standalone packages as possible/practical.</p>
<p>Like the Good Tables package, these are all <em>alpha releases</em>, but each has a passing test suite on Python 2.7, 3.3, and 3.4.</p>
<h3 id="tellme">TellMe</h3>
<p><a href="https://github.com/okfn/tellme">TellMe</a> is a Python package for creating user-facing reports from things happening in code. It is a simple library that provides a logger-like interface to build reports, and then generate them in several output formats.</p>
<h3 id="jtskit">JTSKit</h3>
<p><a href="https://github.com/okfn/jtskit-py">JTSKit</a> is a Python package providing a set of utilities for working with JSON Table Schema.</p>
<h3 id="good-tables-web">Good Tables Web</h3>
<p><a href="https://github.com/okfn/goodtables-web">Good Tables Web</a> is a flask application that provides a Web API over Good Tables, as well as a simple form UI.</p>
Paul Walsh
Wanted - Data Curators to Maintain Key Datasets in High-Quality, Easy-to-Use and Open Form
2015-01-03T00:00:00+00:00
http://okfnlabs.org/blog/2015/01/03/data-curators-wanted-for-core-datasets
<p>Wanted: volunteers to join a team of “Data Curators” maintaining <strong>“core” datasets</strong> (like GDP or ISO-codes) in <strong>high-quality, easy-to-use and open</strong> form.</p>
<ul>
<li><strong>What is the project about</strong>: Collecting and maintaining important and commonly-used (“core”) datasets in high-quality, standardized and easy-to-use form - in particular, as up-to-date, well-structured <a href="http://data.okfn.org/doc/data-package/">Data Packages</a>.<br />
The “Core Datasets” effort is part of the broader <a href="http://data.okfn.org/">Frictionless Data initiative</a>.</li>
<li><strong>What would you be doing</strong>: identifying and locating core (public) datasets, cleaning and standardizing the data and making sure the results are kept up to date and easy to use</li>
<li><strong>Who can participate</strong>: anyone can contribute. Details on the skills needed are below.</li>
<li><strong>Get involved</strong>: read more below or jump straight to <a href="#sign-up">the sign-up section</a>.</li>
</ul>
<p><img src="http://assets.okfn.org/p/data/img/icon-128.png" alt="" style="display: block; margin: auto;" /></p>
<h2 id="what-is-the-core-datasets-effort">What is the Core Datasets effort?</h2>
<p>Summary: Collect and maintain important and commonly-used (“core”) datasets in high-quality, reliable and easy-to-use form (as Data Packages).</p>
<p>Core = important and commonly-used datasets e.g. reference data (country codes) and indicators (inflation, GDP)</p>
<p>Curate = take existing data and provide it in high-quality, reliable, and easy-to-use form (standardized, structured, open)</p>
<ul>
<li><strong>Full details</strong>: including slide-deck at <a href="http://data.okfn.org/roadmap/core-datasets">http://data.okfn.org/roadmap/core-datasets</a>.</li>
<li><strong>Live examples</strong>: You can find already packaged core datasets at <a href="http://data.okfn.org/data/">http://data.okfn.org/data/</a> and in “raw” form on Github at <a href="https://github.com/datasets/">https://github.com/datasets/</a></li>
</ul>
<iframe src="https://docs.google.com/presentation/d/1-BLImNBv2RtEkFVq_DdWjy05baHfprWHHdXZiMrmihQ/embed?start=false&loop=false&delayms=3000" frameborder="0" width="480" height="389" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<h2 id="what-roles-and-skills-are-needed">What Roles and Skills are Needed</h2>
<p>We need a variety of roles from identifying new “core” datasets to packaging the data to performing quality control (checking metadata etc).</p>
<p><strong>Core Skills</strong> - at least one of these skills will be needed:</p>
<ul>
<li><strong>Data Wrangling Experience</strong>. Many of our source datasets are not complex (just an Excel file or similar) and can be “wrangled” in a Spreadsheet program. What we therefore recommend is at least one of:
<ul>
<li>Experience with a Spreadsheet application such as Excel or (preferably) Google Docs including use of formulas and (desirably) macros (you should at least know how you could quickly convert a cell containing ‘2014’ to ‘2014-01-01’ across 1000 rows)</li>
<li>Coding for data processing (especially scraping) in one or more of python, javascript, bash</li>
</ul>
</li>
<li><strong>Data sleuthing</strong> - the ability to dig up data on the web (specific desirable skills: you know how to search by filetype in google, you know where the developer tools are in chrome or firefox, you know how to find the URL a form posts to)</li>
</ul>
<p><strong>Desirable Skills</strong> (the more the better!):</p>
<ul>
<li>Data vs Metadata: know difference between data and metadata</li>
<li>Familiarity with Git (and Github)</li>
<li>Familiarity with a command line (preferably bash)</li>
<li>Know what JSON is</li>
<li>Mac or Unix is your default operating system (will make access to relevant tools that much easier)</li>
<li>Knowledge of Web APIs and/or HTML</li>
<li>Use of curl or similar command line tool for accessing Web APIs or web pages</li>
<li>Scraping using a command line tool or (even better) by coding yourself</li>
<li>Know what a Data Package and a Tabular Data Package are</li>
<li>Know what a text editor is (e.g. notepad, textmate, vim, emacs, …) and know how to use it (useful for both working with data and for editing Data Package metadata)</li>
</ul>
<p><a name="sign-up" id="sign-up"></a></p>
<h2 id="get-involved---sign-up-now">Get Involved - Sign Up Now!</h2>
<p>We are looking for volunteer contributors to form a “curation team”.</p>
<ul>
<li><strong>Time commitment</strong>: Members of the team commit to at least 8-16h per month (though this will be an average - if you are especially busy with other things one month and do less that is fine)</li>
<li><strong>Schedule</strong>: There is no schedule so you can contribute at any time that is good for you - evenings, weekeneds, lunch-times etc</li>
<li><strong>Location</strong>: all activity will be carried out online so you can be based anywhere in the world</li>
<li><strong>Skills</strong>: see above</li>
</ul>
<p>To register your interest fill in the following form. Any questions, please <a href="/contact/">get in touch directly</a>.</p>
<iframe src="https://docs.google.com/forms/d/1d9chMK0jU9CJs0_mnK_JQU9iIJocjm7AEp0ZM5eSiNg/viewform?embedded=true" width="620" height="1425" frameborder="0" marginheight="0" marginwidth="0">Loading...</iframe>
<h3 id="want-to-dive-straight-in">Want to Dive Straight In?</h3>
<p>Can’t wait to get started as a Data Curator? You can dive straight in and start packaging the already-selected (but not packaged) core datasets. Full instructions here:</p>
<p><a href="http://data.okfn.org/roadmap/core-datasets#contribute">http://data.okfn.org/roadmap/core-datasets#contribute</a></p>
Rufus Pollock
A Data API for Data Packages in Seconds Using CKAN and its DataStore
2014-09-11T00:00:00+00:00
http://okfnlabs.org/blog/2014/09/11/data-api-for-data-packages-with-dpm-and-ckan
<p><code class="language-plaintext highlighter-rouge">dpm</code> the command-line ‘data package manager’ now supports pushing <a href="http://data.okfn.org/standards">(Tabular)
Data Packages</a> straight into a <a href="http://ckan.org/">CKAN instance</a> (including
pushing all the data into the <a href="http://docs.ckan.org/en/latest/maintaining/datastore.html">CKAN DataStore</a>):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dpm ckan {ckan-instance-url}
</code></pre></div></div>
<p>This allows you, in seconds, to get a fully-featured web data API – including <a href="http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_search">JSON</a> and
<a href="http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_search_sql">SQL-based</a> query APIs:</p>
<p><img src="http://assets.okfnlabs.org/p/dpm/img/dpm-ckan.gif" alt="dpm ckan demo" /></p>
<p style="text-align: center; font-size: x-small"><a href="http://assets.okfnlabs.org/p/dpm/img/dpm-ckan.gif">View fullsize</a></p>
<p>Once you have a nice web data API like this we can very easily create data-driven applications and visualizations. As a simple demonstration, there’s the <a href="http://dev.rufuspollock.org/ckan-explorer/">CKAN Data Explorer</a> (<a href="http://dev.rufuspollock.org/ckan-explorer/?endpoint=http://datahub.io&resource=ea3926e3-43a8-46d0-832a-e53efd61ebb0">example with IMF data</a> - see below).</p>
<h2 id="where-can-i-find-a-ckan-instance-to-upload-to">Where Can I Find a CKAN instance to Upload to?</h2>
<p>If you’re looking for a CKAN site to upload your Data Packages to we recommend
the <a href="http://datahub.io/">DataHub</a> which is community-run and free. To upload to the DataHub
you’ll want to.</p>
<ol>
<li>
<p>Configure the DataHub CKAN instance in your <code class="language-plaintext highlighter-rouge">.dpmrc</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ckan.datahub]
url = http://datahub.io/
apikey = your-api-key
</code></pre></div> </div>
</li>
<li>
<p>Upload your Data Package</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dpm ckan datahub --owner_org=your-organization
</code></pre></div> </div>
<p>You have to set the owner organization as all datasts on the DataHub need an
owner organization.</p>
</li>
</ol>
<h2 id="one-i-did-earlier">One I Did Earlier</h2>
<p>Here’s a live example of one “I did earlier”:</p>
<ul>
<li>Here’s the source Data Package: <a href="http://data.okfn.org/data/core/imf-weo">IMF World Economic Outlook in data.okfn.org
registry</a> (<a href="https://github.com/datasets/imf-weo">Data Package on github
(source)</a>)</li>
<li>Get this on your local machine (<code class="language-plaintext highlighter-rouge">dpm install</code> or just clone the github repo)</li>
<li>Then I uploaded it: <code class="language-plaintext highlighter-rouge">dpm ckan http://datahub.io/ --owner_org=rufuspollock</code></li>
<li>Now it’s live on the DataHub: <a href="http://datahub.io/dataset/imf-weo">http://datahub.io/dataset/imf-weo</a>
<ul>
<li>Indicators: <a href="http://datahub.io/dataset/imf-weo/resource/ea3926e3-43a8-46d0-832a-e53efd61ebb0">http://datahub.io/dataset/imf-weo/resource/ea3926e3-43a8-46d0-832a-e53efd61ebb0</a></li>
<li>Values: <a href="http://datahub.io/dataset/imf-weo/resource/24cd8ebe-fa3f-4353-9ad9-d53bd88751a6">http://datahub.io/dataset/imf-weo/resource/24cd8ebe-fa3f-4353-9ad9-d53bd88751a6</a></li>
<li>Note this is a normalized dataset in which there are 2 tables (the
DataStore supports JOINS if we want to put them back together)</li>
</ul>
</li>
<li>Here’s a sample API query to get all indicators related to GDP: <a href="http://datahub.io/api/action/datastore_search?resource_id=ea3926e3-43a8-46d0-832a-e53efd61ebb0&limit=5&q=GDP">http://datahub.io/api/action/datastore_search?resource_id=ea3926e3-43a8-46d0-832a-e53efd61ebb0&limit=5&q=GDP</a></li>
<li>Now the data has a nice web Data API you can easily build data-driven apps or
visualizations. For example, the <a href="http://dev.rufuspollock.org/ckan-explorer/">CKAN Explorer</a> is a simple JS +
HTML app which allows you to explore CKAN DataStore data. Here’s the app
pre-loaded with the <a href="http://dev.rufuspollock.org/ckan-explorer/?endpoint=http://datahub.io&resource=ea3926e3-43a8-46d0-832a-e53efd61ebb0">DataStore indicator data</a></li>
</ul>
<p>Context: a big motivation (personally) for doing this is that I’d like to see a
nice web data API available for the <a href="http://data.okfn.org/data/">“Core” Data Packages</a> we’re creating
as part of the <a href="http://data.okfn.org/">Frictionless Data effort</a>. If you’re interested
in helping, <a href="http://discuss.okfn.org/t/data-packages-creating-finding-and-tooling/48">get in touch</a>.</p>
<h2 id="links">Links</h2>
<ul>
<li><a href="http://data.okfn.org/tools/">dpm Homepage</a></li>
<li><a href="https://github.com/okfn/dpm">dpm on Github</a></li>
<li><a href="https://github.com/okfn/datapackage-ckan">data package to ckan (node) library</a></li>
<li>IRC: freenode.net Channel: #okfn</li>
</ul>
Rufus Pollock
Bubbles: Python ETL Framework (prototype)
2014-09-01T00:00:00+00:00
http://okfnlabs.org/blog/2014/09/01/bubbles-python-etl
<h2 id="introduction-and-etl">Introduction and ETL</h2>
<p>The abbreviation <em>ETL</em> stands for <em>extract, transform and load</em>. What is it
good for? For everything between data sources and fancy visualisations. In the
data warehouse the data will spend most of the time going through some kind of
ETL, before they reach their final state. ETL is mostly automated,
reproducible and should be designed in a way that it is not difficult to track
how the data move around the data processing pipes.</p>
<p>Data warehouse stands and falls on ETLs.</p>
<h2 id="bubbles">Bubbles</h2>
<p><a href="http://bubbles.databrewery.org">Bubbles</a> is, or rather is meant to be, a
framework for ETL written in Python, but not necessarily meant to be used from
Python only. Bubbles is meant to be based rather on metadata describing the
data processing pipeline (ETL) instead of script based description. The
principles of the framework can be summarized as:</p>
<ul>
<li>ETL is described as a data processing pipeline which is an <a href="https://en.wikipedia.org/wiki/Directed_graph">directed
graph</a></li>
<li>Processing operations are nodes in the graph, such as <em>aggregation</em>,
<em>filtering</em>, <em>dataset comparison (diff)</em>, <em>conversion</em>, …</li>
<li>Nodes might have multiple different inputs and a single output (there might
be multiple outgoing connections, but all of them are the same) – the inputs
are considered <em>operands</em> to the operation and the output is the operation
<em>result</em>.</li>
<li>Data do not flow, if it is not necessary</li>
</ul>
<p>The pipeline is described in a such way, that it is technology agnostic – the
ETL developer, the person who wants data to be processed, does not have to
care about how to access and work with data in particular data store, he can
just focus on his task – deliver the data in the form that he needs to be
delivered.</p>
<h3 id="data-objects-and-data-store">Data Objects and Data Store</h3>
<p>The core of Bubbles are <em>data objects</em> – abstract concept of datasets which
might have multiple internal representations. What actually flows between the
nodes are not data itself, but those virtual representations of data and their
compositions. Data are fetched only if it is really necessary – if there is no
other option how to compose the data, such as join between a database table
and a CSV file.</p>
<p>Here are few objects with different representations:</p>
<p><img src="http://okfnlabs.org/img/posts/bubbles/bubbles-object_representations.png" alt="Object Representations" /></p>
<p>The objects are:</p>
<ul>
<li>object which originates from a <em>CSV file</em>, can be processed mainly using
the python iterators, however retains its text CSV nature, just in case some
of the nodes might know how to work with it more efficiently, for example
row filtering without actually parsing the CSV into row objects</li>
<li><em>SQL object representing a table</em> – it can be composed into other SQL
statements or can be used directly as a Python iterable</li>
<li><em>MongoDB collection</em> – similar to the previous SQL table, can be iterated as
raw stream of documents</li>
<li><em>SQL statement</em> which might be a result of previous operations or our custom
complex query. It can be used as such statement and composed with further
operations, or the data can be fetched and iterated over in Python. Since
this SQL object comes from a known database (PostgreSQL in this case) which
implements a <a href="http://www.postgresql.org/docs/current/static/sql-copy.html">COPY</a>
command that generates CSV output, we can treat that object as such and
provide the option to use CSV representation as well</li>
<li><em>Twitter API object</em> – an exampple of a data object that actually does not
exists for us as a physical table, we do not even know from how many
original tables the Twitter is feeding us the data and we do not have to
care at all. We are just fine that we can have an impression of iterable
dataset.</li>
</ul>
<p>To be more concrete, take a simple filtering for example. Say we have sample
of Tweets stored in a SQL database,
<a href="http://docs.mongodb.org/manual/tutorial/query-documents/">MongoDB</a>
and obviously <a href="https://dev.twitter.com/docs/api/1/get/statuses/user_timeline">on Twitter</a>.
We want to get all tweets by OKFN. In SQL we use a SQL driver, connect to the
database and do:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT * FROM WHERE screen_name = 'okfn'
</code></pre></div></div>
<p>in Mongo we use a mongodb driver, connect to the database and do:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>db.tweets.find(
{ },
{ screen_name: 'okfn'}
)
</code></pre></div></div>
<p>and in Twitter we just issue the following HTTP request:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://api.twitter.com/1/statuses/user_timeline.json?screen_name=okfn
</code></pre></div></div>
<p>We asked for the same data object – <em>a tweet</em> in three different data stores.
We had to use three different approaches. It does not look bad for us, “the
tech people”. What if we can just write:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>p = Pipeline(...)
p.source("data", "tweets")
p.filter_value("screen_name", "okfn")
p.pretty_print()
p.run()
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">"data"</code> is the data store. We as ETL designers do not have to worry what
kind of data store it is, how to talk to it, how to get data from it.</p>
<p>Now we would like to count the tweets, so let us add <code class="language-plaintext highlighter-rouge">aggregate()</code> operation,
which by default yields only record count:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>p = Pipeline(...)
p.source("data", "tweets")
p.filter_value("screen_name", "okfn")
p.aggregate()
p.pretty_print()
p.run()
</code></pre></div></div>
<p>What happens here? For example, in the SQL case the <code class="language-plaintext highlighter-rouge">COUNT()</code> aggregation
function will be used. For twitter, because our backend does not know better,
the tweets will have to be pulled all from the Twitter API and counted
one-by-one. Which is sad, but good for our example. The objective was to
deliver the desired result, which happened.</p>
<h3 id="context">Context</h3>
<p>One thing is missing in my examples above: <code class="language-plaintext highlighter-rouge">Pipeline(...)</code> – the pipeline
works in a context. We need to provide the description of data stores. For
example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>stores = { "data": {"type": "sql", "url": "postgresl://localhost/twitter" }}
</code></pre></div></div>
<p>Stores have an interface for getting datasets by name or creating new
datasets. Dataset might be:</p>
<ul>
<li><em>table</em> in a <em>SQL</em> store</li>
<li><em>collection</em> in a <em>MongoDB</em> store</li>
<li><em>CSV file</em> in a store represented by a directory of CSV files</li>
<li><em>JSON newlline delimited</em> file in a store represented by a directory of
JSOND files</li>
<li>resource collection over an API, such as the Twitter example above</li>
<li>dataset from a <a href="http://data.okfn.org/doc/data-package">datapackage</a></li>
</ul>
<p>ETL designer should not care about the underlying implementation, he should
care only about having “a set of data that look like a table”. Object dataset
responds to methods such as <code class="language-plaintext highlighter-rouge">object_names()</code> or <code class="language-plaintext highlighter-rouge">get_object(name)</code>.</p>
<h3 id="operations">Operations</h3>
<p>The ETL operations work on data objects provided as operands. An operation
returns another data object. As mentioned above, the flow of data is just
virtual. That means that when we are filtering the data, the framework might
be actually composing a SQL <code class="language-plaintext highlighter-rouge">WHERE</code> statement instead of just pulling the data
out of the database and filtering them row-by-rown in Python.</p>
<p>Similar with fields in the dataset – if we want to keep just certain columns,
why to pass them around all in the first place? Why not to ask only for those
that we actually need at the end? That is what Bubbles should do. Therefore
the <code class="language-plaintext highlighter-rouge">keep_fields()</code> operation just selects certain columns when used in the
SQL context.</p>
<p>There might be multiple implementations of the same operation. Which
implementation (function) is used is determined at the time of pipeline
execution. <code class="language-plaintext highlighter-rouge">aggregate()</code> might be in-python row-by-row aggregation using a
dictionary or it might be <code class="language-plaintext highlighter-rouge">SUM()</code> or <code class="language-plaintext highlighter-rouge">AVG()</code> with <code class="language-plaintext highlighter-rouge">GROUP BY</code> statement in SQL,
depending on which kind of object is passed to the operation.</p>
<p>In the following image you might see how the most appropriate operation is
chosen for you depending on the data source. You can also see, that for
certain representations the operations are combined together to produce just
single data query for the source system:</p>
<p><img src="http://okfnlabs.org/img/posts/bubbles/bubbles-operation_representations.png" alt="Operations and Object
Representations" /></p>
<h2 id="examples">Examples</h2>
<p><a href="https://gist.github.com/Stiivi/5937938">Here is an example</a> of Bubbles
framework in action: “list customer details of customers who ordered something
between year 2011 and 2013”. You might see that the source is a directory of
CSV files. For comparison on the SQL example we <code class="language-plaintext highlighter-rouge">create()</code> a table, so the
rest of the pipeline will hapen as SQL, not in Python.</p>
<p><a href="https://gist.github.com/Stiivi/5907305">Another example</a> showing aggregation
joining of details.</p>
<p><a href="https://gist.github.com/Stiivi/9104719">An example</a> that uses a data package
(<a href="http://data.okfn.org/doc/data-package">according to spec</a>) as a data store:</p>
<p>The pipeline looks like this:</p>
<p><img src="http://okfnlabs.org/img/posts/bubbles/bubbles-join_example.png" alt="Pipeline Example" /></p>
<p>The Python source code for the pipeline:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Aggregate population per independence type for every year
# Sources: Population and Country Codes datasets
#
from bubbles import Pipeline
# List of stores with datasets. In this example we are using the "datapackage"
# store
stores = {
"source": {"type": "datapackages", "url": "."}
}
p = Pipeline(stores=stores)
# Set the source dataset
p.source("source", "population")
# Prepare another dataset and keep just relevant fields
cc = p.fork(empty=True)
cc.source("source", "country-codes")
cc.keep_fields(["ISO3166-1-Alpha-3", "is_independent"])
# Join them – left inner join
p.join_details(pop, "Country Code", "ISO3166-1-Alpha-3")
# Aggregate Value by status and year
p.aggregate(["is_independent", "Year"],
[["Value", "sum"]],
include_count=True)
# Sort for nicer output...
p.sort(["is_independent", "Year"])
# Print pretty table.
p.pretty_print()
p.run()
</code></pre></div></div>
<h2 id="note-about-metadata">Note about Metadata</h2>
<p>I have been using Python as a scripting language to define my pipelines.
Observant reader might have noticed, that all I did was just composition of
some messages, which is true. The <code class="language-plaintext highlighter-rouge">p</code> Pipeline object contains just a graph
and the <code class="language-plaintext highlighter-rouge">run()</code> method uses an execution engine to resolve the graph and pick
the appropriate operations for the given thask. That means, my whole
processing pipeline does not need to be written in Python at all. It migh be
described as a JSON for example, it might even be generated from some
graphical user inteface for flow based programming.</p>
<p>There is more into metadata in Bubbles than mentioned in this blog post.
The framework understands higher level metadata, such as analytical – role of
a field from data analysis perspective. For example the <code class="language-plaintext highlighter-rouge">aggregate()</code>
operation might by default aggregate all fields that are of analytical type
<code class="language-plaintext highlighter-rouge">measure</code> and that information is passed on. Which results in less writing and
less noise on the side of pipeline designer.</p>
<h2 id="summary">Summary</h2>
<p>Why should someone who just wants to achieve his goal of extracting,
transforming and presenting the data care about the underlying technology and
query language? Mostly these days when we are dealing with so many systems it
is an unnecessary distraction. Moreover, many ETL blocks are generic and reusable,
why we would have to write the same code for every system we use?</p>
<p>Having an abstract ETL framework allows us to share transformations, cleaning
methods, quality checks and more much easier.</p>
<p>In addition, it leaves the optimization of the process to the operation
writers, to the people with technical skills, who know when it is good to move
data over the networks and through the disks, or if we can just compose an
operation and issue a sigle statment that the source system understands.</p>
<h3 id="future">Future</h3>
<p>The bubbles is still just a prototype, for the brave ones. But I would love to
see it as a Python ETL/data integration framework. The short term needs and
objectives are:</p>
<ul>
<li>Simpler pipeline definition interface, more functional programming oriented</li>
<li>Larger library of higher level reusable components, such as dimension
loaders (there is
<a href="https://en.wikipedia.org/wiki/Slowly_changing_dimension">more</a> into <code class="language-plaintext highlighter-rouge">UPSERT</code>
that many of us think, but that is another story)</li>
<li>Easier way to write operations.</li>
<li>Larger variety of supported backends and services</li>
</ul>
<p>If anyone is willing to help to prototype, I will gladly guide him/her. Let us
build a python open source integration framework together. Extensible.
Understandable. Focused on the use, way of thinking and pipeline design
workflow.</p>
<p>Links:</p>
<ul>
<li><a href="http://bubbles.databrewery.org">Homepage</a></li>
<li><a href="https://github.com/stiivi/bubbles">Github</a></li>
<li>IRC: freenode.net Channel: #databrewery</li>
</ul>
Stefan Urbanek
Data Central: a static frontend for data package collections
2014-08-19T00:00:00+00:00
http://okfnlabs.org/blog/2014/08/19/datacentral
<p><a href="http://centraldedados.pt"><img src="/img/posts/datacentral.png" /></a></p>
<p>This post explains our issues at the Portuguese open data front when it
comes to providing bulk datasets in standard and easy-to-parse ways. It
also introduces <a href="https://github.com/centraldedados/datacentral">Data Central</a>,
our tentative solution to those issues: a Python tool to
generate static web frontends for your data packages.</p>
<h2 id="first-problem-have-a-common-format-for-storing-datasets">First problem: Have a common format for storing datasets</h2>
<p>At <a href="http://transparenciahackday.org">Transparência Hackday Portugal</a>, as with any other open data interest group,
we work with many datasets. An issue that has been slowing us down for a long
time is that we never had a centralized solution for storing datasets: some are
in Google Docs, others in Git repositories, others live on web servers.</p>
<p>Before that, another issue was the data format: we found ourselves lost among
CSV or JSON files, SQL database dumps, spreadsheets and plaintext files.
Converting these was something we’d do in an ad hoc basis, and the challenge of
finding (or devising) a common format usually stumbled into differing personal
preferences and the difficulty involved in mass-conversion of heterogeneous
data collections.</p>
<h2 id="solution-tabular-data-packages">Solution: Tabular data packages</h2>
<p>We stumbled almost accidentally into the <a href="http://data.okfn.org/standards">Data Package standards page</a>. It was a
revelation to see how elegant a solution this was to our format problems: using
the <a href="http://data.okfn.org/doc/tabular-data-package">Tabular Data Package</a> spec, we could go ahead and convert our datasets into
CSV, along with their metadata – which is fairly easy to generate and maintain
using the existing tools for the job. From there, we can also develop scripts
to re-fetch and update the datasets, as well as post-processing tools to
generate other formats from the data package.</p>
<p>There is already much information available on Data Packages:</p>
<ul>
<li>the <a href="http://data.okfn.org/vision">Frictionless Data vision</a>, which clearly lays out the problem and the
proposed workflow to deal with heterogeneous sets of data</li>
<li>the <a href="http://data.okfn.org/doc/data-package">Data Package</a> info page</li>
<li>the <a href="http://data.okfn.org/doc/tabular-data-package">Tabular Data Package</a> info page, which is the format we use</li>
<li>the comprehensive specifications for <a href="http://www.dataprotocols.org/data-packages/">Data Packages</a> and <a href="http://www.dataprotocols.org/simple-data-format/">Tabular Data Packages</a></li>
<li>many tools to manage and publish data packages at <a href="http://data.okfn.org/tools">data.okfn.org/tools</a></li>
</ul>
<p>So our common data format problem is now solved. We then faced another issue:
how to publish and distribute these datasets in an equally frictionless way.</p>
<h2 id="second-problem-simple-system-to-publish-data-packages">Second problem: Simple system to publish data packages</h2>
<p>Something that we’ve also been missing was a central point from which to
distribute the datasets we have. Having a site to aggregate all of our data
packages would be a necessary step for some requirements we had:</p>
<ul>
<li>It would make hosting data workshops easier, by providing a quick way to
access bulk data instead of fumbling around with USB sticks, Google documents
and Dropbox links.</li>
<li>It’d make our efforts more visible, by aggregating all our work that is
currently all over the place and presenting them in a simple manner.</li>
<li>More importantly, it gives us an easier way to present our work in gathering
and converting data, and a better argument to present to public entities for
publishing their data: instead of saying “Give us your data so we can convert
it and make it open”, we can simply say “Give us your data so it can be
available at OurGreatOpenDataPortal.pt”. Having a separate “brand” makes
things easier to explain – and open data matters are involved enough to be
able to hold people’s attention.</li>
</ul>
<p>There are existing solutions, such as <a href="http://thedatatank.com">DataTank</a> or, more prominently, <a href="http://ckan.org">CKAN</a>. So
why wouldn’t CKAN be an option?</p>
<p>CKAN is a brilliant framework for hosting, managing and dealing with groups of
heterogeneous datasets. However, installing CKAN is an <a href="http://docs.ckan.org/en/latest/maintaining/installing/index.html">involved process</a>, and its
power comes at the cost of maintaining a full web application: it requires a
carefully configured server, doing regular updates, and ensuring server
resources are not going above a reasonable level. And since we’re a small team,
we don’t require most of its advanced features (like permissions).</p>
<p>Finally, at Transparência Hackday we
have to manage many web applications already, and being too familiar with the
experience made us look for a simpler application design.</p>
<h2 id="solution-data-central-a-static-site-generator-for-data-package-collections">Solution: Data Central, a static site generator for data package collections</h2>
<p>We set out to design a simple application that could meet our purposes. The main
design principles are:</p>
<ul>
<li>
<p><em>Enable access to bulk data sets</em>. Easy, straightforward access to
the actual files is the main driver behind the current implementation. This
differs from an API-driven approach which, while powerful, would require
significant additional complexity.</p>
</li>
<li>
<p><em>Generated static HTML site</em> – Publishing datasets doesn’t need a real-time
server-based application to query the data and show it. We would only need to
update the site daily, at most, and we could then skip the server-side logic.</p>
</li>
<li>
<p><em>Generate locally and upload</em> – The site generation ought to happen locally.
We decided to have one of our non-remote servers take care of the hard work of
generating the site, and then upload it with rsync to a hosted service.</p>
</li>
<li>
<p><em>Low hardware footprint</em> – Local generation means that our system spec
requirements are low. Not needing specialized hardware means that we can use
an old computer for this task. It’s actually what we do – the site
generation is being done on an old 2007 Sony Vaio laptop with a broken screen.</p>
</li>
<li>
<p><em>Separate the datasets from the site</em> – By hosting each data package on a
separate Git repository, the local generator could fetch it and re-generate
the site without having to host and manage a separate copy of the data
package and run the risk of both versions going out of sync. We found this
happens often when building a database-driven web application. By separating
the data packages and the web frontend, packagers and editors can work
independently on the data, while the site generator updates the live version
periodically.</p>
</li>
<li>
<p><em>Operated via the command line</em> – For the sake of simplicity and at the cost
of user-friendliness, we settled for a CLI-centered management workflow. We
realised that managing this kind of site should be a mostly automated process,
and an efficient way to do this would be to restrict the application to a set
of scripts that can be managed through Makefiles and run by cron jobs.</p>
</li>
</ul>
<p>There are some significant downsides to this direction, though.</p>
<ul>
<li>
<p>There is no API since it’s all just HTML. This might be the most evident
shortcoming of a static approach.</p>
</li>
<li>
<p>This also means there are no search capabilities. One could always consider
using a third-party search engine since the site is plain HTML that can be
scraped by Google, DuckDuckGo and other web crawlers.</p>
</li>
<li>
<p>There is no support for dynamic content, such as a site blog. Listing external
feeds could be done through widgets in JavaScript.</p>
</li>
<li>
<p>Since the application management is done locally through the command line,
there isn’t any web interface to make edits or changes inside the
browser.</p>
</li>
</ul>
<h2 id="how-it-works">How it works</h2>
<p>The workflow goes like this:</p>
<ol>
<li>Data packages are published and updated on individual repositories by
package maintainers.</li>
<li>The Datacentral application is configured to become aware of which
repositories it should track.</li>
<li>The first run of the application clones all repositories and generates the
HTML pages for each data package.</li>
<li>Individual HTML pages (About, Contact) are generated from local Markdown
files.</li>
<li>The generated output can then be pushed through FTP or rsync to a remote,
public web server.</li>
</ol>
<p>In practice, there is a <code class="language-plaintext highlighter-rouge">generate.py</code> script that inspects each data package
and uses <a href="http://jinja.pocoo.org">Jinja</a> to fill up a set of HTML template files. It saves the generated
HTML in an <code class="language-plaintext highlighter-rouge">_output</code> directory, that can then be inspected using a local
webserver or pushed into a live VPS. All actions, from installation to generation and upload, can be carried out by means of a Makefile.</p>
<p>If you’re interested in reading more about Data Central and even trying it out (it’s simple!), <a href="https://github.com/centraldedados/datacentral">check out the project site</a>. We’d heartily welcome all possible feedback, so please let us know about any bugs, suggestions or feature requests at the Datacentral <a href="https://github.com/centraldedados/datacentral/issues">issue tracker</a>. Finally, you can see it in action in our (in development) Portuguese independent data hub, <a href="http://centraldedados.pt">Central de Dados</a>.</p>
Ricardo Lafuente
Labs newsletter: 5 June, 2014
2014-06-05T00:00:00+00:00
http://okfnlabs.org/blog/2014/06/05/newsletter
<p>Welcome back to the OKFN Labs! Members of the Labs have been building tools, visualizations, and even new data protocols—as well as setting up conferences and events. Read on to learn more.</p>
<p>If you’d like to suggest a piece of news for next month’s newsletter, leave a comment on its <a href="https://github.com/okfn/okfn.github.com/issues/215">GitHub issue</a>.</p>
<h2 id="commasearch">commasearch</h2>
<p><a href="http://okfnlabs.org/members/tlevine/">Thomas Levine</a> has been working on an innovative new approach to searching tabular data, <a href="https://github.com/tlevine/commasearch">commasearch</a>.</p>
<p>Unlike a normal search engine, where you submit words and get pages of words back, with commasearch, you submit spreadsheets and get spreadsheets in return.</p>
<p>What does that mean, and how does it work? Check out Thomas’s excellent blog post “<a href="http://dada.pink/dada/pagerank-for-spreadsheets/">Pagerank for Spreadsheets</a>” to learn more.</p>
<h2 id="github-diffs-for-csv-files">GitHub diffs for CSV files</h2>
<p><em>Submitted by <a href="http://okfnlabs.org/members/paulfitz/">Paul Fitzpatrick</a>.</em></p>
<p>GitHub has added CSV viewing support in their web interface, which is fantastic, but it still doesn’t handle changes well. If you use Chrome, and want lovely diffs, check out James Smith’s <a href="https://github.com/theodi/csvhub">CSVHub</a> extension (<a href="http://theodi.org/blog/csvhub-github-diffs-for-csv-files">blogpost and screenshot</a>). The diffs are produced using the <a href="http://paulfitz.github.io/daff/">daff</a> library, available in javascript, ruby, php, and python3.</p>
<h2 id="textus-wordpress-plugin">Textus Wordpress plugin</h2>
<p><em>Update from Iain Emsley.</em></p>
<p>The Open Literature project to provide a <a href="https://github.com/okfn/textus-wordpress">Wordpress plugin back-end for the Textus viewer</a> has made new progress.</p>
<p>This project’s goal was to keep the existing Textus frontend—which has been <a href="https://github.com/okfn/textus-viewer">split off as its own project</a> by Rufus Pollock—and replace the backend with a Wordpress plugin, to make it easier to deploy. A version of this plugin backend is now available.</p>
<p>The new plugin acts as a stand-alone module that can be enabled and disabled as required by the administrative user. It creates a new Wordpress post type called “Textus” which is available as part of the menu, giving the user a place to upload text and annotation files using the Media uploader.</p>
<p>If you are interested in the project, check out its <a href="https://github.com/okfn/textus-wordpress/issues">issues</a> and discussion on the <a href="https://lists.okfn.org/mailman/listinfo/open-humanities">Open Humanities list</a>.</p>
<h2 id="data-protocols-updates">Data protocols: updates</h2>
<p><a href="http://dataprotocols.org/">Data Protocols</a>, the Labs’s set of lightweight standards and patterns for open data, has had a couple of interesting developments.</p>
<p>The <a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a> protocol has just added support for constraints (i.e. validation), thanks to <a href="http://www.ldodds.com/">Leigh Dodds</a>. This adds a <code class="language-plaintext highlighter-rouge">constraints</code> attribute containing requirements on the content of fields. See the full <a href="http://dataprotocols.org/table-schema/#field-constraints">list of valid constraints</a> on the JSON Table Schema site.</p>
<p>The <a href="https://github.com/okfn/dpm/">Data Package Manager</a> tool for Data Packages is shaping up nicely: the <code class="language-plaintext highlighter-rouge">install</code> and <code class="language-plaintext highlighter-rouge">init</code> commands have now been implemented. You can see an <a href="https://github.com/okfn/dpm/issues/3#issuecomment-43440812">animated GIF</a> of the former in the issue thread.</p>
<h2 id="annotatorjs-new-home">AnnotatorJS: new home</h2>
<p>Annotator is “an open-source JavaScript library to easily add annotation functionality to any webpage”.</p>
<p>The project now lives on its own domain at <a href="http://annotatorjs.org/">annotatorjs.org</a>. Check it out and see how easy it is to add comments and notes to your pages!</p>
<h2 id="csvconf">csv,conf</h2>
<p>Data makers everywhere will want to check out <a href="http://csvconf.com/">csv,conf</a>, a fringe event of <a href="http://2014.okfestival.org/">Open Knowledge Festival 2014</a> taking place in Berlin on 15 July.</p>
<p>csv,conf is a non-profit community conference that will “bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share tools and stories”.</p>
<p><a href="http://register.csvconf.com/">Tickets are $75, $50 with an OKFest ticket</a>. If you can make it to Berlin in July and you’re into “advancing the art of data collaboration”, come join in!</p>
Neil Ashton
First International Sport Hackdays Kick Off New OK Working Group
2014-06-04T00:00:00+00:00
http://okfnlabs.org/blog/2014/06/04/open-sport-kickoff
<p>At the first <a href="http://opendata.ch/2014/04/sports-hackdays/">International Sports Hackdays</a> in Basel, Sierre and Milan, over 120 developers and designers, journalists and scientists, professionals and amateurs came together to prototype new approaches to make creative use of sports data. They built new types of hardware, new interfaces for fitness equipment and spectator apps, analyzed Tour de France performances and FC Basel’s tactics, sport education policies, infrastructures and much more – and thus brought the spirit of open innovation and creative technology use to the field of sports. More <a href="https://blog.scraperwiki.com/2014/06/world-cup-hack-day-london-10th-june-a-teaser/">hackdays</a> are coming up, and a new international <a href="https://lists.okfn.org/mailman/listinfo/open-sports">OK Working Group</a> is being kicked off!</p>
<div style="width: 300px; float: right;"><a href="http://opendata.ch/files/2014/05/hacksports-desktop.jpg"><img src="http://opendata.ch/files/2014/05/hacksports-desktop.jpg" alt="Desktop of a participant, sketching and prototyping a Tour de France data visualization"></a><p><i>Desktop of a participant, sketching and prototyping a Tour de France data visualization.</i></p></div>
<div style="width: 300px;"><a href="http://make.opendata.ch/wiki/project:secondlamp"><img src="http://opendata.ch/files/2014/05/hacksports-hue.jpg" alt="Project visualizing the intensity and tendency of a football match with the color and brightness of two lightbulbs"></a><p><i><a href="http://make.opendata.ch/wiki/project:secondlamp">Project Secondlamp</a> is visualizing the intensity and tendency of a football match with the color and brightness of two light bulbs.</i></p></div>
<p>While open government data has become an established force for transparency, efficiency and innovation in the public sector, the world of sports stands at a beginning: even though there’s so much passion, even though there’s so much potential, sports data often remain in the closed coffers of functionaries. Last weekend, the International Sports Hackdays were just one successful play to change this game, just one step towards opening up sports to data, and sports data to the world. Therefore: Mr. Blatter, tear down this wall, make FIFA’s data available to all!</p>
<p><a href="https://twitter.com/sportmetrics">Martin Rumo</a>, Embedded Computer Scientist at Switzerland’s Federal Institute of Sports in Magglingen explained the situation: “In elite sport, we collect more and more data every day, but to make it used and useful, we must build bridges between developers, designers and data scientists and the world of sports.”</p>
<p>With experts for athletic data from leading companies such as <a href="http://www.deltatre.com/">Deltatre</a> or <a href="http://www.technogym.com/">Technogym</a> or from <a href="http://en.wikipedia.org/wiki/Federal_Office_of_Sport">Switzerland’s National Sports Centre</a>, with data visualization experts from companies such as <a href="http://tuxtax.it/">Tuxtax</a> or<a href="http://www.interactivethings.com/"> Interactive Things</a> as well as academics, hackers and makers from leading local tech firms, the event attracted talents hardly ever brought together, a set of interdisciplinary innovators that proved to be extraordinarily productive – in building bridges, but also in making actual progress.</p>
<div style="width: 300px; float: right;"><a href="http://repository.opendata.ch/sport/hackdays-14-tdf/"><img eight="300" width="300" src="http://opendata.ch/files/2014/05/hacksports-tdf.png" alt="Plot helping to visualize and analyze the career of a road bicycle racer, using data from the Tour de France."></a><p><i><a href="http://repository.opendata.ch/sport/hackdays-14-tdf/">Plot</a> helping to visualize and analyze the career of a road bicycle racer, using data from the Tour de France.</i></p></div>
<div style="width: 300px;"><a href="http://make.opendata.ch/wiki/project:matchquote"><img height="300" width="300" src="http://opendata.ch/files/2014/05/hacksports-foosball.png"></a><p><i>Sensors connected to a fooseball table, collecting data and corelating it to historic pro soccer matches – to make connections like <a href="http://make.opendata.ch/wiki/project:matchquote">“you’re playing like Liverpool-Basel today”</a>.</i></p></div>
<p>The fascinating projects developed by the creative industry volunteers included hardware projects, software projects and data visualizations. They concentrated mainly on football (<a href="http://make.opendata.ch/wiki/project:secondlamp">Secondlamp</a>, second-screen match app <a href="http://make.opendata.ch/wiki/project:blitzpoll">BlitzPoll</a>, ..), cycling (such as <a href="http://repository.opendata.ch/sport/hackdays-14-tdf/">deep</a>, <a href="http://make.opendata.ch/wiki/project:tour_de_france_history">historical</a> analyses of Tour de France performances). The prototypes covered both personal apps like <a href="http://make.opendata.ch/wiki/project:sportee">Sportee</a> or <a href="http://make.opendata.ch/wiki/project:beatit">BeatIt</a> and the organizational level, e.g for the financial analysis of state-funded sport promotion as done in <a href="http://make.opendata.ch/wiki/project:optisports">OptiSports</a> or, specifically for Switzerland, the tool called <a href="http://make.opendata.ch/wiki/project:spoertle">Spörtle</a>. In all cases, the approaches taken were extraordinarily inventive and of amazing quality: the creativity and innovation that could be experienced in these two days in Milan, Sierre and Basel was very impressive indeed.</p>
<p>The data sources used at the event are all <a href="http://datahub.io/organization/sport">available on datahub.io</a>. These datasets include crowdsourced data, data extracted from public websites as well official data releases made available to a broader audience for the very first time.
To foster and facilitate the <a href="http://opendefinition.org/">open</a> publication and productive use of open sports data, <a href="http://okfn.org/">Open Knowledge</a> is currently incubating an official sports data working group. The group will cover a wide range of issues that can be tackled with sports data: from leisure sport performance data in professional sport to using data for financial transparency and governance in sports institutions. The group is already attracting top-notch experts as well as data-wrangling fans and data journalists, so it's time to <a href="https://lists.okfn.org/mailman/listinfo/open-sports">join the game</a> now!</p>
Hannes Gassert
Using open football data - Get ready for the World Cup in Brazil 2014
2014-05-06T00:00:00+00:00
http://okfnlabs.org/blog/2014/05/06/open-data-world-cup
<p>Football is the world’s most popular sport and
the World Cup in Brazil - kicking off next month in São Paulo on June 12th
(in 38 days 3 hours 15 minutes and counting) -
is the world’s biggest (sport) event with 32 national teams
from six continents competing in 64 matches in 12 cities for the championship title.</p>
<h2 id="wheres-the-open-football-data-lets-ask-the-intertubes">Where’s the open football data? Let’s ask the intertubes</h2>
<p>Now lets say you want to build a world cup match day widget for
your site using a web service (HTTP JSON API) that gets
you all teams, groups, matches, players, and so on.</p>
<p>Example - HTTP JSON API:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GET /event/worldcup.2014/teams
{
"event": {
"key": "worldcup.2014", "title": "World Cup 2014"
},
"teams": [
{ "key": "gre", "title": "Greece", "code": "GRE" },
{ "key": "ned", "title": "Netherlands", "code": "NED" },
{ "key": "ger", "title": "Germany", "code": "GER" },
{ "key": "por", "title": "Portugal", "code": "POR" },
...
]
}
</code></pre></div></div>
<p>Ideally, there’s a free service using open football data from the world football federation,
from the world’s sport cable channels, from the world’s sport newspapers, and so on.
Let’s ask the intertubes to find out the state of open football data in the real world -
let’s google <a href="http://www.google.com/search?q=json+world+cup+brazil"><code class="language-plaintext highlighter-rouge">json world cup brazil</code></a>
or post a question on the open data stackexchange <a href="http://opendata.stackexchange.com/questions/1791/any-open-data-sets-for-the-football-world-cup-in-brazil-2014">‘Q: Any Open Data Sets for the (Football) World Cup (in Brazil 2014)?’</a>.</p>
<p>Nothing. Nada. Nichts. Niente. Zilch. Zero.
So what? Let’s build an open football data project.</p>
<h2 id="whats-footballdb">What’s <code class="language-plaintext highlighter-rouge">football.db?</code></h2>
<p>Let’s welcome <code class="language-plaintext highlighter-rouge">football.db</code>. An open football data project
offering - suprise, suprise - free open public domain
football data for the World Cup in Brazil 2014, and more.</p>
<p><img src="/img/posts/openfootball/worldcup2014-db-download.png" alt="" /></p>
<p>The open football project also sports a free self-hosted HTTP JSON API service
for football data, for example. Get started in two steps:</p>
<ul>
<li>Step 1: Download the <code class="language-plaintext highlighter-rouge">worldcup2014.db</code> SQLite Database</li>
<li>Step 2: Serve up teams, rounds, matches, etc. via HTTP JSON API using the <code class="language-plaintext highlighter-rouge">sportdb</code> command line tool</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sportdb serve
</code></pre></div></div>
<p>Services available include:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">/event/world.2014/teams</code> – List all teams</li>
<li><code class="language-plaintext highlighter-rouge">/event/world.2014/rounds</code> – List all rounds (matchdays)</li>
<li><code class="language-plaintext highlighter-rouge">/event/world.2014/round/20</code> – List all matches in a round e.g. - 20th Round (=> Final)</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GET /event/world.2014/round/1
{
"event": { "key": "world.2014", "title": "World Cup 2014" },
"round": { "pos": 1, "title": "Matchday 1" },
"games": [
{
"team1_key": "bra",
"team1_title": "Brazil",
"team1_code": "BRA",
"team2_key": "cro",
"team2_title": "Croatia",
"team2_code": "CRO",
"play_at": "2014/06/12",
"score1": null,
"score2": null,
"score1ot": null,
"score2ot": null,
"score1p": null,
"score2p": null
}
]
}
</code></pre></div></div>
<h2 id="how-does-it-work--distributed-is-the-new-centralized">How does it work? Distributed is the new centralized</h2>
<p>The open football data project collects public domain data sets
in plain old text files that you store on your hard disk
and that you can share with a distributed version tracker (that is, git repos)
with your friends or the world.
A free public domain command line tool (that is, <code class="language-plaintext highlighter-rouge">sportdb</code>)
lets you read the plain text data sets into your SQL database of choice
(for example, MySQL, PostgreSQL, SQLite, etc).</p>
<p><img src="/img/posts/openfootball/github-openfootball-worldcup.png" alt="" /></p>
<h3 id="example-europeteamstxt---comma-separated-values">Example: <code class="language-plaintext highlighter-rouge">europe/teams.txt</code> - Comma-separated values</h3>
<p>Let’s look at a plain text file for national teams in Europe, for example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>############
# UEFA (Union of European Football Associations)
# - 54 members
aut, Austria, AUT, at, fifa|uefa
bel, Belgium, BEL, be, fifa|uefa
cyp, Cyprus, CYP, cy, fifa|uefa
...
</code></pre></div></div>
<p>(Source: <a href="https://github.com/openfootball/national-teams/blob/master/europe/teams.txt">europe/teams.txt</a>)</p>
<p>The plain text file uses the comma-separated values (CSV) format
with some extras for comments, blank lines, etc.</p>
<h3 id="example-worldcup2014cuptxt----mini-football-data-language">Example: <code class="language-plaintext highlighter-rouge">worldcup/2014/cup.txt</code> - Mini football data language</h3>
<p>For match schedules the open football project use a new strutured data format,
that is, a new domain-specific language (DSL).</p>
<p>Example - Open Football Match Schedule Language:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1) Thu Jun/12 17:00 Brazil - Croatia @ Arena de São Paulo, São Paulo (UTC-3)
(2) Fri Jun/13 13:00 Mexico - Cameroon @ Estádio das Dunas, Natal (UTC-3)
</code></pre></div></div>
<p>(Source: <a href="https://github.com/openfootball/world-cup/blob/master/2014--brazil/cup.txt">world-cup/2014/cup.txt</a>)</p>
<p>Why invent yet another data format?
The new mini language for structured football match schedule data
offers you the best of both worlds, that is,
1) looks n feels like free-form plain text - easy-to-read and easy-to-write -
2) but offers a 100-% data accuracy guarantee (when loading into SQL tables, for example).</p>
<p>The mini language also includes
support for groups, matchdays, grounds, and more. Example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>############################
# World Cup 2014 Brazil
Group A | Brazil Croatia Mexico Cameroon
Group B | Spain Netherlands Chile Australia
Group C | Colombia Greece Côte d'Ivoire Japan
Group D | Uruguay Costa Rica England Italy
Group E | Switzerland Ecuador France Honduras
Group F | Argentina Bosnia-Herzegovina Iran Nigeria
Group G | Germany Portugal Ghana United States
Group H | Belgium Algeria Russia South Korea
Matchday 1 | Thu Jun/12
Matchday 2 | Fri Jun/13
Matchday 3 | Sat Jun/14
...
(16) Round of 16 | Sat Jun/28 - Tue Jul/1
(17) Quarter-finals | Fri Jul/4 - Sat Jul/5
(18) Semi-finals | Tue Jul/8 - Wed Jul/9
(19) Match for third place | Sat Jul/12
(20) Final | Sun Jul/13
Group A:
(1) Thu Jun/12 17:00 Brazil - Croatia @ Arena de São Paulo, São Paulo (UTC-3)
(2) Fri Jun/13 13:00 Mexico - Cameroon @ Estádio das Dunas, Natal (UTC-3)
(17) Tue Jun/17 16:00 Brazil - Mexico @ Estádio Castelão, Fortaleza (UTC-3)
(18) Wed Jun/18 18:00 Cameroon - Croatia @ Arena Amazônia, Manaus (UTC-4)
(33) Mon Jun/23 17:00 Cameroon - Brazil @ Brasília (UTC-3)
(34) Mon Jun/23 17:00 Croatia - Mexico @ Recife (UTC-3)
Group B:
(3) Fri Jun/13 16:00 Spain - Netherlands @ Arena Fonte Nova, Salvador (UTC-3)
(4) Fri Jun/13 18:00 Chile - Australia @ Arena Pantanal, Cuiabá (UTC-4)
(19) Wed Jun/18 16:00 Spain - Chile @ Estádio do Maracanã, Rio de Janeiro (UTC-3)
(20) Wed Jun/18 13:00 Australia - Netherlands @ Estádio Beira-Rio, Porto Alegre (UTC-3)
(35) Mon Jun/23 13:00 Australia - Spain @ Curitiba (UTC-3)
(36) Mon Jun/23 13:00 Netherlands - Chile @ São Paulo (UTC-3)
...
</code></pre></div></div>
<p>Interested? Find out more at the <a href="https://github.com/openfootball">project site</a>
or post your questions or comments to the <a href="http://groups.google.com/group/opensport">forum/mailing list</a>. Thanks.</p>
<h2 id="appendix-basics---whats-not-open-structured-data">Appendix: Basics - What’s (Not) Open (Structured) Data?</h2>
<p>What’s (Not) Open (Structured) Data?</p>
<p>Example 1:</p>
<ul>
<li>A Free One-Page Booklet (PDF) Download for the Match Schedule from <a href="http://fifa.com/worldcup/matches"><code class="language-plaintext highlighter-rouge">fifa.com</code></a>.
<ul>
<li>Copyright © FIFA 2014. All Rights Reserved.</li>
</ul>
</li>
</ul>
<p><img src="/img/posts/openfootball/fifa-match-schedule-download.png" alt="" /></p>
<p>Example 2a:</p>
<ul>
<li>Match Schedule on FIFA Website</li>
</ul>
<p><img src="/img/posts/openfootball/fifa-match-schedule.png" alt="" /></p>
<p>Example 2b:</p>
<ul>
<li>Match Schedule on FIFA Website (Source - Document Object Model Tree)</li>
</ul>
<p><img src="/img/posts/openfootball/fifa-match-schedule-inside.png" alt="" /></p>
<p>Example 3a:</p>
<ul>
<li>Match Schedule on Wikipedia</li>
</ul>
<p><img src="/img/posts/openfootball/wikipedia-worldcup.png" alt="" /></p>
<p>Example 3b:</p>
<ul>
<li>Match Schedule on Wikipedia (Source - Plain Text or Mediawiki Text)</li>
</ul>
<p>Cut-n-Paste Text:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>12 June 2014 17:00 Brazil Match 1 Croatia Arena de São Paulo, São Paulo
13 June 2014 13:00 Mexico Match 2 Cameroon Arena das Dunas, Natal
17 June 2014 16:00 Brazil Match 17 Mexico Estádio Castelão, Fortaleza
18 June 2014 19:00 Cameroon Match 18 Croatia Arena Amazônia, Manaus
23 June 2014 17:00 Cameroon Match 33 Brazil Estádio Nacional Mané Garrincha, Brasília
23 June 2014 17:00 Croatia Match 34 Mexico Arena Pernambuco, Recife
</code></pre></div></div>
<p>Wikipedia Source:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>===Group A===
\{\{\{\{main|2014 FIFA World Cup Group A}}}}
\{\{\{\{Fb cl2 header navbar}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|BRA}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|CRO}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=|border=green}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|MEX}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=}}}}
\{\{\{\{Fb cl2 team |t=\{\{\{\{fb|CMR}}}} |w=0 |d=0 |l=0 |gf=0 |ga=0 |bc=}}}}
|}
\{\{\{\{Football box
|date=12 June 2014
|time=17:00
|team1=\{\{\{\{fb-rt|BRA}}}}
|score=[[2014 FIFA World Cup Group A#Brazil v Croatia|Match 1]]
|report=
|team2=\{\{\{\{fb|CRO}}}}
|goals1=
|goals2=
|stadium=[[Arena Corinthians|Arena de São Paulo]], [[São Paulo]]
|attendance=
|referee=
}}}}
\{\{\{\{Football box
|date=13 June 2014
|time=13:00
|team1=\{\{\{\{fb-rt|MEX}}}}
|score=[[2014 FIFA World Cup Group A#Mexico v Cameroon|Match 2]]
|report=
|team2=\{\{\{\{fb|CMR}}}}
|goals1=
|goals2=
|stadium=[[Arena das Dunas]], [[Natal, Rio Grande do Norte|Natal]]
|attendance=
|referee=
}}}}
</code></pre></div></div>
Gerald Bauer
CSV Conf 2014 - for Data Makers Everywhere
2014-05-05T00:00:00+00:00
http://okfnlabs.org/blog/2014/05/05/csv-conf-2014
<p>Announcing <a href="http://csvconf.com/">CSV,Conf - the conference for data makers everywhere</a> which
takes place on <strong>15 July 2014</strong> in <strong>Berlin</strong>.</p>
<p>This one day conference will focus on <strong>practical</strong>, <strong>real-world</strong> stories,
examples and techniques of how to <strong>scrape</strong>, <strong>wrangle</strong>, <strong>analyze</strong>, and
<strong>visualize</strong> data. Whether your data is big or small, tabular or spatial,
graphs or rows this event is for you.</p>
<h2 id="key-info">Key Info</h2>
<ul>
<li><strong>Where</strong>: Kalkscheune, Berlin, Germany</li>
<li><strong>When</strong>: 15 July 2014, all day</li>
<li><strong>Web</strong>: <a href="http://csvconf.com/">http://csvconf.com/</a></li>
<li><strong>Register</strong>: <a href="http://register.csvconf.com/">http://register.csvconf.com/</a></li>
<li><strong>Submit a talk</strong>: <a href="http://csvconf.com/#help">http://csvconf.com/#help</a> (Deadline: Noon GMT 31st May)</li>
</ul>
<p>CSV,Conf is run in conjunction with the week long <a href="http://okfestival.org">Open Knowledge Festival</a>.</p>
<h2 id="what-is-it-about">What Is It About?</h2>
<h3 id="building-community">Building Community</h3>
<p>We want to bring together data makers/doers/hackers from backgrounds like
science, journalism, open government and the wider software industry to share
tools and stories.</p>
<h3 id="for-those-who-love-data">For those who love data</h3>
<p>CSV Conf is a non-profit community conference run by some folks who really love
data and sharing knowledge. If you are as passionate about data and the
application it has to society then you should join us!</p>
<h3 id="big-and-small">Big and small</h3>
<p>This isn’t a conference just about spreadsheets. We are curating content about
advancing the art of data collaboration, from putting your CSV on GitHub to
producing meaningful insight by running large scale distributed processing.</p>
<h2 id="colophon-why-csv">Colophon: Why CSV?</h2>
<p>This conference isn’t just about <a href="http://data.okfn.org/doc/csv">CSV</a> data. But we chose to call it CSV
Conf because we think CSV embodies certain important qualities that set the
tone for the event:</p>
<ul>
<li><strong>Simplicity</strong>: CSV is incredibly simple - perhaps the simplest structured data
format there is</li>
<li><strong>Openness</strong>: the CSV ‘standard’ is well-known and open - free for anyone to use</li>
<li><strong>Easy to use</strong>: CSV is widely supported - practically every spreadsheet
program, relational database and programming language in existence can handle
CSV in some form or other</li>
<li><strong>Hackable</strong>: CSV is text-based and therefore amenable to manipulation and access
from a wide range of standard tools (including revision control systems such
as git, mercurial and subversion)</li>
<li><strong>Big or small</strong>: CSV files can range from under a kilobyte to gigabytes and its
line-oriented structure mean it can be incrementally processed – you do not
need to read an entire file to extract a single row.</li>
</ul>
<p>More informally:</p>
<blockquote>
<p>CSV is the data Kalashnikov: not pretty, but many [data] wars have been
fought with it and even kids can use it. <a href="http://pudo.org/">@pudo</a> (Friedrich Lindenberg)</p>
</blockquote>
<blockquote>
<p>CSV is the ultimate simple, standard data format - streamable, text-based, no
need for proprietary tools etc <a href="http://rufuspollock.org/">@rufuspollock</a> (Rufus Pollock)</p>
</blockquote>
<p>[The above is adapted from the <a href="http://dataprotocols.org/tabular-data-package/#why-csv">“Why CSV” section</a> of the Tabular Data
Package specification]</p>
Rufus Pollock
Morph, a scraper platform for hackers and would be hackers
2014-03-22T00:00:00+00:00
http://okfnlabs.org/blog/2014/03/22/morph
<h2 id="in-an-ideal-world">In an ideal world…</h2>
<p><strong>In an ideal world</strong> we would go in search of a piece of data by using our favorite search engine and we would land on a page with a big download button. It would give you a few options for formats. You pick the right one and off you go.</p>
<p>Unfortunately, we all know from bitter experience that this is not yet routinely how the real world operates. We’re getting there, there’s more data out there than ever before but we still have a very very long way to go.</p>
<p><strong>In the real world</strong> we hopefully find the data we’re after. Maybe it’s published on a government website somewhere. There’s no big download button, just a big html table.</p>
<p>If we just need to grab a snapshot copy of the data and it’s just on a single page we can copy and paste it with a bit of luck into a spreadsheet.</p>
<p>What if the data is spread over hundreds of pages or we need to keep it regularly updated? Well, we have to write a <a href="https://en.wikipedia.org/wiki/Web_scraping">scraper</a>.</p>
<p>If you know the basics of a language like PHP, Python or Ruby, writing a scraper isn’t very hard at all.</p>
<p>What is a pain is all the stuff around it. Where do I run it? How do I schedule it to run automatically? What if the website that I’m scraping changes? How do I check that the data is still coming in regularly? What do I do if I need an API so another application can regularly access the scraped data?</p>
<p>All of these things you can solve but why bother if something else can take care of that for you?</p>
<h2 id="introducing-morphio">Introducing Morph.io</h2>
<p><a href="https://morph.io"><img src="/img/posts/morph/logo.png" width="300" /></a></p>
<p>This is where <a href="https://morph.io">Morph.io</a> comes in. It’s a new free scraping platform made by the not-for-profit <a href="https://www.openaustraliafoundation.org.au">OpenAustralia Foundation</a>.</p>
<p>The basic idea is you write your scraper in PHP, Python or Ruby. You can do your scraping in pretty much whatever way you want. All that matters is that the final data is written to an SQLite database in your local directory. This gives you enormous power and flexibility while for the simple case it all stays nice and easy.</p>
<p>All your scraper code is stored in <a href="https://github.com">GitHub</a> under version control. People can fork your scraper and contribute fixes and everything on <a href="https://morph.io">Morph.io</a> integrates tightly with GitHub.</p>
<p>You can run that scraper from the commandline or manually via the web interface. You can also schedule it to run automatically every day.</p>
<p>You can then download the resulting data as CSVs or JSON or even do a custom SQL query against your SQLite database using the API.</p>
<p>You can also watch your scraper (or anyone elses) and get notified via email if the scraper errors.</p>
<p>You can work on the command line with your scraper and when you’re happy push the changes to GitHub and then the next time your scraper runs on <a href="https://morph.io">Morph.io</a> (either manually or automatically) it will pick up your changes.</p>
<p><a href="https://morph.io/planningalerts-scrapers/blue-mountains"><img src="/img/posts/morph/screenshot.png" style="border: 1px solid;" /></a></p>
<p>Apart from making the straightforward use case of hosting and running a scraper easy, the focus of <a href="https://morph.io">Morph.io</a> is around collaboration. It’s easy to collaborate with someone else on developing and maintaining a scraper - you use GitHub in the way you know and there’s no mucking around with server deployments or the like.</p>
<h2 id="migrating-from-scraperwiki-classic">Migrating from ScraperWiki Classic</h2>
<p>If you’re a user of <a href="https://classic.scraperwiki.com/">ScraperWiki Classic</a> you probably have already received an email letting you know that the ScraperWiki Classic service is shutting down. Sad news as we’ve been long time users of ScraperWiki ourselves. However, we’ve gotten together with the ScraperWiki folks to make it super easy to migrate your existing ScraperWiki Classic scrapers over to <a href="https://morph.io">Morph.io</a>. It literally only requires two clicks.</p>
<h2 id="open-source">Open Source</h2>
<p><a href="https://morph.io">Morph.io</a> is also <a href="https://github.com/openaustralia/morph/">open source</a> licensed under the Affero GPL. So you can use <a href="https://morph.io">Morph.io</a> without any fear of vendor lock-in.</p>
<p>If your needs outgrow using <a href="https://morph.io">Morph.io</a> the service, then install your own private instance.</p>
<h2 id="get-started">Get started</h2>
<p>Hopefully this has given you enough motivation to give <a href="https://morph.io">Morph.io</a> a try.</p>
<p>Go to <a href="https://morph.io">Morph.io</a> and write a scraper!</p>
<p><a href="mailto:contact@oaf.org.au">Feedback</a> on how <a href="https://morph.io">Morph.io</a> is working for you is always appreciated.</p>
Matthew Landauer
Labs newsletter: 20 March, 2014
2014-03-20T00:00:00+00:00
http://okfnlabs.org/blog/2014/03/20/newsletter
<p>We’re back with a bumper crop of updates in this new edition of the now-monthly Labs newsletter!</p>
<h2 id="textus-viewer-refactoring">Textus Viewer refactoring</h2>
<p>The <a href="http://okfnlabs.org/textus-viewer/">TEXTUS Viewer</a> is an HTML + JS application for viewing texts in the format of <a href="http://okfnlabs.org/projects/textus/">TEXTUS</a>, Labs’s open source platform for collaborating around collections of texts. The viewer has now been <a href="https://github.com/okfn/textus-viewer/issues/5">stripped down</a> to its bare essentials, becoming a leaner and more streamlined beast that’s easier to integrate into your projects.</p>
<p>Check out <a href="http://okfnlabs.org/textus-viewer/">the demo</a> to see the new Viewer in action, and see the <a href="https://github.com/okfn/textus-viewer#usage">full usage instructions</a> in the repo.</p>
<h2 id="json-table-schema-foreign-key-support">JSON Table Schema: foreign key support</h2>
<p>The <a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a>, Labs’s schema for tabular data, has just added an important new feature: support for <a href="http://dataprotocols.org/table-schema/#foreign-keys">foreign keys</a>. This means that the schema now provides a method for linking entries in a table to entries in a separate resource.</p>
<p>This update has been in the works for a long time, as you can see from <a href="https://github.com/dataprotocols/dataprotocols/issues/23">the discussion thread on GitHub</a>. Many thanks to everyone who participated in that year-long discussion, including <a href="http://trestletechnology.net">Jeff Allen</a>, <a href="http://deadpansincerity.com">David Miller</a>, <a href="https://github.com/gthb">Gunnlaugur Thor Briem</a>, <a href="http://standardanalytics.io">Sebastien Ballesteros</a>, <a href="http://opennorth.ca">James McKinney</a>, <a href="http://robotrebuilt.com/people/paulfitz/">Paul Fitzpatrick</a>, <a href="https://github.com/besquared">Josh Ferguson</a>, <a href="https://github.com/tryggvib">Tryggvi Björgvinsson</a>, and <a href="http://okfnlabs.org/members/rgrp">Rufus Pollock</a>.</p>
<h2 id="renaming-of-data-explorer">Renaming of Data Explorer</h2>
<p><a href="http://okfnlabs.org/projects/data-explorer/">Data Explorer</a> is Labs’s in-browser data cleaning and visualization app—and it’s about to get a name change.</p>
<p>For the past four months, <a href="https://github.com/okfn/dataexplorer/issues/150">discussion around the new name</a> has been bubbling. As of right now, <a href="http://okfnlabs.org/members/rgrp">Rufus Pollock</a> is proposing to go with the new name <em>DataDeck</em>.</p>
<p>What do you think? If you object, now’s your chance to jump in the thread and re-open the issue!</p>
<h2 id="on-the-blog-sec-edgar-database">On the blog: SEC EDGAR database</h2>
<p>Rufus has been doing some work with the <a href="http://okfnlabs.org/blog/2014/03/04/sec-edgar-database.html">Securities and Exchange Commission (SEC) EDGAR database</a>, “a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports”. He has written up his initial findings on the blog and created a <a href="https://github.com/datasets/edgar">repo</a> for the extracted data.</p>
<p>This is an interesting example of working with XBRL, the popular XML framework for financial reporting. You can find several good Python libraries for working with XBRL in <a href="https://lists.okfn.org/pipermail/okfn-labs/2014-March/001337.html">Rufus’s message to the mailing list</a>.</p>
<h2 id="labs-hangout-today">Labs Hangout: today!</h2>
<p>Labs Hangouts are a fun and informal way for Labs members and friends to get together, discuss their work, and seek out new contributions—and the next one is happening today (20 March) at 1700-1800 GMT!</p>
<p>If you want to join in, <a href="http://pad.okfn.org/p/labs-hangouts">visit the hangout Etherpad</a> and record your name. The URL of the Hangout will be announced on the Labs mailing list as well as reported on the pad.</p>
<h2 id="get-involved">Get involved</h2>
<p>Want to join in Labs activities? There’s lots to do! Possibilities for contribution include:</p>
<ul>
<li><a href="https://github.com/okfn/data.okfn.org/issues/24">Google Spreadsheet imports</a> for <a href="http://data.okfn.org">data.okfn.org</a></li>
<li><a href="https://github.com/okfn/timemapper/issues/107#issuecomment-37631369">JSON and CSV import</a> for <a href="http://timemapper.okfnlabs.org/">TimeMapper</a></li>
<li><a href="https://github.com/okfn/datapipes/issues/107">developer documentation</a> for <a href="http://datapipes.okfnlabs.org">Data Pipes</a></li>
</ul>
<p>And much much more. Leave an idea on the <a href="http://okfnlabs.org/ideas/">Ideas Page</a>, or visit the Labs site to learn more about how you can <a href="http://okfnlabs.org/join/">join the community</a>.</p>
Neil Ashton
The SEC EDGAR Database
2014-03-04T00:00:00+00:00
http://okfnlabs.org/blog/2014/03/04/sec-edgar-database
<p>This post looks at the Securities and Exchange Commission (SEC) EDGAR database.
EDGAR is a rich source of data containing regulatory filings from
publicly-traded US corporations including their annual and quarterly reports:</p>
<blockquote>
<p>All companies, foreign and domestic, are required to file registration
statements, periodic reports, and other forms electronically through EDGAR.
Anyone can access and download this information for free. [from the <a href="http://www.sec.gov/edgar.shtml">SEC
website</a>]</p>
</blockquote>
<p>This post introduces the basic structure of the database, and how to get access
to filings via ftp. Subsequent posts will look at how to use the structured
information in the form of XBRL files.</p>
<div class="alert alert-success">
<strong>Note</strong>: an extended version of the notes here plus additional data and scripts
can be found in this <a href="https://github.com/datasets/edgar">SEC EDGAR Data
Package on Github</a>.
</div>
<h2 id="human-interface">Human Interface</h2>
<p>See <a href="http://www.sec.gov/edgar/searchedgar/companysearch.html">http://www.sec.gov/edgar/searchedgar/companysearch.html</a></p>
<p><img src="http://webshot.okfnlabs.org/api/generate?url=http%3A%2F%2Fwww.sec.gov%2Fedgar%2Fsearchedgar%2Fcompanysearch.html" /></p>
<h2 id="bulk-data">Bulk Data</h2>
<p>EDGAR provides bulk access via FTP: <a href="ftp://ftp.sec.gov/">ftp://ftp.sec.gov/</a> - <a href="https://www.sec.gov/edgar/searchedgar/ftpusers.htm">official
documentation</a>. We summarize here the main points.</p>
<p>Each company in EDGAR gets an identifier known as the CIK which is a 10 digit
number. You can find the CIK by searching EDGAR using a name of stock market
ticker.</p>
<p>For example, <a href="http://www.sec.gov/cgi-bin/browse-edgar?CIK=ibm&action=getcompany">searching for IBM by ticker</a> shows us that
the the CIK is <code class="language-plaintext highlighter-rouge">0000051143</code>.</p>
<p>Note that leading zeroes are often omitted (e.g. in the ftp access) so this
would become <code class="language-plaintext highlighter-rouge">51143</code>.</p>
<p><img src="http://webshot.okfnlabs.org/api/generate?url=http%3A%2F%2Fwww.sec.gov%2Fcgi-bin%2Fbrowse-edgar%3FCIK%3Dibm%26action%3Dgetcompany&width=1024&height=768" /></p>
<p>Next each submission receives an ‘Accession Number’ (acc-no). For example,
IBM’s quarterly financial filing (form 10-Q) in October 2013 had accession
number: <code class="language-plaintext highlighter-rouge">0000051143-13-000007</code>.</p>
<h3 id="ftp-file-paths">FTP File Paths</h3>
<p>Given a company with CIK (company ID) XXX (omitting leading zeroes) and
document accession number YYY (acc-no on search results) the path would be:</p>
<p>File paths are of the form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/edgar/data/XXX/YYY.txt
</code></pre></div></div>
<p>For example, for the IBM data above it would be:</p>
<p><a href="ftp://ftp.sec.gov/edgar/data/51143/0000051143-13-000007.txt">ftp://ftp.sec.gov/edgar/data/51143/0000051143-13-000007.txt</a></p>
<p>Note, if you are looking for a nice HTML version you can find it at in the
Archives section with a similar URL (just add -index.html):</p>
<p><a href="http://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm">http://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm</a></p>
<h3 id="indices">Indices</h3>
<p>If you want to get a list of all filings you’ll want to grab an Index. As the help page explains:</p>
<blockquote>
<p>The EDGAR indices are a helpful resource for FTP retrieval, listing the
following information for each filing: Company Name, Form Type, CIK, Date
Filed, and File Name (including folder path).</p>
<p>Four types of indexes are available:</p>
<ul>
<li>company — sorted by company name</li>
<li>form — sorted by form type</li>
<li>master — sorted by CIK number</li>
<li>XBRL — list of submissions containing XBRL financial files, sorted by CIK
number; these include Voluntary Filer Program submissions</li>
</ul>
</blockquote>
<p>URLs are like:</p>
<p><a href="ftp://ftp.sec.gov/edgar/full-index/2008/QTR4/master.gz">ftp://ftp.sec.gov/edgar/full-index/2008/QTR4/master.gz</a></p>
<p>That is, they have the following general form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ftp://ftp.sec.gov/edgar/full-index/{YYYY}/QTR{1-4}/{index-name}.[gz|zip]
</code></pre></div></div>
<p>So for XBRL in the 3rd quarter of 2010 we’d do:</p>
<p><a href="ftp://ftp.sec.gov/edgar/full-index/2010/QTR3/xbrl.gz">ftp://ftp.sec.gov/edgar/full-index/2010/QTR3/xbrl.gz</a></p>
<h3 id="cik-lists-and-lookup">CIK lists and lookup</h3>
<p>There’s a full list of all companies along with their CIK code here: <a href="http://www.sec.gov/edgar/NYU/cik.coleft.c">http://www.sec.gov/edgar/NYU/cik.coleft.c</a></p>
<p>If you want to look up a CIK or company by its ticker you can do the following query against the normal search system:</p>
<p><a href="http://www.sec.gov/cgi-bin/browse-edgar?CIK=ibm&Find=Search&owner=exclude&action=getcompany&output=atom">http://www.sec.gov/cgi-bin/browse-edgar?CIK=ibm&Find=Search&owner=exclude&action=getcompany&output=atom</a></p>
<p>Then parse the atom to grab the CIK. (If you prefer HTML output just omit output=atom).</p>
<p>There is also a full-text company name to CIK lookup here:</p>
<p><a href="http://www.sec.gov/edgar/searchedgar/cik.htmL">http://www.sec.gov/edgar/searchedgar/cik.htmL</a></p>
<p>(Note this does a POST to a ‘text’ API at <a href="http://www.sec.gov/cgi-bin/cik.pl.c">http://www.sec.gov/cgi-bin/cik.pl.c</a>)</p>
Rufus Pollock
Labs newsletter: 20 February, 2014
2014-02-20T00:00:00+00:00
http://okfnlabs.org/blog/2014/02/20/newsletter
<p>The past few weeks have seen major improvements to the Labs website, another Open Data Maker Night in London, updates to the TimeMapper project, and more.</p>
<h2 id="labs-hangout-today">Labs Hangout: today</h2>
<p>The next Labs online hangout is taking place today in just a few hours—now’s your chance to sign up on <a href="http://pad.okfn.org/p/labs-hangouts">the hangout’s Etherpad</a>!</p>
<p>Labs hangouts are informal online gatherings held on Google Hangout at which Labs members and friends get together to discuss their work and to set the agenda for Labs activities.</p>
<p>Today’s hangout will take place at 1700 - 1800 GMT. Check <a href="http://pad.okfn.org/p/labs-hangouts">the hangout pad</a> for more details, and watch the pad for notes from the meeting.</p>
<h2 id="crowdcrafting-at-citizen-cyberscience-summit-2014">Crowdcrafting at Citizen Cyberscience Summit 2014</h2>
<p>In today’s other news, Labs’s <a href="http://okfnlabs.org/members/teleyinex">Daniel Lombraña González</a> is presenting <a href="http://okfnlabs.org/projects/crowdcrafting-and-pybossa/">Crowdcrafting</a> at the <a href="http://lanyrd.com/2014/citizen-cyberscience-summit/">Citizen Cyberscience Summit 2014</a>. You can read more about his presentation <a href="http://lanyrd.com/2014/citizen-cyberscience-summit/sctxth/">here</a>.</p>
<p>Crowdcrafting is an open-source citizen science platform that “empowers citizens to become active players in scientific projects by donating their time in order to solve micro-task problems”. Crowdcrafting has been used by institutions including CERN, the United Nations, and the National Institute of Space Research of Brazil.</p>
<h2 id="labs-site-updates">Labs site updates</h2>
<p>Labs has been discussing improving the website for some time now, and the past weeks have seen many of those proposed improvements being put into action.</p>
<p>One of the biggest changes is a new <a href="http://okfnlabs.org/projects/">projects page</a>. Besides having a beautiful new layout, the new projects page implements <a href="https://github.com/okfn/okfn.github.com/issues/160">filtering</a> by tags, language, and more.</p>
<p>The site now also features a reciprocal linking of users and projects. The projects page now shows projects’ maintainers (n.b. plural!), and users pages now show which projects users contribute to (e.g. <a href="http://okfnlabs.org/members/andylolz/">Andy Lulham’s page</a> highlights his <a href="http://okfnlabs.org/projects/data-pipes/">Data Pipes</a> contributions).</p>
<h2 id="timemapper-improvements">TimeMapper improvements</h2>
<p><a href="http://timemapper.okfnlabs.org/">TimeMapper</a> is a Labs project allowing you to create elegant timelines with inline maps from Google Spreadsheets in a matter of seconds.</p>
<p>A number of improvements have been made to TimeMapper:</p>
<ul>
<li>support for <a href="https://github.com/okfn/timemapper/issues/121">different types of data views</a> (including simple timeline and map views)</li>
<li><a href="https://github.com/okfn/timemapper/issues/109">nice URLs for anonymous timemaps</a></li>
<li>bugfix: <a href="https://github.com/okfn/timemapper/issues/86">missing start date doesn’t cause a crash</a></li>
</ul>
<h2 id="open-data-maker-night-february">Open Data Maker Night February</h2>
<p>Two weeks ago today, the <a href="http://www.meetup.com/OpenKnowledgeFoundation/London-GB/1093152/">ninth Open Data Maker Night London</a> was hosted by <a href="http://okfnlabs.org/members/andylolz">Andy Lulham</a>. This edition was a <em>mapping special</em>, featuring <a href="http://www.openstreetmap.org/">OpenStreetMap</a> contributor <a href="http://harrywood.co.uk/">Harry Wood</a>.</p>
<p><a href="http://okfnlabs.org/events/open-data-maker/">Open Data Maker Nights</a> are informal, action-oriented get-togethers where <em>things get made</em> with open data. Visit the Labs website for <a href="http://okfnlabs.org/events/open-data-maker/">more information</a> on them, including info on how to host your own.</p>
<h2 id="datapackage--bubbles">DataPackage + Bubbles</h2>
<p>On last week’s newsletter, you heard about <a href="http://okfnlabs.org/members/Stiivi/">Štefan Urbánek</a>’s abstract data processing framework <a href="https://github.com/Stiivi/bubbles">Bubbles</a>. Štefan just notified the OKFN Labs list that he has created a <a href="https://gist.github.com/Stiivi/9104719">demo of Bubbles using Data Packages</a>, Labs’s simple standard for data publication.</p>
<p>“The example is artificial”, Štefan says, but it highlights the power of the Bubbles framework and the potential of the Data Package format.</p>
<h2 id="get-involved">Get involved</h2>
<p>We’re always looking for new contributions at the Labs. Read about <a href="http://okfnlabs.org/join/">how you can join</a>, and see the <a href="http://okfnlabs.org/ideas/">Ideas Page</a> to get in on the ground floor of a Labs project—or just join the <a href="http://lists.okfn.org/mailman/listinfo/okfn-labs">Labs mailing list</a> to participate by offering feedback.</p>
Neil Ashton
Labs newsletter: 30 January, 2014
2014-01-30T00:00:00+00:00
http://okfnlabs.org/blog/2014/01/30/newsletter
<p>From now on, the Labs newsletter will arrive through a special announce-only mailing list, <em>newsletter@okfnlabs.org</em>, more details on which can be found below.</p>
<p>Keep reading for other new developments including the fifth Labs Hangout, the launch of SayIt, and new developments in the vision of “Frictionless Data”.</p>
<h2 id="new-newsletter-format">New newsletter format</h2>
<p>Not everyone who wants to know about Labs activities wants or needs to observe those activities unfolding on the main Labs list. For friends of Labs who just want occasional updates, we’ve created a new, <a href="http://sendy.co/">Sendy</a>-based announce-only list that will bring you a Labs newsletter every two weeks.</p>
<p>Everyone currently subscribed to <em>okfn-labs@lists.okfn.org</em> has been added to the new list. To join the new announce list, see the <a href="http://okfnlabs.org/contact/">Labs Contact page</a>, where there’s a form.</p>
<h2 id="labs-hangout-no-5">Labs Hangout no. 5</h2>
<p>Last Thursday, <a href="http://okfnlabs.org/members/andylolz">Andy Lulham</a> hosted the fifth OKFN Labs Hangout. The Labs Hangouts are a way for people curious about Labs projects to informally get together, share their work, and talk about the future of Labs.</p>
<p>For full details, check out the <a href="http://pad.okfn.org/p/labs-hangouts">minutes from the hangout</a>. Highlights included:</p>
<ul>
<li>SayIt, a new publication platform for speeches & transcripts, introduced by <a href="http://twitter.com/steiny">Tom Steinberg</a> of <a href="http://t.co/KKNpVhbitu">mySociety</a> (see below for more!)</li>
<li>announcement of an <a href="http://humanities.okfn.org/open-literature-sprint-jan-2014/">Open Literature Sprint</a> this past Saturday</li>
<li>full coverage of PyBossa source code with <a href="https://coveralls.io/r/PyBossa/pybossa">unit tests</a></li>
<li><a href="http://twitter.com/tfmorris">Tom Morris</a>’s work parsing and importing e-publications from <a href="http://openlibrary.org">Open Library</a></li>
<li>updates to <a href="http://data.okfn.org/vision">Frictionless Data</a> (see below)</li>
</ul>
<h2 id="sayit">SayIt</h2>
<p><a href="http://sayit.mysociety.org/">SayIt</a>, an open-source tool for publishing and sharing transcripts, has just been launched by <a href="http://poplus.org/">Poplus</a>. At last week’s Labs Hangout, <a href="http://twitter.com/steiny">Tom Steinberg</a> of <a href="http://t.co/KKNpVhbitu">mySociety</a> (one half of Poplus, alongside <a href="http://www.ciudadanointeligente.org/?lang=en">Ciudadano Inteligente</a>) shared some of the motivations behind the creation of the tool, which was also discussed <a href="https://lists.okfn.org/pipermail/okfn-discuss/2014-January/010083.html">on the okfn-discuss mailing list</a>.</p>
<p>As Tom explained, mySociety’s <a href="http://www.theyworkforyou.com/">They Work For You</a> has proven the popularity of transcript data. But making the transcripts available in a nice way (e.g. with a decent API) has so far called for bespoke software development. SayIt is designed to encourage “nice” publication as the starting-point—and to serve as a pedagogical example of what a good data publication tool looks like.</p>
<h2 id="frictionless-data-vision-roadmap-composability">Frictionless data: vision, roadmap, composability</h2>
<p>We’ve heard about <a href="http://okfnlabs.org/members/rgrp">Rufus</a>’s vision for an ecosystem of “frictionless data” in the past. Now the discussion is starting to get serious. <a href="http://data.okfn.org/">data.okfn.org</a> now hosts two key documents generated through the conversation:</p>
<ul>
<li><a href="http://data.okfn.org/vision">the vision</a>: what will create a dynamic, productive, and attractive open data ecosystem?</li>
<li><a href="http://data.okfn.org/roadmap">the roadmap</a>: what has to happen to bring this vision to life?</li>
</ul>
<p>The new roadmap is a particularly lucid overview of how the frictionless data vision connects with concrete actions. Would-be creators of this new ecosystem should consult the roadmap to see where to join in.</p>
<p><a href="https://lists.okfn.org/pipermail/okfn-labs/2014-January/001260.html">Discussion on the Labs list</a> has also generated some interesting insights. <a href="http://t.co/pL0Yy7uNuf">Data Unity</a>’s Kev Kirkland discussed his work with Semantic Web formalization of composable data manipulation processes, and <a href="http://okfnlabs.org/members/Stiivi/">Štefan Urbánek</a> made a connection with his work on “abstracting datasets and operations” in the ETL framework <a href="https://github.com/Stiivi/bubbles">Bubbles</a>.</p>
<h2 id="on-the-blog-olap-part-two">On the blog: OLAP part two</h2>
<p>Last week, <a href="http://okfnlabs.org/members/Stiivi/">Štefan Urbánek</a> wrote us an <a href="http://okfnlabs.org/blog/2014/01/10/olap-introduction.html">introduction to Online Analytical Processing</a>. Shortly afterwards, he followed up with a second post taking a closer look at <a href="http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model.html">how OLAP data is structured and why</a>.</p>
<p>Check out Štefan’s post to learn about how OLAP represents data as multidimensional “cubes” that users can slice and dice to explore the data along its many dimensions.</p>
<h2 id="timemapper-improvements">TimeMapper improvements</h2>
<p><a href="http://okfnlabs.org/members/andylolz">Andy Lulham</a> has started working on <a href="http://timemapper.okfnlabs.org">TimeMapper</a>, Labs’s easy-to-use tool for the creation of interactive timelines linked to geomaps.</p>
<p>Some of the improvements he has made so far have been bugfixes (e.g. <a href="https://github.com/okfn/timemapper/pull/119">preventing overflowing form controls</a>, <a href="https://github.com/okfn/timemapper/pull/118">fixing the template settings file</a>), but one of them is a new user feature: adding a way to <a href="http://timemapper.okfnlabs.org">change the starting event</a> on a timeline so that they don’t always have to start at the beginning.</p>
<h2 id="get-involved">Get involved</h2>
<p>Want to get involved with Labs’s projects? Now is a great time to join in! Check out the <a href="http://okfnlabs.org/ideas/">Ideas Page</a> to see some of the many things you can do once you <a href="http://okfnlabs.org/join/">join Labs</a>, or just jump on the <a href="http://lists.okfn.org/mailman/listinfo/okfn-labs">Labs mailing list</a> and take part in a conversation.</p>
Neil Ashton
OLAP Cubes and Logical Models
2014-01-20T00:00:00+00:00
http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model
<p>Last time we talked about OLAP in general – what it is and why it is useful.
Today we are going to look at the data – how they are structured and why? What
are cubes? What does it mean “multi-dimensional”?</p>
<h2 id="data-cubes-and-logical-model">Data Cubes and Logical Model</h2>
<p>Application data might be a mess from user’s perspective. Not only that, data
might be scattered all around the place in multiple systems. Even when the
data would be put into one place called “data warehouse”, they will still have
their original form which is not ready to answer our questions quickly.
Purpose of the logical model is to hide physical structure of the data (how
applications use it) and provide user-oriented view of the data (how business
sees it).</p>
<p>“Answering questions quickly” does not depend only on database performance and
amount of data. We might have the fastest database and computation engine in
the world, but we will not get the answer quickly because it will take weeks
to properly translate the human (business) question into technical terms.
The challenges are:</p>
<ul>
<li>Where are the data stored? What table? Which column?</li>
<li>What are the categories and what can I summarize?</li>
<li>What are the relationships between columns?</li>
<li>Is this <code class="language-plaintext highlighter-rouge">category_id</code> column the same as this <code class="language-plaintext highlighter-rouge">pk_prod_cat</code>?</li>
<li>Does this column contain a key (which is unique) or is the a label (which
might be not, due to data evolution)?</li>
<li>How can I group the data?</li>
</ul>
<p>All this information is collected in <em>metadata</em> called <em>logical model</em>.
Analysts or report writers do not have to know where name of an organisation
or category is stored, nor he does not have to care whether customer data is
stored in single table or spread across multiple tables (customer, customer
types, …). They just ask for “customer name” or “category code”.</p>
<h2 id="cubes">Cubes</h2>
<p>The data structures used in the OLAP are multidimensional data cubes or <a href="http://en.wikipedia.org/wiki/OLAP_cube"><em>OLAP
cubes</em></a>:</p>
<p><img src="/img/posts/olap-data_cube.png" alt="" /></p>
<p>Cube is a data structure that can be imagined as multi-dimensional
spreadsheet. How we can imagine it? Take a spreadsheet, put year on columns,
department on rows – that’s two-dimensional cube. Now create multiple sheets
with data of the same structure, say one sheet per country. Now you have
three-dimensional cube.</p>
<h2 id="facts-and-measures">Facts and Measures</h2>
<p>Fact is most detailed information that can be measured.</p>
<p><img src="/img/posts/olap-data_cube-fact.png" alt="" /></p>
<p>Example of a fact might be a contract, a spending, a phone call, a visit. We
can measure:</p>
<ul>
<li>contract: financial amount, discount, planned amount</li>
<li>spending: financial amount, quantity</li>
<li>phone call: duration, cost</li>
<li>visit: duration</li>
</ul>
<p>Those measurable properties, such as amount, discount or duration are
called <em>measures</em>.</p>
<p>We are mostly interested in summarized view: “what was the overall spending?”,
“what is the average call duration?” or “how many contracts are there?” Those
computed values are called <em>aggregates</em> or <em>aggregated measures</em>.</p>
<p>Facts might have multiple measures or they might even have none. If there are
no measures we still can at least answer questions of type “how many?”.</p>
<p>Note: The terminology might differ slightly in various literature and systems.
For example, Microsoft calls <em>measure</em> a <em>measure group</em> and they label
<em>aggregates</em> as <em>measures</em>.</p>
<h2 id="dimensions">Dimensions</h2>
<p>OLAP is suitable mostly for data which can be categorized – grouped by
categories. The categorical view of data should be also the main interest of
the data analysis. Example of categories might be: color, department, location
or even a date.</p>
<p>The categories are called <em>dimensions</em>.</p>
<p><img src="/img/posts/olap-data_cube-dimensions.png" alt="" /></p>
<p>Dimensions provide <em>context for facts</em>:</p>
<ul>
<li>Where did that happen?</li>
<li>When was the contract signed?</li>
<li>What kind of goods or services was in the contract?</li>
</ul>
<p>Dimensions are used to <em>filter</em> queries:</p>
<ul>
<li>What was the spending last year?</li>
<li>How many contracts signed by the department of Health?</li>
</ul>
<p>They are used to control <em>scope of aggregation</em> of facts:</p>
<ul>
<li>What was the number of contracts by department?</li>
<li>What was the average visit duration per month?</li>
<li>What are the sales of each product?</li>
</ul>
<h2 id="concept-hierarchies">Concept Hierarchies</h2>
<p>We might be interested in amount per year, then per month for particular
year; products can be grouped by categories and subcategories; location might
be defined by country, country might have multiple cities… Those are concept
hierarchies of dimensions.</p>
<p>Hierarchy has multiple levels and there might be various hierarchical views of
any dimension. For example the date might be split by year, month and day. Or
it might be split by year, quarter, month and no day (because we have no daily
data) or by year and week (for weekly data).</p>
<p>From technical perspective you might associate an attribute with a dimension.
Depending on the modelling method a dimension might be composed of just one
attribute or multiple attributes grouped by hierarchies.</p>
<p>Note: there are multiple approaches to concept hierarchies. The one described
here is: Dimension might be composed of multiple levels and the levels are grouped into
hierarchies. Another approach might be “hierarchies are lists of dimensions”
where a dimension represents just single attribute.</p>
<h2 id="slicing-and-dicing">Slicing and Dicing</h2>
<p>We have a data cube full of facts, how can we explore the data? We slice the
cube! What does that mean?</p>
<p>Say we have a data cube of contracts with dimensions: <em>time</em>, <em>country</em> and
<em>type (of procured subject)</em></p>
<p>We might be interested in spending in 2010:</p>
<p><img src="/img/posts/olap-slice_and_dice-time.png" alt="" /></p>
<p>… or contracts in Estonia:</p>
<p><img src="/img/posts/olap-slice_and_dice-country.png" alt="" /></p>
<p>… or contracts in Estonia in 2010:</p>
<p><img src="/img/posts/olap-slice_and_dice-country_time.png" alt="" /></p>
<p>… or just IT contracts in general:</p>
<p><img src="/img/posts/olap-slice_and_dice-type.png" alt="" /></p>
<p>IT contractsin Estonia in 2010:</p>
<p><img src="/img/posts/olap-slice_and_dice-all_dimensions.png" alt="" /></p>
<p>Some OLAP systems might have this information readily available in a
pre-computed (pre-aggregated), therefore we might get the answer very quickly
despite of huge amount of original data. Even if the system does not store the
pre-aggregated data cells, it might use some other transparent tricks to
achieve fast responses.</p>
<p>Slicing and dicing is an operation that filters the data cells of a cube and
narrows our focus from broader view:</p>
<p><img src="/img/posts/olap-slice_and_dice-overview.png" alt="" /></p>
<h2 id="drilling-down">Drilling down</h2>
<p><em>How many contracts per year?</em> or <em>Which type of products was most wanted in
2012?</em> are kind of questions that are answered by “drilling down” through the
data. Drilling down means changing our focus to more detailed data.</p>
<p>Drilling down can be done by concept hierarchies – for example going from year
summary to month summary to daily sales or by going from country level to
regional level.</p>
<p>The opposite operation is called “roll-up” – for example going from a monthly
view to a yearly view.</p>
<h2 id="try-it">Try It</h2>
<p>You might try OLAP with light-weight Python framework
<a href="http://cubes.databrewery.org">Cubes</a>. I’ll be talking about the framework in
more details in the future, meanwhile here are the main features:</p>
<ul>
<li>ROLAP – OLAP on top of relational database</li>
<li>quick prototyping on top of existing database schemas</li>
<li>metadata driven with user-oriented metadata</li>
<li>localizable</li>
<li>OLAP API with HTTP <a href="http://pythonhosted.org/cubes/server.html">server</a></li>
<li>no need to know Python</li>
</ul>
<p>The <a href="https://github.com/Stiivi/cubes">development version</a> includes pluggable
datawarehouse (cubes from external sources) and many new backends such as
MongoDB.</p>
<p>For reporting and data exploration you might use
<a href="http://jjmontesl.github.io/cubesviewer/">CubesViewer</a>. More visualisation
software being developed.</p>
<h2 id="summary">Summary</h2>
<p>Concept of OLAP cubes and multidimensional modeling brings more understandable
and usable data to the end-users. It very easy and straightforward to
translate business questions into multidimensional query.</p>
<p>The OLAP systems, thanks to the nature of multi-dimensional data cubes, can
prepare data by aggregating them up-in-front to provide answers faster.</p>
<p>Moreover, explicit metadata (logical model) allows not only more flexible data
navigation but also easy transformation of the data to be used in various
reporting software. Some OLAP tools can work with certain database schemas
immediately.</p>
<p>To sum it up in few words, the multidimensional modeling of OLAP cubes brings:
understandability, better usability, speed and logical data reusability.</p>
<p>Next time we will look at the <a href="http://cubes.databrewery.org">Cubes – Lightweight Python
framework</a> – how to have an OLAP server running
“in 15 minutes”.</p>
Stefan Urbanek
Labs newsletter: 16 January, 2014
2014-01-16T00:00:00+00:00
http://okfnlabs.org/blog/2014/01/16/newsletter
<p>Welcome back from the holidays! A new year of Labs activities is well underway, with long-discussed improvements to the Labs projects page, many new PyBossa developments, a forthcoming community hangout, and more.</p>
<h2 id="labs-projects-page">Labs projects page</h2>
<p><a href="https://github.com/okfn/okfn.github.com/issues/46">Getting the Labs project page organized better</a> has been high on the agenda for some time now. In the past little while, significant progress has been made. New improvements to the project page include:</p>
<ul>
<li><a href="https://github.com/okfn/okfn.github.com/pull/168">a custom filter menu</a></li>
<li><a href="https://github.com/okfn/okfn.github.com/pull/165">individual project lightbox</a></li>
<li><a href="https://github.com/okfn/okfn.github.com/pull/159">attributes for projects</a></li>
</ul>
<p><a href="http://okfnlabs.org/members/loleg/">Oleg Lavrosky</a>, <a href="http://okfnlabs.org/members/teleyinex/">Daniel Lombraña González</a>, and <a href="http://okfnlabs.org/members/andylolz/">Andy Lulham</a> have all contributed to this development—and work is still ongoing, with <a href="https://github.com/okfn/okfn.github.com/issues/161">further enhancements to attributes</a> and <a href="https://github.com/okfn/okfn.github.com/issues/160">more work on the UI</a> still to come.</p>
<h2 id="lots-of-pybossa-milestones">Lots of PyBossa milestones</h2>
<p>PyBossa has achieved so many milestones since the last newsletter that it’s hard to know where to begin.</p>
<p>PyBossa v0.2.1 was released by <a href="http://okfnlabs.org/members/teleyinex/">Daniel Lombraña González</a>, becoming a more robust service through the inclusion of a new rate-limiting feature for API calls. Alongside rate limits, the new PyBossa has improved security through the addition of a secure cookie-based solution for posting task runs. Full details can be found <a href="http://docs.pybossa.com/en/latest/api.html#rate-limiting">in the documentation</a>.</p>
<p>Daniel also released <a href="http://daniellombrana.es/taggingpictures.html">a new PyBossa template</a> for annotating pictures. The template, which incorporates the <a href="http://annotorious.github.io/">Annotorious.JS</a> JavaScript library, “allow[s] anyone to extract structured information from pictures or photos in a very simple way”.</p>
<p>The <a href="https://github.com/PyBossa/enki">Enki</a> package for analyzing PyBossa applications was also released over the break. Enki makes it possible to download completed PyBossa tasks and associated task runs, analyze them with <a href="http://pandas.pydata.org/">Pandas</a>, and share the result as an <a href="http://ipython.org/notebook.html">IPython Notebook</a>. Check out Daniel’s <a href="http://daniellombrana.es/blog/2013/12/16/pybossa-enki.html">blog post</a> on Enki to see what it’s about.</p>
<h2 id="new-on-the-blog">New on the blog</h2>
<p>We’ve had a couple of great new contributions on the <a href="http://okfnlabs.org/blog/">Labs blog</a> since the last newsletter.</p>
<p><a href="http://okfnlabs.org/members/tlevine/">Thomas Levine</a> has written about <a href="http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html">how he parses PDF files</a>, lovingly exploring a problem that all data wranglers will encounter and gnash their teeth over at least a few times in their lives.</p>
<p><a href="http://okfnlabs.org/members/Stiivi/">Stefan Urbanek</a>, meanwhile, has written an <a href="http://okfnlabs.org/blog/2014/01/10/olap-introduction.html">introduction to OLAP</a>, “an approach to answering multi-dimensional analytical queries swiftly”, explaining what that means and why we should take notice.</p>
<h2 id="dānabox">Dānabox</h2>
<p>Labs friend <a href="http://okfn.org/members/darwin/">Darwin Peltan</a> reached out to the list to point out that his friend’s project <a href="http://danabox.io/">Dānabox</a> is looking for testers and general feedback. Labs members are invited to pitch in by finding bugs and breaking it.</p>
<p>Dānabox is “Heroku but with public payment pages”, crowdsourcing the payment for an app’s hosting costs. Dānabox is <a href="https://github.com/danabox">open source</a> and built on the <a href="http://deis.io/">Deis platform</a>.</p>
<h2 id="community-hangout">Community hangout</h2>
<p>It’s almost time for the Labs community hangout. The Labs hangout is the regular event where Labs members meet up online to discuss their work, find ways to collaborate, and set the agenda for the weeks to come.</p>
<p>When will the hangout take place? <a href="http://okfnlabs.org/members/rgrp">Rufus</a> proposes <a href="https://github.com/okfn/okfn.github.com/issues/167">moving the hangout from the 21st to the 23rd</a>. If you want to participate, leave a comment on the thread to let Labs know what time would work for you.</p>
<h2 id="get-involved">Get involved</h2>
<p>Labs is the Labs community, no more and no less, and you’re invited to become a part of it! <a href="http://okfnlabs.org/join/">Join the community</a> by coding, blogging, kicking around ideas on the <a href="http://okfnlabs.org/ideas/">Ideas Page</a>, or joining the conversation on the <a href="http://lists.okfn.org/mailman/listinfo/okfn-labs">Labs mailing list</a>.</p>
Neil Ashton
Introduction to OLAP
2014-01-10T00:00:00+00:00
http://okfnlabs.org/blog/2014/01/10/olap-introduction
<h2 id="what-is-olap">What is OLAP?</h2>
<p><em>“Online Analytical Processing – OLAP is an approach to answering
multi-dimensional analytical queries swiftly”</em> says
<a href="http://en.wikipedia.org/wiki/Online_analytical_processing">Wikipedia</a>. What
does that mean? What are multi-dimensional analytical queries? Why this
approach? We will learn all this in a short blog series.</p>
<p>The term OLAP is becoming a bit less appropriate. The OLAP term comes from
traditional data warehousing from times when “big data” would fit into your
current laptop and it was time consuming to process even that little
amount compared to today’s standards. Nowadays the majority of analytical
processing can be considered online. More appropriate term is
“multidimensional data processing” as we will see later. For now we will stick
with the original name of the approach.</p>
<p>The basic concepts of OLAP are:</p>
<ul>
<li><em>data cubes</em> – multi-dimensional approach to data</li>
<li>fast aggregation or pre-aggregation</li>
</ul>
<h2 id="why-olap">Why OLAP?</h2>
<p>There are two sides of data: data to be used by application systems
(transactional, operational) and data to be used by humans for decision
making. The two kinds of data are in many cases very different, as the
applications have other needs than humans do.</p>
<p><img src="/img/posts/olap-overview.png" alt="" /></p>
<p>Applications require and store more detailed data. They require efficiency on
transaction (operation) level and integrity for example. The data might be
stored in different places, depending on how they are used by the systems. On
the other hand, decision makers want to see the data at one place and in the
form that reflects their view on the world. They don’t care how long does it
take to store a million transactions, they want to know when those million
transactions happened and where.</p>
<p>Why OLAP then? Decision makers, analysts or just any other curious people would
like to answer their questions quickly. The data in database systems are not
stored in a way that the questions can be answered easily. OLAP is the
technical and semantic bridge between the two ways of using the data.</p>
<p>Why can an OLAP system <em>answer the analytical queries swiftly</em> or at least
faster than operational system? The data in the analytical system are already
modeled and prepared in a form closer to the typical questions:</p>
<p><em>Pre-aggregation</em> – data are aggregated at different levels of granularity,
such as monthly data, and stored. Questions like “what were the sales in
January 2010” would not require any computation, just direct fetch of the
answer. The data might be pre-aggregated at all levels and combinations of
their properties (dimensions).</p>
<p><em>Multi-dimensional databases</em> – data are stored in alternative kinds of
faster structures.</p>
<h2 id="different-approach">Different Approach</h2>
<p>In operational systems update of information is permitted even required.
Keeping change history might be impractical in events with high frequency. From
analytical point of view this might be undesired, as we would like to know
historical evolution of a state. For example: it is sufficient to know actual
account amount in the bank or available budget of a kind, analysts would like
to see how the amount changed over time, how it might have been influenced by
other events.</p>
<p>One of the differences that I consider crucial are the system requirements:
for operational systems the requirements for data design are well known up in
front. It is very unlikely that an application will generate an unexpected
query to the system, therefore the system is built around the application’s
behavior. On the other hand, requirements of an analysts are ad-hoc. The
analytical system has to be designed in a way that would allow quick answers
to business questions.</p>
<p>Redundancy: in applications redundancy might introduce quite lot of errors
mostly because of data inconsistency. In the analytical system the
redundancy is many times desired. Any information that is readily available close to
the data being queried is making the responses faster. The design of the
analytical systems and presence of historical data allows reconstructability of
the redundant information, therefore any inconsistencies might be corrected.
Examples of redundancies in the analytical systems:</p>
<ul>
<li>multiple copies of the same data in different contexts</li>
<li>denormalized data</li>
</ul>
<p>One more difference I am going to mention here is the amount of data being
processed at a single time. In the operational systems only small amount of data is
required to complete desired operation. For example: change of client’s address
(client’s identification and new address), budget expense (budget line and the
expended amount). In the analytical system large amount of data has to be
“touched” to answer analysts question: “what was the spending by country?”</p>
<p>Here are the differences between analytical vs. operational data summarized:</p>
<ul>
<li>subject oriented vs. application oriented</li>
<li>summarized vs. detailed</li>
<li>analysis driven vs. transaction driven</li>
<li>read-only vs. updateable</li>
<li>unknown processing requirements vs. well known processing requirements</li>
<li>redundancy allowed vs. redundancy undesired</li>
<li>large amount of data per operation vs. tiny amount of data per operation</li>
</ul>
<p>The two systems can be physically separate, but with modern tools the
analytical system might be integrated in a database platform with
transactional.</p>
<h2 id="software">Software</h2>
<p>There are many commercial databases and applications for multi-dimensional
data modelling and OLAP in form of Business Intelligence suites from
historically big names such as Oracle, Microsoft, SAS and others. Google for
“OLAP documentation” and company name to get the idea about their approach,
capabilities and features.</p>
<p>New trend is to have OLAP or OLAP-like systems as a service, were one imports
his data, the software transforms the data into data cubes and provides
reporting interface.</p>
<p>There are very few open-source OLAP packages though and even fewer
general-purpose. Just to mention two:</p>
<ul>
<li><a href="http://cubes.databrewery.org">Cubes</a> – light-weigh OLAP framework (written
in Python, <a href="https://github.com/Stiivi/cubes">Github</a>)</li>
<li><a href="http://www.pentaho.com">Pentaho</a> – full-featured Business Intelligence
suite (written in Java)</li>
</ul>
<h2 id="summary">Summary</h2>
<p>OLAP is a way of making transactional data usable and understandable for
decision making.</p>
<p>Further reading:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Data_warehousing">Data Warehousing</a></li>
<li><a href="http://en.wikipedia.org/wiki/Business_intelligence">Business Intelligence</a></li>
</ul>
<p>Next time we will look at the multi-dimensional modeling.</p>
Stefan Urbanek
How I parse PDF files
2013-12-25T00:00:00+00:00
http://okfnlabs.org/blog/2013/12/25/parsing-pdfs
<p>Much of the world’s data are stored in portable document format (PDF) files.
This is not my preferred storage or presentation format, so I often convert
such files into databases, graphs, or <a href="http://csvsoundsystem.com">spreadsheets</a>.
I sort of follow this decision process.</p>
<ol>
<li>Do we need to read the file contents at all?</li>
<li>Do we only need to extract the text and/or images?</li>
<li>Do we care about the layout of the file?</li>
</ol>
<h2 id="example-pdfs">Example PDFs</h2>
<p>I’ll show a few different approaches to parsing and analyzing
<a href="https://github.com/tlevine/scott-documents">these</a> PDF files.
Different approaches make sense depending on the question you ask.</p>
<p>These files are public notices of applications for permits to dredge or fill
wetlands. The Army Corps of Engineers posts these notices so that the public
may comment on the notices before the Corps approves them; people are thus
able to voice concerns about whether these permits would fall within the rules
about what sorts of construction is permissible.</p>
<p>Theses files are
<a href="https://github.com/tlevine/scott/tree/master/reader">downloaded daily</a>
from the <a href="http://www2.mvn.usace.army.mil/ops/regulatory/publicnotices.asp?ShowLocationOrder=False">New Orleans Army Corps of Engineers website</a>
and renamed according to the permit application and the date of download.
They once fed into that the Gulf Restoration Network in their efforts to
protect the wetlands from reckless destruction.</p>
<h2 id="if-i-dont-need-the-file-contents">If I don’t need the file contents</h2>
<p>Basic things like file size, file name and modification date might be useful
in some contexts. In the case of PDFs, file size will give you an idea of how
many/much of the PDFs are text and how many/much are images.</p>
<p>Let’s <a href="https://github.com/dzerbino/ascii_plots/blob/master/hist">plot a histogram</a>
of the file sizes. I’m running this from the root of the documents repository,
and I cleaned up the output a tiny bit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls --block-size=K -Hs */public_notice.pdf | sed 's/[^0-9 ].*//' | hist 5
15 | 2 | **
20 | 55 | ********************************************************************************
25 | 4 | *****
30 | 4 | *****
35 | 11 | ****************
40 | 4 | *****
45 | 2 | **
50 | 2 | **
60 | 1 | *
75 | 1 | *
80 | 1 | *
95 | 1 | *
100 | 2 | **
120 | 1 | *
125 | 2 | **
135 | 1 | *
145 | 3 | ****
150 | 6 | ********
155 | 4 | *****
160 | 8 | ***********
165 | 3 | ****
170 | 6 | ********
175 | 7 | **********
180 | 24 | **********************************
185 | 11 | ****************
190 | 6 | ********
195 | 4 | *****
200 | 23 | *********************************
205 | 7 | **********
210 | 7 | **********
215 | 3 | ****
220 | 3 | ****
225 | 1 | *
230 | 1 | *
235 | 1 | *
240 | 2 | **
245 | 2 | **
250 | 1 | *
255 | 3 | ****
265 | 1 | *
280 | 1 | *
460 | 1 | *
545 | 1 | *
585 | 1 | *
740 | 1 | *
860 | 2 | **
885 | 1 | *
915 | 1 | *
920 | 1 | *
945 | 1 | *
950 | 1 | *
980 | 1 | *
2000 | 1 | *
2240 | 1 | *
2335 | 1 | *
7420 | 1 | *
TOTAL| 248 |
</code></pre></div></div>
<p>The histogram shows us two modes. The smaller mode, around 20 kb, corresponds to
files with no images (PDF export from Microsoft Word), and the larger mode
corresponds to files with images (scans of print-outs of the Microsoft Word
documents). It looks like about 80 are just text and the other 170 are scans.</p>
<p>This isn’t a real histogram, but if we’d used a real one with an interval scale,
the outliers would be more obvious. Let’s cut off the distribution at 400 kb
and look more closely at the unusually large documents that are above that
cutoff.</p>
<p>What’s in that 7 mb file? Well let’s find it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls --block-size=K -Hs */public_notice.pdf | grep '742.K'
7424K MVN-2010-1080-WLL_MVN-2010-1032-WLLB/public_notice.pdf
</code></pre></div></div>
<p>You can see it <a href="https://github.com/tlevine/scott-documents/raw/master/MVN-2010-1080-WLL_MVN-2010-1032-WLLB/public_notice-2012-08-09.pdf">here</a>.
It’s not a typical public notice; rather, it is a series of scanned documents
related to a permit transfer request. Interesting.</p>
<p>Next, how are two large files within 5 kb of each other?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls --block-size=K -Hs */public_notice.pdf | grep 860K
860K MVN-2012-006152-WII/public_notice.pdf
860K MVN-2012-1797-CU/public_notice.pdf
</code></pre></div></div>
<p>Those are here</p>
<ul>
<li><a href="https://github.com/tlevine/scott-documents/raw/master/MVN-2012-006152-WII/public_notice-2012-11-20.pdf">MVN-2012-006152-WII</a></li>
<li><a href="https://github.com/tlevine/scott-documents/raw/master/MVN-2012-1797-CU/public_notice-2012-10-02.pdf">MVN-2012-1797-CU</a></li>
</ul>
<p>Hmm. Nothing special about those. People see patterns in randomness.</p>
<p>Now let’s look at some basic properties of the pdf files. This will give us a
basic overview of one file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pdfinfo MVN-2013-00026-WKK/public_notice.pdf
Creator: FUJITSU fi-4010CU
Producer: Adobe Acrobat 9.52 Paper Capture Plug-in
CreationDate: Fri Jan 25 09:45:08 2013
ModDate: Fri Jan 25 09:46:16 2013
Tagged: yes
Form: none
Pages: 3
Encrypted: no
Page size: 606.1 x 792 pts
Page rot: 0
File size: 199251 bytes
Optimized: yes
PDF version: 1.6
</code></pre></div></div>
<p>Let’s run it on all of the files.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ for file in */public_notice.pdf; do pdfinfo $file && echo; done
# Lots of output here
</code></pre></div></div>
<p>What was used to produce these files?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ for file in */public_notice.pdf; do pdfinfo $file|sed -n 's/Creator: *//p' ; done|sort|uniq -c
33 Acrobat PDFMaker 10.1 for Word
48 Acrobat PDFMaker 9.1 for Word
10 FUJITSU fi-4010CU
135 HardCopy
7 HP Digital Sending Device
2 Oracle9iAS Reports Services
6 PScript5.dll Version 5.2.2
4 Writer
</code></pre></div></div>
<p>When were they created?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ for file in */public_notice.pdf; do pdfinfo $file|grep CreationDate: > /dev/null && date -d "$(pdfinfo $file|sed -n 's/CreationDate: *//p')" --rfc-3339 date ; done
2012-07-03
2012-07-06
2012-07-06
2012-07-06
# ...
</code></pre></div></div>
<p>How many pages do they have?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ for file in */public_notice.pdf; do pdfinfo $file|sed -n 's/Pages: *//p' ; done | hist 1
1 | 1 |
2 | 27 | **********
3 | 198 | ********************************************************************************
4 | 16 | ******
5 | 1 |
8 | 2 |
10 | 1 |
31 | 1 |
40 | 1 |
TOTAL | 248 |
</code></pre></div></div>
<p>It might actually be fun to see relate these variables to each other. For
example, when did the Corps upgrade from PDFMaker 9.1 to PDFMaker 10.1?</p>
<p>Anyway, we got somewhere interesting without looking at the files. Now let’s
look at them.</p>
<h2 id="if-messy-raw-file-contents-are-fine">If messy, raw file contents are fine</h2>
<p>The main automatic processing that I run on the PDFs is a search for a few
identification numbers. The Army Corps of Engineers uses a number that starts
with “MVN”, but other agencies use different numbers. I also search for two
key paragraphs</p>
<p><a href="https://github.com/tlevine/scott/blob/master/reader/bin/translate">My approach</a>
is pretty crude. For the PDFs that aren’t scans, I just use <code class="language-plaintext highlighter-rouge">pdftotext</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># translate
pdftotext "$FILE" "$FILE"
</code></pre></div></div>
<p>Then I just use regular expressions to search the resulting text file.</p>
<p><code class="language-plaintext highlighter-rouge">pdftotext</code> normally screws up the layout of PDF files, especially when they
have multiple columns, but it’s fine for what I’m doing because I only need to
find small chunks of text rather than a whole table or a specific line on
multiple pages.</p>
<p>As we saw earlier, most of the files contain images, so I need to run OCR.
Like <code class="language-plaintext highlighter-rouge">pdftotext</code>, OCR programs often mess up the page layout, but I don’t
care because I’m using regular expressions to look for small chunks.</p>
<p>I don’t even care whether the images are in order; I just use <code class="language-plaintext highlighter-rouge">pdfimages</code>
to pull out the images and then <code class="language-plaintext highlighter-rouge">tesseract</code> to OCR each image and add that
to the text file. (This is all in the
<a href="https://github.com/tlevine/scott/blob/master/reader/bin/translate"><code class="language-plaintext highlighter-rouge">translate</code></a>
script that I linked above.)</p>
<h2 id="if-i-care-about-the-layout-of-the-page">If I care about the layout of the page</h2>
<p>If I care about the layout of the page, <code class="language-plaintext highlighter-rouge">pdftotext</code> probably won’t work.
Instead, I use <code class="language-plaintext highlighter-rouge">pdftohtml</code> or <code class="language-plaintext highlighter-rouge">inkscape</code>. I’ve never needed to go deeper,
but if I did, I’d use something like
<a href="http://www.unixuser.org/~euske/python/pdfminer/">PDFMiner</a>.</p>
<h3 id="pdftohtml">pdftohtml</h3>
<p><code class="language-plaintext highlighter-rouge">pdftohtml</code> is useful because of its <code class="language-plaintext highlighter-rouge">-xml</code> flag.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pdftohtml -xml MVN-2013-00180-ETT/public_notice.pdf
Page-1
Page-2
Page-3
$ head MVN-2013-00180-ETT/public_notice.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.22.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="37" family="Times" color="#000000"/>
<fontspec id="1" size="21" family="Times" color="#000000"/>
<fontspec id="2" size="16" family="Times" color="#000000"/>
<fontspec id="3" size="13" family="Times" color="#000000"/>
<fontspec id="4" size="16" family="Times" color="#000000"/>
</code></pre></div></div>
<p>Open that with an XML parser like lxml</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># This is python
import lxml.etree
pdf2xml = lxml.etree.parse('MVN-2013-00180-ETT/public_notice.xml')
</code></pre></div></div>
<p>One of the things that I try to extract is the “CHARACTER OF WORK” section.
I do this with regular expressions, but we could also do this with the XML.
Here are some XPath selectors that get us somewhere.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># This is python
print pdf2xml.xpath('//text/b[text()="CHARACTER OF WORK"]/../text()')
print pdf2xml.xpath('//text/b[text()="CHARACTER OF WORK"]/../following-sibling::text/text()')
</code></pre></div></div>
<h3 id="inkscape">Inkscape</h3>
<p>Inkscape can convert a PDF page to an SVG file. I have a
<a href="https://github.com/scraperwiki/pdf2svg">little script</a> that runs this across
all pages within a PDF file.</p>
<p>Once you’ve converted the PDF file to a bunch of SVG files, you can open it
with an XML parser just like you could with the <code class="language-plaintext highlighter-rouge">pdftohtml</code> output, except
this time much more of the layout is preserved, including the groupings of
elements on the page.</p>
<p>Here’s a snippet from one project where I used Inkscape to parse PDF files.
I created a crazy system for receiving a very messy PDF table over email and
converting it into a spreadsheet that is hosted on a website.</p>
<p>This function is contains all of the parsing functions for a specific page of
the pdf file once it has been converted to SVG. It takes an
<code class="language-plaintext highlighter-rouge">lxml.etree._ElementTree</code> object like the one we get from <code class="language-plaintext highlighter-rouge">lxml.etree.parse</code>,
along with some metadata. It runs a crazy XPath selector (determined only after
much test-driven development) to pick out the table rows, and then runs a bunch
of functions (not included) to pick out the cells within the rows.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def page(svg, file_name, page_number):
'I turn a svg tree into a list of dictionaries.'
# County name
county = unicode(svg.xpath(
'//svg:g/svg:path[position()=1]/following-sibling::svg:text/svg:tspan/text()',
namespaces = { 'svg': 'http://www.w3.org/2000/svg' }
)[0])
rows = _page_tspans(svg)
def skip(reason):
print 'Skipped a row on %s page %d because %s.' % (file_name, page_number, reason)
data = []
for _row in rows:
row_text = [text.xpath('string()') for text in _row]
try:
if row_text == []:
skip('the row is empty')
print row_text
elif _is_header(row_text):
skip('it appears to be a header.')
print row_text
# ...
</code></pre></div></div>
<p>I’d like to point out the <code class="language-plaintext highlighter-rouge">string()</code> xpath command. That converts the current
node and its decendents into plain text; it’s particularly nice for
inconsistently structured files like this one.</p>
<h2 id="optical-character-recognition">Optical character recognition</h2>
<p>People often think that optical character recognition (OCR) is going to be
a hard part. It might be, but it doesn’t really change this decision process.
If I care about where the images are positioned on the page, I’d probably
use Inkscape. If I don’t, I’d probably use <code class="language-plaintext highlighter-rouge">pdfimages</code>, as I did here.</p>
<h2 id="review">Review</h2>
<p>When I’m parsing PDFs, I use some combination of these tools.</p>
<ol>
<li>Basic file analysis tools (<code class="language-plaintext highlighter-rouge">ls</code> or another language’s equivalent)</li>
<li>PDF metadata tools (<code class="language-plaintext highlighter-rouge">pdfinfo</code> or an equivalent)</li>
<li><code class="language-plaintext highlighter-rouge">pdftotext</code></li>
<li><code class="language-plaintext highlighter-rouge">pdftohtml -xml</code></li>
<li>Inkscape via <a href="https://github.com/scraperwiki/pdf2svg"><code class="language-plaintext highlighter-rouge">pdf2svg</code></a></li>
<li><a href="http://www.unixuser.org/~euske/python/pdfminer/">PDFMiner</a></li>
</ol>
<p>I prefer the
ones earlier in the list when the parsing is less involved because the tools
do more of the work for me. I prefer the ones towards the end as the job gets
more complex because these tools give me more control.</p>
<p>If I need OCR, I use <code class="language-plaintext highlighter-rouge">pdfimages</code> to remove the images and <code class="language-plaintext highlighter-rouge">tesseract</code> to run
OCR. If I needed to run OCR and know more about the layout, I might convert the
PDFs to SVG with Inkscape and, and then take the images out of the SVG in order
to know more precisely where they are in the page’s structure.</p>
<p><em>This article was originally posted <a href="http://thomaslevine.com/!/parsing-pdfs">on Thomas Levine’s site</a>.</em></p>
Thomas Levine
Convert data between formats with Data Converters
2013-12-17T00:00:00+00:00
http://okfnlabs.org/blog/2013/12/17/convert-data-between-formats-data-converters
<p><a href="http://okfnlabs.org/dataconverters/">Data Converters</a> is a command line tool and Python library making routine data conversion tasks easier. It helps data wranglers with everyday tasks like moving between tabular data formats—for example, converting an Excel spreadsheet to a CSV or a CSV to a JSON object.</p>
<p>The current release of Data Converters can convert between Excel spreadsheets, CSV data, and JSON tables, as well as some geodata formats (with additional requirements).</p>
<p>Its smart parser can guess the types of data, correctly recognizing dates, numbers, strings, and so on. It works as easily with URLs as with local files, and it is designed to handle very large files (bigger than memory) as easily as small ones.</p>
<p><img src="http://i.imgur.com/kDDrgPW.png" alt="Data Converters homepage" /></p>
<h2 id="converting-data">Converting data</h2>
<p>Converting an Excel spreadsheet to a CSV or a <a href="http://dataprotocols.org/en/latest/table-schema.html">JSON table</a> with the Data Converters command line tool is easy. Data Converters is able to read XLS(X) and CSV files and to write CSV and JSON, and input files can be either local or remote.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataconvert simple.xls out.csv
dataconvert out.csv out.json
# URLs also work
dataconvert https://github.com/okfn/dataconverters/raw/master/testdata/xls/simple.xls out.csv
</code></pre></div></div>
<p>Data Converters will try to guess the format of your input data, but you can also specify it manually.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataconvert --format=xls input.spreadsheet out.csv
</code></pre></div></div>
<p>Instead of writing the converted output to a file, you can also send it to <em>stdout</em> (and then pipe it to other command-line utilities).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataconvert simple.xls _.json # JSON table to stdout
dataconvert simple.xls _.csv # CSV to stdout
</code></pre></div></div>
<p>Converting data files can also be done within Python using the Data Converters library. The <code class="language-plaintext highlighter-rouge">dataconvert</code> convenience function shares the <code class="language-plaintext highlighter-rouge">dataconvert</code> command line utility’s file reading and writing functionality.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from dataconverters import dataconvert
dataconvert('simple.xls', 'out.csv')
dataconvert('out.csv', 'out.json')
dataconvert('input.spreadsheet', 'out.csv', format='xls')
</code></pre></div></div>
<h2 id="parsing-data">Parsing data</h2>
<p>Data Converters can do more than just convert data files. It can also parse tabular data into Python objects that captures the semantics of the source data.</p>
<p>Data Converters’ various <code class="language-plaintext highlighter-rouge">parse</code> functions each return an iterator over the records of the source data along with a metadata dictionary containing information about the data. The records returned by <code class="language-plaintext highlighter-rouge">parse</code> are not just (e.g.) split strings: they’re hash representations of the contents of the row, with column names and data types auto-detected.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import dataconverters.xls as xls
with open('simple.xls') as f:
records, metadata = xls.parse(f)
print metadata
print [r for r in records]
=> {'fields': [{'type': 'DateTime', 'id': u'date'}, {'type': 'Integer', 'id': u'temperature'}, {'type': 'String', 'id': u'place'}]}
=> [{u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Galway', u'temperature': 1.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Galway', u'temperature': -1.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Galway', u'temperature': 0.0}, {u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Berkeley', u'temperature': 6.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Berkeley', u'temperature': 8.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Berkeley', u'temperature': 5.0}]
</code></pre></div></div>
<h2 id="whats-next">What’s next?</h2>
<p>Excel spreadsheets and CSVs aren’t the only kinds of data that need converting.</p>
<p>Data Converters also supports geodata conversion, including converting between <a href="https://developers.google.com/kml/documentation/">KML</a> (the format for geographical data used in Google Maps and Google Earth), <a href="http://geojson.org/">GeoJSON</a>, and <a href="http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf">ESRI Shapefiles</a>.</p>
<p>Data Converters’ ability to convert between tabular data may also grow, adding JSON support on the input side and XLS(X) support on the output side—as well as new conversions for <a href="https://github.com/okfn/dataconverters/issues/15">XML</a>, <a href="https://github.com/okfn/dataconverters/issues/11">SQL dumps</a>, and <a href="https://github.com/okfn/dataconverters/issues/7">SPSS</a>.</p>
<p>Visit the <a href="http://okfnlabs.org/dataconverters/">Data Converters home page</a> to learn how to install Data Converters and its dependencies, and check out <a href="https://github.com/okfn/dataconverters">Data Converters on GitHub</a> to see how you can contribute to the project.</p>
Neil Ashton
Labs newsletter: 12 December, 2013
2013-12-12T00:00:00+00:00
http://okfnlabs.org/blog/2013/12/12/newsletter
<p>We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the <a href="http://nomenklatura.okfnlabs.org">Nomenklatura</a>, <a href="http://okfnlabs.org/annotator/">Annotator</a>, and <a href="http://www.dataprotocols.org">Data Protocols</a> projects and some new posts on the <a href="http://okfnlabs.org/blog/">Labs blog</a>.</p>
<h2 id="migration-from-trello-to-github">Migration from Trello to GitHub</h2>
<p>For some time now, Labs activities requiring coordination have been organized on <a href="http://trello.com">Trello</a>—but those days are now over. Labs has moved its organizational setup over to <a href="http://github.com">GitHub</a>, coordinating actions and making plans by means of GitHub issues. This change comes as a big relief to the many Labs members who already use GitHub as their main platform for collaboration.</p>
<p>General Labs-related activities are now tracked on the <a href="https://github.com/okfn/okfn.github.com/issues/">Labs site’s issues</a>, and activities around individual projects are managed (as before!) through those projects’ own issues.</p>
<h2 id="new-bad-data">New Bad Data</h2>
<p>New examples of <a href="http://okfnlabs.org/bad-data/">bad data</a> continue to roll in—and we invite even more <a href="http://okfnlabs.org/bad-data/add/">new submissions</a>.</p>
<p>Bad datasets added since last newsletter include the <a href="http://okfnlabs.org/bad-data/ex/gla-spending/">UK’s Greater London Authority spend data</a> (65+ files with 25+ different structures!), <a href="http://okfnlabs.org/bad-data/ex/nature-magazine-supplementary/">Nature Magazine’s supplementary data</a> (an awful PDF jumble), and more.</p>
<h2 id="nomenklatura-new-alpha">Nomenklatura: new alpha</h2>
<p>As we’ve previously noted, Labs member <a href="http://pudo.org/">Friedrich Lindenberg</a> has been thinking about producing “a fairly radical re-framing” of the <a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a> data reconciliation service.</p>
<p>Friedrich has now released an alpha version of a new release of Nomenklatura at <a href="http://nk-dev.pudo.org/">nk-dev.pudo.org</a>. The major changes with this alpha include:</p>
<ul>
<li>A fully JavaScript-driven frontend</li>
<li>String matching now happens inside the PostgreSQL database</li>
<li>Better introductory text explaining what Nomenklatura does</li>
<li>“entity” and “alias” domain objects have been merged into “entity”</li>
</ul>
<p>Friedrich is keen to hear what people think about this prototype—so jump in, give it a try, and leave your comments at the <a href="https://github.com/pudo/nomenklatura">Nomenklatura repo</a>.</p>
<h2 id="annotator-v129">Annotator v1.2.9</h2>
<p>A new maintenance release of <a href="http://okfnlabs.org/annotator/">Annotator</a> came out ten days ago. This new version is intended to be one of the last in the v1.2.x series—indeed, v1.2.8 itself was intended to be the last, but that version had some significant issues that this new release corrects.</p>
<p>Fixes in this version include:</p>
<ul>
<li>Fixed a major packaging error in v1.2.8. Annotator no longer exports an excessive number of tokens to the page namespace.</li>
<li>Notification display bugfixes. Notification levels are now correctly removed after notifications are hidden.</li>
</ul>
<p>The new Annotator is available, as always, <a href="https://github.com/okfn/annotator/releases/tag/v1.2.9">from GitHub</a>.</p>
<h2 id="data-protocols-updates">Data Protocols updates</h2>
<p><a href="http://dataprotocols.org/">Data Protocols</a> is a project to develop simple protocols and formats for working with open data. <a href="http://okfnlabs.org/members/rgrp/">Rufus Pollock</a> wrote a <a href="https://lists.okfn.org/pipermail/okfn-labs/2013-December/001185.html">cross-post to the list</a> about several new developments with Data Protocols of interest to Labs. These included:</p>
<ul>
<li>Close to final agreement on a spec for adding “primary keys” to the <a href="http://dataprotocols.org/table-schema/">JSON Table Schema</a> (<a href="https://github.com/dataprotocols/dataprotocols/issues/21">discussion</a>)</li>
<li>Close to consensus on spec for “foreign keys” (<a href="https://github.com/dataprotocols/dataprotocols/issues/23">discussion</a>)</li>
<li>Proposal for a JSON spec for views of data, e.g. graphs or maps (<a href="https://github.com/dataprotocols/dataprotocols/issues/77">discussion</a>)</li>
</ul>
<p>For more, check out <a href="https://lists.okfn.org/pipermail/okfn-labs/2013-December/001185.html">Rufus’s message</a> and the <a href="https://github.com/dataprotocols/dataprotocols/issues">Data Protocols issues</a>.</p>
<h2 id="on-the-blog">On the blog</h2>
<p>Labs members have added a couple new posts to the blog since the last newsletter. Yours truly (with extensive help from Rufus) posted on <a href="http://okfnlabs.org/blog/2013/12/05/view-csv-with-data-pipes.html">using Data Pipes to view a CSV</a>. <a href="http://okfnlabs.org/members/mihi">Michael Bauer</a>, meanwhile, wrote about the new <a href="http://okfnlabs.org/blog/2013/12/06/Introducing-Reconcile-csv.html">Reconcile-CSV service</a> he developed while working on education data in Tanzania. Look to the <a href="http://okfnlabs.org/blog/">Labs blog</a> for the full scoop.</p>
<h2 id="get-involved">Get involved</h2>
<p>If you have some spare time this holiday season, why not spend it helping out with Labs? We’re always always looking for new people to <a href="http://okfnlabs.org/join/">join the community</a>—visit the <a href="https://github.com/okfn/okfn.github.com/issues/">Labs issues</a> and the <a href="http://okfnlabs.org/ideas/">Ideas Page</a> to get some ideas for how you can join in.</p>
Neil Ashton
Introducing Reconcile-CSV
2013-12-06T00:00:00+00:00
http://okfnlabs.org/blog/2013/12/06/Introducing-Reconcile-csv
<p>Recently I spent a week in Tanzania working on education data with the
ministry of education (<a href="http://schoolofdata.org/2013/12/06/a-deep-dive-into-fuzzy-matching-in-tanzania/">blog post
here</a>).
One of the problems we faced there were spreadsheets, we liked to merge,
without having any unique IDs. I quickly realized we can do this through
reconciliation services in <a href="http://openrefine.org">OpenRefine</a>. The API and
some projects implementing the reconciliation service are described in the
<a href="https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API">OpenRefine
wiki</a>.
Nevertheless, most of the projects had a ton of requirements or need a
database.</p>
<p>I wanted a service that is:</p>
<ul>
<li>easy to install and run</li>
<li>works on top of a CSV file</li>
</ul>
<p>Because I love <a href="http://clojure.org">Clojure</a> and I already had a
<a href="https://github.com/mihi-tr/fuzzy-string">fuzzy-matching library</a> at hand,
I chose to go down that route. Clojure has the great advantage of being
able to generate .jar files including all the dependencies. - thus running
the service is a matter of executing a .jar file.</p>
<p>All that was left was implementing the reconciliation API around it. I am
proudly introducing <a href="http://okfnlabs.org/reconcile-csv">reconcile-csv</a> - a
reconciliation service, running on your machine without much hassle. See
<a href="http://okfnlabs.org/reconcile-csv">http://okfnlabs.org/reconcile-csv</a> for
more details and instructions.</p>
<p>While this might not be the first reconciliation service to be written - I
do think by now it’s the easiest to use: you’ll only need java, a CSV file
and the .jar provided.</p>
Michael Bauer
View a CSV (Comma Separated Values) in Your Browser
2013-12-05T00:00:00+00:00
http://okfnlabs.org/blog/2013/12/05/view-csv-with-data-pipes
<p>This post introduces one of the handiest features of <a href="http://datapipes.okfnlabs.org/">Data Pipes</a>: <strong><a href="http://datapipes.okfnlabs.org/html/">fast (pre) viewing of CSV files in your browser</a></strong> (and you can share the result by just copying a URL).</p>
<p><a href="http://datapipes.okfnlabs.org/html/"><img src="http://i.imgur.com/LKjphLo.png" alt="" /></a></p>
<h2 id="the-raw-csv">The Raw CSV</h2>
<p><a href="http://data.okfn.org/standards/csv/">CSV files are frequently used for storing tabular data</a> and are widely supported by spreadsheets and databases. However, you can’t usually look at a CSV file in your browser - usually your browser will automatically download a CSV file. And even if you <em>could</em> look at a CSV file, it is <a href="http://datapipes.okfnlabs.org/csv/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">not very pleasant to look at</a>:</p>
<p><a href="http://datapipes.okfnlabs.org/csv/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">
<img src="http://i.imgur.com/zVGW1zD.png" alt="Raw CSV" />
</a></p>
<h2 id="the-result">The Result</h2>
<p>But using the <a href="http://datapipes.okfnlabs.org/html/">Data Pipes <code class="language-plaintext highlighter-rouge">html</code> feature</a>, you can turn an online CSV into a pretty HTML table in a few seconds. For example, the CSV you’ve just seen would become <a href="http://datapipes.okfnlabs.org/csv/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">this pretty table</a>:</p>
<p><a href="http://datapipes.okfnlabs.org/csv/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv"><img src="http://i.imgur.com/fbR8DvX.png" alt="CSV, HTML view" /></a></p>
<h2 id="using-it">Using it</h2>
<p>To use this service, just visit <a href="http://datapipes.okfnlabs.org/html/">http://datapipes.okfnlabs.org/html/</a> and paste your CSV’s URL into the form.</p>
<p>For power users (or for use from the command line or API), you can just append your CSV url to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://datapipes.okfnlabs.org/csv/html/?url=
</code></pre></div></div>
<h3 id="previewing-just-the-first-part-of-a-csv-file">Previewing Just the First Part of a CSV File</h3>
<p>You can also extend this basic previewing using other datapipes features. For example, suppose you have a big CSV file (say with more than a few thousand rows). If you tried to turn this into an HTML table and then view in your browser, it would probably crash it.</p>
<p>So what if you could just see a part of the file? After all, you may well only be interested in seeing what that CSV file looks like, not every row. Fortunately, <strong>Data Pipes supports only showing the first 10 lines of a CSV file</strong> using a <a href="http://datapipes.okfnlabs.org/head"><code class="language-plaintext highlighter-rouge">head</code> operation</a>. To demonstrate, let’s just extend our example above to use <code class="language-plaintext highlighter-rouge">head</code>. This gives us the following URL (click to see the live result):</p>
<p><code><a href="http://datapipes.okfnlabs.org/csv/head/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">http://datapipes.okfnlabs.org/csv/<strong>head</strong>/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv</a></code></p>
<h2 id="colophon">Colophon</h2>
<p>Data Pipes is a free and open service run by <a href="http://okfnlabs.org/">Open Knowledge Foundation Labs</a>. You can find the source code on GitHub at: <a href="https://github.com/okfn/datapipes">https://github.com/okfn/datapipes</a>. It also available as a <a href="http://nodejs.org/">Node</a> library and command line tool.</p>
<p>If you like previewing CSV files in your browser, you might also be interested in the <a href="https://chrome.google.com/webstore/detail/recline-csv-viewer/ibfcfelnbfhlbpelldnngdcklnndhael">Recline CSV Viewer</a>, a Chrome plugin that automatically turns CSVs into searchable HTML tables in your browser.</p>
Neil Ashton
Labs newsletter: 28 November, 2013
2013-11-28T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/28/newsletter
<p>Another busy week at the Labs! We’ve had lots of discussion around the idea of “bad data”, a blog post about Mark’s aid tracker, new PyBossa developments, and a call for help with a couple of projects. Next week we can look forward to another <a href="http://okfnlabs.org/events/open-data-maker/">Open Data Maker Night</a> in London.</p>
<h2 id="bad-data">Bad Data</h2>
<p>Last Friday, Rufus announced <a href="http://okfnlabs.org/bad-data/">Bad Data</a>, a new educational mini-project that highlights real-world examples of how data <em>shouldn’t</em> be published.</p>
<p>This announcement was greeted with glee and with contributions of new examples. Open government activist <a href="http://about.me/IvanBegtin">Ivan Begtin</a> chimed in with the Russian Ministry of the Interior’s <a href="http://mvd.ru/opendata/od1">list of regional offices</a> and the Russian government’s <a href="http://nalog.ru/ru/opendata/p9/">tax rates for municipalities</a>. Labs member <a href="http://okfnlabs.org/members/pudo/">Friedrich Lindenberg</a> added the <a href="http://www.bundeshaushalt-info.de/download.html">German finance ministry</a>’s new open data initiative. As <a href="http://okfnlabs.org/members/andylolz/">Andy Lulham</a> said, “bad data” will be very useful for testing the new <a href="http://datapipes.okfnlabs.org/">Data Pipes</a> operators.</p>
<p>You can follow the whole discussion thread <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-November/001165.html">in the list archive</a>.</p>
<h2 id="blog-post-looking-at-aid-in-the-philippines">Blog post: Looking at aid in the Philippines</h2>
<p>At last week’s hangout, you heard about <a href="http://okfnlabs.org/members/markbrough/">Mark Brough</a>’s new project, a <a href="http://pwyf.github.io/philippines/">browser for aid projects in the Philippines</a> generated from <a href="http://iatiregistry.org/">IATI data</a>.</p>
<p>Now you can read more about Mark’s project <a href="http://okfnlabs.org/blog/2013/11/25/philippines.html">on the blog</a>, learning about where the data comes from, how the site is generated from the data (interestingly, it uses the Python-based static site generator <a href="http://packages.python.org/Frozen-Flask/">Frozen-Flask</a>), and what Mark plans to do next.</p>
<h2 id="new-pybossa-cache-system">New PyBossa cache system</h2>
<p>Labs member and citizen science expert <a href="http://okfnlabs.org/members/teleyinex/">Daniel Lombraña González</a> has been “working really hard to add a new cache system to <a href="https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&ved=0CDsQFjAC&url=https%3A%2F%2Fgithub.com%2FPyBossa%2Fpybossa&ei=BUyXUuA0zfyoAcyogegG&usg=AFQjCNErY8CeTnsLM8hOLUYr5xGR5qlvzw&bvm=bv.57155469,d.aWM">PyBossa</a>”, the open source crowdsourcing platform.</p>
<p>As Daniel has discovered, the <a href="http://redis.io/">Redis</a> key-value store meets all his requirements for a load-balanced, high-availability, persistent cache. As he put it: “Redis is <em>amazing</em>. Let me repeat it: amazing.”</p>
<p>Read the <a href="http://daniellombrana.es/blog/2013/11/26/pybossa-cache.html">blog post</a> to learn more about the new Redis-based PyBossa setup and its benefits.</p>
<h2 id="contributions-needed-ios-and-python-development">Contributions needed: iOS and Python development</h2>
<p>Philippe Plagnol of <a href="http://www.product-open-data.com">Product Open Data</a> needs a few good developers to help with some projects.</p>
<p>Firstly, the <a href="https://play.google.com/store/apps/details?id=org.okfn.pod">Product Open Data Android app</a> has been out for a while (<a href="https://github.com/okfn/product-browser-android">source code</a>), and it’s high time there was a port for Apple devices. If you’re interested in contributing to the port, leave a comment at <a href="https://github.com/okfn/product-browser-ios/issues/1">this GitHub issue</a>.</p>
<p>Secondly, work is now underway on a brand repository which will assign a Brand Standard Identifier Number (BSIN) to each brand worldwide, making it possible to integrate products in the product repository. Python developers are needed to help make this happen. If you want to help out, join in <a href="https://github.com/okfn/brand-manager/issues/9">this GitHub thread</a>. (Lots of people have already signed up!)</p>
<h2 id="next-week-open-data-maker-night-london-7">Next week: Open Data Maker Night London #7</h2>
<p>On the 4th of December next week, the <a href="http://www.meetup.com/OpenKnowledgeFoundation/London-GB/1062882/">seventh London Open Data Maker Night</a> is taking place. Anyone interested in building tools or insights from data is invited to drop in at any time after 6:30 and join the fun. (Please note that the event will take place on Wednesday rather than the usual Tuesday.)</p>
<p>What is an Open Data Maker Night? <a href="http://okfnlabs.org/events/open-data-maker/">Read more about them here</a>.</p>
<h2 id="get-involved">Get involved</h2>
<p>Labs is always looking for new contributors. Read more about how you can <a href="http://okfnlabs.org/join/">join the community</a>, whether you’re a coder, a data wrangler, or a communicator, and check out the <a href="http://okfnlabs.org/ideas/">Ideas Page</a> to see what else is brewing.</p>
Neil Ashton
Looking at aid in the Philippines
2013-11-25T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/25/philippines
<p><em>See also: “<a href="http://www.publishwhatyoufund.org/updates/by-topic/techfortransparency/closer-look-aid-philippines/">A closer look at aid in the Philippines</a>”</em></p>
<p>Since Typhoon Yolanda/Haiyan struck the Philippines on 8th November there has been some discussion around the availability of information to help coordinate activities effectively in the disaster response phase.</p>
<p>To see what data was already available, I put together <a href="http://pwyf.github.io/philippines/">a quick projects browser</a>, which generates a static site for all projects currently available in the Philippines that have been published to the <a href="http://iatiregistry.org">IATI Registry</a>, a CKAN instance for sharing aid information in the standard <a href="http://iatistandard.org">IATI format</a>.</p>
<p><a href="http://pwyf.github.io/philippines/"><img src="http://publishwhatyoufund.org/files/philippines-front-page.png" alt="Philippines projects browser" /></a></p>
<p><em>Philippines Projects browser</em></p>
<h2 id="where-the-data-comes-from">Where the data comes from</h2>
<p>IATI data is available in a standard XML schema, which makes it relatively easy to pull together quickly. However, with almost 200 publishers and over 3000 individual packages now published, searching for data for a single country would be complicated.</p>
<p>The <a href="http://iati-datastore.herokuapp.com">IATI datastore</a> simplifies this task by pulling together all of the data published to the IATI Registry each night, and provides a queryable API for requesting results in CSV, XML or JSON format.</p>
<p>It was therefore possible to get all data for the Philippines by querying:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://iati-datastore.herokuapp.com/api/1/access/activity.json?recipient-country=PH&limit=50&offset=%s
</code></pre></div></div>
<p>… and then paging through the results (by increasing offset by 50 each time)</p>
<p><a href="http://iati-datastore.herokuapp.com"><img src="http://publishwhatyoufund.org/files/iati-datastore-front-page.png" alt="IATI Datastore" /></a></p>
<p><em>The IATI Datastore</em></p>
<p><strong>NB:</strong> IATI was originally designed with a focus on traditional development aid, which is why the number of projects that specifically relate to the typhoon are limited. But the key concepts are mostly the same, which is why you do see some humanitarian aid in there.</p>
<h2 id="creating-a-static-site">Creating a static site</h2>
<p>I decided to create a static site so that:</p>
<ol>
<li>I would not have to think about creating a database, or modelling the data;</li>
<li>I could deploy the site to Github pages, so I wouldn’t have to think about server setup;</li>
<li>The site would run fast.</li>
</ol>
<p>Ruby has some nice modules for creating static sites, such as <a href="http://jekyllrb.com/">Jekyll</a> and particularly <a href="http://middlemanapp.com/">Middleman</a> in this sort of context.</p>
<p>However, I wanted to write this one in Python, and unfortunately all of the <a href="http://gistpages.com/2013/08/12/complete_list_of_static_site_generators_for_python">static site generators available in Python</a> come with a lot of assumptions about the structure of your site (basically, they all think it should look like a blog).</p>
<p>One exception (discovered via <a href="https://nicolas.perriault.net/code/2012/dead-easy-yet-powerful-static-website-generator-with-flask/">this blogpost</a>) is <a href="http://packages.python.org/Frozen-Flask/">Frozen-Flask</a>. This is great because <a href="http://flask.pocoo.org/">Flask</a> is super simple and easy to work with, and provides a lot of the stuff you need out of the box.</p>
<p>In this instance, all I had to do was add the lines:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app.config['FREEZER_RELATIVE_URLS'] = True
freezer = Freezer(app)
</code></pre></div></div>
<p>and then to generate a static site, which is output to <code class="language-plaintext highlighter-rouge">/build</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>freezer.freeze()
</code></pre></div></div>
<p>Frozen-Flask wants to set all of the URLs as absolute and beginning with <code class="language-plaintext highlighter-rouge">/</code>. Because I’m deploying to Github pages, I need to be able to have relative URLs. Frozen-Flask provides <a href="http://pythonhosted.org/Frozen-Flask/#configuration">an option to do that</a> with <code class="language-plaintext highlighter-rouge">FREEZER_RELATIVE_URLS</code>, but the weird side effect of this is that all the URLs have to end with <code class="language-plaintext highlighter-rouge">index.html</code>.</p>
<h2 id="whats-next">What’s next?</h2>
<p>Firstly, <strong>it would be great to get more data in there</strong>. The projects browser can be updated automatically each night by pulling the data in from the datastore – something I’m going to try and get set up in the next couple of days.</p>
<p>However, with a couple of exceptions (particularly GlobalGiving and UNOCHA FTS), in general IATI publishers update their data a maximum of once a month. The projects browser suggests that may not be frequent enough in this sort of situation. Additionally, it is somewhat difficult to see which projects are related to this crisis as opposed to previous earthquakes and typhoons in the region (or alternatively, development rather than humanitarian activities). Some discussions have begun about adding an extension to IATI to capture additional fields that might be relevant to humanitarian actors, for example to tag a project as related to the Haiyan response.</p>
<p>Secondly, <strong>the interface could be improved somewhat</strong>. Some other ways of filtering and searching through the data would be useful, and improving the performance of the existing filter would be sensible.</p>
<h2 id="let-me-know-what-you-think">Let me know what you think!</h2>
<ul>
<li>Email: <a href="mailto:mark.brough@publishwhatyoufund.org">mark.brough@publishwhatyoufund.org</a></li>
<li>Tweet: <a href="http://twitter.com/mark_brough">@mark_brough</a> or <a href="http://twitter.com/aidtransparency">@aidtransparency</a></li>
<li>Source code: <a href="http://github.com/pwyf/philippines">http://github.com/pwyf/philippines</a></li>
<li>Live site: <a href="[http://pwyf.github.io/philippines">http://pwyf.github.io/philippines</a></li>
</ul>
Mark Brough
Labs newsletter: 21 November, 2013
2013-11-21T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/21/newsletter
<p>This week, Labs members gathered in an online hangout to discuss what they’ve been up to and what’s next for Labs. This special edition of the newsletter recaps that hangout for those who weren’t there (or who want a reminder).</p>
<h2 id="data-pipes-update">Data Pipes update</h2>
<p>Last week you heard about <a href="http://okfnlabs.org/members/andylolz">Andy Lulham</a>’s improvements to <a href="http://datapipes.okfnlabs.org">Data Pipes</a>, the online streaming data transformations service. He didn’t stop there, and in this week’s hangout, Andy described some of the new features he has been adding:</p>
<ul>
<li>parse and render are now <em>streaming</em> operations</li>
<li>option parsing now uses <a href="https://github.com/substack/node-optimist">optimist</a></li>
<li>a basic command-line interface</li>
<li>… and much, much more</li>
</ul>
<p>Coming up next: <a href="https://github.com/okfn/datapipes/issues/21">map & filter with arbitrary functions</a>!</p>
<h2 id="crowdcrafting-progress-and-projects">Crowdcrafting: progress and projects</h2>
<p>New <a href="http://www.shuttleworthfoundation.org/fellows/daniel-lombrana/">Shuttleworth fellow</a> <a href="http://okfnlabs.org/members/teleyinex">Daniel Lombraña González</a> reported on progress with <a href="http://crowdcrafting.org/">CrowdCrafting</a>, the citizen science platform built with <a href="http://dev.pybossa.com/">PyBossa</a>.</p>
<p>CrowdCrafting now has more than 3,500 users (though Daniel cautions that this doesn’t mean much in terms of participation), and the site now has more answers than tasks.</p>
<p>Last week, the team at <a href="http://micromappers.com/">MicroMappers</a> used CrowdCrafting to classify <a href="">tweets about the typhoon disaster</a> in the Philippines. Digital mapping activists <a href="http://skytruth.org/">SkyTruth</a>, meanwhile, have used CrowdCrafting to <a href="http://crowdcrafting.org/app/frackfinder_tadpole/">map and track fracking sites</a> in the northeast United States. Daniel has also been in contact with <a href="http://www.epicollect.net/">EpiCollect</a> about a project on trash collection in Spain.</p>
<h2 id="open-data-button">Open Data Button</h2>
<p>Labs member <a href="http://okfnlabs.org/members/loleg">Oleg Lavrovsky</a> discussed the <a href="http://button.datalets.ch/">Open Data Button</a>, an interesting fork of the recently-launched <a href="https://www.openaccessbutton.org/">Open Access Button</a>.</p>
<p>The Open Access Button, an idea of the Open Science working group at <a href="http://okcon.org">OKCon 2013</a>, is a bookmarklet that allows users to report their experiences of having their research blocked by paywalls. The Open Data Button applies this same idea to Open Data: users can use it to report their problems with legal and technical restrictions on data. (As Rufus pointed out, this ties in nicely with the <a href="https://github.com/okfn/ideas/issues/41">IsItOpenData</a> project.)</p>
<h2 id="queremos-saber">Queremos Saber</h2>
<p>Labs ally <a href="http://vitorbaptista.com/">Vítor Baptista</a> reported on a new development with <a href="http://www.queremossaber.org.br/">Queremos Saber</a>, the Brazilian FOI request portal.</p>
<p>Changes in the way the Brazilian federal government accepts FOI requests have caused Queremos Saber problems. The federal government no longer accepts requests by email, forcing the use of a specialized FOI system which they are now promoting for local governments as well. This limits the number of places that will accept requests from Queremos Saber.</p>
<p>A solution to this problem is underway: an <em>email-based API</em> that will take emails received at certain addresses (e.g. <em>ministryofhealthcare@queremossaber.org.br</em>) and turn them into instructions for a web crawler to create an FOI request in the appropriate system. An interesting side effect of this would be the creation of an <em>anonymization layer</em>, allowing users to bypass the legal requirement that FOI requests not be placed anonymously.</p>
<h2 id="philippines-projects">Philippines Projects</h2>
<p>Labs data wrangler <a href="http://okfnlabs.org/members/markbrough">Mark Brough</a> showed off a test project collecting <a href="http://markbrough.github.io/philippines/">data on aid activities in the Philippines</a>. Mark’s small static site, updated each night, collects <a href="http://iatistandard.org">IATI</a> aid data on projects in the Philippines and republishes it in a more browsable form.</p>
<p>Mark also discussed another data-mashup project, still in the planning stage, that would combine budget and aid data for Tanzania (or any other developing country)—similar to Publish What You Fund’s old <a href="http://publishwhatyoufund.org/uganda/">Uganda project</a> but based on a non-static dataset.</p>
<h2 id="global-economic-map">Global Economic Map</h2>
<p>Alex Peek discussed his initiative to create the <a href="http://meta.wikimedia.org/wiki/Global_Economic_Map">Global Economic Map</a>, “a collection of standardized data set of economic statistics that can be applied to every country, region and city in the world”.</p>
<p>The GEM will draw data from sources like government publications and SEC filings and will cover <a href="https://meta.wikimedia.org/wiki/Grants:IdeaLab/Global_Economic_Map#Format_and_economic_statistics">eleven statistics</a> that touch on GDP, employment, corporations, and budgets. The GEM aims to be <a href="https://meta.wikimedia.org/wiki/Grants:IdeaLab/Global_Economic_Map#Wikidata_integration">fully integrated with Wikidata</a>.</p>
<h2 id="frictionless-data">Frictionless data</h2>
<p>Finally, <a href="http://okfnlabs.org/members/rgrp">Rufus Pollock</a> discussed <a href="http://data.okfn.org">data.okfn.org</a> and the mission of “frictionless data”: making it “as simple as possible to get the data you want into the tool of your choice.”</p>
<p>data.okfn.org aims to help achieve this goal by promoting, among other things, <a href="http://data.okfn.org/standards">simple data standards</a> and the tooling to support them. As reported in last week’s newsletter, this now includes a <a href="https://github.com/okfn/dpm">Data Package Manager</a> based on <a href="https://npmjs.org/">npm</a>, now working at a very basic level. It also includes the data.okfn.org <a href="http://data.okfn.org/tools/view">Data Package Viewer</a>, which provides a nice view on data packages hosted on GitHub, S3, or wherever else.</p>
<h2 id="improving-the-labs-site">Improving the Labs site</h2>
<p>The hangout wrapped up with a discussion of how to improve the Labs site. Besides some discussion of the possibility of a <a href="https://github.com/okfn/okfn.github.com/issues/134">one-click creation system for Open Data Maker Nights</a>, talk focused on <a href="https://github.com/okfn/okfn.github.com/issues/46">improving the projects page</a>.</p>
<p>Oleg, who has volunteered to take the lead in reforming the projects page, highlighted the need for a way to differentiate projects by their activity level and their need for more contributors. Mark agreed, suggesting also that it would be nice to be able to filter projects by the languages and technologies they use. Both ideas were proposed as a way to fill out <a href="http://www.todrobbins.com/">Tod Robbins</a>’s suggestion that the projects page needs <em>categories</em>.</p>
<p>See the <a href="http://pad.okfn.org/p/labs-hangouts">Labs hangout notes</a> for the full details of this discussion.</p>
<h2 id="get-involved">Get involved</h2>
<p>As always, Labs wants you to join in and get involved! Read more about how you can <a href="http://okfnlabs.org/join/">join the community</a> and participate by coding, wrangling data, or doing outreach and engagement, and have a look at the <a href="http://okfnlabs.org/ideas/">Ideas Page</a> to see what other members have been thinking.</p>
Neil Ashton
Bad Data: real-world examples of how not to do data
2013-11-19T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/19/bad-data-examples-how-not-to-do-data
<p>We’ve just started a mini-project called <a href="http://okfnlabs.org/bad-data/">Bad Data</a>. Bad Data provides real-world examples of how <em>not</em> to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly.</p>
<p>This isn’t about being critical but about <strong>educating</strong>—providing examples of how <strong>not</strong> to do something may be one of the best ways of showing how to do it right. It also provides a source of good practice material for budding data wranglers!</p>
<p><img src="http://i.imgur.com/FNBf3aR.png" alt="Bad Data: ASCII spreadsheet" /></p>
<p>Each “bad” dataset on gets a simple page that shows what’s wrong along with a preview or screenshot.</p>
<p>We’ve started to stock the site with some of the better examples of bad data that we’ve come across over the years. This includes machine-<em>un</em>readable <a href="http://okfnlabs.org/bad-data/ex/tfl-passenger-numbers/">Transport for London passenger numbers from the London Datastore</a> and a classic “<a href="http://okfnlabs.org/bad-data/ex/bls-us-employment/">ASCII spreadsheet</a>” from the US Bureau of Labor Statistics.</p>
<p><strong>We welcome contributions of new examples! <a href="http://okfnlabs.org/bad-data/add">Submit them here</a>.</strong></p>
Rufus Pollock
Labs newsletter: 14 November, 2013
2013-11-14T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/14/newsletter
<p>Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.</p>
<h2 id="data-pipes-lots-of-improvements">Data Pipes: lots of improvements</h2>
<p><a href="http://datapipes.okfnlabs.org/">Data Pipes</a> is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.</p>
<p>This past week, <a href="http://okfnlabs.org/members/andylolz">Andy Lulham</a> has made a <em>huge</em> number of improvements to Data Pipes. Just a few of the new features and fixes:</p>
<ul>
<li>new operations: <code class="language-plaintext highlighter-rouge">strip</code> (removes empty rows), <code class="language-plaintext highlighter-rouge">tail</code> (truncate dataset to its last rows)</li>
<li>new features: a <code class="language-plaintext highlighter-rouge">range</code> function and a “complement” switch for <code class="language-plaintext highlighter-rouge">cut</code>; options for <code class="language-plaintext highlighter-rouge">grep</code></li>
<li>all operations in pipeline are now trimmed for whitespace</li>
<li>basic tests have been added</li>
</ul>
<p>Have a look at the <a href="https://github.com/okfn/datapipes/issues?page=1&state=closed">closed issues</a> to see more of what Andy has been up to.</p>
<h2 id="webshot-new-homepage-and-feature">Webshot: new homepage and feature</h2>
<p>Last week we introduced you to <a href="http://webshot.okfnlabs.org/">Webshot</a>, a web API for screenshots of web pages.</p>
<p>Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a <a href="http://webshot.okfnlabs.org/">proper home page</a> with a form interface to the API.</p>
<p>Webshot has also added support for <em>full page</em> screenshots. Now you can capture the whole page rather than just its visible portion.</p>
<h2 id="on-the-blog-natural-language-processing-with-python">On the blog: natural language processing with Python</h2>
<p>Labs member <a href="http://okfnlabs.org/members/tamr/">Tarek Amr</a> has contributed an awesome post on <a href="http://okfnlabs.org/blog/2013/11/11/python-nlp.html">Python natural language processing</a> with the NLTK toolkit to the Labs blog.</p>
<p>“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.</p>
<h2 id="data-packages-workflow-à-la-node">Data Packages workflow à la Node</h2>
<p>Wouldn’t it be nice to be able to initialize new <a href="http://data.okfn.org/standards/data-package">Data Packages</a> as easily as you can initialize a Node module with <code class="language-plaintext highlighter-rouge">npm init</code>?</p>
<p><a href="http://www.maxogden.com">Max Ogden</a> started a <a href="https://github.com/okfn/datapackage.js/issues/3">discussion thread</a> around this enticing idea, eventually leading to <a href="http://okfnlabs.org/members/rgrp">Rufus Pollock</a> booting a new repo for <a href="https://github.com/okfn/dpm">dpm</a>, the Data Package Manager. Check out <a href="https://github.com/okfn/dpm/issues">dpm’s Issues</a> to see what needs to happen next with this project.</p>
<h2 id="nomenklatura-looking-forward">Nomenklatura: looking forward</h2>
<p><a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a> is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.</p>
<p><a href="http://okfnlabs.org/members/pudo">Friedrich Lindenberg</a> has noted on the Labs mailing list that <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-November/001138.html">Nomenklatura has some serious problems</a>, and he has proposed “a fairly radical re-framing of the service”.</p>
<p>The conversation around what this re-framing should look like is still underway—check out <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-November/001138.html">the discussion thread</a> and jump in with your ideas.</p>
<h2 id="data-issues-following-issues">Data Issues: following issues</h2>
<p>Last week, the idea of <a href="http://okfnlabs.org/blog/2013/11/06/tracking-data-issues.html">Data Issues</a> was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.</p>
<p>Discussion on the Labs list highlighted another benefit of using GitHub. <a href="https://github.com/aliounedia">Alioune Dia</a> suggested that Data Issues should let users register to be notified when a particular issue is fixed. But <a href="http://feedmechocolate.com/">Chris Mear</a> pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”</p>
<h2 id="get-involved">Get involved</h2>
<p>Anyone can join the Labs community and get involved! Read more about how you can <a href="http://okfnlabs.org/join/">join the community</a> and participate by coding, wrangling data, or doing outreach and engagement. Also check out the <a href="http://okfnlabs.org/ideas/">Ideas Page</a> to see what’s cooking in the Labs.</p>
Neil Ashton
Natural Language Processing using Python
2013-11-11T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/11/python-nlp
<p>This weekend the <a href="http://www.gdgcairo.org/" title="GDG Cairo">Google Developer Group in Cairo</a> arranged 2-days workshops followed by a hackathon. During this event, I organized a workshop about <a href="http://nltk.org/" title="Natural Language Toolkit">NLTK</a> and the use of Python in Natural Language Processing (NLP). The session’s slides can be found <a href="http://tarekamr.appspot.com/slides/pynlp" title="Python NLP Slides">here</a>. The beauty of NLP is that it enables computers to extract knowledge from unstructured data inside textual documents. Websites like Zite use NLP to deliver custom news to readers based on their taste. NLP enables Google to extract times and dates from email messages so that Gmail users can automatically add events mentioned in their emails into their calendars. The same technology enables us to translate text and predict the language of a tweet. Data journalists can also use NLP to analyse transcripts of the speeches of politicians and MPs to find newsworthy information that is not feasible to be found without such technology.</p>
<h2 id="normalization-and-tokenization">Normalization and Tokenization.</h2>
<p>Two initial steps should take place before dealing with text. Words can take various forms according to their context. For example, the same word is capitalized when it is placed in the beginning of the sentence, and it starts with small letters elsewhere. Plural words have different endings from their singular counterparts, while conjugation changes the ending of verbs.</p>
<p>The word “free” appears twice in the following sentence, “Free hosting and free domain”, but for a computer to know that it is the same word regardless of its case, we may need to convert the whole sentence into lower cases. This is simple done in Python as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>“Free hosting and free domain”.lower()
</code></pre></div></div>
<p>Nevertheless, in some cases we might need to take the case the words into our considerations, especially when it carries some information about the meaning of those words. Consider the following example, “The CEO of Apple gave me an apple”.</p>
<p>Additionally, stemming is used to make sure that plural and singular words become the same. It also normalizes adjectives, adverbs and verbs given their various conjugations:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import nltk
stemmer = nltk.PorterStemmer()
stemmer.stem('running') # => run
stemmer.stem('shoes') # => shoe
stemmer.stem('expensive') # => expens
</code></pre></div></div>
<p>Another useful command in NLTK is clean_html(), which is capable of removing all HTML tags from a given text.</p>
<p>After normalizing our text, we usually need to divide it into sentences and words. The split() method is capable of converiting the following string, “We sell you finger-licking fires.”, into that list of words:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"We sell you finger-licking fries.".split()
['We', 'sell', 'you', 'finger-licking', 'fries.']
</code></pre></div></div>
<p>One problem with the previous command that it did not deal well with the hyphen and the fullstop. Anther alternative method is the wordpunct_tokenize() provided by NLTK:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize('We sell you finger-licking fries.')
['We', 'sell', 'you', 'finger', '-', 'licking', 'fries', '.']
</code></pre></div></div>
<h2 id="text-analysis">Text Analysis</h2>
<p>NLTK allows us to find out the frequencies of each word in our textual data. In the demo <a href="https://github.com/gr33ndata/NLP_GDGCairo2013" title="GNU GPL Demo">‘gnugpl.py’</a>, you can see how to use nltk.Text() to list the top n words in the GPL3 license. Similarly, we can get the frequency distribution of characters in text, rather than words. We will show later on how we cam detect the language of some text using the frequency distribution of characters there.</p>
<p><img src="http://i.imgur.com/DxbwkGrl.png" alt="Frequency distribution of characters" /></p>
<p><em>Frequency distribution of characters in both English and Arabizi</em></p>
<p>One problem with word frequencies is that big percentage of the top n words are stop-words. Stop-words are common words in a certain language that are not related to the topic of the document, such as “the”, “of”, “and”, etc. In the demo, <a href="https://github.com/gr33ndata/NLP_GDGCairo2013" title="Wikipedia pages for Egypt, Tunisia and Lebanon">wikianalysis.py</a>, we grabbed the text of the Wikipedia pages of Egypt, Tunisia and Lebanon. The top n words from each page were put in a table <a href="https://docs.google.com/spreadsheet/ccc?key=0AmbldjoHWBdZdGpvWDFBcjBneDBlY05ScHZ2dU8yU3c" title="Wikipedia Analysis">here</a>. One way to deal with stop-words is to re-weight terms. Words that appear in one page but not in the other should be given higher weight compared to words that are common in the three pages, even if they have higher frequencies. Thus, we divided the counts of each word in an page by the total count of it across the 3 pages. The results were put in the next tab, where we can see that the words marked in green are the ones related to each country. Additionally, you can use the collocations() method in NLTK to find out word pairs that frequently appear together.</p>
<h2 id="text-classification">Text Classification</h2>
<p>People sometimes find it easier to write Arabic words in Latin letters on the social media websites. This way of writing is commonly known as Arabizi or Francoarab. We needed to find a way to be able to tell if a text is written in English or in Francoarab. Initially, we noticed that the letter distribution varies between the two. The distribution of consecutive letter pairs also varies. Thus, in the demo, <a href="https://github.com/gr33ndata/NLP_GDGCairo2013" title="Text Language Classification">franco.py</a>, we used Naive Bayes Classifier to predict the language of a given text. We trained the classifier using the distribution of character pairs (bigrams) in our training set, “corpus/franco.txt”.</p>
<p>In addition to classifying a whole document, one might also need to predict the category of each word in the document. The categories are better known as Part of Speech (PoS) tags. For example, the word “book” is considered as a noun in “I am reading a book”, while in the phrase “I am going to book my train ticket”, it happens to be a verb. PoS tagging is thus used to whether a word is a noun, adjective, verb, etc. There are built in PoS taggers in NLTK, however in our demo, <a href="https://github.com/gr33ndata/NLP_GDGCairo2013" title="#CairoTraffic Analysis">cairotraffic.py</a>, we wanted to have our own set of PoS tags.</p>
<p>People normally use the hashtag #CairoTraffic on twitter to update their friends about the status of the traffic in the different streets of Cairo. Although, such tweets easily understood by humans, it is hard for a computer to extract structured data from them. For example, the following two phrases carry the same meaning despite their different wording:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The road to Zamalek from Tahrir is blocked.
Traffic was totally blocked from Tahrir in the direction of Zamalek.
</code></pre></div></div>
<p>Thus, in “cairotraffic.py” we needed create PoS tags for the “FROM” and “TO” locations in each tweet. We used NLTK’s UnigramTagger() and BigramTagger() in addition to Naive Bayes classifier to extract the words reflecting the “FROM” and “TO” locations from each tweet. It was clear from our demo that the Naive Bayes classifier outperformed the unigram and bigram taggers for unseen words and the variations in sentence structures.</p>
<p>In addition to PoS tagging, we also applied a simple Naive Bayes classifier to tell whether a road is blocked (za7ma) or alright (7alawa) given the words used in a tweet.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Python and NLTK makes it easy to carry on complex natural language processing tasks using few lines of code. We have seen here that the toolkit provide all suitable methods for doing text tokenization, analysis and classification. You can also read more about the concepts discussed here and the other capabilities of NLTK in this <a href="http://nltk.org/book/" title="Natural Language Processing with Python">free book</a> by Steven Bird, Ewan Klein, and Edward Loper.</p>
<p>If any of the topics discussed here is not clear, please feel free to ask me about it. Also feel free to <a href="http://tarekamr.appspot.com/" title="Tarek Amr Homepage">contact me</a> if you have any comments about the code or you would like me to help you using it in any of your projects.</p>
Tarek Amr
Labs newsletter: 7 November, 2013
2013-11-07T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/07/newsletter
<p>There was lots of interesting activity around Labs this week, with two launched projects, a new initiative in the works, and an Open Data Maker Night in London.</p>
<h2 id="webshot-online-screenshot-service">Webshot: online screenshot service</h2>
<p><a href="http://webshot.okfnlabs.org/">webshot.okfnlabs.org</a>, an online service for taking screenshots of websites, is now live, thanks to <a href="https://github.com/opsb">Oliver Searle-Barnes</a> and <a href="https://github.com/simong">Simon Gaeremynck</a>.</p>
<p>Try it out with an API call like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://webshot.okfnlabs.org/?url=http://okfnlabs.org&width=640&height=480
</code></pre></div></div>
<p>Read more about the development behind the service <a href="https://github.com/okfn/ideas/issues/63">here</a>.</p>
<h2 id="product-open-data-android-app">Product Open Data Android app</h2>
<p>The first version of the <a href="https://play.google.com/store/apps/details?id=org.okfn.pod">Android app</a> for <a href="http://www.product-open-data.com/">Product Open Data</a> has launched, allowing you to conveniently look up open data associated with a product on your phone.</p>
<p>The <a href="https://github.com/okfn/product-browser-android">source code</a> for the app is available on GitHub.</p>
<h2 id="crowdcrafting-for-public-bodies">Crowdcrafting for Public Bodies</h2>
<p><a href="http://publicbodies.org">PublicBodies.org</a> aims to provide “a URL for every part of government”. Many entries in the database lack good description text, though, making them harder to use effectively. Fixing this would be <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-October/001117.html">a good use of CrowdCrafting.org</a>, the crowd-sourcing platform powered by <a href="http://dev.pybossa.com/">PyBossa</a>.</p>
<p>Rufus suggests this start small and begin with EU public bodies. It should be <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-October/001120.html">easy to build a CrowdCrafting app</a> to cover those, says Daniel Lombraña González. Friedrich Lindenberg thinks this approach <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-November/001125.html">could work for other datasets</a> as well.</p>
<p>Discussion of this idea is still happening on the list, so jump in and say what you think—or help build the app!</p>
<h2 id="open-data-maker-night-6">Open Data Maker Night #6</h2>
<p>The sixth <a href="http://okfnlabs.org/events/open-data-maker/">Open Data Maker Night</a> took place this past Tuesday in London. Open Data Maker Nights are informal events where people make things with open data, whether apps or insights.</p>
<p>This night’s focus was on adding more UK and London data to <a href="http://openspending.org">OpenSpending</a>, and it featured special guest <a href="http://maxogden.com/">Max Ogden</a>. It was hosted by the Centre for Creative Collaboration.</p>
<p>Our next Open Data Maker Night will happen in early December. If you want to organize your own, though, it’s super easy: just see the <a href="http://okfnlabs.org/events/open-data-maker/">Open Data Maker Night website</a> for help booting, promoting, and running the event.</p>
Neil Ashton
Tracking Issues with Data the Simple Way
2013-11-06T00:00:00+00:00
http://okfnlabs.org/blog/2013/11/06/tracking-data-issues
<p><a href="https://github.com/datasets/issues">Data Issues</a> is a prototype initiative to track “issues” with data using a simple bug tracker—in this case, GitHub Issues.</p>
<p>We’ve all come across “issues” with data, whether it’s “data” that turns out to be provided as a PDF, the many ways to badly format tabular data (<a href="http://okfnlabs.org/bad-data/ex/tfl-passenger-numbers/">empty rows, empty columns</a>, inlined metadata …), “<a href="http://okfnlabs.org/bad-data/ex/bls-us-employment/">ASCII spreadsheets</a>”, or simply erroneous data.</p>
<p>Key to starting to improve data quality is a way to report and record these issues.</p>
<p>We’ve thought about ways to address this for <a href="http://blog.okfn.org/2011/03/31/building-the-open-data-ecosystem/">quite some time</a> and, led by <a href="http://okfnlabs.org/members/pudo/">Labs member Friedrich Lindenberg</a>, even experimented with building our <a href="http://okfnlabs.org/blog/2012/07/10/dataissues.html">own service</a>. But recently, thanks to a comment from <a href="http://okfnlabs.org/members/david/">Labs member David Miller</a>, we were hit with a blinding insight: why not do the simplest thing possible and just use an <strong>existing bug tracker tool</strong>? And so was born the current version of <a href="https://github.com/datasets/issues">Data Issues based on a github issue tracker</a>!</p>
<p><img src="http://i.imgur.com/lyIJYGo.png" alt="Data Issues" /></p>
<p><em>Aside: Before you decide we were completely crazy not to see this in the first place, it should be said that doing data issues “properly” (in the medium term) probably does require something a bit more than a normal bug tracker. For example, it would be nice to be able to both pinpoint an issue precisely (e.g. the date in column 5 on line 3751 is invalid) and group similar issues (e.g. all amounts in column 7 have a commas in them). Doing this would require a tracker that was customized for data. The solution described in this post, however, seems like a great way to get started.</em></p>
<h2 id="introducing-data-issues">Introducing Data Issues</h2>
<p>Given the existence of so many excellent issue-tracking systems, we thought the best way to start is to reuse one—in the simplest possible way.</p>
<p>With <a href="https://github.com/datasets/issues">Data Issues</a>, we’re using GitHub Issues to track issues with datasets. Data Issues is essentially just a GitHub repository whose Issues are used to report problems on open datasets. Any problem with any dataset can be reported on Data Issues.</p>
<p>To report an issue with some data, just <a href="https://github.com/datasets/issues/issues/new">open an issue in the tracker</a>, add relevant info on the data (its URL, who’s responsible for it, the line number of the bug, etc.), and explain the problem. You can add labels to group related issues—for example, if multiple datasets from the same site have problems, you can add a label that identifies the dataset’s site of origin.</p>
<p>Straightaway, the issue you raise becomes a <em>public notice</em> of the problem with the dataset. Everyone interested in the dataset has access to the issue. The issue is also <em>actionable</em>: each issue contains a thread of comments that can be used to track the issue’s status, and the issue can be <em>closed</em> when it has been fixed. All issues submitted to Data Issues are visible in a central list, which can be filtered by keyword or label to zoom in on relevant issues. All of these great features come <em>for free</em> because we’re using GitHub Issues.</p>
<h2 id="get-involved">Get Involved</h2>
<p>For Data Issues to work, people need to use it. If civic hackers, journalists, and other data wranglers learn about Data Issues and start using it to track their work on datasets, we might find that the problem of tracking issues with datasets has already been solved.</p>
<p>You can also contribute by helping develop the project into something richer than a simple Issues page. One limitation of Data Issues is that raising an issue does not actually contact the parties responsible for the data. Our next goal is to automate sending along feedback from Data Issues, making it a more effective bug tracker.</p>
<p>If you want to discuss new directions for Data Issues or point out something you’ve built that contributes to the project, get in touch via the <a href="http://lists.okfn.org/mailman/listinfo/okfn-labs">Labs mailing list</a>.</p>
Rufus Pollock
A Python guide for open data file formats
2013-10-17T00:00:00+00:00
http://okfnlabs.org/blog/2013/10/17/python-guide-for-file-formats
<p>If you are an open data researcher you will need to handle a lot of different file formats from datasets. Sadly, most of the time, you don’t have the opportunity to choose which file format is the best for your project, but you have to comport with all of them to be sure that you won’t find a dead end. There’s always someone who knows the solution to your problem, but that doesn’t mean that answers come easy. Here is a guide for each file formats from the <a href="http://opendatahandbook.org/">open data handbook</a> and a suggestion with a python library to use.</p>
<p><strong>JSON</strong> is a simple file format that is very easy for any programming language to read. Its simplicity means that it is generally easier for computers to process than others, such as XML. Working with JSON in Python is almost the same such as working with a python dictionary. You will need the json library, but it is preinstalled to every python 2.6 and after.</p>
<pre>import json
json_data = open("file root")
data = json.load(json_data)</pre>
<p>Then data[“key”] prints the data for the json</p>
<p><strong>XML</strong> is a widely used format for data exchange because it gives good opportunities to keep the structure in the data and the way files are built on, and allows developers to write parts of the documentation in with the data without interfering with the reading of them. This is pretty easy in python as well. You will need minidom library. It is also preinstalled.</p>
<pre>from xml.dom import minidom
xmldoc = minidom.parse("file root")
itemlist = xmldoc.getElementsByTagName("name")</pre>
<p>This prints the data for the “name” tag.</p>
<p><strong>RDF</strong> is a W3C-recommended format and makes it possible to represent data in a form that makes it easier to combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations. RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing open data initiatives on the Web. RDF is still not widespread, but it has been a trend among Open Government initiatives, including the British and Spanish Government Linked Open Data projects. The inventor of the Web, Tim Berners-Lee, has recently proposed a five-star scheme that includes linked RDF data as a goal to be sought for open data initiatives I use rdflib for this file format. Here is an example.</p>
<pre>from rdflib.graph import Graph
g = Graph()
g.parse("file root", format="format")
for stmt in g:
print(stmt)</pre>
<p>In rdf you can run queries too and return only the data you want. But this isn’t easy as parsing it. You can find a tutorial <a href="http://code.alcidesfonseca.com/docs/rdflib/gettingstarted.html#run-a-query">here</a></p>
<p><strong>Spreadsheets</strong>. Many authorities have information left in the spreadsheet, for example Microsoft Excel. This data can often be used immediately with the correct descriptions of what the different columns mean. However, in some cases there can be macros and formulas in spreadsheets, which may be somewhat more cumbersome to handle. It is therefore advisable to document such calculations next to the spreadsheet, since it is generally more accessible for users to read. I prefer to use a tool like xls2csv and then use the output file as a csv. But if you want for any reason to work with an xls, here is the best source I had <a href="http://www.python-excel.org/">www.python-excel.org</a>. The most popular is the first one, xlrd. There is also another library <a href="http://pythonhosted.org/openpyxl/">openpyxl</a>, where you can work with xlsx files.</p>
<p><strong>Comma Separated Files (CSV)</strong> files can be a very useful format because it is compact and thus suitable to transfer large sets of data with the same structure. However, the format is so spartan that data are often useless without documentation since it can be almost impossible to guess the significance of the different columns. It is therefore particularly important for the comma-separated formats that documentation of the individual fields are accurate. Furthermore it is essential that the structure of the file is respected, as a single omission of a field may disturb the reading of all remaining data in the file without any real opportunity to rectify it, because it cannot be determined how the remaining data should be interpreted. You can use the CSV python library. Here is an example.</p>
<pre>import csv
with open('eggs.csv', 'rb') as csvfile:
file = csv.reader("file root", delimiter=' ', quotechar='|')
for row in file:
print ', '.join(row)</pre>
<p><strong>Plain Text (txt)</strong> are very easy for computers to read. They generally exclude structural metadata from inside the document however, meaning that developers will need to create a parser that can interpret each document as it appears. Some problems can be caused by switching plain text files between operating systems. MS Windows, Mac OS X and other Unix variants have their own way of telling the computer that they have reached the end of the line. You can load the txt file but how you will use it after that, it depends on the data format.</p>
<pre>text_file = open("file root", "r")
lines = text_file.read()</pre>
<p>This example will return the whole txt.</p>
<p><strong>PDF</strong> Here is the biggest problem in open data file formats. Many datasets have their data in pdf and unfortunately it isn’t easy to read and then edit them. PDF is really presentation oriented and not content oriented. But you can use <a href="https://pypi.python.org/pypi/pdfminer/">PDFMiner</a> to work with it. I won’t include any example here since it isn’t a trivial one, but you can find anything you want in their documentation.</p>
<p><strong>HTML</strong>. Nowadays much data is available in HTML format on various sites. This may well be sufficient if the data is very stable and limited in scope. In some cases, it could be preferable to have data in a form easier to download and manipulate, but as it is cheap and easy to refer to a page on a website, it might be a good starting point in the display of data. Typically, it would be most appropriate to use tables in HTML documents to hold data, and then it is important that the various data fields are displayed and are given IDs which make it easy to find and manipulate data. Yahoo has developed a tool <a href="http://developer.yahoo.com/yql/">yql</a> that can extract structured information from a website, and such tools can do much more with the data if it is carefully tagged. I have used many times a python library called Beautiful Soup for my projects.</p>
<pre>from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
soup.title
soup.title.name
soup.title.string
soup.title.parent.name
soup.p
soup.p['class']
soup.a
soup.find_all('a')
soup.find(id="link3")</pre>
<p>Those are only a few of what you can do with this library. By calling the tag, will return the content. You can find more on their <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/">documentation</a></p>
<p><strong>Scanned image</strong>. Yes. It is true. Probably the least suitable form for most data, but both TIFF and JPEG-2000 can at least mark them with documentation of what is in the picture - right up to mark up an image of a document with full text content of the document. If images are clean, containing only text and without any noise, you can use a library called pytesser. You will need the PIL library to use it. Here is an example.</p>
<pre>from pytesser import *
image = Image.open('fnord.tif') # Open image object using PIL
print image_to_string(image)</pre>
<p><strong>Proprietary formats</strong>. Last but not least, some dedicated systems, etc. have their own data formats that they can save or export data in. It can sometimes be enough to expose data in such a format - especially if it is expected that further use would be in a similar system as that which they come from. Where further information on these proprietary formats can be found should always be indicated, for example by providing a link to the supplier’s website. Generally it is recommended to display data in non-proprietary formats where feasible.. I suggest to google if there is any library specific for this dataset.</p>
<p><strong>Additional Info</strong>. Maybe you will find useful the <a href="http://pandas.pydata.org/pandas-docs/stable/io.html">Panda</a> library, whose I/O capabilities integrate and unify access from/to most of the formats: CSV, Excel, HDF, SQL, JSON, HTML, Pickle.</p>
Anastasios Ventouris
Introducing TimeMapper - Create Elegant TimeMaps in Seconds
2013-10-11T00:00:00+00:00
http://okfnlabs.org/blog/2013/10/11/timemapper
<p><a href="http://timemapper.okfnlabs.org">TimeMapper</a> lets you create elegant and embeddable timemaps quickly and easily from a simple spreadsheet.</p>
<p><a href="http://timemapper.okfnlabs.org/okfn/medieval-philosophers"><img src="http://i.imgur.com/FmPTZlr.png" alt="Medieval philosophers timemap" /></a></p>
<p>A timemap is an interactive timeline whose items connect to a geomap. Creating a timemap with TimeMapper is as easy as filling in a spreadsheet template and copying its URL.</p>
<p>In this quick walkthrough, we’ll learn how to recreate the <a href="http://timemapper.okfnlabs.org/okfn/medieval-philosophers">timemap of medieval philosophers</a> shown above using TimeMapper.</p>
<h2 id="getting-started-with-timemapper">Getting started with TimeMapper</h2>
<p>To get started, go to the <a href="http://timemapper.okfnlabs.org/">TimeMapper website</a> and sign in using your <a href="http://twitter.com">Twitter</a> account. Then click <strong>Create a new Timeline or TimeMap</strong> to start a new project. As you’ll see, it really is as easy as 1-2-3.</p>
<p>TimeMapper projects are generated from <a href="http://docs.google.com">Google Sheets</a> spreadsheets. Each item on the timemap – an event, an individual, or anything else associated with a date (or two, for the start and end of a period) – is a spreadsheet row.</p>
<p>What can you put in the spreadsheet? Check out the <a href="https://docs.google.com/a/okfn.org/spreadsheet/ccc?key=0AqR8dXc6Ji4JdFRNOTVYYTRqTmh6TUNNd3U2X2pKMGc#gid=0">TimeMapper template</a>. It contains all of the columns that TimeMapper understands, plus a row of cells explaining what each of them means. Your timemap doesn’t have to use all of these columns, though—it just requires a <em>Start</em> date, a <em>Title</em>, and a <em>Description</em> for each item, plus geographical coordinates for the map.</p>
<p>So you’ve put your data in a Google spreadsheet—how can you make it into a timemap? Easy! From Google Sheets, go to <strong>File -> Publish to the web</strong> and hit <strong>Start publishing</strong>. Then click on your sheet’s <strong>Share</strong> button and set the sheet’s visibility to <em>Anyone who has the link can <strong>view</strong></em>. You can either copy the URL from <em>Link to share</em> and paste that URL into the box in Step 2 of the TimeMapper creation process or click on <strong>Select from Your Google Drive</strong> to just browse to the sheet. Whichever you do, then hit <strong>Connect and Publish</strong>—and voilà!</p>
<p><img src="http://i.imgur.com/5SLOURu.png" alt="Share your spreadsheet" /></p>
<p>Embedding your new timemap is just as easy as creating it. Click on <strong>Embed</strong> in the top right corner. It will pop up a snippet of HTML which you can paste into your webpage to embed the timemap. And that’s all it takes!</p>
<p><img src="http://i.imgur.com/3KWL6p6.png" alt="Embed your timemap" /></p>
<h2 id="coming-next">Coming next</h2>
<p>We have big plans for TimeMapper, including:</p>
<ul>
<li>Support for indicating size and time on the map</li>
<li>Quickly create TimeMaps using information from Wikipedia</li>
<li>Connect markers in maps to form a route</li>
<li>Options for timeline- and map-only project layouts</li>
<li><a href="http://disqus.com">Disqus</a>-based comments</li>
<li>Core JS library, <strong>timemapper.js</strong>, so you can build your own apps with timemaps</li>
</ul>
<p>Check out the <a href="https://github.com/okfn/timemapper/issues">TimeMapper issues list</a> to see what ideas we’ve got and to leave suggestions.</p>
<h2 id="code">Code</h2>
<p>In terms of the internals the app is a simple node.js app with storage into s3. The timemap visualization is pure JS built using KnightLabs excellent <a href="http://timeline.knightlab.com/">Timeline.js</a> for the timeline and <a href="http://leafletjs.com/">Leaflet</a> (with OSM) for the maps. For those interested in the code it can be found at: <a href="https://github.com/okfn/timemapper/">https://github.com/okfn/timemapper/</a></p>
<h2 id="history-and-credits">History and credits</h2>
<p>TimeMapper is made possible by awesome open source libraries like <a href="http://timeline.verite.co">TimelineJS</a>, <a href="http://backbonejs.org">Backbone</a>, and <a href="http://leafletjs.com">Leaflet</a>, not to mention open data from <a href="http://www.openstreetmap.org">OpenStreetMap</a>. When we first built a TimeMapper-style site in 2007 under the title “Weaving History”, it was a real struggle over many months to build a responsive JavaScript-heavy app. Today, thanks to libraries like these and advances in browsers, it’s now a matter of weeks.</p>
Neil Ashton
Datapackageproxy - work with datapackages in your browser
2013-10-11T00:00:00+00:00
http://okfnlabs.org/blog/2013/10/11/Datapackageproxy-work-with-datapackages-in-your-browser
<p><a href="http://data.okfn.org/standards/data-package">Datapackages</a> are a neat idea
along the “using data like we use code” way. While Tryggvi has created a
nice <a href="https://github.com/tryggvib/datapackage">python module to handle datapackages</a> - there is a problem
using datapackages in javascript.</p>
<p>In an ideal world I’d just call something like <code class="language-plaintext highlighter-rouge">d3.csv()</code> on any csv
file on the web. Browser restrictions, however don’t allow loading of
arbitrary files from arbitrary websites (for good reasons). To do so
anyway, you’ll need to explicitly allow this. (read more on
<a href="http://cors-enable.org">CORS</a>.</p>
<p>Datapackages are hosted on a variety of hosters and many don’t support
CORS - thus we’ll need to proxy them through a system that understand the
format and is CORS enabled: datapackageproxy.</p>
<p>Using datapackageproxy is simple. To access a resource of a package, simply
use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://datapackageproxy.appspot.com/resource?url=http://data.okfn.org/data/bond-yields-uk-10y
</code></pre></div></div>
<p>the optional id=’’ parameter allows specifying the id (though this is not
implemented yet). It will return the data as a csv. (so you can use it in
d3.csv()). To get the metadata (package definition) of a datapackage use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://datapackageproxy.appspot.com/metadata?url=http://data.okfn.org/data/bond-yields-uk-10y
</code></pre></div></div>
<p>The datapackageproxy is built on appspot - so if there is very heavy load,
it might go over usage limits. (If this happens, I’ll try to either move it
or figure something else out…)</p>
<p>Find the code on <a href="https://github.com/mihi-tr/datapackageproxy">github</a> -
and the proxy itself on <a href="http://datapackageproxy.appspot.com">datapackageproxy.appspot.com</a></p>
Michael Bauer
PublicBodies.org - Update no. 2
2013-10-07T00:00:00+00:00
http://okfnlabs.org/blog/2013/10/07/publicbodies.org-update-no-2
<p>Herewith is a report on recent improvements to <a href="http://publicbodies.org/">PublicBodies.org</a>, our project in Open Knowledge Foundation Labs project to provide “a URL (and information) on every “public body” - that’s every government funded agency, department or organization.</p>
<p><a href="http://publicbodies.org/"><img src="http://farm6.staticflickr.com/5349/10141929423_d84e45764d_z.jpg" alt="" /></a></p>
<h2 id="new-data">New data</h2>
<p>New data contributed over the last couple of months is now validated and live - this includes new data for <strong>Switzerland, Greece, Brazil and the US</strong>. Huge thank-you to contributors here including Hannes, Charalampos, Augusto and Todd.</p>
<p>We also have pending data for Italy and China to get in once it has been reviewed, and we have data in progress for Canada!</p>
<p>We’d love to have more data - if you’re interested in contributing see <a href="https://github.com/okfn/publicbodies#contribute-data">https://github.com/okfn/publicbodies#contribute-data</a></p>
<h2 id="updated-schema-for-data">Updated Schema for Data</h2>
<p>Thanks to input from James McKinney and others we’ve <a href="https://github.com/okfn/publicbodies/issues/29">reworked the schema</a> quite extensively to match up as much as possible with the Popolo spec. You can see the new schema in the <a href="https://github.com/okfn/publicbodies/blob/master/datapackage.json#L13-L105">datapackage.json</a>.</p>
<p>Or if you don’t like raw JSON so much in prettier HTML version on: <a href="http://data.okfn.org/community/okfn/publicbodies">http://data.okfn.org/community/okfn/publicbodies</a></p>
<h2 id="search-support">Search support</h2>
<p>We now have basic search support via google custom search: <a href="http://publicbodies.org/search">http://publicbodies.org/search</a></p>
<h2 id="get-involved">Get Involved</h2>
<p>As always we’d love help! There is a <a href="https://github.com/okfn/publicbodies/issues">full list of issues here</a> and example items:</p>
<ul>
<li><a href="https://github.com/okfn/publicbodies/issues/35">Getting Descriptions for 130 EU Public Bodies</a></li>
<li><a href="https://github.com/okfn/publicbodies/issues/8">Support for sending corrections / additions on the website</a></li>
<li><a href="https://github.com/okfn/publicbodies/issues/2">Support for reconciliation (e.g. via nomenklatura)</a></li>
</ul>
Rufus Pollock
Data as Code Deja-Vu
2013-10-04T00:00:00+00:00
http://okfnlabs.org/blog/2013/10/04/data-as-code-dejavu
<p>Someone just pointed me at <a href="http://ben.balter.com/2013/09/16/treat-data-as-code/">this post from Ben Balter about Data as Code</a> in which he emphasizes the analogies between data and code (and especially open data and open-source – e.g. “data is where code was 2 decades ago” …).</p>
<p>I was delighted to see this post as it makes many points I deeply agree with - and have for some time. In fact, reading it gave me something a sense of (very positive) deja-vu since it made similar points to several posts I and others had written several years ago - suggesting that perhaps we’re now getting close to the critical mass we need to create a real distributed and collaborative <a href="http://blog.okfn.org/2011/03/31/building-the-open-data-ecosystem/">open data ecosystem</a>!</p>
<p>It also suggested it was worth dusting off and recapping some of this earlier material as much of it was written more than 6 years ago, a period which, in tech terms, can seem like the stone age.</p>
<h2 id="previous-thinking">Previous Thinking</h2>
<p>For example, there is this essay from 2007 on <a href="http://blog.okfn.org/writings/componentization/">Componentization and Open Data</a> that Jo Walsh and I wrote for our XTech talk that year on CKAN. It emphasized analogies with code and the importance of componentization and packaging.</p>
<p>This, in turn, was based on <a href="http://blog.okfn.org/2006/05/09/the-four-principles-of-open-knowledge-development/">Four principles for Open Knowledge Development</a> and <a href="http://blog.okfn.org/2007/04/30/what-do-we-mean-by-componentization-for-knowledge/">What do we mean by componentization for knowledge</a>. We also emphasized the importance of “version control” in facilitating distributed collaboration, for example in <a href="http://blog.okfn.org/2007/02/20/collaborative-development-of-data/">Collaborative Development of Data (2006/2007)</a> and, more recently, in <a href="http://blog.okfn.org/2010/07/12/we-need-distributed-revisionversion-control-for-data/">Distributed Revision Control for Data (2010)</a> and this year in <a href="http://blog.okfn.org/2013/07/02/git-and-github-for-data/">Git (and GitHub) for Data</a>.</p>
<h2 id="package-managers-and-ckan">Package Managers and CKAN</h2>
<p>This also brings me to a point relevant both to Ben’s post and Michal’s comment: the original purpose (and design) of CKAN was <em>precisely</em> to be a package manager a la rubygems, pypi, debian etc. It has evolved a lot from that into more of a “wordpress for data” - i.e. a platform for publishing, managing (and storing) data because of user demand. (Note that in early CKAN “datasets” were called packages in both the interface and code - a poor UX decision ;-) that illustrated we were definitely ahead of our time - or just wrong!)</p>
<p>Some sense of what was intended is evidenced by the fact that in 2007 we were writing a command line tool called datapkg (since changed to dpm for data package manager) to act at the command equivalent of gem / pip / apt-get - see <a href="http://blog.okfn.org/2010/02/23/introducing-datapkg/">this Introducting DataPkg post</a> which included this diagram illustrating how things were supposed to work.</p>
<p><img src="http://m.okfn.org/files/talks/media/debian_of_data.png" alt="" /></p>
<h2 id="recent-developments">Recent Developments</h2>
<p>As CKAN has evolved into a more general-purpose tool – with less of a focus on just being a registry supported automated access – we’ve continued to develop those ideas. For example:</p>
<ul>
<li>The basic “package” idea from CKAN has evolved into the <a href="http://data.okfn.org/standards/data-package">Data Package spec</a> - and <a href="http://data.okfn.org/standards/simple-data-format">Simple Data Format</a></li>
<li>We’ve <a href="http://blog.okfn.org/2013/07/02/git-and-github-for-data/">explored storing data using code tools like git</a> - with a dedicated <a href="http://github.com/datasets">datasets organization on Github</a></li>
<li>We’ve re-booted the idea of a simple registry and storage mechanism in the form of <a href="http://data.okfn.org/">http://data.okfn.org/</a> - with data stored in simple data format in git repos on github, displayed in a very simple registry with good tool integration, and curated by a dedicated group of maintainers</li>
<li>We’ve booted the <a href="http://blog.okfn.org/2013/04/24/frictionless-data-making-it-radically-easier-to-get-stuff-done-with-data/">“Frictionless Data” initiative</a> as a way to bring together these different activities in one coherent vision of how we can do something simple to make progress</li>
</ul>
<p><a href="http://data.okfn.org/standards"><img src="http://assets.okfn.org/p/data.okfn.org/img/the-idea.png" alt="" /></a></p>
<p class="caption">Data Packages and Frictionless Data - from <a href="http://data.okfn.org/about">data.okfn.org</a></p>
Rufus Pollock
Full stack datavis - scraperwiki, d3 and github.
2013-09-24T00:00:00+00:00
http://okfnlabs.org/blog/2013/09/24/Full-stack-datavis-using-scraperwiki-d3-and-github
<p>The city of Vienna started releasing waiting times for some of its service
offices recently. I followed my usual hunch and just wrote a <a href="https://scraperwiki.com/dataset/guvh44q">small script
on scraperwiki</a> that stows away
the JSON released by the city not knowing yet what to do with it.</p>
<p>Weeks later <a href="http://hackshackers.at">Hacks/Hackers Vienna</a> decided to host
a hackathon. I couldn’t make it (I thought I might) but had the idea to
develop the data into a visualization. I sat down later that week and
published a <a href="http://wannaufsamt.tentacleriot.eu">visualization of waiting times</a>.</p>
<p><img src="http://wannaufsamt.tentacleriot.eu/waa.png" alt="Wann aufs amt?" /></p>
<h3 id="so-why-am-i-rambling-on-about-this">So why am I rambling on about this?</h3>
<p>I realized a couple of things while doing this:</p>
<p>One or two years back,
facing a problem like this I would have: made space on a server, write an
extensive scraper in python, set up a database to store stuff, write a
backend web-application to generate graphics and spit them out.</p>
<p>Today: I have my scraper and backend run as a service by
<a href="http://scraperwiki.com">scraperwiki</a>, use <a href="http://d3js.org">d3</a> to
generate graphics on the client (much better looking ones) and host the
whole thing for free on <a href="http://github.com">github</a> - because I don’t need
a backend anymore.</p>
<p>This is made possible by:</p>
<ul>
<li>More and more things offered as a service (often for free)</li>
<li>Amazing frameworks in modern languages, that make development easier</li>
<li>Fantastic resources to exchange knowledge</li>
</ul>
<p>Developing a small data-driven application used to be a lot of work - not
anymore. While it takes a while to get used to the intricate ways of some
frameworks (<a href="http://d3js.org">d3</a> has a quite unique way of doing things):
once you’re over the hump, things get a lot easier. This leaves you in the
end thinking about the visualization or application you’re building: not
worrying about server security, costs and setup.</p>
<p>This also means: Full stack datavisualization has become easier. You needed
a team of specialists (sysadmins, backend-developers, designers) to do a
decent dataviz: now you just learn the missing parts and you’re able to
pull it off.</p>
Michael Bauer
Using d3 as user input
2013-09-16T00:00:00+00:00
http://okfnlabs.org/blog/2013/09/16/using-d3-as-user-input
<p>Recently, I was at <a href="http://chicaspoderosas.org/">Chicas Poderosas</a> in
Bogota - the three day event featured talks on two days and a hackday on
the last. During the event I was approached by
<a href="http://cuyabracadabra.wordpress.com/">Natalia</a> an industrial designer who
introduced a project of hers:
<a href="http://cuyabracadabra.wordpress.com/electrocardiograma-%C2%B7-%C2%B7-%C2%B7/">Electrocardiogr_ama</a>.
She wanted to build an app with similar features and pitched it on the
hackday. I’ve ended up working with Natalia and Knight/Mozilla Opennews Fellow Sonya
Song on the project.</p>
<p>Using <a href="http://d3js.org">D3</a> for visualizing the output was quite straigt
forward. But then, we wanted to have some easy to use user input - we
graded mood on a scale, but how to represent it best? Numbers from 1-x as
they are often used didn’t seem very intuitive (is 1 best or 10 best?).
After thinking about it for a while we had an idea of using a smiley as a
slider - the smiley would smile if happy and look sad if dragged to a sad
status.</p>
<p>see it working here (try draging it up and down):</p>
<iframe src="http://sonya2song.github.io/moodlog/input.html" width="250" height="350" frameborder="0"></iframe>
<p>To read it’s value we use the following code.</p>
<pre><code class="javascript">
function sbmt() {
smilescale=d3.scale.linear()
.domain([50,250])
.range([1,10])
note=document.getElementById("note").value;
d3.select("svg > g#smiley").each(function(d) {
score=smilescale(d.y);
// XHTTP Post Request follows here
})
}
</code></pre>
<p>If you want to see it in action: try out the
<a href="http://moodlogr.appspot.com">Moodlog</a> app, or check out the
<a href="https://github.com/sonya2song/moodlog">github repo</a>.</p>
<p>User inputs are often not very intuitive, let’s make them better!</p>
Michael Bauer
Data Pipes - streaming online data transformations
2013-09-11T00:00:00+00:00
http://okfnlabs.org/blog/2013/09/11/datapipes
<p><strong><a href="http://datapipes.okfnlabs.org/">Data Pipes</a></strong> provides an online service built in NodeJS to do <strong>simple data transformations</strong> – deleting rows and columns, find and replace, filtering, viewing as HTML – and, furthermore, to <strong>connect these transformations together</strong> <em>Unix pipes style</em> to make more complex transformations. Because Data Pipes is a web service, data transformation with Data Pipes takes place entirely online and the results <strong>and</strong> process are completely shareable simply by sharing the URL.</p>
<h2 id="an-example">An example</h2>
<p>This takes the <a href="https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">input data</a> (sourced from this <a href="http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv">original Greater London Authority financial data</a>), slices out the first 50 rows (head), deletes the first column (its blank!) (cut), deletes rows 1 through 7 (delete) and finally renders the result as HTML (html).</p>
<p><a href="http://datapipes.okfnlabs.org/csv/head%20-n%2050/cut%200/delete%201:7/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv"><code>http://datapipes.okfn.labs.org/csv/head -n 50/cut 0/delete 1:7/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv</code></a></p>
<h3 id="before">Before</h3>
<p><a href="http://datapipes.okfnlabs.org/csv/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">
<img src="http://farm3.staticflickr.com/2827/9726020844_0301af2ded.jpg" width="500" height="213" alt="Data pipes: GLA data, HTML view" />
</a></p>
<h3 id="after">After</h3>
<p><a href="http://datapipes.okfnlabs.org/csv/head%20-n%2050/cut%200/delete%201:7/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">
<img src="http://farm4.staticflickr.com/3728/9726020800_ff01da582e.jpg" width="500" height="177" alt="Data pipes: GLA data, trimmed" />
</a></p>
<h2 id="motivation---data-wrangling-pipes-nodejs-and-the-unix-philosophy">Motivation - Data Wrangling, Pipes, NodeJS and the Unix Philosophy</h2>
<p>When you find data in the wild you usually need to poke around in it and then to some cleaning for it to be usable.</p>
<p>Much of the inspiration for Data Pipes comes from our experience using Unix command lines tools like <code class="language-plaintext highlighter-rouge">grep</code>, <code class="language-plaintext highlighter-rouge">sed</code>, and <code class="language-plaintext highlighter-rouge">head</code> to do this kind of work. These tools a powerful way to operate on <em>streams</em> of text (or more precisely streams of lines of text, since Unix tools process text files line by line). By using streams, they can scale to large files easily (they don’t load the whole file but process it bit by bit) and, more importantly, allow “piping” – that is, direct connection of the output of one command with the input of another.</p>
<p>This already provides quite a powerful way to do data wrangling (see <a href="https://github.com/rgrp/command-line-data-wrangling">here</a> for more). But there are limits: data isn’t always line-oriented, plus command line tools aren’t online, so it’s difficult to share and repeat what you are doing. Inspired by a combination of Unix pipes and the possibilities of <a href="http://nodejs.org/">NodeJS</a>’s great streaming capabilities, we wanted to take the pipes online for data processing – and so Data Pipes was born.</p>
<p>We wanted to use the <a href="http://www.faqs.org/docs/artu/ch01s06.html">Unix philosophy</a> that teaches us to solve problems with cascades of simple, composable operations that manipulate streams, an approach which has proven almost <em>universally</em> effective.</p>
<p>Data Pipes brings the Unix philosophy and the Unix pipes style to online data. Any <a href="http://data.okfn.org/standards/csv">CSV</a> data can be piped through a cascade of transformations to produce a modified dataset, without ever downloading the data and with no need for your own backend. Being online means that the operations are immediately shareable and linkable.</p>
<h2 id="more-examples">More Examples</h2>
<p>Take, for example, this copy of a set of <a href="https://raw.github.com/okfn/datapipes/master/test/data/gla.csv">Greater London Authority financial data</a>. It’s unusable for most purposes, simply because it doesn’t abide by the CSV convention that the first line should contain the headers of the table. The header is preceded by six lines of useless commentary. Another problem is that the first column is totally empty.</p>
<p><img src="http://farm4.staticflickr.com/3824/9726020908_bb2d26b694.jpg" width="500" height="363" alt="Data pipes: Greater London Authority financial data, in the raw" /></p>
<p>First of all, let’s use the Data Pipes <code class="language-plaintext highlighter-rouge">html</code> operation to get a nicer-looking view of the table.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GET /csv/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv
</code></pre></div></div>
<p><img src="http://farm3.staticflickr.com/2827/9726020844_0301af2ded.jpg" width="500" height="213" alt="Data pipes: GLA data, HTML view" /></p>
<p>Now let’s get rid of those first six lines and the empty column. We can do this by chaining together the <code class="language-plaintext highlighter-rouge">delete</code> operation and the <code class="language-plaintext highlighter-rouge">cut</code> operation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GET /csv/delete 0:6/cut 0/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv
</code></pre></div></div>
<p>And just like that, we’ve got a well-formed CSV!</p>
<p><img src="http://farm4.staticflickr.com/3728/9726020800_ff01da582e.jpg" width="500" height="177" alt="Data pipes: GLA data, trimmed" /></p>
<p>But why stop there? Why not take the output of that transformation and, say, search it for the string “LONDON” with the <code class="language-plaintext highlighter-rouge">grep</code> transform, then take just the first 20 entries with <code class="language-plaintext highlighter-rouge">head</code>?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GET /csv/delete 0:6/cut 0/grep LONDON/head -n 20/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv
</code></pre></div></div>
<p><img src="http://farm6.staticflickr.com/5505/9726020732_c5ca38c10a.jpg" width="500" height="370" alt="Data pipes: GLA data, final view" /></p>
<p>Awesome!</p>
<h2 id="whats-next">What’s next?</h2>
<p>Data Pipes already supports a useful collection of operations, but it’s still in development, and more are yet to come, including find-and-replace operation <code class="language-plaintext highlighter-rouge">sed</code> plus support for <a href="https://github.com/okfn/datapipes/issues/21">arbitrary map and filter functions</a>.</p>
<p>You can see the full list <a href="http://datapipes.okfnlabs.org/">on the Data Pipes site</a>, and you can suggest more transforms to implement by <a href="https://github.com/okfn/datapipes/issues/new">raising an issue</a>.</p>
<p>Data Pipes needs more operations for its toolkit. That means its developers need to know what you do with data – and to think about how it can be broken down in the grand old Unix fashion. To join in, check out <a href="https://github.com/okfn/datapipes">Data Pipes on GitHub</a> and let us know what you think.</p>
Rufus Pollock
Miga, a new app generator for structured data
2013-08-27T00:00:00+00:00
http://okfnlabs.org/blog/2013/08/27/miga
<p>I’m pleased to announce the <a href="http://migadv.com/">Miga Data Viewer</a>, or Miga, an open source tool I created that lets you create a web/mobile app nearly automatically from a set of CSV data.</p>
<p>There are already various applications/frameworks that provide a JavaScript-enabled front-end for structured data - and not just structured data, but CSV data. These include two published by the Open Knowledge Foundation itself, <a href="http://okfnlabs.org/recline/">Recline.js</a> and the related <a href="http://explorer.okfnlabs.org">Data Explorer</a>.</p>
<p>Miga is different from these and (I think) other tools in a few ways. Instead of presenting an aggregate view of the data, its interface is more similar to that of an app, where each entity/row has its own page. (There are also ways to view the data in aggregate, with maps and schedules and the like.)</p>
<p>There are some other features that Miga has that I don’t believe other data-browsing tools have at the moment, including the ability to browse multiple, linked tables (like having one file about countries and other about cities, where a column in the latter connects the two). But the most important difference is that Miga provides offline viewing: once a site, i.e. a dataset, has been accessed, it can be viewed even if the internet connection is lost. The offline capability is provided through two very useful technologies: Web SQL and Application Cache. (The use of Web SQL unfortunately means that Miga can’t be used on Firefox or Internet Explorer, though it works fine on all other major browsers, including mobile browsers.)</p>
<p>I think Miga could potentially be useful for publishing a lot of different kinds of data - for both regular browsing and mobile app-style functionality. I would encourage anyone to try it out for themselves - if you have a set of data in CSV format, or can create it, the rest of the setup is not that hard.</p>
Yaron Koren
How a bad experience made an OKFN labs project
2013-08-22T00:00:00+00:00
http://okfnlabs.org/blog/2013/08/22/how-a-bad-experience-made-an-okfn-labs-project
<h2 id="from-theory-to-experimentation">From theory to experimentation</h2>
<p>Back in November 2010, I faced a problem while teaching my students about the Semantic Web. I wanted to convey the idea that Semantic Web technologies can break down the barriers between dataset silos on the Web and simplify the publication and consumption of open data. This idealistic idea was suddenly undermined when we moved from theory to practice. My exercise used open data to answer the following question: which films have been biased by a partnership relation between the film director and a member of the cast? This question, when written in the SPARQL Semantic Web querying language, can be executed on the DBPedia SPARQL Endpoint (if you are curious about the results, visit <a href="http://bit.ly/14didjd">http://bit.ly/14didjd</a>). Based on this query I built a whole exercise for my students to discover the potential of the SPARQL language.</p>
<p>When the day came for my students to experiment with the amazing capabilities of Semantic Web technology, the DBPedia server (hosting the SPARQL Endpoint service used by this exercise) was down, leaving me with some awkward remarks like: “it was too beautiful to be true”. As a result of that experience I took 2 decisions: 1) Instead of using one exercise and one endpoint, I will provide 3 exercises using 3 different endpoints to maximise my chances to have at least one running, and 2) I decided to develop an application that gives the real picture of one aspect of the Semantic Web architecture: the availability of SPARQL Endpoints.</p>
<h2 id="sparql-endpoint-status">SPARQL Endpoint Status</h2>
<p><img src="https://sites.google.com/site/pierreyvesvandenbussche/resources/SES.png" alt="SPARQL Endpoint Status logo" />
The SPARQL Endpoint Status application has been monitoring the publicly available SPARQL Endpoints listed in <a href="http://datahub.io/">datahub.io</a> for 2 and a half years. From this study, we can see in the following figure:</p>
<p><img src="https://sites.google.com/site/pierreyvesvandenbussche/resources/sparqles_fig1.png" alt="Evolution of the average endpoint availability between February 2011 and April 2013" /></p>
<p>that the mean endpoint availability has been decreasing over time; however the mean trend is not followed by all endpoints. For instance, the DBPedia endpoint has always been above 90% of availability whereas the mean availability is affected by a growing number of offline endpoints exemplified by KASABI NASA endpoint.</p>
<p>The variance of endpoint-availability profiles is illustrated by the distribution</p>
<p><img src="https://sites.google.com/site/pierreyvesvandenbussche/resources/sparqles_fig2.png" alt="Evolution of Endpoints number per availability rate between February 2011 and April 2013" /></p>
<p>where many endpoints fall into one of two extremes: 24.3% of endpoints are always down, whereas 31% of endpoints have an availability rate higher than 95%. The apparent overall decline in endpoint availability is possibly an effect of maturation. SPARQL is currently moving away from experimentation, leaving permanently offline endpoints in its wake (e.g. Kasabi endpoints) with fewer new <em>experimental</em> endpoints being reported. However, other endpoints (such as data.gov) are supported by well-established stakeholders (here, the U.S. government), and are part of a sustainable policy to deliver a high quality of service to end-user applications.</p>
<h2 id="an-okfn-labs-project">An OKFN Labs Project</h2>
<p>Thanks to a growing community interest and the support of the Open Knowledge Foundation, our project scope is now being extended to further monitor the performance, the discoverability and the interoperability of SPARQL Endpoints. A new version of the tool will soon be hosted by OKFN and will be presented during the ISWC 2013 conference.</p>
<p>As soon as the tool is up and running, we will announce it on the OKFN Labs blog, so stay tuned!</p>
Pierre-Yves Vandenbussche
ropenspending - accessing the OpenSpending API through R
2013-08-14T00:00:00+00:00
http://okfnlabs.org/blog/2013/08/14/R-openspending-accessing-the-openspending-API-in-R
<p>Tonight a couple of us were having a discussion on the
<a href="http://openspending.org">OpenSpending</a> IRC channel on how we can promote
and better document the usage of the API. Tony had already begun to work on
OpenSpending using <a href="http://r-project.org">R</a>. I had previously done so as
well. This prompted me to start out and create an R Package:
<code class="language-plaintext highlighter-rouge">ropenspending</code>.</p>
<p><code class="language-plaintext highlighter-rouge">ropenspending</code> aims at making it easier to access the openspending API
from within R. It provides access to certain bits of the API - most
importantly the aggreagate function (through <code class="language-plaintext highlighter-rouge">openspending.aggregate</code>).
While it is still in it’s infancy, it is functional and can be obtained on
<a href="http://github.com/mihi-tr/r-openspending">github</a>. I’ll work to push this
on CRAN as well (as soon as I figured out how to do this).</p>
Michael Bauer
Diffing and patching tabular data
2013-08-08T00:00:00+00:00
http://okfnlabs.org/blog/2013/08/08/diffing-and-patching-data
<p>A few years ago at the Eastern Conference for
Workplace Democracy in New Hampshire, a bunch of friends
chatting on a grassy knoll
realized they were all working on overlapping directories of their
communities, and decided to pool their efforts.
They tracked down some techies (I was one) and set them
to work building a directory website. Someone should have warned
them about techies.</p>
<p><img src="http://imgs.xkcd.com/comics/the_general_problem.png" alt="From http://xkcd.com/974" title="From http://xkcd.com/974" /></p>
<p>Eight years later, okay there is a <a href="http://find.coop">directory website</a>, but the project has
morphed into something a lot more ambitious:</p>
<ul>
<li>A full-blown co-op to deal with the cultural and legal side of data sharing. This is the <a href="http://datacommons.find.coop">Data Commons Co-op</a>.</li>
<li>A growing toolbox to deal with the technological side of data sharing, specifically how to have fun (rather than get depressed) collaborating on data projects. This is the <a href="https://github.com/paulfitz/coopy">Coopy Toolbox</a>.</li>
</ul>
<p>We, like <a href="http://blog.okfn.org/2010/07/12/we-need-distributed-revisionversion-control-for-data/">others</a> in the Open Data world, have been asking: where’s the
git(hub) for data? More fundamentally, where’s the <code class="language-plaintext highlighter-rouge">diff</code> and <code class="language-plaintext highlighter-rouge">patch</code>
programs for data? Where’s something like <code class="language-plaintext highlighter-rouge">diff3</code> for doing 3-way
merges? Can we bring the whole free and open toolchain of diffing,
patching, merging, and version control to the world of data?</p>
<h2 id="the-toolchain-is-there-already-for-some">The toolchain is there already, for some</h2>
<p>Fun data collaboration is possible today by inventive use of
existing tools, as
Rufus Pollock <a href="http://blog.okfn.org/2013/07/02/git-and-github-for-data/">has noted</a>.
Here’s an example
of a pull request found in the wild, made to a repository on github
that tracks some bus routes in Iceland in regular CSV files:</p>
<blockquote>
<p><a href="https://github.com/gudmundur/straeto-data/pull/4">https://github.com/gudmundur/straeto-data/pull/4</a><br />
“Fixed stops on the wrong side”</p>
</blockquote>
<p><img src="/img/coopy-bus-stop.png" alt="patching a bus schedule" /></p>
<p>I’m a bit embarrassed to remember how excited I was to
stumble across this real live data-oriented pull request. I ran around showing people, saying
“this!” “this!” (One of those moments when
you realize: you really are a nerd). Some asked, why is this better than,
for example, a shared spreadsheet online, edited live? For the same reason that <code class="language-plaintext highlighter-rouge">git</code> and <code class="language-plaintext highlighter-rouge">hg</code> were so much more
exciting than <code class="language-plaintext highlighter-rouge">svn</code> and <code class="language-plaintext highlighter-rouge">cvs</code>. The awful artificial problem of
who to bestow write-access upon and who to keep outside the clique
just evaporates,
and the equivalent of social coding kicks in.</p>
<p>There are definitely some technical drawbacks to working this way today though. For example: what happens when things go wrong? A poor merge
or a merge conflict in a text file still results in a text file, which
a user can edit as usual to fix up. Text file in, text file out. But a poor
merge or a conflict in a CSV file can leave you with an invalid CSV
file with missing/surplus columns on some rows, or with conflict
markers inserted. This bumps the user out of
Gnumeric/LibreOffice/Excel/Sqlite/… or whatever they are using to edit the
table, and leaves them staring at randomly garbled text. Another problem: column changes look awful in line-oriented diffs, which isn’t
a deal-breaker but which is certainly a pity. We can do better.</p>
<h2 id="a-diff-for-tabular-data">A diff for (tabular) data</h2>
<p>There didn’t seem to be any neutral format for comparing tables out
there when I went looking. SQL could be abused to serve (DELETE
clauses to express rows removed, INSERTs rows added, UPDATEs for
modified cells, etc) but I couldn’t see that leading anywhere happy.
My first idea was to express diffs as tables in CSV form, as a list
of operations. <a href="http://share.find.coop/doc/patch_format_csv_v_0_2.html">This was ugly</a>. Then Joe Panico of <a href="http://www.diffkit.org">http://www.diffkit.org</a> and I hammered out something we
called <a href="http://share.find.coop/doc/patch_format_tdiff.html">TDIFF</a>,
“tabular diff format”, very much inspired by classic diffs, with added
awareness of columns. This was better, but still felt a bit clunky.
Finally, I settled on what now seems obvious:
a “<a href="http://share.find.coop/doc/spec_hilite.html">highlighter</a>” format
that is just the original table with stylized editing marks to show
changes (and large chunks of unchanged material removed).</p>
<p>For example, here is a highlighter data diff against Jessica Lord’s
<a href="http://jlord.github.io/hack-spots/">Hack Spots</a>
<a href="https://docs.google.com/spreadsheet/ccc?key=0Ao5u1U6KYND7dFVkcnJRNUtHWUNKamxoRGg4ZzNiT3c#gid=0">spreadsheet</a> at the time of writing (Hack Spots is a list
of hacking-friendly coffee shops and the like, demoing <a href="http://jlord.github.io/sheetsee.js/">sheetsee.js</a>):</p>
<table style="border-collapse:collapse">
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">@@</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Contributer's Twitter</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Name</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Address</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">City</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">State</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">long</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">lat</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Country</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Wifi Password</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Outlets</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Couch</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Large Table</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Brewing</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Outdoor Seating</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">hexcolor</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">cwmma</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Block 11</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">11 Bow St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Somerville</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">MA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-71.096974</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">42.380881</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Intelligentsia</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#7f7fff" bgcolor="#7f7fff">→</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">thomaslevine</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7f7fff" bgcolor="#7f7fff">Bordelands Cafe→Borderlands Café</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">870 Valencia St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">San Francisco</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">CA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-122.42151</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">37.759031</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">open</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">coffee</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">lukekarrys</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Cartel Coffee Lab</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">225 W University Dr</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Tempe</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">AZ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-111.942978</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">33.421907</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">espresso</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">In-house</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">thomaslevine</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">El Beit</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">158 Bedford Ave</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Brooklyn</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">NY</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-73.956847</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">40.718529</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">brooklyn</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">few</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">---</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">hij1nx</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Five Elephant</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Reichenberger Straße 101, 10999</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Berlin</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Berlin</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">13.43829</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">52.493365</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">DE</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f; color:#888" bgcolor="#ff7f7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">5 Elephant</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">uhduh</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Gangplank</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">260 S Arizona Ave</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Chandler</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">AZ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-111.841302</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">33.244008</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">walktheplank</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">sfrdmn</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Noisebridge</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">2169 Mission St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">San Francisco</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">CA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-122.419161</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">37.762372</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Open</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">BYOC</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">possibly</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">+++</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">fitzyfitzyfitzy</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">Sandwich Theory</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">590 Valley Rd</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f"> Montclair</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f"> NJ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">-74.208086</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">40.840497</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">sandwich</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">coffee</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">#B9FCFC</td>
</tr>
</table>
<p>This was generated by taking the Hack Spot spreadsheet, editing it
in <code class="language-plaintext highlighter-rouge">gnumeric</code>, then comparing with the original using <a href="https://npmjs.org/package/coopyhx">coopyhx</a>.
I corrected a typo, added an entry for Sandwich Theory in my
neighborhood, and - completely accidentally - deleted the
entry for Five Elephant.</p>
<p>Rather than editing the Hack Spots spreadsheet directly in google
docs, in my ideal world I’d send a pull-request (and someone would catch
my Five Elephant goof).</p>
<p>So far, the diff we have is just row-based.
Suppose I also added a column for the location’s website, and deleted
the password column (rather missing the whole point) - the
highlighter data diff would now look like this:</p>
<table style="border-collapse:collapse">
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#aaa" bgcolor="#aaa">!</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa" bgcolor="#aaa">+++</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa" bgcolor="#aaa">---</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaa; color:#888" bgcolor="#aaa"></td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">@@</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Contributer's Twitter</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Name</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Website</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Address</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">City</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">State</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">long</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">lat</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Country</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Wifi Password</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Outlets</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Couch</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Large Table</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Brewing</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Outdoor Seating</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">hexcolor</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">cwmma</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Block 11</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f; color:#888" bgcolor="#7fff7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">11 Bow St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Somerville</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">MA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-71.096974</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">42.380881</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f; color:#888" bgcolor="#ff7f7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Intelligentsia</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#7f7fff" bgcolor="#7f7fff">→</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">thomaslevine</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7f7fff" bgcolor="#7f7fff">Bordelands Cafe→Borderlands Café</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f; color:#888" bgcolor="#7fff7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">870 Valencia St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">San Francisco</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">CA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-122.42151</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">37.759031</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">open</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">coffee</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">lukekarrys</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Cartel Coffee Lab</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f; color:#888" bgcolor="#7fff7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">225 W University Dr</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Tempe</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">AZ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-111.942978</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">33.421907</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">espresso</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">In-house</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">thomaslevine</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">El Beit</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f; color:#888" bgcolor="#7fff7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">158 Bedford Ave</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Brooklyn</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">NY</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-73.956847</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">40.718529</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">brooklyn</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">few</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">---</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">hij1nx</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Five Elephant</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f; color:#888" bgcolor="#ff7f7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Reichenberger Straße 101, 10999</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Berlin</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Berlin</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">13.43829</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">52.493365</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">DE</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f; color:#888" bgcolor="#ff7f7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">5 Elephant</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">uhduh</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Gangplank</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f; color:#888" bgcolor="#7fff7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">260 S Arizona Ave</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Chandler</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">AZ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-111.841302</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">33.244008</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">walktheplank</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">sfrdmn</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Noisebridge</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f; color:#888" bgcolor="#7fff7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">2169 Mission St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">San Francisco</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">CA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-122.419161</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">37.762372</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f" bgcolor="#ff7f7f">Open</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">BYOC</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">possibly</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">+++</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">fitzyfitzyfitzy</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">Sandwich Theory</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">www.sandwichtheory.com</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">590 Valley Rd</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">Montclair</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">NJ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">-74.208086</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">40.840497</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#ff7f7f; color:#888" bgcolor="#ff7f7f"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">coffee</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#7fff7f" bgcolor="#7fff7f">#B9FCFC</td>
</tr>
</table>
<p>The highlighter format is tabular, and designed to be as simple as I
could make it without introducing ambiguity. The first column in a
highlighter diff is called the “action” column, containing marks
meaning “inserted row”, “deleted row”, “modified row”, etc. Remaining
columns are drawn from either or both of the tables being compared.
If there are column differences to note, as there are here, an extra row called the
“schema row” is inserted, which has marks for the inserted, deleted,
or otherwise modified columns.
The whole diff can be transmitted
safely in CSV, then optionally formatted for prettiness using some
mechanical rules.</p>
<p>There are plenty of other details, but that is the basic flavor of the
Coopy highlighter diff format today. You can <a href="http://share.find.coop/doc/spec_hilite.html">read more about
it</a> or
try out two different implementations live, a
<a href="http://paulfitz.github.io/coopyhx/">Javascript implementation</a>
and a
<a href="http://share.find.coop/">C++ implementation</a>.
You can also get a feel for using this kind of diff in a workflow
at <a href="http://growrows.com/">GrowRows.com</a>. Please send bug reports, or
ideas for better alternatives!</p>
<h2 id="dealing-with-conflict">Dealing with conflict</h2>
<p>What happens if two people make conflicting changes to a table?
A regular text-based merge would stick in <code class="language-plaintext highlighter-rouge">>>>>>>></code> <code class="language-plaintext highlighter-rouge">========</code> <code class="language-plaintext highlighter-rouge"><<<<<<<</code>
blocks, which would destroy our table’s structure. I’ve played
with a few ways to do better. The method I’m happiest with
so far is to report on conflicts in an extension of the
highlighter diff format that shows the alternate updates
possible. Imagine if, as I fixed the spelling of “Bordelands Cafe”
to “Borderlands Café”,
someone else had already corrected it to
the slightly different “Borderlands Cafe”. So the diff would be:</p>
<table style="border-collapse:collapse">
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">@@</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Contributer's Twitter</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Name</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Address</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">City</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">State</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">long</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">lat</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Country</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Wifi Password</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Outlets</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Couch</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Large Table</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Brewing</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">Outdoor Seating</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#aaf; font-weight:bold; padding-bottom:4px; padding-top:5px; text-align:left" bgcolor="#aaf" align="left">hexcolor</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">cwmma</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Block 11</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">11 Bow St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Somerville</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">MA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-71.096974</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">42.380881</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Intelligentsia</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; background-color:#f00" bgcolor="#f00">→</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">thomaslevine</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show; background-color:#f00" bgcolor="#f00">Bordelands Cafe→Borderlands Cafe→Borderlands Café</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">870 Valencia St</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">San Francisco</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">CA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-122.42151</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">37.759031</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">open</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">?</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">coffee</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show; color:#888"></td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">lukekarrys</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Cartel Coffee Lab</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">225 W University Dr</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">Tempe</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">AZ</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">-111.942978</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">33.421907</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">USA</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">espresso</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">yes</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">In-house</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">no</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">#B9FCFC</td>
</tr>
<tr style=":first-child td{border-top:1px solid #2D4068}">
<td style="border:1px solid #2D4068; padding:3px 7px 2px; border-left:1px solid #2D4068; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
<td style="border:1px solid #2D4068; padding:3px 7px 2px; empty-cells:show">...</td>
</tr>
</table>
<p>Resolving the conflict amounts to just editing the diff, deleting the parts you
don’t want and keeping the parts you do.
You can get a sense for how this works by testing on
<a href="http://paulfitz.github.io/coopyhx/">http://paulfitz.github.io/coopyhx/</a>. Be sure to select
“Use 3-way comparison” option, which will set up two
versions of a table with a shared “common ancestor”.
Double-click on cells in the diff to view their plain “CSV”
representation, and edit them.</p>
<h2 id="full-blown-revision-control">Full-blown revision control</h2>
<p>I’ve tried two methods to build revision control with all this:</p>
<ul>
<li>Modifying <a href="http://www.fossil-scm.org"><code class="language-plaintext highlighter-rouge">fossil</code></a>, a distributed revision
control system with beautifully compact and hackable source code,
to use tabular diffs and merges natively. The result is <code class="language-plaintext highlighter-rouge">ssfossil</code>
(“spreadsheet fossil”) in the Coopy toolbox.</li>
<li>Using custom diff and merge drivers with <code class="language-plaintext highlighter-rouge">git</code>, to achieve a similar
result. A tutorial for doing this is at <a href="http://share.find.coop/doc/tutorial_git.html">http://share.find.coop/doc/tutorial_git.html</a>.</li>
</ul>
<p>Both approaches share the same features:</p>
<ul>
<li>No change to how the SCM stores data internally. For example,
<code class="language-plaintext highlighter-rouge">fossil</code> will continue using
its <a href="http://www.fossil-scm.org/xfer/doc/trunk/www/delta_format.wiki">delta encoding</a>, likewise <code class="language-plaintext highlighter-rouge">git</code> (technically in pack files only).</li>
<li>The <em>visualization</em> of diffs changes, and how merges happen. This is good, since changes that would conflict in text-file world may well <em>not</em> conflict in tabular world, and we are guaranteed to always have valid tables.</li>
</ul>
<p>Until making more radical changes to the SCM, it definitely makes
sense to store tables in a text format. Formats I’ve experimented
with are:</p>
<ul>
<li>CSV. Simple, globally understood. But just a table.</li>
<li>CSVS. I made this up. It is an extension to CSV with multiple tables,
an unambiguous spot for column names, and a place for table names. <a href="https://github.com/paulfitz/coopy/blob/master/tests/fold/contacts.csvs">Looks like this</a>.</li>
<li>Sqlitext, pronounced “Sqlite Text”. I made this up. This is a text dump
of an Sqlite database, with consistent ordering of rows. With
careful use of <code class="language-plaintext highlighter-rouge">clean</code> and <code class="language-plaintext highlighter-rouge">smudge</code> filters, a “live” Sqlite
database can be stored in <code class="language-plaintext highlighter-rouge">git</code>, using this format as
an intermediate. This has the nice property of storing more
meta-data (keys, references, etc).</li>
<li>SocialCalc. A text format for representing spreadsheets used
by <a href="https://github.com/DanBricklin/socialcalc">SocialCalc</a> and
inherited by Audrey Tang’s <a href="http://ethercalc.org">http://ethercalc.org</a>. Stores table
formatting and other good stuff.</li>
</ul>
<p>Conspicuously absent from this list are common formats like those
of Excel. We need one more trick to deal with those.</p>
<h2 id="the-last-mile">The last mile</h2>
<p>Complicated spreadsheets are not great candidates for version control
as I’ve imagined it so far, since we don’t have a way to diff/merge
non-data features. So arbitrary spreadsheets in Gnumeric,
LibreOffice, and other programs (for simplicity I’m going
to call all these programs “Excel” from now on, forgive me)
with charts and formulae
aren’t really in our scope. But simple spreadsheets, just storing data
without anything fancy, can be very useful. And Excel is certainly a
convenient, familiar editor for tables.</p>
<p>Putting an Excel file in a git/fossil repository won’t lead anywhere good.
But what we can do is this:</p>
<ul>
<li>We use git/fossil/… to do version control on data in a
version-control-friendly format.</li>
<li>We keep that data in sync with an Excel file
using a merge method that preserves formatting as much as possible. The Excel
file is never regenerated from scratch (except parhaps once,
on initial cloning), but instead incrementally patched.</li>
</ul>
<p>In principle a modified SCM could collapse these steps but we’re
definitely not there yet. So what I’ve written is a program called
(perhaps confusingly) <code class="language-plaintext highlighter-rouge">Coopy</code> that handles the end-to-end work
of versioning Excel (and similar) files. Here is Coopy cloning
a repository with a single table in it called “numbers” (the user
needed a URL for the repository in order to do this, see the
manual at <a href="http://share.find.coop/doc/CoopyGuide.pdf">http://share.find.coop/doc/CoopyGuide.pdf</a> for a complete
workthrough):</p>
<p><img src="/img/coopy-clone.png" alt="x" /></p>
<p>I won’t win any awards for UI design, I know. At this point,
under the hood, the repository is checked out on the user’s machine,
with data in a neutral format. The list of tables is shown, in this
example just a table called “numbers”. When the user selects that
table for the very first time, they are prompted to save it:</p>
<p><img src="/img/coopy-save-table.png" alt="x" /></p>
<p>They can choose the format to save the data, for example in
an Excel-compatible format. The appropriate conversion
happens, and the file opens in an appropriate editor (<code class="language-plaintext highlighter-rouge">gnumeric</code> for
me):</p>
<p><img src="/img/coopy-save-xls.png" alt="x" /></p>
<p>We can now go ahead and edit the table at will. When we’re ready,
in Coopy, we click “push out”. We’ll be prompted for a commit
message describing the changes, and (the first time) where to actually
push to:</p>
<p><img src="/img/coopy-commit.png" alt="x" /></p>
<p>From then on, “pulling in” and “pushing out” will act as if they are
operating on the spreadsheet, with local formatting being preserved
even if no format information is in fact being stored in the neutral
repository format. It is perhaps hard to see why that is important,
but imagine how annoying it would be if, for example, the column sizes
of a spreadsheet kept getting reset everytime you pulled in a
collaborator’s changes.</p>
<p>There’s a lot more to say, but a key point is that we could now have
one person editing a table in Excel, another in Gnumeric, another
tweaking it using Sqlite, and the whole thing being periodically
sync’d to a MySQL database on a webserver. Fun!</p>
<h2 id="the-power-of-patching">The power of patching</h2>
<p>Stepping back from full-on revision control, I’d like to mention
something nice that popped out of this that I hadn’t anticipated.
Once you have diff + patch, you can play games like this:</p>
<ul>
<li>Store data in some form optimized for machine access, e.g. a MySQL database
with carefully chosen keys, indexes, cross-references etc.</li>
<li>Export part of data to some easy-to-edit form, e.g. a spreadsheet.</li>
<li>Make changes in the spreadsheet.</li>
<li>Generate a diff for that spreadsheet.</li>
<li>Apply that diff as a patch to the original data store e.g. in MySQL.</li>
</ul>
<p>The export step here will usually blow away all sorts of meta-data
vital to the database. It may also scramble stuff due to type mismatches
or other muddles. But remember, the patch will get applied with all the original meta-data
available, so things work out just fine more often than I expected. I’m
excited to push forward on reducing the irreversibility of data exports. Today, as
soon as a format conversion happens, fixes to the converted data are much
less likely to ever make it back to the original source. This is sad, and
can’t be allowed to continue.</p>
<h2 id="next-steps">Next steps</h2>
<p>There’s so much to do it is hard to summarize. And this is all just
one piece of the puzzle, for one kind of data (the carefully curated
kind listing your neighborhood wifi hotspots, not the
gigabyte-per-minute stream from a set of temperature sensors). Here
are just some of the things that need doing:</p>
<ul>
<li>Nail down the diff format or formats. There are wobbly areas such as
column/row reordering. It’d also be exciting to
use cross-references between tables when known (e.g. in Sqlite, and
maybe <a href="http://data.okfn.org/standards/simple-data-format">Simple Data Format</a> in the future)
to produce more meaningful diffs on relational data.</li>
<li>Support more data formats, more completely. I’ve just been scratching
my own random itches so far.</li>
<li>Get really really solid community-tested implementations of diffing, patching, merging etc.</li>
<li>Get nice repository hosting in place. In fact, <code class="language-plaintext highlighter-rouge">fossil</code> is interesting
for that since it is self-hosting, but it is a bit geeky for
non-programmers. I’m hoping to push <a href="http://GrowRows.com">http://GrowRows.com</a> in this
direction. And of course, github works, and would work even better
if they supported data diffs.</li>
<li>Think about how systematic data transformations might be
handled better, if they can be handled at all. Personally,
I’m waiting for <a href="https://github.com/maxogden/dat"><code class="language-plaintext highlighter-rouge">dat</code></a>.</li>
</ul>
<p>This is all frankly way too much for me. Help!</p>
<h2 id="appendix-a-list-of-the-main-coopy-tools">Appendix: a list of the main Coopy tools</h2>
<p>The Coopy toolbox (<a href="http://share.find.coop">website</a>, <a href="https://github.com/paulfitz/coopy">repository</a>, <a href="http://share.find.coop/doc/CoopyGuide.pdf">manual</a>) contains the following utilities:</p>
<ul>
<li><a href="http://share.find.coop/doc/ssdiff.html"><code class="language-plaintext highlighter-rouge">ssdiff</code></a>: Show the difference between two tables/databases/spreadsheets.</li>
<li><a href="http://share.find.coop/doc/sspatch.html"><code class="language-plaintext highlighter-rouge">sspatch</code></a>: Modify a table/database/spreadsheet to apply the changes described in a pre-computed difference.</li>
<li><a href="http://share.find.coop/doc/ssmerge.html"><code class="language-plaintext highlighter-rouge">ssmerge</code></a>: Integrate changes in table/database/spreadsheets that have a common ancestor.</li>
<li><a href="http://share.find.coop/doc/ssresolve.html"><code class="language-plaintext highlighter-rouge">ssresolve</code></a>: Select a particular resolution to a merge conflict.</li>
<li><a href="http://share.find.coop/doc/ssformat.html"><code class="language-plaintext highlighter-rouge">ssformat</code></a>: Convert tables/databases/spreadsheets from one format to another.</li>
<li><a href="http://share.find.coop/doc/ssrediff.html"><code class="language-plaintext highlighter-rouge">ssrediff</code></a>: Convert a diff from one format to another (for example,
from highlighter format to a sequence of SQL instructions).</li>
<li><a href="http://share.find.coop/doc/ssfossil.html"><code class="language-plaintext highlighter-rouge">ssfossil</code></a>: A lightly modified version of <a href="http://www.fossil-scm.org"><code class="language-plaintext highlighter-rouge">fossil</code></a> to use ssmerge’s 3-way
merge algorithm on data.</li>
<li><a href="http://share.find.coop/doc/coopy.html"><code class="language-plaintext highlighter-rouge">Coopy</code></a>: A first pass at a user interface for versioning Excel and other
non-textual formats.</li>
</ul>
<p>The toolbox is written in C++. Recently I’ve ported some of the core
parts of the toolbox to a Javascript (via Haxe) implementation.
This port is called coopyhx (<a href="http://paulfitz.github.io/coopyhx/">website</a>, <a href="https://github.com/paulfitz/coopyhx">repository</a>). The reimplementation
is better in several respects than the original (need to merge them!), but
supports far fewer formats. The port contains:</p>
<ul>
<li>The <a href="https://npmjs.org/package/coopyhx"><code class="language-plaintext highlighter-rouge">coopyhx</code></a> program, which is a stripped down
version of <code class="language-plaintext highlighter-rouge">ssdiff</code> and <code class="language-plaintext highlighter-rouge">sspatch</code>, operating only on basic CSV/JSON tables
and the highlighter diff format.</li>
<li>A javascript library for diffing and patching, suitable for in-browser use.</li>
<li>A render function for converting highlighter diffs in CSV format into
pretty HTML.</li>
<li>A render function for <a href="http://handsontable.com/">handsontable</a> to allow online
editing of diffs in a pretty format.</li>
</ul>
<p>Awkwardly, there’s also an entirely separate ruby implementation (<a href="https://github.com/paulfitz/coopy/tree/master/rb_coopy">source</a>, <a href="https://rubygems.org/gems/coopy">gem</a>), strictly limited to Sqlite, that was written for use on <a href="https://scraperwiki.com/">ScraperWiki</a> (classic).</p>
<p>Related websites:</p>
<ul>
<li>
<p><a href="http://growrows.com">http://growrows.com</a>, a start at a service for crowd-sourcing tables, using diffs and patches (without calling them that).</p>
</li>
<li>
<p><a href="http://datacommons.find.coop/vision">http://datacommons.find.coop/vision</a>, the Data Commons Co-op, incorporated in July 2012 in Massachusetts.
The co-op has 20 member organizations, mostly in the US, a couple in Canada, plus one recently in the UK.
This co-op specializes in archiving, correlating, and disseminating data about alternative economic activity,
and needs lots of software that doesn’t quite exist yet!</p>
</li>
</ul>
Paul Fitzpatrick
Mapping Antimatter tracks with CrowdCrafting.org
2013-08-06T00:00:00+00:00
http://okfnlabs.org/blog/2013/08/06/mapping-antimatter-with-crowdcrafting
<p>This last weekend, CERN hosted a very special event: <a href="http://www.citizencyberscience.net/wiki/index.php?title=Main_Page">the 2nd CERN Summer Student Webfest</a> organized by the <a href="http://www.citizencyberscience.net/">Citizen Cyberscience Centre</a>.</p>
<p><img src="http://www.citizencyberscience.net/wiki/images/1/1b/Cernwebfest.png" alt="CERN Summer Student Webfest logo" /></p>
<p>The Webfest invites CERN summer students to participate in a 48 hours marathon hacking new applications, tools, games, etc. about physics. This year, I participated and worked in a very interesting one: <a href="http://crowdcrafting.org/app/antimatter/">The Antimatter project</a></p>
<p>With a team of around 8 persons, we divided the work in different areas and learned about the project and the goals for the CrowdCrafting application.</p>
<p>Michael Doser from CERN and the spokesperson from the <a href="http://aegis.web.cern.ch/aegis/">AEgIS experiment</a>, is studying antimatter.</p>
<iframe width="640" height="360" src="//www.youtube-nocookie.com/embed/8PXSQjjsPUo?rel=0" frameborder="0" allowfullscreen=""></iframe>
<h2 id="but-what-is-antimatter">But, what is antimatter?</h2>
<p>The observable Universe is composed almost entirely of matter but we can produce stuff called antimatter in the lab. Antimatter is material composed of antiparticles. So for example, a positron (the antiparticle of an electron) combines with an antiproton to form an antihydrogen atom.</p>
<p>Antiparticles have the same mass as normal matter particles but the opposite charge. When an antiparticle collides with an ordinary matter particle they both obliterate to emit radiation and some other particles - this is called annihilation.</p>
<p>Because of Einstein’s weak equivalence principle (gravity doesn’t depend on composition) antiparticles should interact gravitationally just like particles of ordinary matter - and that’s what scientist’s expect to observe - but if they don’t then Einstein was wrong…</p>
<h2 id="whats-the-experiment">What’s the experiment?</h2>
<p>The Antihydrogen Experiment: Gravity, Interferometry, Spectroscopy (AEgIS) experiment at CERN shoot antihydrogen atoms horizontally, whereupon they fly (and drop) until they hit a wall made of matter - any matter will do, silicon, silver, paper,… - and annihilate there</p>
<p>On hitting the wall, the antihydrogen annihilates with a nucleus of the wall to produce mostly pions and some other particles - which we’ll call starburst.</p>
<p>The starburst travel through a special gel called an emulsion and we can see its tracks. If we trace these tracks to their point of origin then we know exactly where annihilation occurred.</p>
<p>Then as we know the starting position of the antiparticles, the distance they travelled to the point of annihilation and how much they dropped - we can work out how far the antiparticle fell during its journey.</p>
<p>Then we can figure out how antimatter interacts gravitationally.</p>
<p><img src="http://i.imgur.com/uVVjKzD.jpg" alt="AEgIS Experiment Installation" /></p>
<p>Michael Doser gave us access to a set of 99 areas photographed with a microscope, that allows us to see tracks and the starbursts. Each of the areas have 40 pictures. These pictures cover the same area but at a differen depth.</p>
<p>As we discussed about the project, we decided to create a “movie style” task, where the CrowdCrafting application will be playing in a loop all the images for the same area. Then, we will allow the volunteers to map the tracks using their mouse as in any image software. The coordinates of the tracks, starting and ending points, will be saved, and we will use those points to render in real time a 3D model of the tracks thanks to WebGL.</p>
<p>We divided the work between different groups, and we worked together in the different areas:</p>
<ul>
<li>Creating of tasks based on the data</li>
<li>2D movie style using HTML5 canvas feature</li>
<li>3D model of tracks using HTML5 WebGL</li>
<li>Physics description of the problem and tutorial.</li>
</ul>
<p>For the 2D Canvas solution we decided to use the popular <a href="http://www.kineticjs.com/">Kinetic.JS</a> library. This library is very versatile as you can not only render images in the 2D canvas, but also paint lines.</p>
<p>For the 3D model we decided to use the popular <a href="http://threejs.org/">Three.JS library</a>. We created a 3D area using the Tron colors palette to draw the reported tracks by the users.</p>
<p>Then, we have another group that worked really hard in explaining the physics of the experiment and the tutorial. We even created a <a href="https://juanracasti.makes.org/popcorn/1adt">Mozilla Webmaker project</a> about it.</p>
<p>At the end of Sunday we had a <a href="http://crowdcrafting.org/app/antimatter">fully operational prototype</a> that allows you to actually track antimatter in CrowdCrafting:</p>
<p><img src="https://github-camo.global.ssl.fastly.net/9a7c3a33b5470bf0c42f19f74a7443adf0e116ef/687474703a2f2f692e696d6775722e636f6d2f716b32393067352e706e67" alt="Screenshot" /></p>
<p>From here I would like to thank to all the team members because the actually loved the project and push it to the next level. This efforts will help other CrowdCrafting/PyBossa developers to use the new HTML5 Canvas and WebGL features developed for this application, as the source code is already available in Github and can be used as a template for any CrowdCrafting/PyBossa application.</p>
<p>If you want, you can follow the <a href="https://github.com/CERNSummerWebfest/antimatter">Github repository</a> development of the project.</p>
Daniel Lombraña González
data.okfn.org - update no. 2
2013-08-06T00:00:00+00:00
http://okfnlabs.org/blog/2013/08/06/data-update-2
<p><a href="http://data.okfn.org">data.okfn.org</a> is the Labs’ repository of high-quality, easy-to-use <a href="http://opendefinition.org/">open data</a>. This update summarizes some of the improvements to data.okfn.org that have taken place over the past two months.</p>
<h2 id="new-tools">New tools</h2>
<p>Several tools which make it easier to use the <a href="http://data.okfn.org/standards/data-package">Data Package standard</a> are now operational. These include a <a href="http://data.okfn.org/tools/create">Data Package creator</a>, a <a href="http://data.okfn.org/tools/view">Data Package viewer</a>, and there’s progress on a <a href="https://github.com/okfn/data.okfn.org/issues/27">validator for Data Packages</a>.</p>
<h3 id="data-package-creator">Data Package Creator</h3>
<p>Turning a CSV into a Data Package means creating a file, <code class="language-plaintext highlighter-rouge">datapackage.json</code>, which houses the metadata associated with the CSV. The <a href="http://data.okfn.org/tools/create">Data Package Creator</a> simplifies this process.</p>
<p>Provide the Creator with the URL of a CSV and it will return a well-formed JSON object with the required fields, as well as a raw JSON URL (the JSON URL serves as a basic machine accessible API).</p>
<p><img src="http://farm8.staticflickr.com/7362/9449152387_962624e792.jpg" alt="Data Package Creator in action" /></p>
<h3 id="data-package-viewer">Data Package Viewer</h3>
<p>The metadata included with Data Packages makes it possible to construct a simple view of the data. We now provide an online <a href="http://data.okfn.org/tools/view">Data Package Viewer</a> to do this for you.</p>
<p>Just provide the link to your Data Package and Viewer generates a user-friendly description, a graph of the data, and a summary of the data fields. Here, for example, is the Viewer’s display of <a href="http://data.okfn.org/tools/view?url=https://raw.github.com/rgrp/wheat-us/master/datapackage.json">US wheat production data</a>.</p>
<p><img src="http://farm6.staticflickr.com/5340/9449152367_13b33222df.jpg" alt="Data Package Viewer in action" /></p>
<h2 id="new-datasets">New datasets</h2>
<p>The biggest data news was having our first ‘out-of-the-blue’ contribution of an ‘official’ dataset! <a href="https://github.com/ewheeler">Evan Wheeler</a> pinged us to offer a comprehensive collection of <a href="http://data.okfn.org/data/country-codes-comprehensive">country codes</a> for the world’s countries in <a href="http://data.okfn.org/standards/simple-data-format">Simple Data Format</a>. Here is the:</p>
<ul>
<li><a href="http://data.okfn.org/data/country-codes-comprehensive">Comprehensive Country Codes dataset on data.okfn.org</a></li>
<li><a href="https://github.com/datasets/country-codes-comprehensive">Associated GitHub repo</a> for the dataset</li>
</ul>
<p><img src="http://farm8.staticflickr.com/7324/9451935968_32719167a7.jpg" alt="Country codes data, table view" /></p>
<p>Also new:</p>
<ul>
<li><a href="http://data.okfn.org/data/s-and-p-500">Standard and Poor’s 500 Index Data including Dividend, Earnings, and P/E Ratio</a> (<a href="https://github.com/datasets/s-and-p-500">GitHub</a>)</li>
<li><a href="http://data.okfn.org/data/cpi-us">US Consumer Price Index and Inflation monthly time series from January 1913</a> (<a href="https://github.com/datasets/cpi-us">GitHub</a>)</li>
</ul>
<p>If you want to contribute a new dataset, check out the <a href="http://data.okfn.org/about/contribute#data">instructions</a> and the <a href="https://github.com/datasets/registry/issues">outstanding requests</a>.</p>
<h2 id="new-standards-pages">New standards pages</h2>
<p>Among data.okfn.org’s chief purpose is promoting simple <a href="http://data.okfn.org/standards">standards for data transport</a> in the form of Data Package and Simple Data Format - helping to create a world of <a href="http://blog.okfn.org/2013/04/24/frictionless-data-making-it-radically-easier-to-get-stuff-done-with-data/">frictionless data</a>.</p>
<p>Key here is providing simple, easy-to-understand, information and so we’ve <a href="http://data.okfn.org/standards">revamped the standards page</a> and created two new pages dedicated to providing simple introduction and overview for Data Package and Simple Data Format:</p>
<ul>
<li><a href="http://data.okfn.org/standards/data-package">Data Package Overview and Introduction</a></li>
<li><a href="http://data.okfn.org/standards/simple-data-format">Simple Data Format Overview and Introduction</a></li>
</ul>
<h2 id="get-involved">Get involved</h2>
<p>Anyone can contribute, and it’s easy – if you can use a spreadsheet, you can help!</p>
<p>Instructions for <a href="http://data.okfn.org/about/contribute">getting involved can be found here</a>.</p>
Neil Ashton
Analyzing Icelandic conviction rates with CrowdCrafting.org
2013-07-31T00:00:00+00:00
http://okfnlabs.org/blog/2013/07/31/crodcrafting-data-journalism
<p><a href="http://crowdcrafting.org">CrowdCrafting.org</a> hosts a wide variety of applications that range from <a href="http://crowdcrafting.org/app/airquality/">science</a> to
<a href="http://crowdcrafting.org/app/bardomatic/">humanities</a>. Since the official launch of <a href="http://crowdcrafting.org">CrowdCrafting.org</a>, <a href="http://crowdcrafting.org/app/category/featured/">lots of applications have been created</a>
, but one of them has done a really impressive job: <a href="http://crowdcrafting.org/app/heradsdomar/">Héraðsdómar -
sýknað eða sakfellt</a>.</p>
<p><strong>Héraðsdómar - sýknað eða sakfellt</strong> is an application developed by <a href="http://gogn.in/">Páll Hilmarsson</a> (<a href="https://twitter.com/pallih">@pallih</a>, <a href="https://github.com/pallih">Github</a>). The application was one of the most popular and active
applications in CrowdCrafting.org when it was published (300 volunteers helped!),
so I wanted to interview the author and ask some questions about it: why he created the application,
what was the result, etc.</p>
<p>Páll told me that he created the application after reading the <a href="http://www.visir.is/simon-sigvaldason-sakfellir-naer-alltaf/article/2012121229180">an article</a> published in an Icelandic news web site.</p>
<p><img src="http://i.imgur.com/6GlMJ1p.png" alt="Application UI translated using Google Translate" /></p>
<p>The article analyzed the <strong>conviction rates of a named judged in the Reykjavik district court</strong>,
stating that the conviction rates for cases where he presided as a judge was 99%.
Páll found it interesting, but also “biased” as the reporter only analyzed one judge.</p>
<p>After the publication of the story, some bloggers and readers of the post, discussed
about why analyzing only one judge, reporting it back to the author. The journalist
addressed all the questions and comments answering that<br />
calculating all the conviction rates for every case would take too long.</p>
<p>Páll was not happy with this answer, so he decided to show him, and other reporters, that
this could be easily done by crowdsourcing the job, and that it would not take too long.</p>
<p><strong>Páll uploaded around 4,700 rulings as tasks, and the volunteers analyze them in 7 days!</strong> Each ruling
went to at least three different users, totaling 14,208 assessments. In the end more than
17,000 assessments were made by over 300 users! (you can check the stats <a href="http://crowdcrafting.org/app/heradsdomar/stats">here</a>).</p>
<p>But here it comes the best part, <strong>Páll only spent 10 hours in this project</strong> (including
the time to scrape the rulings, set up the tasks on CrowdCrafting.org and displaying
the <a href="http://gogn.in/heradsdomar/">results on his blog</a>. Amazing!</p>
Daniel Lombraña González
Making puzzles out of Shapefiles - bringing Open Data to the physical world
2013-07-28T00:00:00+00:00
http://okfnlabs.org/blog/2013/07/28/making-puzzles-from-shapefiles
<p>For a while I’ve been thinking about how to make Open Data more tangible.
Even with great visualizations, it tends to remain stuck in computers and
smartphones. Recently, I had the idea to start taking geodata, released by
cities, and start making it into physical things. This is the first steps
and prototypes: Making a puzzle out of district borders.</p>
<p><img src="http://farm6.staticflickr.com/5469/9380801297_dd91e4b99e_z_d.jpg" alt="Shapefile Puzzle" /></p>
<p>Thanks to the <a href="http://openscience.alpine-geckos.at/events/open-week-graz-3/">Open Week</a>
organized by <a href="http://at.okfn.org">OKFN Austria</a> member <a href="https://twitter.com/stefankasberger">Stefan Kasberger</a>
I finally got the chance to put this idea into action. Here’s what we did:</p>
<ul>
<li>First download the city boundaries as Shapefile</li>
<li>Make sure it’s WGS 84 EPSG 4326 formatted</li>
<li>Convert it to SVG using <a href="http://kartograph.org/about/kartograph.py/">kartograph.py</a></li>
<li>Convert it to PDF - so the lazercutter can understand it</li>
<li>Adopt the PDF for lazercutting (set line-widths to hairlines…)</li>
<li>Cut!</li>
<li>Try to assemble the puzzle</li>
</ul>
<p>To make the process a little easier Stefan created a <a href="https://github.com/skasberger/lazzzor-puzzle">script for converting shapefiles to svg</a>
with the data.</p>
<p><img src="http://farm6.staticflickr.com/5490/9383586970_99d359e228_z_d.jpg" alt="finished puzzle" /></p>
<p>We created puzzles for both Vienna and Graz using district boundaries
released as Open Data. Once done, we noticed that solving a district boundary puzzle is not as
easy as it seems… (even though the number of pieces are limited)</p>
Michael Bauer
Apps using DBpedia Wikipedia from Open Knowledge Foundation Greece
2013-07-23T00:00:00+00:00
http://okfnlabs.org/blog/apps/2013/07/23/DBpedia-apps
<p>Having developed the Greek DBpedia, the first Internationalized DBpedia, OKFN Greece is now involved in the OKFN Labs by introducing three applications using DBpedia.</p>
<h2 id="1-dbpedia-spotlight">1. DBpedia Spotlight</h2>
<p>DBpedia Spotlight is an application that automatically spots and disambiguates words or phrases of text documents that might be sources of DBpedia and annotates them with DBpedia URIs.</p>
<p>DBpedia spotlight implements the Aho-Corasick string matching algorithm in the spotting stage described above along with the use of <a href="http://lucene.apache.org/">Apache Lucene</a> over the index built in the offline training / configuration stage. For the disambiguation of the spotted words/phrases, a VSM representation of the DBpedia resources is used along with a variant of the TF-IDF technique for determining the weight of words based on their ability to distinguish between candidates of a given term.</p>
<p>The Greek DBpedia Spotlight with full compatibility with the Greek characters, encoded in UTF-8, was implemented by graduate student Ioannis Avraam under the supervision of Dr. Charalampos Bratsas, coordinator of OKFN: Greece. The project was organised by the OKFN Greece in coordination with the Web Science Master Program and Semantic Web Unit of the Aristotle University of Thessaloniki. The Greek DBpedia spotlight is deployed as a Web service and features a user interface at <a href="http://dbpedia-spotlight.okfn.gr/">http://dbpedia-spotlight.okfn.gr/</a>. The source code is open and available inder Apache license V2 at <a href="https://github.com/iavraam/dbpedia-spotlight.git">https://github.com/iavraam/dbpedia-spotlight.git</a> (dbpediaSpotlight_el branch).</p>
<h2 id="2--day-like-today">2. Day Like Today</h2>
<p>Day Like Today (<a href="http://el.dbpedia.org/apps/DayLikeToday/">http://el.dbpedia.org/apps/DayLikeToday/</a>) is the second application that uses DBpedia in order to inform the user about what happened a day like today, in the past. Similar existing applications are using data that their author has included into the application. The application developed, differs in, that the data displayed have been extracted from Wikipedia with DBpedia queries and visualize the results in a timeline using the okfn lab timeliner.</p>
<p>At the back end of the application, the user can choose the DBpedia that will be queried as well as which the queries themselves. The queries are submitted, data is analyzed then exported into JSON format and forwarded to the frontend and illustrated into a pie chart.</p>
<p>Queries return data such as :</p>
<ul>
<li>
<p>Title of the fact</p>
</li>
<li>
<p>Date of the fact</p>
</li>
<li>
<p>A small description</p>
</li>
<li>
<p>A small thumbnail</p>
</li>
<li>
<p>A large picture</p>
</li>
<li>
<p>Link to the related article in Wikipedia</p>
</li>
<li>
<p>Link to the corresponding DBpedia which we got the data.</p>
</li>
</ul>
<p>The source code is open and available inder Apache license V2 at <a href="https://github.com/okfngr/DayLikeToday">https://github.com/okfngr/DayLikeToday</a></p>
<p>Here are some statistics on the amount of information that was exported</p>
<p><img src="http://farm8.staticflickr.com/7294/9347989761_29b1227361_o.jpg" alt="" /></p>
<p>From Greek DBpedia</p>
<p><img src="http://farm3.staticflickr.com/2845/9347989793_f25bdb1d37.jpg" width="300" height="180" alt="english_dbpedia-300x180" /></p>
<p>From English DBpedia</p>
<p><img src="http://farm3.staticflickr.com/2884/9350771082_6efe745d5e.jpg" width="300" height="180" alt="german_dbpedia-300x180" /></p>
<p>From German DBpedia</p>
<h2 id="3dbpedia-game">3.DBpedia Game</h2>
<p>The third application is DBpedia ** *Game * **(<a href="http://wiki.el.dbpedia.org/apps/ssg/dist/ssg.html">http://wiki.el.dbpedia.org/apps/ssg/dist/ssg.html</a> ) an entertaining and educational tool to produce knowledge and evaluation of knowledge. It is consisted of a series of games such as: multiple choice, anagram, hangman and matching. The novelty of the DBpedia <a href="http://wiki.el.dbpedia.org/apps/ssg/dist/ssg.html">Game</a> is the immediate and automatic generation of games from Wikipedia’s data using Semantic Web technologies in Greek Linked Data (DBpedia). The DBpedia data is accessed with the SPARQL Query Language. A total of 31 SPARQL queries were created, that retrieved facts belonging to 8 general categories (geography, history, athletics, astronomy, general, chemistry, politics, economy). These facts form the basis the next processing steps of the application. A sample query is depicted in Listing 1, retrieving Animal labels along with their depiction and feedback links for the DBpedia and Wikipedia pages.</p>
<p><img src="http://farm8.staticflickr.com/7281/9347989751_c6f71f30ca.jpg" width="500" height="228" alt="sparql" /></p>
<p><img src="http://farm3.staticflickr.com/2836/9347989725_1687807f5c.jpg" width="367" height="500" alt="table" /></p>
<p>Even with a relatively small number of 30 queries, we managed to produce a total of 12,000 sets of results from Greek DBpedia (cf. Table 1)</p>
<p>The present version runs as a Java Applet and is a rapid prototype to be improved. (<a href="https://github.com/okfngr/DBpedia-Game">https://github.com/okfngr/DBpedia-Game</a>)</p>
Charalampos Bratsas
Spanish Party Financing Scandals - CrowdSourcing Data Extraction with CrowdCrafting
2013-07-11T00:00:00+00:00
http://okfnlabs.org/blog/2013/07/11/transcribing-corruption
<p>Spanish society has been bombarded recently with a flurry of news stories about possible cases of corruption in the major political parties like the <a href="http://www.elmundo.es/elmundo/2013/07/02/andalucia/1372764547.html">Partido Socialista Obrero Español</a> and the <a href="http://politica.elpais.com/politica/2013/07/11/actualidad/1373542957_537498.html">Partido Popular</a>.</p>
<p>In January of 2013 the party that rules the country, Partido Popular (PP), was featured in the front page of the newspaper <a href="http://politica.elpais.com/politica/2013/01/30/actualidad/1359583204_085918.html">El País</a> with a new case about a possible financial scandal in the party. The new disclosed several scanned copies of the accounting book of the party – with donations from companies to members of the party – allegedly hand written by the official treasurer of PP: Luis Bárcenas, as well as several accounts in Swiss banks.</p>
<p><img src="http://i.imgur.com/7lv6Mjw.png" alt="Screenshot of the application" /></p>
<p>Since then, many news and press conferences have happened, however the recent 28 of June decision had a very interesting twist. The <a href="http://www.elmundo.es/elmundo/2013/06/27/espana/1372330113.html">judge ordered Luis Bárcenas to be sent to prison</a> after the financial anticorruption district attorney requested this.</p>
<p>After this action, the Spanish mass media started to ask questions about this imprisonment, and the consequences that could have to party that rules the country.</p>
<p>This last Sunday 7 of July, one of the main newspapers from Spain <a href="http://www.elmundo.es/elmundo/2013/07/07/espana/1373186360.html">published an interview with the suspect: Luis Bárcenas</a>. The interesting side is that up to now, Bárcenas denied all the accusations, however as he is now in jail he has started to admit that the donations were made from Spanish development companies and other enterprises, that the money was delivered in some cases in plastic bags, and that usually those companies obtained contracts with the with the administrations governed by the party.</p>
<p>The next day, the public face of the party, María Dolores de Cospedal, gave <a href="http://politica.elpais.com/politica/2013/07/08/actualidad/1373269187_324066.html">a press conference about these accusations saying that everything is completely false</a>.</p>
<p>A few minutes after the press conference was over, <a href="http://www.eldiario.es/politica/Anonymous-filtra-cuentas-Partido-Popular_0_151535327.html">the Anonymous hackers group disclosed the last 20 year of accounting info from the politic party in bayfiles.net</a> and distributed the links on Twitter, newspapers, etc.</p>
<p><img src="http://images.lainformacion.com/cms/anonymous-publica-las-cuentas-del-partido-popular/2013_7_8_HzMQI0BrlpxBRTPyBJPom1-d85e8a83579bec67db0c71db28a306bd-1373300234-77.jpg?width=645&height=645&type=flat&id=HzMQI0BrlpxBRTPyBJPom1&time=1373300238&project=lainformacion" alt="Screenshot courtesy of Lainformacion.com" /></p>
<p>The links were spread like fire and people started to coordinate themselves under the hashtag <a href="https://twitter.com/search?q=CuentasDelPP">#cuentasdelpp</a> to analyze the data. They found that most of the documents are PDF scans from hand written notes, some reports from different fiscal accounting softwares, etc. so they asked for help. As the data format is not machine readable, someone suggested to use <a href="http://crowdcrafting.org">CrowdCrafting.org</a> to do the transcription taking as inspiration the sample <a href="http://crowdcrafting.org/app/pdftranscribe">PDF transcription app</a>. A few hours later the application was up and running:</p>
<p><img src="http://i.imgur.com/7lv6Mjw.png" alt="Screenshot of the application" /></p>
<p>This is a great example about how citizens can coordinate themselves to analyze a problem from their society using open tools like <a href="http://crowdcrafting.org">CrowdCrafting</a> and PyBossa. Unfortunately due to legal threats regarding the leaked data, the author has, as of today, felt obliged to take the app down (we hope temporarily). I have contacted the author, and, as soon as we know if the owner can re-open it, we will be letting you know!</p>
Daniel Lombraña González
PublicBodies.org progress
2013-07-09T00:00:00+00:00
http://okfnlabs.org/blog/2013/07/09/publicbodies-progress
<p>There have been many new developments with <a href="http://publicbodies.org">PublicBodies.org</a>, the Labs project which aims to provide “a URL for every part of government”, since <a href="http://okfnlabs.org/blog/2013/05/01/publicbodies.org-an-update.html">the last update</a> on the Labs blog.</p>
<p>The news includes: a new and improved backend; a push for integration with <a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a>; discussion of a revamp of the PublicBodies schema; lots of new data waiting to be integrated; and a new idea for how PublicBodies might be useful.</p>
<h2 id="publicbodies-now-much-shinier">PublicBodies: now much shinier</h2>
<p>Thanks to the hard work of Labs member <a href="http://okfnlabs.org/members/wombleton/">Rowan Crawford</a>, PublicBodies is now a proper webapp. It’s now a <a href="https://github.com/okfn/publicbodies">Node.js app</a> running on <a href="https://www.heroku.com/">Heroku</a>, and its interface is much nicer than before. Let’s all give Rowan a hand!</p>
<p>Development of the PublicBodies website is ongoing. The next task for improving the site will be <a href="https://github.com/okfn/publicbodies/issues/3">adding search</a>.</p>
<h2 id="nomenklatura-integration">Nomenklatura integration</h2>
<p>Entity reconciliation is crucial for a service like PublicBodies. Luckily, the Labs has another project that simplifies reconciliation, namely <a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a>. The obvious step is to start pushing PublicBodies data to Nomenklatura and pulling it when it gets updated. This idea is discussed more fully in <a href="https://github.com/okfn/publicbodies/issues/2">an issue</a>.</p>
<p>Contributor <a href="https://github.com/davidread">David Read</a> has got the ball rolling with Nomenklatura integration by pushing <a href="http://nomenklatura.okfnlabs.org/uk-public-bodies">UK public bodies data</a>. This is a great start – but we want to automate this and start automatically pushing CSVs across to Nomenklatura. Volunteers to build this functionality, please step up!</p>
<h2 id="popolo-schema-integration">Popolo schema integration</h2>
<p><a href="http://popoloproject.com/">Popolo</a> is a project with a goal very relevant to PublicBodies: the creation of “international open government data specifications relating to the legislative branch of government”. These include a data specification for <a href="http://popoloproject.com/specs/organization.html">organizations</a>.</p>
<p>We’re considering reworking the PublicBodies schema to follow the Popolo organization spec. The changes would be nontrivial but wouldn’t involve any massive reorganization of the data. Please help us think this through by joining in the discussion <a href="https://github.com/okfn/publicbodies/issues/29">in the issues</a>.</p>
<h2 id="lots-of-new-data">Lots of new data</h2>
<p>Once the matter of revamping the schema is resolved, we can start integrating the heaps of new data which has been contributed. The new data includes public bodies from the US, Germany, China, Quebec, Italy, and Slovenia. You can see it all <a href="https://github.com/okfn/publicbodies/issues?direction=desc&labels=Data&page=1&sort=updated&state=open">here</a>. Thanks to the contributors who have brought this data together.</p>
<p>The sooner we come to a decision about the Popolo schema, the sooner we can start incorporating all of this new material – so please let us know what you think!</p>
<h2 id="discussion-organization-identifiers">Discussion: organization identifiers</h2>
<p>Contributor <a href="https://github.com/markbrough">Mark Brough</a> has come up with an interesting idea for how PublicBodies might be useful: it could be used to generate organisation identifiers usable in situations calling for unique identifiers, such as IATI data publication. As Mark observes, public organizations often lack these identifiers, which makes publishing data a struggle.</p>
<p>Read the details of Mark’s proposal <a href="https://github.com/okfn/publicbodies/issues/41">in the issues</a>, and let him know what you think.</p>
Neil Ashton
Open Data Maker Night London No 3 - Tuesday 16th July
2013-07-08T00:00:00+00:00
http://okfnlabs.org/blog/events/2013/07/08/open-data-maker-night-london-3
<p>The next <strong>Open Data Maker Night London</strong> will be on <strong>Tuesday 16th July 6-9pm</strong> (you can drop in any time during the evening). Like the last two it is kindly hosted by the wonderful <a href="http://creative-collaboration.net/about/contact/">Centre for Creative Collaboration, 16 Acton Street, London</a>.</p>
<ul>
<li>When: Tuesday 16th July 2013</li>
<li>Where: <a href="http://creative-collaboration.net/about/contact/">Centre for Creative Collaboration, 16 Acton Street, London</a>.</li>
<li>Signup: <a href="http://www.meetup.com/OpenKnowledgeFoundation/London-GB/984832/">on Meetup page</a> (optional but nice to know numbers!)</li>
</ul>
<p>Look forward to seeing folks there!</p>
<p><img src="http://farm9.staticflickr.com/8524/8500104205_4e209ef952.jpg" alt="" /></p>
<h3 id="what">What</h3>
<p>Open Data Maker Nights are informal events focused on “making” with open data – whether that’s creating apps or insights. They aren’t a general meetup – if you come, expect to get pulled into actually building something, though we won’t force you!</p>
<h3 id="who">Who</h3>
<p>The events usually have short introductory talks about specific projects and suggestions for things to work on – it’s absolutely fine to turn up knowing nothing about data or openness or tech as, there’ll an activity for you to help and someone to guide you in contributing!</p>
<h3 id="organize-your-own">Organize your own!</h3>
<p>Not in London? Why not <a href="/events/open-data-maker/">organize your own Open Data Maker night</a> in your city? Anyone can and it’s easy to do – <a href="/events/open-data-maker/">find out more »</a></p>
Rufus Pollock
Open Data QA - the Aid Transparency Tracker
2013-07-08T00:00:00+00:00
http://okfnlabs.org/blog/2013/07/08/aid-transparency-tracker
<p>Back in April, I <a href="http://blog.okfn.org/2013/03/13/launching-the-aid-transparency-tracker/">wrote on the Open Knowledge Foundation main blog</a> to launch the <a href="http://tracker.publishwhatyoufund.org/plan/">first component of our Aid Transparency Tracker</a>, a tool to analyse aid donors’ commitments to publish more open data about their aid activities.</p>
<p>At the end of that post, I pointed to our future plans to also monitor the quality of publication. It is possible to do this programmatically because donors have agreed to publish their data according to the <a href="http://iatistandard.org">IATI Standard</a>.</p>
<p><a href="http://publishwhatyoufund.org/files/tracker-frontpage.png"><img src="http://publishwhatyoufund.org/files/tracker-frontpage-small.png" alt="Aid Transparency Tracker - data quality" /></a></p>
<p>Over the last six months we’ve spent a lot of time building a framework for testing the quality of aid donors’ <a href="http://iatiregistry.org">IATI data</a>, as well as a survey tool to capture data not available in the IATI format. We launched this to donors last month.</p>
<p>We will be releasing the results as part of our 2013 <a href="http://publishwhatyoufund.org/index">Aid Transparency Index</a> in October. In the meantime, I wanted to give a sneak peak of some of the things the Tracker can now do. All the <a href="https://github.com/markbrough/IATI-Data-Quality">source code is on Github</a>.</p>
<h2 id="automatic-testing-framework">Automatic testing framework</h2>
<p><a href="http://publishwhatyoufund.org/files/tracker-iati.png"><img src="http://publishwhatyoufund.org/files/tracker-iati-small.png" alt="One donor's IATI data" /></a></p>
<p>The biggest part of this tool is the automated data quality analysis. This works as follows:</p>
<ol>
<li><strong>Tests</strong>: A series of tests is written in <a href="https://github.com/mk270/foxpath-tools">FoXPath</a> (“a cunning version of XPath”), a language we created for this purpose. The idea was to make the tests a bit more readable for non-programmers, agnostic about the language used to run them, and structured so that regular expressions could implement them in whatever language is required.</li>
<li><strong>Registry</strong>: The <a href="http://iatiregistry.org">IATI Registry</a> (a CKAN instance) is queried to check for any changes to the data. The Registry uses <a href="https://github.com/okfn/ckanext-archiver">CKAN Archiver</a> to create a hash of each package every night.</li>
<li><strong>Testing</strong>: all of the tests are run against each package found for testing. This is run as a background process, with RabbitMQ used for the queue.</li>
<li><strong>Results</strong>: Each result is stored as a pass, fail, or error, alongside the package id, the test id, the publishing organisation’s id, and the <a href="http://iatistandard.org/activities-standard/iati-identifier">activity identifier</a> (if applicable and available).</li>
<li><strong>Aggregations</strong>: results are then aggregated up to create a percentage of passes for each test for each package.</li>
<li><strong>Indicators</strong>: when presenting the results in the user interface, tests are grouped into indicators to make the information more readable. At the moment, there is only one set of indicators (our 2013 Index indicators), but some fairly small changes would make it possible to add other sets of indicators - for example, indicators that show whether data is good for an Aid Information Management System, or for making maps with, or for results data.</li>
</ol>
<h3 id="remaining-hurdles">Remaining hurdles</h3>
<ol>
<li><strong>Improving tests</strong>: The tests need to become more expressive, adding more conditions to when they should or shouldn’t be run. Some of these expressions are supported within the existing version of FoXPath.</li>
<li><strong>Changed packages</strong>: Refreshing packages currently has to be done manually, because of the way the IATI Registry records changes to files. This should be fixed within the next couple of weeks.</li>
<li><strong>Space</strong>: The data quality tool currently stores the result for each test alongside each activity. This uses <strong>15GB</strong> of space each time it runs. I’m considering dropping this data, and only storing the aggregates, because such detailed data doesn’t appear to be as useful as I originally thought it might be.</li>
<li><strong>Speed</strong>: testing is quite slow at the moment, and aggregation takes a particularly long time. I’m going to revisit that section of the code (and in fact the aggregation architecture as a whole) to optimise it. However, in the medium term, some more substantial changes might be needed, possibly including re-writing this component in a compiled language.</li>
</ol>
<h2 id="survey-component">Survey component</h2>
<p><a href="http://publishwhatyoufund.org/files/tracker-other.png"><img src="http://publishwhatyoufund.org/files/tracker-other-small.png" alt="An example of non-IATI data" /></a></p>
<p>In previous years, we used Global Integrity’s <a href="http://getindaba.org">Indaba platform</a> for the survey. However, because of the quite different way this year’s Index is constructed, we decided to build our own bespoke survey tool.</p>
<p>Many of the donors we include in the Index have not yet begun publishing data to IATI, and none of them are yet publishing all of the fields. We need to capture this information in the Index, while encouraging donors to publish as much as possible in their IATI data.</p>
<h3 id="what-it-does">What it does</h3>
<ol>
<li><strong>Donor-specific indicators</strong>: if information is found in the donor’s IATI data, then there’s no need to ask again where that information can be found. If it’s not found at all in the IATI data, then we look to see where else we can find that information on a donor’s website.</li>
<li><strong>Format matters</strong>: more accessible formats are scored higher in this year’s Index. We’re encouraging donors to move as much information out of PDFs into websites, then into CSV, Excel, or some other machine readable format, and then into the IATI-XML format. Obviously, it’s great if they can jump steps and go straight to IATI-XML - we’re seeing that from several donors this year.</li>
<li><strong>Retaining an audit trail</strong>: Each of the steps in the survey are recorded and will be published, so that if there are disagreements between us, the donor, or the independent reviewer, then readers can see that correspondence and reach their own conclusions.
<a href="http://publishwhatyoufund.org/files/tracker-survey.png"><img src="http://publishwhatyoufund.org/files/tracker-survey-small.png" alt="Example of several stages in the survey" /></a></li>
</ol>
<h3 id="how-it-works">How it works</h3>
<ol>
<li>When a new survey is created, indicators are only created if they score 0 in the IATI data quality assessment.</li>
<li>When a user has finised responding to the survey, they submit the form and a simple linear workflow moves the survey to the next step.</li>
<li>Users have access to specific parts of the workflow for a specific organisation, depending on whether they’re a donor or an independent reviewer, and whether they should have edit permissions or read-only permissions.</li>
</ol>
<h2 id="what-were-aiming-to-achieve">What we’re aiming to achieve</h2>
<p>Finally, it’s worth emphasising what we’re trying to achieve from all of this, and looking at the extent to which we’re doing that already.</p>
<ol>
<li><strong>Non-IATI publishers begin publishing to IATI</strong>: the incentives in the Index are very clearly structured this year: more points are awarded for publishing in more open formats, with the internationally comparable IATI standard format scored highest. Several donors are trying to start publishing by the July 31st deadline, which is when automated data collection will end and we’ll begin writing our analysis.</li>
<li><strong>IATI publishers publish more fields, and improve their data</strong>: several donors are working to add more fields into their data where they have that information to hand.</li>
<li><strong>Donors can use the data quality tool to improve their own publication</strong>: several donors are using the tool to flag areas where there could be improvements in their data. We want this tool to be useful on an ongoing basis to donors, but that will require both that tests run nightly, and also that donors can test unpublished data. We’ll be working on those features over the next month.</li>
</ol>
<h2 id="whats-next">What’s next</h2>
<p>We’ll be presenting the Aid Transparency Tracker at <a href="http://okcon.org">OKCon</a> in Geneva in September, and talking about how it could be used as a basis for monitoring the quality of data in other open data spending standards.</p>
<p>We would also very much welcome any feedback. Please get in touch:</p>
<ul>
<li>Email: mark.brough@publishwhatyoufund.org</li>
<li>
<p>Twitter: <a href="http://twitter.com/mark_brough">@mark_brough</a></p>
</li>
</ul>
Mark Brough
Querying ElasticSearch - A Tutorial and Guide
2013-07-01T00:00:00+00:00
http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial
<p>ElasticSearch is a great open-source search tool that’s built on Lucene (like
SOLR) but is natively JSON + RESTful. Its been used quite a bit at the <a href="http://okfn.org/">Open
Knowledge Foundation</a> over the last few years. Plus, as its easy to
<a href="http://www.elasticsearch.org/guide/reference/setup/">setup locally</a> its an attractive option for digging into data on your
local machine.</p>
<p>While its general interface is pretty natural, I must confess I’ve sometimes
struggled to find my way around ElasticSearch’s powerful, but also quite
complex, query system and the associated JSON-based “<a href="http://www.elasticsearch.org/guide/reference/query-dsl/">query DSL</a>” (domain
specific language).</p>
<p>This post therefore provides a simple introduction and guide to querying
ElasticSearch that provides a short overview of how it all works together with
a good set of examples of some of the most standard queries.</p>
<div class="alert alert-success">
Note: here at Open Knowledge Foundation Labs we have several open-source
ElasticSearch related project including an <strong><a href="/projects/elasticsearch-js/">easy-to-use Javascript Library for
ElasticSearch</a></strong> and the <strong><a href="/projects/reclinejs/">Recline suite of JS Data Components</a></strong>
which make it easy and fast to build powerful JS+HTML-based interfaces to
ElasticSearch.
</div>
<p><strong>Table of Contents</strong></p>
<ul id="markdown-toc">
<li><a href="#terminology-and-urls" id="markdown-toc-terminology-and-urls">Terminology and URLs</a></li>
<li><a href="#quickstart" id="markdown-toc-quickstart">Quickstart</a> <ul>
<li><a href="#curl-or-browser" id="markdown-toc-curl-or-browser">cURL (or Browser)</a></li>
<li><a href="#javascript" id="markdown-toc-javascript">Javascript</a></li>
<li><a href="#python" id="markdown-toc-python">Python</a></li>
</ul>
</li>
<li><a href="#querying" id="markdown-toc-querying">Querying</a> <ul>
<li><a href="#basic-queries-using-only-the-query-string" id="markdown-toc-basic-queries-using-only-the-query-string">Basic Queries Using Only the Query String</a></li>
<li><a href="#full-query-api" id="markdown-toc-full-query-api">Full Query API</a></li>
</ul>
</li>
<li><a href="#query-language" id="markdown-toc-query-language">Query Language</a> <ul>
<li><a href="#query-dsl-overview" id="markdown-toc-query-dsl-overview">Query DSL: Overview</a></li>
<li><a href="#examples" id="markdown-toc-examples">Examples</a> <ul>
<li><a href="#match-all--find-everything" id="markdown-toc-match-all--find-everything">Match all / Find Everything</a></li>
<li><a href="#classic-search-box-style-full-text-query" id="markdown-toc-classic-search-box-style-full-text-query">Classic Search-Box Style Full-Text Query</a></li>
<li><a href="#filter-on-one-field" id="markdown-toc-filter-on-one-field">Filter on One Field</a></li>
<li><a href="#find-all-documents-with-value-in-a-range" id="markdown-toc-find-all-documents-with-value-in-a-range">Find all documents with value in a range</a></li>
<li><a href="#full-text-query-plus-filter-on-a-field" id="markdown-toc-full-text-query-plus-filter-on-a-field">Full-Text Query plus Filter on a Field</a></li>
<li><a href="#filter-on-two-fields" id="markdown-toc-filter-on-two-fields">Filter on two fields</a></li>
<li><a href="#geospatial-query-to-find-results-near-a-given-point" id="markdown-toc-geospatial-query-to-find-results-near-a-given-point">Geospatial Query to find results near a given point</a></li>
</ul>
</li>
<li><a href="#facets" id="markdown-toc-facets">Facets</a></li>
</ul>
</li>
<li><a href="#appendix" id="markdown-toc-appendix">Appendix</a> <ul>
<li><a href="#adding-updating-and-deleting-data" id="markdown-toc-adding-updating-and-deleting-data">Adding, Updating and Deleting Data</a></li>
<li><a href="#schema-mapping" id="markdown-toc-schema-mapping">Schema Mapping</a></li>
<li><a href="#jsonp-support" id="markdown-toc-jsonp-support">JSONP support</a></li>
</ul>
</li>
</ul>
<h2 id="terminology-and-urls">Terminology and URLs</h2>
<p>Throughout <code class="language-plaintext highlighter-rouge">{endpoint}</code> refers to the ElasticSearch <a href="http://www.elasticsearch.org/guide/reference/glossary/#type">index type</a> (aka
table). Note that ElasticSearch often let’s you run the same queries on both
“<a href="http://www.elasticsearch.org/guide/reference/glossary/#index">indexes</a>” (aka database) and types.</p>
<p>If you were just using ElasticSearch standalone an example of an endpoint would be:
http://localhost:9200/gold-prices/monthly-price-table.</p>
<p>Key urls:</p>
<ul>
<li>
<p>Query: <code class="language-plaintext highlighter-rouge">{endpoint}/_search</code> (in ElasticSearch < 0.19 this will return an
error if visited without a query parameter)</p>
<ul>
<li>Query example: <code class="language-plaintext highlighter-rouge">{endpoint}/_search?size=5&pretty=true</code></li>
</ul>
</li>
<li>
<p>Schema (Mapping): <code class="language-plaintext highlighter-rouge">{endpoint}/_mapping</code></p>
</li>
</ul>
<h2 id="quickstart">Quickstart</h2>
<h3 id="curl-or-browser">cURL (or Browser)</h3>
<p>The following examples utilize the <a href="http://curl.haxx.se/">cURL</a> command line utility. If you prefer,
you you can just open the relevant urls in your browser:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> <span class="c"># query for documents / rows with title field containing 'jones'</span>
<span class="c"># added pretty=true to get the json results pretty printed</span>
curl <span class="o">{</span>endpoint<span class="o">}</span>/_search?q<span class="o">=</span>title:jones&size<span class="o">=</span>5&pretty<span class="o">=</span><span class="nb">true</span></code></pre></figure>
<p>Adding some data:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> <span class="c"># Data (argument to -d) should be a JSON document</span>
curl <span class="nt">-X</span> POST <span class="o">{</span>endpoint<span class="o">}</span> <span class="nt">-d</span> <span class="s1">'{
"title": "jones",
"amount": 5.7
}'</span></code></pre></figure>
<h3 id="javascript">Javascript</h3>
<p>A simple ajax (JSONP) request to the data API using jQuery:</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"> <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">size</span><span class="p">:</span> <span class="mi">5</span> <span class="c1">// get 5 results</span>
<span class="na">q</span><span class="p">:</span> <span class="dl">'</span><span class="s1">title:jones</span><span class="dl">'</span> <span class="c1">// query on the title field for 'jones'</span>
<span class="p">};</span>
<span class="nx">$</span><span class="p">.</span><span class="nx">ajax</span><span class="p">({</span>
<span class="na">url</span><span class="p">:</span> <span class="p">{</span><span class="nx">endpoint</span><span class="p">}</span><span class="sr">/_search</span><span class="err">,
</span> <span class="na">dataType</span><span class="p">:</span> <span class="dl">'</span><span class="s1">jsonp</span><span class="dl">'</span><span class="p">,</span>
<span class="na">success</span><span class="p">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">data</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">alert</span><span class="p">(</span><span class="dl">'</span><span class="s1">Total results found: </span><span class="dl">'</span> <span class="o">+</span> <span class="nx">data</span><span class="p">.</span><span class="nx">hits</span><span class="p">.</span><span class="nx">total</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">});</span></code></pre></figure>
<p><em>Note: we’ve written a simple <a href="https://github.com/okfn/elasticsearch.js">JS library for ElasticSearch</a> which makes working with ElasticSearch much easier. Here’s a sample:</em></p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="c1">// Your ElasticSearch instance is running at http://localhost:9200/</span>
<span class="c1">// We are using index 'twitter' and type (table) 'tweet'</span>
<span class="kd">var</span> <span class="nx">endpoint</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">http://localhost:9200/twitter/tweet</span><span class="dl">'</span><span class="p">;</span>
<span class="c1">// Table = an ElasticSearch Type (aka Table)</span>
<span class="c1">// http://www.elasticsearch.org/guide/reference/glossary/#type</span>
<span class="kd">var</span> <span class="nx">table</span> <span class="o">=</span> <span class="nx">ES</span><span class="p">.</span><span class="nx">Table</span><span class="p">(</span><span class="nx">endpoint</span><span class="p">);</span>
<span class="c1">// Create some data</span>
<span class="nx">table</span><span class="p">.</span><span class="nx">upsert</span><span class="p">({</span>
<span class="na">id</span><span class="p">:</span> <span class="dl">'</span><span class="s1">123</span><span class="dl">'</span><span class="p">,</span>
<span class="na">title</span><span class="p">:</span> <span class="dl">'</span><span class="s1">My new tweet</span><span class="dl">'</span>
<span class="p">}).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="c1">// now get it</span>
<span class="nx">table</span><span class="p">.</span><span class="kd">get</span><span class="p">(</span><span class="dl">'</span><span class="s1">123</span><span class="dl">'</span><span class="p">).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">doc</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">doc</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">});</span>
<span class="c1">// Query for data</span>
<span class="c1">// Queries follow Recline Query spec -</span>
<span class="c1">// http://okfnlabs.org/recline/docs/models.html#query-structure</span>
<span class="c1">// (very similar to ES)</span>
<span class="nx">table</span><span class="p">.</span><span class="nx">query</span><span class="p">({</span>
<span class="na">q</span><span class="p">:</span> <span class="dl">'</span><span class="s1">hello</span><span class="dl">'</span>
<span class="na">filters</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span> <span class="na">term</span><span class="p">:</span> <span class="p">{</span> <span class="dl">'</span><span class="s1">owner</span><span class="dl">'</span><span class="p">:</span> <span class="dl">'</span><span class="s1">jones</span><span class="dl">'</span> <span class="p">}</span> <span class="p">}</span>
<span class="p">]</span>
<span class="p">}).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">out</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">out</span><span class="p">);</span>
<span class="p">});</span></code></pre></figure>
<h3 id="python">Python</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="c1"># =================================
# Store some data
</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">'{endpoint}'</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'title'</span><span class="p">:</span> <span class="s">'jones'</span><span class="p">,</span>
<span class="s">'amount'</span><span class="p">:</span> <span class="mf">5.7</span>
<span class="p">}</span>
<span class="c1"># have to send the data as JSON
</span><span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">req</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">Request</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">headers</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">req</span><span class="p">)</span>
<span class="k">print</span> <span class="n">out</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
<span class="c1"># =================================
# Query the resulting "table"
</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">'{endpoint}/_search?q=title:jones&size=5'</span>
<span class="n">req</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">Request</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">req</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
<span class="k">print</span> <span class="n">data</span>
<span class="c1"># returned data is JSON
</span><span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># total number of results
</span><span class="k">print</span> <span class="n">data</span><span class="p">[</span><span class="s">'hits'</span><span class="p">][</span><span class="s">'total'</span><span class="p">]</span></code></pre></figure>
<h2 id="querying">Querying</h2>
<h3 id="basic-queries-using-only-the-query-string">Basic Queries Using Only the Query String</h3>
<p>Basic queries can be done using only query string parameters in the URL. For
example, the following searches for text ‘hello’ in any field in any document
and returns at most 5 results:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{endpoint}/_search?q=hello&size=5
</code></pre></div></div>
<p>Basic queries like this have the advantage that they only involve accessing a
URL and thus, for example, can be performed just using any web browser.
However, this method is limited and does not give you access to most of the
more powerful query features.</p>
<p>Basic queries use the <code class="language-plaintext highlighter-rouge">q</code> query string parameter which supports the <a href="http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html">Lucene
query parser syntax</a> and hence filters on specific fields (e.g.
<code class="language-plaintext highlighter-rouge">fieldname:value</code>), wildcards (e.g. <code class="language-plaintext highlighter-rouge">abc*</code>) and more.</p>
<p>There are a variety of other options (e.g. size, from etc) that you can also
specify to customize the query and its results. Full details can be found in
the <a href="http://www.elasticsearch.org/guide/reference/api/search/uri-request.html">ElasticSearch URI request docs</a>.</p>
<h3 id="full-query-api">Full Query API</h3>
<p>More powerful and complex queries, including those that involve faceting and
statistical operations, should use the full ElasticSearch query language and API.</p>
<p>In the query language queries are written as a JSON structure and is then sent
to the query endpoint (details of the query langague below). There are two
options for how a query is sent to the search endpoint:</p>
<ol>
<li>
<p>Either as the value of a source query parameter e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> {endpoint}/_search?source={Query-as-JSON}
</code></pre></div> </div>
</li>
<li>
<p>Or in the request body, e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> curl -XGET {endpoint}/_search -d 'Query-as-JSON'
</code></pre></div> </div>
<p>For example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> curl -XGET {endpoint}/_search -d '{
"query" : {
"term" : { "user": "kimchy" }
}
}'
</code></pre></div> </div>
</li>
</ol>
<h2 id="query-language">Query Language</h2>
<p>Queries are JSON objects with the following structure (each of the main
sections has more detail below):</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"> <span class="p">{</span>
<span class="nl">size</span><span class="p">:</span> <span class="err">#</span> <span class="nx">number</span> <span class="k">of</span> <span class="nx">results</span> <span class="nx">to</span> <span class="k">return</span> <span class="p">(</span><span class="nx">defaults</span> <span class="nx">to</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">from</span><span class="p">:</span> <span class="err">#</span> <span class="nx">offset</span> <span class="nx">into</span> <span class="nx">results</span> <span class="p">(</span><span class="nx">defaults</span> <span class="nx">to</span> <span class="mi">0</span><span class="p">)</span>
<span class="nx">fields</span><span class="p">:</span> <span class="err">#</span> <span class="nx">list</span> <span class="k">of</span> <span class="nb">document</span> <span class="nx">fields</span> <span class="nx">that</span> <span class="nx">should</span> <span class="nx">be</span> <span class="nx">returned</span> <span class="o">-</span> <span class="nx">http</span><span class="p">:</span><span class="c1">//elasticsearch.org/guide/reference/api/search/fields.html</span>
<span class="nx">sort</span><span class="p">:</span> <span class="err">#</span> <span class="nx">define</span> <span class="nx">sort</span> <span class="nx">order</span> <span class="o">-</span> <span class="nx">see</span> <span class="nx">http</span><span class="p">:</span><span class="c1">//elasticsearch.org/guide/reference/api/search/sort.html</span>
<span class="nx">query</span><span class="p">:</span> <span class="p">{</span>
<span class="err">#</span> <span class="dl">"</span><span class="s2">query</span><span class="dl">"</span> <span class="nx">object</span> <span class="nx">following</span> <span class="nx">the</span> <span class="nx">Query</span> <span class="nx">DSL</span><span class="p">:</span> <span class="nx">http</span><span class="p">:</span><span class="c1">//elasticsearch.org/guide/reference/query-dsl/</span>
<span class="err">#</span> <span class="nx">details</span> <span class="nx">below</span>
<span class="p">},</span>
<span class="nx">facets</span><span class="p">:</span> <span class="p">{</span>
<span class="err">#</span> <span class="nx">facets</span> <span class="nx">specifications</span>
<span class="err">#</span> <span class="nx">Facets</span> <span class="nx">provide</span> <span class="nx">summary</span> <span class="nx">information</span> <span class="nx">about</span> <span class="nx">a</span> <span class="nx">particular</span> <span class="nx">field</span> <span class="nx">or</span> <span class="nx">fields</span> <span class="k">in</span> <span class="nx">the</span> <span class="nx">data</span>
<span class="p">}</span>
<span class="err">#</span> <span class="nx">special</span> <span class="k">case</span> <span class="k">for</span> <span class="nx">situations</span> <span class="nx">where</span> <span class="nx">you</span> <span class="nx">want</span> <span class="nx">to</span> <span class="nx">apply</span> <span class="nx">filter</span><span class="o">/</span><span class="nx">query</span> <span class="nx">to</span> <span class="nx">results</span> <span class="nx">but</span> <span class="o">*</span><span class="nx">not</span><span class="o">*</span> <span class="nx">to</span> <span class="nx">facets</span>
<span class="nx">filter</span><span class="p">:</span> <span class="p">{</span>
<span class="err">#</span> <span class="nx">filter</span> <span class="nx">objects</span>
<span class="err">#</span> <span class="nx">a</span> <span class="nx">filter</span> <span class="nx">is</span> <span class="nx">a</span> <span class="nx">simple</span> <span class="dl">"</span><span class="s2">filter</span><span class="dl">"</span> <span class="p">(</span><span class="nx">query</span><span class="p">)</span> <span class="nx">on</span> <span class="nx">a</span> <span class="nx">specific</span> <span class="nx">field</span><span class="p">.</span>
<span class="err">#</span> <span class="nx">Simple</span> <span class="nx">means</span> <span class="nx">e</span><span class="p">.</span><span class="nx">g</span><span class="p">.</span> <span class="nx">checking</span> <span class="nx">against</span> <span class="nx">a</span> <span class="nx">specific</span> <span class="nx">value</span> <span class="nx">or</span> <span class="nx">range</span> <span class="k">of</span> <span class="nx">values</span>
<span class="p">},</span>
<span class="p">}</span></code></pre></figure>
<p>Query results look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
# some info about the query (which shards it used, how long it took etc)
...
# the results
hits: {
total: # total number of matching documents
hits: [
# list of "hits" returned
{
_id: # id of document
score: # the search index score
_source: {
# document 'source' (i.e. the original JSON document you sent to the index
}
}
]
}
# facets if these were requested
facets: {
...
}
}
</code></pre></div></div>
<h3 id="query-dsl-overview">Query DSL: Overview</h3>
<p>Query objects are built up of sub-components. These sub-components are either
basic or compound. Compound sub-components may contains other sub-components
while basic may not. Example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
# compound component
"bool": {
# compound component
"must": {
# basic component
"term": {
"user": "jones"
}
}
# compound component
"must_not": {
# basic component
"range" : {
"age" : {
"from" : 10,
"to" : 20
}
}
}
}
}
}
</code></pre></div></div>
<p>In addition, and somewhat confusingly, ElasticSearch distinguishes between
sub-components that are “queries” and those that are “filters”. Filters, are
really special kind of queries that are: mostly basic (though boolean
compounding is alllowed); limited to one field or operation and which, as such,
are especially performant.</p>
<p>Examples, of filters are (full list on RHS at the bottom of the <a href="http://elasticsearch.org/guide/reference/query-dsl/">query-dsl</a> page):</p>
<ul>
<li>term: filter on a value for a field</li>
<li>range: filter for a field having a range of values (>=, <= etc)</li>
<li>geo_bbox: geo bounding box</li>
<li>geo_distance: geo distance</li>
</ul>
<p>Rather than attempting to set out all the constraints and options of the
query-dsl we now offer a variety of examples.</p>
<h3 id="examples">Examples</h3>
<h4 id="match-all--find-everything">Match all / Find Everything</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"match_all": {}
}
}
</code></pre></div></div>
<h4 id="classic-search-box-style-full-text-query">Classic Search-Box Style Full-Text Query</h4>
<p>This will perform a full-text style query across all fields. The query string
supports the <a href="http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html">Lucene query parser syntax</a> and hence filters on specific fields
(e.g. <code class="language-plaintext highlighter-rouge">fieldname:value</code>), wildcards (e.g. <code class="language-plaintext highlighter-rouge">abc*</code>) as well as a variety of
options. For full details see the <a href="http://elasticsearch.org/guide/reference/query-dsl/query-string-query.html">query-string</a> documentation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"query_string": {
"query": {query string}
}
}
}
</code></pre></div></div>
<h4 id="filter-on-one-field">Filter on One Field</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"term": {
{field-name}: {value}
}
}
}
</code></pre></div></div>
<p>High performance equivalent using filters:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"constant_score": {
"filter": {
"term": {
# note that value should be *lower-cased*
{field-name}: {value}
}
}
}
}
</code></pre></div></div>
<h4 id="find-all-documents-with-value-in-a-range">Find all documents with value in a range</h4>
<p>This can be used both for text ranges (e.g. A to Z), numeric ranges (10-20) and
for dates (ElasticSearch will converts dates to ISO 8601 format so you can
search as 1900-01-01 to 1920-02-03).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"constant_score": {
"filter": {
"range": {
{field-name}: {
"from": {lower-value}
"to": {upper-value}
}
}
}
}
}
}
</code></pre></div></div>
<p>For more details see <a href="http://elasticsearch.org/guide/reference/query-dsl/range-filter.html">range filters</a>.</p>
<h4 id="full-text-query-plus-filter-on-a-field">Full-Text Query plus Filter on a Field</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"query_string": {
"query": {query string}
},
"term": {
{field}: {value}
}
}
}
</code></pre></div></div>
<h4 id="filter-on-two-fields">Filter on two fields</h4>
<p>Note that you cannot, unfortunately, have a simple <code class="language-plaintext highlighter-rouge">and</code> query by adding two
filters inside the query element. Instead you need an ‘and’ clause in a filter
(which in turn requires nesting in ‘filtered’). You could also achieve the same
result here using a <a href="http://elasticsearch.org/guide/reference/query-dsl/bool-query.html">bool query</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"range" : {
"b" : {
"from" : 4,
"to" : "8"
}
},
},
{
"term": {
"a": "john"
}
}
]
}
}
}
}
</code></pre></div></div>
<h4 id="geospatial-query-to-find-results-near-a-given-point">Geospatial Query to find results near a given point</h4>
<p>This uses the <a href="http://www.elasticsearch.org/guide/reference/query-dsl/geo-distance-filter.html">Geo Distance filter</a>. It requires that indexed documents have
a field of <a href="http://www.elasticsearch.org/guide/reference/mapping/geo-point-type.html">geo point type</a>.</p>
<p>Source data (a point in San Francisco!):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># This should be in lat,lon order
{
...
"Location": "37.7809035011582, -122.412119695795"
}
</code></pre></div></div>
<p>There are alternative formats to provide lon/lat locations e.g. (see
ElasticSearch documentation for more):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Note this must have lon,lat order (opposite of previous example!)
{
"Location":[-122.414753390488, 37.7762147914147]
}
# or ...
{
"Location": {
"lon": -122.414753390488,
"lat": 37.7762147914147
}
}
</code></pre></div></div>
<p>We also need a mapping to specify that Location field is of type geo_point as this will not usually get guessed from the data (see below for more on mappings):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"properties": {
"Location": {
"type": "geo_point"
}
...
}
</code></pre></div></div>
<p>Now the actual query:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"query": {
"filtered" : {
"query" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "20km",
"Location" : {
"lat" : 37.776,
"lon" : -122.41
}
}
}
}
}
}
</code></pre></div></div>
<p>Note that you can specify the query using specific lat, lon attributes even
though original data did not have this structure (you can also use a query
similar to the original structure if you wish - see <a href="http://www.elasticsearch.org/guide/reference/query-dsl/geo-distance-filter.html">Geo distance filter</a> for
more information).</p>
<h3 id="facets">Facets</h3>
<p>Facets provide a way to get summary information about then data in an
elasticsearch table, for example counts of distinct values.</p>
<p>ElasticSearch (and hence the Data API) provides <a href="http://www.elasticsearch.org/guide/reference/api/search/facets/">rich faceting
capabilities</a>. The <a href="http://www.elasticsearch.org/guide/reference/api/search/facets/">ES facet docs</a> go a great job of listing of the various
kinds of facets available and their structure so I won’t repeat it all here.
Here is a list of some of the most important (full list on the facets page):</p>
<ul>
<li><a href="http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html">Terms</a> - counts by distinct terms (values) in a field</li>
<li><a href="http://www.elasticsearch.org/guide/reference/api/search/facets/range-facet.html">Range</a> - counts for a given set of ranges in a field</li>
<li><a href="http://www.elasticsearch.org/guide/reference/api/search/facets/histogram-facet.html">Histogram</a> and <a href="http://www.elasticsearch.org/guide/reference/api/search/facets/date-histogram-facet.html">Date Histogram</a> - counts by constant interval ranges</li>
<li><a href="http://www.elasticsearch.org/guide/reference/api/search/facets/statistical-facet.html">Statistical</a> - statistical summary of a field (mean, sum etc)</li>
<li><a href="http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.html">Terms Stats</a> - statistical summary on one field (stats field) for distinct
terms in another field. For example, spending stats per department or per
region.</li>
<li><a href="http://www.elasticsearch.org/guide/reference/api/search/facets/geo-distance-facet.html">Geo Distance</a>: counts by distance ranges from a given point</li>
</ul>
<p>Note that you can apply multiple facets per query.</p>
<h2 id="appendix">Appendix</h2>
<h3 id="adding-updating-and-deleting-data">Adding, Updating and Deleting Data</h3>
<p>ElasticSeach, and hence the Data API, have a standard RESTful API. Thus:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>POST {endpoint} : INSERT
PUT/POST {endpoint}/ : UPDATE (or INSERT)
DELETE {endpoint}/ : DELETE
</code></pre></div></div>
<p>For more on INSERT and UPDATE see the <a href="http://elasticsearch.org/guide/reference/api/index_.html">Index API</a> documentation.</p>
<p>There is also support bulk insert and updates via the <a href="http://elasticsearch.org/guide/reference/api/bulk.html">Bulk API</a>.</p>
<h3 id="schema-mapping">Schema Mapping</h3>
<p>As the ElasticSearch documentation states:</p>
<blockquote>
<p>Mapping is the process of defining how a document should be mapped to the
Search Engine, including its searchable characteristics such as which fields
are searchable and if/how they are tokenized. In ElasticSearch, an index may
store documents of different “mapping types”. ElasticSearch allows one to
associate multiple mapping definitions for each mapping type.</p>
</blockquote>
<blockquote>
<p>Explicit mapping is defined on an index/type level. By default, there isn’t a
need to define an explicit mapping, since one is automatically created and
registered when a new type or new field is introduced (with no performance
overhead) and have sensible defaults. Only when the defaults need to be
overridden must a mapping definition be provided.</p>
</blockquote>
<p>Relevant docs: <a href="http://elasticsearch.org/guide/reference/mapping/">http://elasticsearch.org/guide/reference/mapping/</a>.</p>
<h3 id="jsonp-support">JSONP support</h3>
<p>JSONP support is available on any request via a simple callback query string parameter:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>?callback=my_callback_name
</code></pre></div></div>
Rufus Pollock
Basic data cleaning with Data Explorer
2013-06-28T00:00:00+00:00
http://okfnlabs.org/blog/2013/06/28/basic-cleaning-with-data-explorer
<p><a href="http://explorer.okfnlabs.org/">Data Explorer</a> is a client-side web application for
data processing and visualization. With Data Explorer, you can import data,
transform it with JavaScript code, and visualize it on a graph or a map – all
fully within the browser and with your data and code nicely persisted to
<a href="https://gist.github.com/">gists</a>.</p>
<p>This tutorial will get you started using Data Explorer by walking you through
the cleaning of a messy data set. It introduces some of the basic concepts of
the <a href="http://okfnlabs.org/recline/">Recline</a> library which provides Data Explorer’s model of data
and highlights why it’s nice to be able to use JavaScript to wrangle data.</p>
<h2 id="getting-started">Getting started</h2>
<p>For this tutorial, we’re going to use a set of <a href="http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv">Greater London Authority (GLA) financial data</a>,
a report of payments made by the GLA for property worth more than £250 over a one-month
period in 2013. Conveniently for our purposes, this dataset is a little buggy.</p>
<p>To get started, go to the <a href="http://explorer.okfnlabs.org/">Data Explorer</a> website, and click <strong>Get
started with your own data</strong> to proceed to the <em>Import</em> page. From there, you will be
able to load the data in a number of ways: uploading a local file, pasting in a URL, or
pasting the data itself into a text box. Choose your preferred method and hit the appropriate
<strong>Load</strong> button, which will take you to the <em>Preview & Save</em> page.</p>
<p>The <em>Preview</em> shows what the data will look like as a grid. Already some
fiddling is necessary to get things started. The row containing the names of fields is six rows
down, and the fields are all nameless – except for one with an erroneous name! To fix this,
change the <strong>Skip initial rows</strong> value to 6.</p>
<p>You can also see that there is a blank column, but you can’t do anything about this yet.
Just choose a name for the dataset, click <strong>Save</strong>, and move on to actually working with the data.</p>
<h2 id="the-grid-and-the-graph">The grid and the graph</h2>
<p>Once the data has been loaded and named, you are taken to the Data Explorer proper. Your
first view of the data will be the <strong>Grid</strong>, a tabular display identical to what was already
shown in the <em>Preview</em> screen.</p>
<p><img src="http://i.imgur.com/B48sGc9.png" alt="The initial grid" /></p>
<p>Data visualizations are constructed with the <strong>Graph</strong>. Let’s try to make a graph
of the data. Click <strong>Graph</strong> to go to the graph screen, which will ask you to choose
which of the data’s fields to bind to the two axes, Axis 1 (= x) and Axis 2 (= y). First
change the Graph Type to <strong>Points</strong>. Then, for Group Column, choose “Clearing Date”, and
for Series A, choose “Amount”. You should get a graph that looks like this.</p>
<p><img src="http://i.imgur.com/NDPFkLN.png" alt="First graph" /></p>
<p>This graph is useless. There are no points with an Amount higher than about £990.
A quick look at the grid will tell you that in fact many points have Amounts
running into the millions of pounds. Also note that the x axis is completely unlabeled. If you scan
your cursor over the data points, which displays their underlying value, you’ll see that
their horizontal arrangement is meaningless.</p>
<p>The problem is that the dataset is formatted badly. All of the values in the Amounts field
that run higher than £999.99 include a comma, which prevents them from being parsed as
numbers. The dates, too, are not being treated as dates but just as ordinary strings,
making it impossible to put them on a scale.</p>
<p>To fix these problems, we’ll write some code. Roll up your sleeves and get ready.</p>
<h2 id="basics-of-de-coding">Basics of DE coding</h2>
<p>To pull up Data Explorer’s tool for editing and running JavaScript code, click <strong>Code</strong>
at the top right of the page. This will cause the <strong>JavaScript console</strong> to drop down.
This console consists of a panel for editing code and, beneath it, an area where
messages are printed.</p>
<p>A bit of code is included in the edit panel by default:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loadDataset("current", function (error, dataset) {
// error will be null unless there is an error
// dataset is a Recline memory store (http://reclinejs.com//docs/src/backend.memory.html).
console.log(dataset);
});
</code></pre></div></div>
<p>This code loads up the current dataset by calling the function <code class="language-plaintext highlighter-rouge">loadDataset</code>
on the string <code class="language-plaintext highlighter-rouge">"current"</code> (the name of the current dataset) and an anonymous
callback function which binds a representation of the dataset to the name <code class="language-plaintext highlighter-rouge">dataset</code>.
The callback function, as defined, prints the dataset to the console’s
message area by calling <code class="language-plaintext highlighter-rouge">console.log</code> on <code class="language-plaintext highlighter-rouge">dataset</code>.
Watch this code work by clicking <strong>Run the Code</strong>.</p>
<p>The console output might surprise you. The dataset is represented as a JavaScript
object with attributes <code class="language-plaintext highlighter-rouge">"records"</code> and <code class="language-plaintext highlighter-rouge">"fields"</code>, the first of which is an array
of objects with attributes for each of the top-level object’s <code class="language-plaintext highlighter-rouge">"fields"</code>. This is
an instance of the <a href="http://reclinejs.com//docs/src/backend.memory.html">Recline memory store</a>. A dataset is a collection
of records, and a record is just an object.</p>
<p>If you understand that, you’re ready to clean the dataset.</p>
<h2 id="cleaning-with-javascript">Cleaning with JavaScript</h2>
<p>The full gamut of JavaScript tools and tricks are available to you when you
clean data in Data Explorer. Besides handy core JavaScript functionality like
regular expressions, Data Explorer makes the <a href="http://underscorejs.org">Underscore.js</a>
suite of functional programming utilities available for data cleaning.</p>
<p>To clean a dataset, write code inside a <code class="language-plaintext highlighter-rouge">loadDataset</code> call that modifies the dataset
in the appropriate way, and finish by calling <code class="language-plaintext highlighter-rouge">saveDataset</code> on the modified dataset.
All code presented in this section is to be placed inside the curly brackets of the <code class="language-plaintext highlighter-rouge">loadDataset</code>
callback function.</p>
<p>Let’s start by getting rid of that annoying blank column we noticed earlier. To do this,
we have to delete <code class="language-plaintext highlighter-rouge">_noname_</code> from the dataset’s fields. We must also drop the <code class="language-plaintext highlighter-rouge">_noname_</code>
attribute from every record in the dataset.</p>
<p>To get rid of the bad field, set the dataset’s field attribute to be the old value
of the attribute minus the field named <code class="language-plaintext highlighter-rouge">"_noname_"</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dataset.fields =
_.reject(dataset.fields,
function (f) {
return f.id === "_noname_";
}) ;
</code></pre></div></div>
<p>Erasing the bad field from each record can be done with an application of <code class="language-plaintext highlighter-rouge">each</code>,
which calls a function with side effects on each item in a collection.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_.each(dataset.records,
function (r) {
delete r._noname_ ;
}) ;
</code></pre></div></div>
<p>Now let’s look at the next problem: the unparsed Amounts with commas. To fix these,
we need to eliminate the commas and then parse the resulting string as a float.
Since we’re already iterating over every record in the dataset, we can add to the
anonymous function in the <code class="language-plaintext highlighter-rouge">each</code> call:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (typeof r.Amount === "string") {
r.Amount = parseFloat(r.Amount.replace(/,/g, "")) ;
}
</code></pre></div></div>
<p>Finally, we can fix the dates. There are two problems with these. The first is that
the Recline dataset object needs to know that the type of their field is <em>date</em>.
The second is that the dates haven’t been parsed. To fix the first problem, add
to the <code class="language-plaintext highlighter-rouge">loadDataset</code> callback function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_.find(dataset.fields,
function (f) {
return f.id === "Clearing Date" ;
})
.type = "date" ;
</code></pre></div></div>
<p>Next, add another bit to the anonymous function in <code class="language-plaintext highlighter-rouge">each</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (typeof r["Clearing Date"] === "string") {
r["Clearing Date"] = new Date(r["Clearing Date"]) ;
}
</code></pre></div></div>
<p>That’s it! All that remains is to save the modified dataset. At the bottom of
the <code class="language-plaintext highlighter-rouge">loadDataset</code> callback function, add a line to save the data:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>saveDataset(dataset) ;
</code></pre></div></div>
<p>Click <strong>Run the Code</strong> and watch the data transform before your very eyes.
The new graph view of the data is now meaningful, correct, and fully consistent
with your expectations. Awesome!</p>
<p><img src="http://i.imgur.com/Bl1cxL8.png" alt="Fixed graph" /></p>
<p>You can also have another look at the grid, which will show you exactly what
has changed in your data.</p>
<p><img src="http://i.imgur.com/WfQxGdV.png" alt="Final grid" /></p>
<p>If you have logged in to GitHub, you will be able to save the result of your
work. To share the work, simply copy the URL of your project. An example of
a project constructed according to the instructions above can be found <a href="http://explorer.okfnlabs.org/#nmashton/e4f4ab6a21471e1aa1b8/view/graph">here</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>With Data Explorer, the full power of arbitrary JavaScript code (enhanced
with Underscore functionality) can be brought to bear on tough data cleaning
problems. The cleaning script’s effects are immediately visible in the grid
and graph views of the data, which enables an easy, interactive style of data
cleaning. And it is all done without a backend, in memory and in the browser.</p>
Neil Ashton
data.okfn.org - update no. 1
2013-05-28T00:00:00+00:00
http://okfnlabs.org/blog/2013/05/28/data-okfn-org-update-no-1
<p>This is the first of regular updates on Labs project <a href="http://data.okfn.org/">http://data.okfn.org/</a>
and summarizes some of the changes and improvements over the last few weeks.</p>
<h3 id="1-refactor-of-site-layout-and-focus">1. Refactor of site layout and focus.</h3>
<p>We’ve done a refactor of the site to have stronger focus on the data. Front page tagline is now:</p>
<blockquote>
<p>We’re providing key datasets in high quality, easy-to-use and open form</p>
</blockquote>
<p>Tools and standards are there in a clear supporting role. Thanks to all the suggestions and feedback on this and welcome more - we’re still iterating.</p>
<h3 id="2-pull-request-data-workflow">2. Pull request data workflow</h3>
<p>There was a nice example of the pull request data workflow being used (by a complete stranger!): <a href="https://github.com/datasets/house-prices-uk/pull/1">https://github.com/datasets/house-prices-uk/pull/1</a></p>
<h3 id="3-new-datasets">3. New datasets</h3>
<p>For example:</p>
<ul>
<li>US house prices <a href="http://data.okfn.org/data/house-prices-us">http://data.okfn.org/data/house-prices-us</a></li>
<li>Annual consumer price index <a href="http://data.okfn.org/data/cpi">http://data.okfn.org/data/cpi</a></li>
</ul>
<p>Looking to contribute data check out the instructions <a href="http://data.okfn.org/about/contribute#data">http://data.okfn.org/about/contribute#data</a> and the outstanding requests: <a href="https://github.com/datasets/registry/issues">https://github.com/datasets/registry/issues</a></p>
<h3 id="4-tooling">4. Tooling</h3>
<ul>
<li>We have a DataPackage.JSON creator tool in progress at http://data.okfn.org/tools/dp/create (<a href="https://github.com/okfn/data.okfn.org/issues/28">here’s the relevant github issue</a>)</li>
<li>We have a new <a href="http://data.okfn.org/tools">data package viewer created by James Smith</a></li>
</ul>
<h3 id="5-feedback-on-standards">5. Feedback on standards</h3>
<p>There’s been a lot of valuable feedback on the <a href="http://data.okfn.org/standards">data package and json table schema standards</a> including some quite major suggestions (e.g. substantial change to JSON Table Schema to align more closely with JSON Schema - thx to jpmckinney)</p>
<h3 id="next-steps">Next steps</h3>
<p>There’s plenty more coming up soon in terms of data and the site and tools.</p>
<ul>
<li>Complete the <a href="https://github.com/okfn/data.okfn.org/issues/28">datapackage.json generator</a> (support for gdocs especially)</li>
<li>Complete the <a href="https://github.com/okfn/data.okfn.org/issues/27">datapackage.json validator</a></li>
<li>More <a href="http://data.okfn.org/about/contribute#data">datasets especially key indices</a></li>
</ul>
<h3 id="get-involved">Get Involved</h3>
<p>Anyone can contribute and its easy – if you can use a spreadsheet you can help!</p>
<p>Instructions for getting involved here: <a href="http://data.okfn.org/about/contribute">http://data.okfn.org/about/contribute</a></p>
Rufus Pollock
Open Humanities Hangout - Open Correspondence and the Letter Net
2013-05-21T00:00:00+00:00
http://okfnlabs.org/blog/2013/05/21/humanities-hangout-open-correspondence-letter-net
<p>Our next Open Humanities Hangout will take place next <strong>Tuesday, 28th May</strong>. This is the latest in the series of regular hangouts we’ve been organizing over the past few months with people interested in tapping in to the growing amount of <strong>open cultural data and content</strong>.</p>
<ul>
<li><strong>What:</strong> Open Humanities Hangout looking at opening up historical correspondence and mapping the “letter net” – e.g. did Dickens write to George Eliot and did she write back? Come help us find out! <a href="#more">Read more below</a></li>
<li><strong>When:</strong> Tuesday 28th May 2013 at 1700 BST, 12:00 EDT, 1800 CET</li>
<li><strong>Where:</strong> Online via Google Hangout and <a href="/contact">IRC</a> – we’ll publish the hangout url nearer the time</li>
<li><strong>Who:</strong> anyone who loves the humanities and wants to see the great works of our past accessible and re-usable by anyone regardless of background or location.</li>
<li><strong>Signup:</strong> please <a href="https://docs.google.com/a/okfn.org/document/d/1WIzi7n3D5_c7QtaGKQAFbm7bMGmi-u_vjmI5NX8MWJA/edit#">sign up here</a> or email sam.leon@okfn.org. Note you can always just drop in on the day but it helps us if we have a sense of numbers!</li>
</ul>
<h2 id="about-the-hangouts">About the Hangouts</h2>
<p>The <a href="/events/hangouts/">Humanities Hangouts</a> are an informal virtual get together to build apps and insights using open cultural material. Among other things participants have put together an app that helps you to get to know Shakespeare better called <a href="http://crowdcrafting.org/app/bardomatic/">Bardomatic</a>, hacked on an annotation tool for public domain texts called <a href="http://textusproject.org">TEXTUS</a> and created interactive <a href="http://timeliner.okfnlabs.org/">timelines of the great Western medieval philosophers</a> (helping to improve and de-bug the <a href="http://timeliner.okfnlabs.org/">Timeliner tool</a> in the process).</p>
<p><img src="http://farm6.staticflickr.com/5323/8768093210_3343870b2a_c.jpg" alt="Screen Shot 2013-05-16 at 13.26.10" width="1272" height="768" class="aligncenter size-full wp-image-2185" /></p>
<h2 id="more">The Challenge: Mapping Networks of Correspondence</h2>
<p>We want to construct a workflow that will enable <em>anyone</em> to take a published set of letters and turn it into open data and content that we can explorer and visualize. Ultimately we want the network of correspondence – the “letter net”.</p>
<h3 id="suggested-process">Suggested process</h3>
<p>This is something to discusson the hangout, but we think the effort will involve at least 3 steps:</p>
<ol>
<li>Locate published collection of letters
<ul>
<li>Great if these are already digitized on gutenberg</li>
</ul>
</li>
<li>Extract structured data like author, recipient, date, location
<ul>
<li>Geo-code all those locations</li>
<li>If the texts are not digitized start thinking about that!</li>
</ul>
</li>
<li>Visualise the results</li>
</ol>
<p>We’ve already done work on steps 1 and even 2 in the <a href="https://github.com/okfn/openletters">case of Dickens</a>. For geocoding there’s already a simple <a href="http://schoolofdata.org/2013/02/19/geocoding-part-i-introduction-to-geocoding/">geocoding guide on the School of Data</a>. For visualization there are plenty of options that we’ll explore on the hangout. (And if we want to start scanning and OCRing there are <a href="http://www.diybookscanner.org/">manuals on how to build your own scanner</a>).</p>
<h3 id="our-goal">Our Goal</h3>
<p>Our basic goal is a set of beautiful and insightful set of visualisations about the correspondence of key cultural figures.</p>
<p>Longer term we would love to see a database of correspondence that is open to everyone to use and add to.</p>
Sam Leon
Nomenklatura - Data Matching and Reconciliation Made Easy
2013-05-16T00:00:00+00:00
http://okfnlabs.org/blog/2013/05/16/nomenklatura-matching-service-reconciliation-made-easy
<p><a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a> is a simple service that makes it easy to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names against that canonical list – for example, matching Acme Widgets, Acme Widgets Inc and Acme Widgets Incorporated to the canonical “Acme Widgets”.</p>
<p>With Nomenklatura its a matters of minutes to set up your own set of master data to match against and it provides a simple user interface and <a href="http://nomenklatura.okfnlabs.org/about">API</a> which you can then use do matching (the API is compatible with Open Refine’s reconciliation function).</p>
<p><a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a> can not only store the master set of entities you want to match against but also will learn and record the various aliases for a given entity - such as a person, organisation or place - may have in various datasets.</p>
<p><a href="http://nomenklatura.okfnlabs.org/"><img src="http://i.imgur.com/h9411NU.jpg" /></a></p>
<p>As such Nomenklatura chooses a design half way between an entity database (such as OpenCorporates, PopIt or similar services) and a automated de-duplication software (such as dedupe or SILK).</p>
<p>Nomenklatura has been battle-tested with real-world usage, for example to de-duplicate the names of <a href="http://nomenklatura.okfnlabs.org/offenesparlament">German parliamentarians</a>, <a href="http://nomenklatura.okfnlabs.org/uk25k-departments">UK government departments</a> and <a href="http://nomenklatura.okfnlabs.org/openinterests-entities">spending data schemas and EU lobbyists</a>.</p>
<p>Typically, a data extraction process will check all the entity names it discovers in the source data against nomenklaturas API. If Nomenklatura does not recognize a name, a new alias record is stored as a placeholder. This alias can then be matched to an entity by the user through a simple-to-use reconciliation user interface.</p>
<p>To kickstart such a process, data can be uploaded via CSV - but new entities can be created dynamically as well. The advantage of a manual approach is that it minimizes the risk of false matches – this level of quality assurance can be crucial, if, for example, the output will be displayed in an application that is intended to hold government to account.</p>
<h2 id="this-release">This Release</h2>
<p>This latest release of Nomenklatura includes a number of important changes:</p>
<ul>
<li>
<p>The domain model was refactored to use a clearer naming scheme, canonical values are now called “entities”, and their alternative spellings are now “aliases”.</p>
</li>
<li>
<p>CSV upload support allows users to submit a list of entities, aliases or fully executed mappings.</p>
</li>
<li>
<p>Support for the Open Refine API was added, so that each Nomenklatura dataset can be added as a reconciliation service and used to clean data from inside Refine.</p>
</li>
<li>
<p>Keyboard shortcuts were added to the reconciliation tool, so that matches can be identified without using a mouse - a fast user can now match a few hundred records an hour.</p>
</li>
<li>
<p>The Python client library has been refactored and submitted to PyPi, it can be installed via “pip install pynomenklatura”.</p>
</li>
</ul>
<h2 id="credits-and-links">Credits and Links</h2>
<p>Nomenklatura was developed by <a href="/members/pudo/">Labs Member Friedrich Lindenberg</a> with contributions from other folks including fellow Labs members Michael Bauer.</p>
<p><a href="https://github.com/pudo/nomenklatura">Nomenklatura source code on GitHub</a></p>
Friedrich Lindenberg
Update on PublicBodies.org - a URL for every part of Government
2013-05-01T00:00:00+00:00
http://okfnlabs.org/blog/2013/05/01/publicbodies.org-an-update
<p>This is an update on <a href="http://publicbodies.org/">PublicBodies.org</a> - a Labs project whose aim is to provide a “URL for every part of Government”: <a href="http://publicbodies.org/">http://publicbodies.org/</a></p>
<p>PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have distinct corporate existence). Examples would include government ministries or departments, state-run organizations such as libraries, police and fire departments and more.</p>
<p><a href="http://publicbodies.org/"><img src="http://i.imgur.com/2AbIjSu.png" alt="" style="margin-top: 15px; margin-bottom: 15px;" /></a></p>
<p>We run into public bodies all the time in projects like OpenSpending (either as spenders or recipients). Back in 2011 as part of the “Organizations” data workshop at OGD Camp 2011, Labs member Friedrich Lindenberg scraped together a first database and site of “public bodies” from various sources (primarily FoI sites like WhatDoTheyKnow, FragDenStaat and AskTheEU).</p>
<p>We’ve recently redone the site converting the sqlite DB to simple flat CSV files:</p>
<ul>
<li>Main github repo: <a href="https://github.com/okfn/publicbodies">https://github.com/okfn/publicbodies</a></li>
<li>Example raw CSV: <a href="https://raw.github.com/okfn/publicbodies/master/data/gb.csv">https://raw.github.com/okfn/publicbodies/master/data/gb.csv</a></li>
</ul>
<p>The site itself is now super-simple flat-files hosted on s3 (<a href="https://github.com/okfn/publicbodies/tree/master/site">build code here</a>). Here’s an example of the output:</p>
<ul>
<li>European Parliament: <a href="http://publicbodies.org/eu/european-parliament.html">http://publicbodies.org/eu/european-parliament.html</a></li>
<li>Associated JSON API (with CORS!) <a href="http://publicbodies.org/eu/european-parliament.json">http://publicbodies.org/eu/european-parliament.json</a></li>
</ul>
<p>The simplicity of CSV for data plus simple templating to flat-files is very attractive. There are some drawbacks such as changes to primary template resulting in a full rebuild and upload of ~6k files so, especially as the data grows, we may want to look into something a bit nicer but for the time being this works well.</p>
<h2 id="next-steps">Next Steps</h2>
<p>There’s plenty that could be improved e.g.</p>
<ul>
<li>More data - other jurisdictions (we only cover EU, UK and Germany) + descriptions for the bodies (this could be a nice crowdcrafting app)</li>
<li>Search and Reconciliation (via nomenklatura)</li>
<li>Making it easier to submit corrections or additions</li>
</ul>
<p>The full list of issues is on github here: <a href="https://github.com/okfn/publicbodies/issues">https://github.com/okfn/publicbodies/issues</a></p>
<p>Help is most definitely wanted! Just grab one of the issues or <a href="http://okfnlabs.org/contact/">get in touch</a> …</p>
Rufus Pollock
Quick and Dirty Analysis on Large CSVs
2013-04-11T00:00:00+00:00
http://okfnlabs.org/blog/2013/04/11/quick-and-dirty-analysis-on-large-csv
<p>I’m playing around with some large(ish) CSV files as part of a <a href="http://openspending.org/">OpenSpending</a> related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be
found in <a href="https://github.com/openspending/thingstodo/issues/5>">this issue on OpenSpending’s things-to-do repo</a>).</p>
<p>The dataset I’m working with is the consolidated spending (over £25k) by all UK goverment departments. Thanks to the efforts of of OpenSpending folks (and specifically Friedrich Lindenberg) this data is already nicely ETL’d from thousands of individual CSV (and xls) files into one big 3.7 Gb file (see below for links and details).</p>
<p>My question is what is the best way to do quick and dirty analysis on this?</p>
<p>Examples of the kinds of options I was considering were:</p>
<ul>
<li>Simple scripting (python, perl etc)</li>
<li>Postgresql - load, build indexes and then sum, avg etc</li>
<li>Elastic MapReduce (AWS Hadoop)</li>
<li>Google BigQuery</li>
</ul>
<p>Love to hear what folks think and if there are tools or approaches they would specifically recommend.</p>
<h3 id="the-data">The Data</h3>
<ul>
<li>Here’s the <a href="http://data.etl.openspending.org/uk25k/spending-latest.csv">3.7 Gb CSV</a></li>
<li>A <a href="http://www.dataprotocols.org/en/latest/data-packages.html">Data Package file</a> for the data describing the fields: <a href="https://raw.github.com/openspending/dpkg-uk25k/master/datapackage.json">https://raw.github.com/openspending/dpkg-uk25k/master/datapackage.json</a></li>
</ul>
Rufus Pollock
Cleaning up Greater London Authority Spending (for OpenSpending)
2013-04-03T00:00:00+00:00
http://okfnlabs.org/blog/2013/04/03/greater-london-authority-spending
<p>I’ve been working to get Greater London Authority spending data cleaned up and
into <a href="http://openspending.org/">OpenSpending</a>. Primary motivation comes from this question:</p>
<blockquote>
<p><strong>Which companies got paid the most (and for doing what)?</strong> (see this <a href="https://github.com/openspending/thingstodo/issues/5>">issue for
more</a>)</p>
</blockquote>
<p>I wanted to share where I’m up to and some of the experience so far as I think
these can inform our wider efforts - and illustrate the challenges just getting
and cleaning up data. I note that the <a href="https://github.com/rgrp/dataset-gla#readme">code and README</a> for this
ongoing work is in a repo on github: <a href="https://github.com/rgrp/dataset-gla">https://github.com/rgrp/dataset-gla</a></p>
<p><a href="http://openspending.org/gb-local-gla"><img src="http://awesomeness.openphoto.me/custom/201307/gla-spend-function-476af3_870x870.jpg" alt="" /></a></p>
<h2 id="data-quality-issues">Data Quality Issues</h2>
<p>There are 61 CSV files as of March 2013 (a list can be found in <a href="https://github.com/rgrp/dataset-gla/blob/master/scrape.json>">scrape.json</a>).</p>
<p>Unfortunately the “format” varies substantially across files (even though they
are all CSV!) which makes using this data real pain. Some examples:</p>
<ul>
<li>no of fields and there names vary across files (e.g. SAP Document no vs
Document no)</li>
<li>number of blank columns or blank lines (some files have no blank lines
(good!), many have blank lines plus some metadata etc etc)</li>
<li>There is also at least one “bad” file which looks to be an excel file saved
as CSV</li>
<li>Amounts are frequently formatted with “,” making them appear as strings to
computers.</li>
<li>Dates vary substantially in format e.g. “16 Mar 2011”, “21.01.2011” etc</li>
<li>No unique transaction number (possibly document number)</li>
</ul>
<p>They also switched from monthly reporting to period reporting (where there are 13 periods of approx 28d each).</p>
<h2 id="progress-so-far">Progress so far</h2>
<p>I do have one month loaded (Jan 2013) with a nice breakdown by “Expenditure
Account”:</p>
<p><a href="http://openspending.org/gb-local-gla">http://openspending.org/gb-local-gla</a></p>
<p>Interestingly after some fairly standard grants to other bodies, <a href="http://openspending.org/gb-local-gla/expenditure-account/542420">“Claim Settlements”</a>
comes in as the biggest item at £2.3m</p>
<ul>
<li>Data getting archived at <a href="http://data.openspending.org/datasets/gb-local-gla/">http://data.openspending.org/datasets/gb-local-gla/</a></li>
<li>Clean up script: <a href="https://github.com/rgrp/dataset-gla/blob/master/scripts/process.js">https://github.com/rgrp/dataset-gla/blob/master/scripts/process.js</a></li>
</ul>
Rufus Pollock
sqlaload, an ETL wrapper for SQLAlchemy
2013-03-30T00:00:00+00:00
http://okfnlabs.org/blog/2013/03/30/sqlaload
<p><a href="https://github.com/okfn/sqlaload">sqlaload</a> is a small library that I use to handle databases in Python data processing. In many projects, your process starts with very messy data (something you’ve scraped or loaded from a hand-prepared Excel sheet). In subsequent stages, you gradually add cleaned values in new columns or new tables. Managing a full SQL schema for such operations can be a hassle, you really want something close to <a href="http://www.mongodb.org/">MongoDB</a>: a NoSQL data store you can throw fairly random data at and get it back later.</p>
<p>With sqlaload, the idea is to combine some of the schema flexibility, while still keeping things in a structured database in the background:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">sqlaload</span> <span class="k">as</span> <span class="n">sl</span>
<span class="n">engine</span> <span class="o">=</span> <span class="n">sl</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">'sqlite:///test.db'</span><span class="p">)</span>
<span class="c1"># add some data:
</span><span class="n">sl</span><span class="p">.</span><span class="n">add_row</span><span class="p">(</span><span class="n">engine</span><span class="p">,</span> <span class="s">'mytable'</span><span class="p">,</span> <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Foo'</span><span class="p">,</span> <span class="s">'has_this'</span><span class="p">:</span> <span class="bp">True</span><span class="p">})</span>
<span class="n">sl</span><span class="p">.</span><span class="n">add_row</span><span class="p">(</span><span class="n">engine</span><span class="p">,</span> <span class="s">'mytable'</span><span class="p">,</span> <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Bar'</span><span class="p">,</span> <span class="s">'has_other'</span><span class="p">:</span> <span class="bp">True</span><span class="p">})</span>
<span class="c1"># Look up a record
</span><span class="n">row</span> <span class="o">=</span> <span class="n">sl</span><span class="p">.</span><span class="n">find_one</span><span class="p">(</span><span class="n">engine</span><span class="p">,</span> <span class="s">'mytable'</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'Foo'</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">row</span><span class="p">[</span><span class="s">'has_this'</span><span class="p">]</span><span class="o">==</span><span class="bp">True</span>
<span class="c1"># Update a record:
</span><span class="n">sl</span><span class="p">.</span><span class="n">upsert</span><span class="p">(</span><span class="n">engine</span><span class="p">,</span> <span class="s">'mytable'</span><span class="p">,</span> <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Foo'</span><span class="p">,</span> <span class="s">'location'</span><span class="p">:</span> <span class="s">'Atlantis'</span><span class="p">},</span> <span class="p">[</span><span class="s">'name'</span><span class="p">])</span>
<span class="c1"># Or create one:
</span><span class="n">sl</span><span class="p">.</span><span class="n">upsert</span><span class="p">(</span><span class="n">engine</span><span class="p">,</span> <span class="s">'mytable'</span><span class="p">,</span> <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Qux'</span><span class="p">,</span> <span class="s">'location'</span><span class="p">:</span> <span class="s">'Elsewhere'</span><span class="p">},</span> <span class="p">[</span><span class="s">'name'</span><span class="p">])</span></code></pre></figure>
<p>I first saw this type of SQL schema generation implemented in <a href="http://scraperwiki.com">ScraperWiki</a>: they have a couple of <a href="https://scraperwiki.com/docs/python/python_help_documentation/">high-level SQLite wrappers</a> that expand your database as you feed them data. We later adopted that concept for the joint CKAN/ScraperWiki <a href="https://github.com/okfn/webstore">webstore</a>, which neither project ended up using.</p>
<p>Still, webstore had become an essential part of many of my data projects as an <a href="http://en.wikipedia.org/wiki/Operational_data_store">operational data store</a>. Eventually, I decided to kick out the networking aspect: data access via HTTP is terribly slow and I wanted to have my data in Postgres, not SQLite. The webstore code went into sqlaload, and became a thin wrapper on top of <a href="http://docs.sqlalchemy.org/en/rel_0_8/">SQLAlchemy core</a> (the non-ORM database abstraction part of SQLAlchemy).</p>
<p>Running on top of SQLAlchemy also means that all of its functionality - for example the <a href="http://docs.sqlalchemy.org/en/rel_0_8/core/expression_api.html">query expression language</a> - are available and can be used to call up more advanced functionality.</p>
<p>If you want to try it out, sqlaload is now on <a href="https://pypi.python.org/pypi/sqlaload">PyPI</a> and the <a href="https://github.com/okfn/sqlaload/blob/master/README.md">README</a> has a lot of detailed documentation on the library.</p>
Friedrich Lindenberg
Next Steps for Textus
2013-03-27T00:00:00+00:00
http://okfnlabs.org/blog/2013/03/27/next-steps-for-textus
<p>At the Culture Labs hangout yesterday we wrote up the plans for the next steps for Textus we have been discussing over the last few months.</p>
<p>The result is this slide deck overview. It both introduces Textus and outlines next steps (slide 12 onwards).</p>
<iframe src="https://docs.google.com/presentation/d/1OlXIaGgntenmBLNMu0tZYTdrP09TvzZ-R5bpJAgznF4/embed?start=false&loop=false&delayms=3000" frameborder="0" width="580" height="464" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<h2 id="key-points">Key Points</h2>
<p>We want to:</p>
<ul>
<li>Maximize simplicity</li>
<li>Connect with a CMS (people always want other content than just the texts)</li>
</ul>
<p>Implications are:</p>
<ul>
<li>Componentize “Textus” and separate text preparation / import from presentation</li>
<li>Create a plugin to make “Textus” style functionality one-click install into Wordpress</li>
<li>Eliminate dependencies on ElasticSearch & NodeJS (texts & markup stored in plain files online or in WP …)</li>
</ul>
<p>Specifically, we plan to break Textus into 3 components:</p>
<ul>
<li><a href="https://github.com/CultureLabs/textus-formatter">textus-formatter</a> - nodejs app/command line tool for formatting texts</li>
<li>textus-viewer - JS-only viewer</li>
<li>textus-wordpress - wordpress integration</li>
</ul>
<p><img src="https://docs.google.com/drawings/d/1S9Hv98LWdcfuG3KjF1qELsZBp-RQ08Ylo3gxaO6tyQg/pub?w=960&h=720" alt="" title="New Architecture" /></p>
Sam Leon
Progress on the Data Explorer
2013-03-18T00:00:00+00:00
http://okfnlabs.org/blog/2013/03/18/progress-with-data-explorer
<p>This is an update on progress with the <a href="http://explorer.okfnlabs.org/">Data Explorer</a> (aka Data Transformer).</p>
<p>Progress is best seen from this <a href="http://explorer.okfnlabs.org/#rgrp/e3e0b0f18dfe151f9f7e">demo which takes you on a tour of house prices and the difference between real and nominal values</a>.</p>
<p>More information on recent developments can be found below. Feedback is <em>very welcome</em> - either here or the issues <a href="https://github.com/okfn/dataexplorer">https://github.com/okfn/dataexplorer</a>.</p>
<p><a href="http://explorer.okfnlabs.org/#rgrp/e3e0b0f18dfe151f9f7e"><img src="http://i.imgur.com/WeDO0vK.png" alt="House prices tutorial" /></a></p>
<h2 id="what-is-the-data-explorer">What is the Data Explorer</h2>
<p>For those not familiar, the <a href="http://explorer.okfnlabs.org/">Data Explorer is a HTML+JS app</a> to view, visualize and process data <em>just in the browser</em> (no backend!). It draws heavily on the <a href="http://okfnlabs.org/recline/">Recline library</a> and features now include:</p>
<ul>
<li>Importing data from various sources (the UX of this could be much improved!)</li>
<li>Viewing and visualizing using Recline to create grids, graphs and maps</li>
<li>Cleaning and transforming data using a scripting component that allows you to write and run javascript</li>
<li>Saving and sharing: everything you create (scripts, graphs etc) can be saved and then shared via public URL.</li>
</ul>
<p>Note, that persistence (for sharing) is to Gists (here’s the <a href="https://gist.github.com/rgrp/e3e0b0f18dfe151f9f7e">gist for the House Prices demo linked above</a>). This has some nice benefits such as versioning; offline editing (clone the gist, edit and push); and bl.ocks.org-style ability to create a gist and have it result in public viewable output (though with substantial differences vs blocks …).</p>
<h2 id="whats-next">What’s Next</h2>
<p>There are many areas that could be worked on – a full list of <a href="https://github.com/okfn/dataexplorer/issues">issues is in github</a>. The most important I think at the moment are:</p>
<ul>
<li><a href="https://github.com/okfn/dataexplorer/issues/88">Storing the data “locally” in the data project</a>. At present, data is always loaded from an “external” source. This probably involves extending the current Recline datastore to back on to IndexedDB.</li>
<li>A <a href="https://github.com/okfn/dataexplorer/issues/60">better project creation & data import process</a> - I think we could learn a lot from Refine here</li>
<li><a href="https://github.com/okfn/dataexplorer/issues/84">“Fork” support</a></li>
<li>More <a href="https://github.com/okfn/dataexplorer/issues/52">documentation and tutorials especially for scripting</a></li>
<li>Getting rid of the many rough edges especially on the UX side of things!</li>
</ul>
<p>I’d very interested in people’s thoughts on the app so far and what should be done next and code contributions are also very welcome (the app has already benefitted from the efforts of many people including the likes of <a href="http://mk.ucant.org/">Martin Keegan</a> and <a href="https://github.com/michael">Michael Aufreiter</a> to the app itself; and from folks like <a href="http://maxogden.com/">Max Ogden</a>, <a href="http://pudo.org/">Friedrich Lindenberg</a>, <a href="http://casbon.me/">James Casbon</a>, <a href="http://driven-by-data.net/">Gregor Aisch</a>, <a href="http://nigelb.me/">Nigel Babu</a> (and many more) in the form of ideas, feedback, work on Recline etc).</p>
Rufus Pollock
Recline JS - Componentization and a Smaller Core
2013-02-26T00:00:00+00:00
http://okfnlabs.org/blog/2013/02/26/recline-js-componentization-and-a-smaller-core
<p>Over time <a href="http://okfnlabs.org/recline/">Recline JS</a> has grown. In particular, since the first <a href="http://blog.okfn.org/2012/07/05/announcing-recline-js-a-javascript-library-for-building-data-applications-in-the-browser/">public
announce of Recline</a> last summer we’ve had several people producing
new backends and views (e.g. <a href="https://github.com/okfn/recline/wiki/Extensions">backends for Couch, a view for d3, a map view
based on Ordnance Survey’s tiles etc etc</a>).</p>
<p>As <a href="http://lists.okfn.org/pipermail/okfn-labs/2013-February/000638.html">I wrote to the labs list recently</a>, continually adding these to
core Recline runs the risk of bloat. Instead, we think it’s better to keep the
core lean and move more of these “extensions” out of core with a clear listing
and curation process - the design of Recline means that <a href="http://okfnlabs.org/recline/docs/backends.html">new backends</a> and
<a href="http://okfnlabs.org/recline/docs/views.html">views</a> can extend the core easily and without any complex dependencies.</p>
<p>This approach is useful in other ways. For example, Recline backends are
designed to support standalone use as well as use with Recline core (they have
no dependency on <em>any</em> other part of Recline - <em>including core</em>) but this is
not very obvious as it stands (where the backend is bundled with Recline). To
take a concrete example, the Google Docs backend is a useful wrapper for the
Google Spreadsheets API in its own right. While this is already true, when this
code is in the main Recline repository it isn’t very obvious but having the
repo split out with its own README would make this much clearer.</p>
<h2 id="so-the-plan-is-">So the plan is …</h2>
<ul>
<li>Announce this approach of a leaner core and more “Extensions”
<ul>
<li>Link to the specifications for <a href="http://okfnlabs.org/recline/docs/backends.html">Backends</a> and <a href="http://okfnlabs.org/recline/docs/views.html">Views</a></li>
<li>Create an official <a href="https://github.com/okfn/recline/wiki/Extensions">Recline Extensions page</a></li>
</ul>
</li>
<li>Identify first items to split out from core - see <a href="https://github.com/okfn/recline/issues/314">this issue</a></li>
<li>Identify what components <em>should</em> remain in core? (I’m thinking Dataset +
Memory DataStore plus one Grid, Graph and Map)</li>
</ul>
<p>So far I’ve already started the process of factoring out some backends (and
soon views) into standalone repos, e.g. here’s GDocs:</p>
<p><a href="https://github.com/okfn/recline.backend.gdocs">https://github.com/okfn/recline.backend.gdocs</a></p>
<p>Any thoughts very welcome and if you already have Recline extensions lurking in
your repos please add them to the <a href="https://github.com/okfn/recline/wiki/Extensions">wiki page</a></p>
Rufus Pollock
Exporting PyBossa data to CSV or JSON with one click
2013-02-20T00:00:00+00:00
http://okfnlabs.org/blog/2013/02/20/exporting-pybossa-data-to-csv-with-one-click
<p>I’m really happy to announce that today we have finally added a feature that
will allow to <a href="http://docs.pybossa.com/en/latest/user/tutorial.html#exporting-the-obtained-results">export your data</a> into a CSV format with just one click
(we also support the same for JSON).</p>
<p><img src="http://i.imgur.com/zqPkMST.png" alt="" /></p>
<p>For this purpose, all the applications in PyBossa now feature a new URI:</p>
<blockquote>
<p>http://PYBOSSA-SERVER/app/slug/export</p>
</blockquote>
<p>Where you will find several options to export the tasks or task runs (the answers)
to different formats. In the case of the CSV format, you will get a CSV file
that could be downloaded to your computer to load it later in any spreadsheet
software :-)</p>
<p><img src="http://i.imgur.com/zVZCYW8.png" alt="" /></p>
<p><strong>NOTE</strong>: bear in mind that CSV is a flat format, so nested JSON objects will
be “dumped” as they are, so for example if you are using GeoJSON for storing
some location, you will get in the CSV file the JSON object as a string.
You can see <a href="http://crowdcrafting.org/app/urbanpark/export?type=task&format=csv">an example of this issue in the Urban Parks application</a>, as this
demo application uses the <a href="http://www.geojson.org/">GeoJSON</a> format for storing the location of the parks.</p>
<p>If you prefer JSON, just click in any of the buttons and save the generated file!</p>
<p><img src="http://i.imgur.com/vBDWLeb.png" alt="" /></p>
<p>If you want to try the new feature, just go ahead and check it in <a href="http://crowdcrafting.org">CrowdCrafting.org</a></p>
Daniel Lombraña González
Mozilla FirefoxOS App Days & Crowdcrafting.org
2013-01-29T00:00:00+00:00
http://okfnlabs.org/blog/2013/01/29/firefoxappday
<p><img class="pull-left" src="https://hacks.mozilla.org/wp-content/uploads/2012/12/firefoxOS-app-days_graphic_RGB.png" />
Last Saturday, the 26th of January, <a href="https://hacks.mozilla.org/2013/01/join-us-for-firefox-os-app-days/">Mozilla held in parallel in 25 cities all over the world a hack day</a>, the <a href="https://twitter.com/search?q=%23firefoxosappdays&src=tyah">#FirefoxOSAppDay</a>, about creating new web applications for their new <a href="http://www.mozilla.org/en-US/firefoxos/">FirefoxOS mobile OS</a> and the desktop web browser (this stills in beta and alpha mode!).</p>
<p>One of the events was held in Madrid, Spain, organized by the <a href="http://www.mozilla-hispano.org/">Mozilla Hispano Community</a> so I had the chance to expend some time with the Mozilla community and play with the <a href="https://developer.mozilla.org/en/docs/Mozilla/Firefox_OS">new APIs and developer tools for their new platform</a>.</p>
<p><img class="pull-right" src="http://mozorg.cdn.mozilla.net/media/img/firefoxos/firefox-phone.png" /></p>
<p>In the morning we attend several talks by several experts on the new APIs that
Mozilla are developing to integrate mobile actions like for example the <a href="http://www.w3.org/TR/battery-status/">battery
API</a> that will allow you to check the
device battery status (right now integrated in the W3C standards) or the <a href="https://wiki.mozilla.org/WebAPI/AlarmAPI">Alarm
API</a> that you can use to schedule a
notification, or for an application to be started, at a specific time.</p>
<p>Mozilla is working really hard to standardize and integrate most of <a href="https://wiki.mozilla.org/WebAPI">these APIs</a> into the W3C in order to make them available in any web browser. Some of the APIs are actually now accepted in the W3C as for example the <a href="http://dvcs.w3.org/hg/dap/raw-file/tip/battery/Overview.html">Battery Status API</a>, <a href="http://dvcs.w3.org/hg/dap/raw-file/tip/network-api/index.html">Network Information API</a>, <a href="http://www.w3.org/TR/ambient-light/">Ambient light sensor</a> or the <a href="http://www.w3.org/TR/2012/WD-proximity-20120712/">Proximity sensor</a>.</p>
<p>Mozilla also presented their efforts in making as easy as possible to create an
application from scratch re-using several <a href="http://buildingfirefoxos.com/">building-blocks</a>
they have created for their new platform. Basically, they have created <a href="http://buildingfirefoxos.com/">a web page</a> where you can copy and paste code snippets that you can later re-use in your own application,
keeping the look and feel of the platform.</p>
<p>After the talks, all the participants had a better idea of what we could
develop with the platform: a web application that could use the hardware of the
new mobile phone devices, as well as Android phones out of the box!</p>
<p>As the goal of the day was to create an app for the FirefoxOS, my idea was to create an application that could help to track when a new scientific application has been added to <a href="http://crowdcrafting.org"><strong>crowd</strong>crafting</a> so you could help doing some tasks in the new application.</p>
<p>The web application basically lets you know which apps are new since the last time you check it out.</p>
<p>The application works in any web browser (even Chrome) but if you want to feel how it will be in the new OS you can try it in your phone if you have an Android device. You will need to install the <a href="http://nightly.mozilla.org/">latest Firefox nightly</a> (<strong>note: </strong><em>this is an experimental build, so it may crash in your phone!</em>) and then type this URL:</p>
<p><a href="http://daniellombrana.es/crowdcrafting-app">http://daniellombrana.es/crowdcrafting-app</a></p>
<p>You will be able to install it in your phone and run it whenever you want directly from your home screen. If you don’t want to install the browser, just open the link with a modern web browser and you should see it running (the install button will only work in <a href="http://www.mozilla.org/en-US/firefox/channel/">Firefox Beta</a> and <a href="http://www.mozilla.org/en-US/firefox/channel/#aurora">Aurora builds</a>).</p>
<p><img src="http://i.imgur.com/xjljFcc.png" alt="FirefoxOS Crowdcrafting app" /></p>
Daniel Lombraña González
PyBossa.JS or how you can easily create new PyBossa applications
2013-01-28T00:00:00+00:00
http://okfnlabs.org/blog/2013/01/28/pybossa-js
<p>In the last weeks we have been working hard in order to make easier to develop new PyBossa applications. For this reason, we are happy to announce a new version of PyBossa.JS. This new version introduces several improvements:</p>
<ul>
<li><strong>Creating an app is much easier!</strong> You only have to override two methods: pybossa.taskLoaded and pybossa.presentTask to fit your app, and call pybossa.run(‘your-app-slug’).</li>
<li><strong>Pre-loading tasks by default!</strong> Now your app could improve its performance, as the next task for the user will be loaded in the background for you while the user stills solving the first one!</li>
<li><strong>Automatically update the task URL</strong>. The library will change the browser’s URL to the current task automatically, so using services like Disqus for comments is really simple (check the updated version of <a href="http://crowdcrafting.org/app/flickrperson">Flickr Person Finder</a> for more details!).</li>
</ul>
<p>As a result of this new version, there are at least two applications using the new PyBossa.JS version:</p>
<ul>
<li><a href="http://crowdcrafting.org/app/flickrperson">Flickr Person Finder</a> has been updating, using this new set of features. If you try the application you will see that loading the next task (in this case an image which is usually 1024x1024px big) is almost instantly. Additionally, the app shows how you can use the Disqus service to allow your users to add comments for each task, but only loading them when the user wants.</li>
<li><a href="http://crowdcrafting.org/app/thefacewemake">The Face We Make</a> is a new application where you have to guess the emoticon that a person is representing in a photo. This app is a joint effort with the official <a href="http://thefacewemake.org/about/">The Face We Make</a> project by <a href="http://dxtr.com/">Dexter Miranda</a> and <a href="http://daniellombrana.es">Daniel Lombraña González</a>. The app has been updated for using the new pre-loading of tasks, and once you have completed all of them (only 10 photos!) show you your results, in other words, how many of your guesses are right/wrong.</li>
</ul>
<p>Finally, we have also added the “missing features” that allow you to create an application without using the API. Right now, you can create an application using only the web forms for creating the application:
<a rel="lightbox" title="Web form for creating an application" href="/img/pybossa-create-app.png"><img src="/img/pybossa-create-app.png" alt="Web form for creating an app" /></a></p>
<p>You can also add and work in the task presenter (we have included the <a href="http://codemirror.net">CodeMirror plugin</a>, so you will see how it looks your code as you type it!):
<a rel="lightbox" title="Web form for editing the task presenter" href="/img/pybossa-task-presenter-editor.png"><img src="/img/pybossa-task-presenter-editor.png" alt="Web form for editing the task presenter" /></a></p>
<p>As well as importing the tasks via a CSV file importer (you can even import the CSV file from a Google Spreadsheet!):</p>
<p><a rel="lightbox" title="Web form for importing tasks from a CSV file" href="/img/pybossa-csv-import.png"><img src="/img/pybossa-csv-import.png" alt="Web form for importing tasks from a CSV file" /></a></p>
<p>The documentation has been updated in order to reflect this new features, and as a result you should be able to write an application really fast. However, we are far from perfect, so any feedback that you can give us will be really good! Thus, please, leave in the comments your feedback or send us an e-mail to info@pybossa.com. We will be more than happy to hear your thoughts on PyBossa!</p>
Daniel Lombraña González
Journoid, data notifications
2013-01-25T00:00:00+00:00
http://okfnlabs.org/blog/2013/01/25/journoid
<p>At the <a href="http://okfnlabs.org/events/hackdays/lobbying.html">Open Interests</a> hackday in November, a discussion with <a href="http://www.martinstabe.com/">Martin Stabe</a> from the <a href="http://www.ft.com/intl/interactive">FT’s interactive desk</a> led a prototype of <a href="https://github.com/pudo/journoid">Journoid</a>. The idea is to monitor changing on-line datasets for remarkable information, like <a href="http://datadesk.latimes.com/">earthquakes</a>, procurement in a particular industry or a close parliamentary vote. While we’d discussed alerting in the context of <a href="http://openspending.org/">OpenSpending</a> before, Martin had a pretty specific list of wishes that neither <a href="http://pandaproject.net/">PANDA</a> nor <a href="http://ifttt.com/">IFTTT</a> can handle:</p>
<ul>
<li>
<p>Search not just for a single keyword or query, but compare the incoming data to a table of matches, such as a list of famous people, well-known companies or any other set of items that you may be interested in.</p>
</li>
<li>
<p>Use Google Docs for configuration. The FT uses Google Apps internally and it’s an interface that their reporters already understand - just add a “Config” sheet to your keyword document, and store all relevant settings - like the source URL and recipient email - in there.</p>
</li>
</ul>
<p>The <a href="https://github.com/pudo/journoid">Journoid</a> prototype from the hackday only fulfills the first of those requirements - and I’m still struggling with #2, as it’s surprisingly hard to find a good Google Docs client library for Python.</p>
<p>Still, the hack was a nice demo: sift through a <a href="http://data.etl.openspending.org/uk25k/">data dump from the UK departmental spending</a>, check the supplier information against a list of companies of interest and finally send a message if there is a hit.</p>
<p>As a further experiment, I was able to use <a href="http://opencorporates.com/">OpenCorporates</a> to check the supplier’s company status, answering a simple but interesting question: does the government do business with insolvent (or even dissolved) companies? It’s interesting to think what other matches can be made when the comparison list is actually an API.</p>
<p>What’s next? It’s time to clean up the <a href="https://github.com/pudo/journoid/tree/master/journoid">messy hackday code</a>, to finish up GDocs configuration, some hosted solution and possibly a few other input formats.</p>
<p>This will also probably be my last post to OKFN Labs - early next month,
I’ll join <a href="http://mozillaopennews.org">Knight-Mozilla OpenNews</a> at
<a href="http://spiegel.de">Spiegel Online</a> to spend ten months working on tools
like this, assisting journalists in telling more compelling stories on
the web. I hope that by continuing to cooperate with my friends in the
<a href="http://spendingstories.org">Spending Stories</a> project on Journoid and
similar efforts we can bring open (and some non-open) data into the
media, making a difference.</p>
<p><em>Photo credit: Mike Tigas, <a href="http://www.flickr.com/photos/madmannova/8384618902/sizes/l/in/set-72157632527677275/">If this then news demo</a> (similar project we’ve
started at OpenNews)</em></p>
Friedrich Lindenberg
Web Scraping with CSS Selectors in Node using JSDOM or Cheerio
2013-01-15T00:00:00+00:00
http://okfnlabs.org/blog/2013/01/15/web-scraping-with-node-css-selectors
<p>I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node given that it is pure JS and therefore could be a more natural fit when getting info out of <em>web</em> pages.</p>
<p>In particular, when my first steps when looking to extract information from a website is to open up the Chrome Developer tools (or Firebug in Firefox) and try and extract information by inspecting the page and playing around in the console - the latter is especially attractive if jQuery is available.</p>
<p>What I often end up with from this is a few lines of jQuery selectors. My desire here was to find a way to directly reuse these same css selectors I use in my browser experimentation directly in the scraping script. Now, things like <a href="http://packages.python.org/pyquery/">pyquery</a> do exist in python (and there is some css selector support in the brilliant BeautifulSoup) but a connection with something like Node seems even more natural - it is after the JS engine from a browser!</p>
<h2 id="uk-crime-data">UK Crime Data</h2>
<p>My immediate motivation for this work was wanting to play around with the <a href="http://police.uk/data">UK Crime data</a> (all <a href="http://opendefinition.org/">open data</a> now!).</p>
<p>To do this I needed to:</p>
<ol>
<li>Get the data in consolidated form by scraping the file list and data files from <a href="http://police.uk/data/">http://police.uk/data/</a> - while they commendably provide the data in bulk there is no single file to download, instead there is one file per force per month.</li>
<li>Do data cleaning and analysis - this included some fun geo-conversion and csv parsing</li>
</ol>
<p>I’m just going to talk about the first part in what folllows - though I hope to cover the second part in a follow up post.</p>
<p>I should also note that all the code used for scraping and working with this data can be found in the <a href="https://github.com/datasets/crime-uk">UK Crime dataset data package on GitHub</a> on Github - <a href="https://github.com/datasets/crime-uk/blob/master/scripts/scrape.js">scrape.js file is here</a>. You can also see some of the ongoing results of these data experiments in an experimental <a href="http://okfnlabs.org/crime/">UK crime “dashboard” here</a>.</p>
<h2 id="scraping-using-css-selectors-in-node">Scraping using CSS Selectors in Node</h2>
<p>Two options present themselves when doing simple scraping using css selectors in node.js:</p>
<ul>
<li>Using <a href="https://github.com/tmpvar/jsdom">jsdom</a> (+ jquery)</li>
<li>Using <a href="https://github.com/MatthewMueller/cheerio">cheerio</a> (which provides jquery like access to html) + something to retrieve html (my preference is <a href="https://github.com/mikeal/request">request</a> but you can just uses <a href="http://nodejs.org/docs/v0.6.11/api/http.html#http.request">node’s built in http request</a>)</li>
</ul>
<p>For the UK crime work I used jsdom but I’ve subsequently used cheerio as it is substantially faster so I’ll cover both here (I didn’t discover cheerio until I’d started on the crime work!).</p>
<p>Here’s an excerpted code example (full example in the <a href="https://github.com/datasets/crime-uk/blob/master/scripts/scrape.js">source file</a>):</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">var</span> <span class="nx">url</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">http://police.uk/data</span><span class="dl">'</span><span class="p">;</span>
<span class="c1">// holder for results</span>
<span class="kd">var</span> <span class="nx">out</span> <span class="o">=</span> <span class="p">{</span>
<span class="dl">'</span><span class="s1">streets</span><span class="dl">'</span><span class="p">:</span> <span class="p">[]</span>
<span class="p">}</span>
<span class="nx">jsdom</span><span class="p">.</span><span class="nx">env</span><span class="p">({</span>
<span class="na">html</span><span class="p">:</span> <span class="nx">url</span><span class="p">,</span>
<span class="na">scripts</span><span class="p">:</span> <span class="p">[</span>
<span class="dl">'</span><span class="s1">http://code.jquery.com/jquery.js</span><span class="dl">'</span>
<span class="p">],</span>
<span class="na">done</span><span class="p">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">errors</span><span class="p">,</span> <span class="nb">window</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">$</span> <span class="o">=</span> <span class="nb">window</span><span class="p">.</span><span class="nx">$</span><span class="p">;</span>
<span class="c1">// find all the html links to the street zip files</span>
<span class="nx">$</span><span class="p">(</span><span class="dl">'</span><span class="s1">#downloads .months table tr td:nth-child(2) a</span><span class="dl">'</span><span class="p">).</span><span class="nx">each</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">idx</span><span class="p">,</span> <span class="nx">elem</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// push the url (href attribute) onto the list</span>
<span class="nx">out</span><span class="p">[</span><span class="dl">'</span><span class="s1">streets</span><span class="dl">'</span><span class="p">].</span><span class="nx">push</span><span class="p">(</span> <span class="nx">$</span><span class="p">(</span><span class="nx">elem</span><span class="p">).</span><span class="nx">attr</span><span class="p">(</span><span class="dl">'</span><span class="s1">href</span><span class="dl">'</span><span class="p">)</span> <span class="p">);</span>
<span class="p">});</span>
<span class="p">});</span>
<span class="p">});</span></code></pre></figure>
<p>As an example of Cheerio scraping here’s an example from work <a href="https://github.com/datasets/opented">scraping info the EU’s TED database</a> (sample <a href="http://files.opented.org.s3.amazonaws.com/scraped/100120-2011/summary.html">html file</a>):</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">var</span> <span class="nx">url</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">http://files.opented.org.s3.amazonaws.com/scraped/100120-2011/summary.html</span><span class="dl">'</span><span class="p">;</span>
<span class="c1">// place to store results</span>
<span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="p">{};</span>
<span class="c1">// do the request using the request library</span>
<span class="nx">request</span><span class="p">(</span><span class="nx">url</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">resp</span><span class="p">,</span> <span class="nx">body</span><span class="p">){</span>
<span class="nx">$</span> <span class="o">=</span> <span class="nx">cheerio</span><span class="p">.</span><span class="nx">load</span><span class="p">(</span><span class="nx">body</span><span class="p">);</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">winnerDetails</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">'</span><span class="s1">.txtmark .addr</span><span class="dl">'</span><span class="p">).</span><span class="nx">html</span><span class="p">();</span>
<span class="nx">$</span><span class="p">(</span><span class="dl">'</span><span class="s1">.mlioccur .txtmark</span><span class="dl">'</span><span class="p">).</span><span class="nx">each</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">i</span><span class="p">,</span> <span class="nx">html</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">spans</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="nx">html</span><span class="p">).</span><span class="nx">find</span><span class="p">(</span><span class="dl">'</span><span class="s1">span</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">span0</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="nx">spans</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">span0</span><span class="p">.</span><span class="nx">text</span><span class="p">()</span> <span class="o">==</span> <span class="dl">'</span><span class="s1">Initial estimated total value of the contract </span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">amount</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="nx">spans</span><span class="p">[</span><span class="mi">4</span><span class="p">]).</span><span class="nx">text</span><span class="p">()</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">finalamount</span> <span class="o">=</span> <span class="nx">cleanAmount</span><span class="p">(</span><span class="nx">amount</span><span class="p">);</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">initialamount</span> <span class="o">=</span> <span class="nx">cleanAmount</span><span class="p">(</span><span class="nx">$</span><span class="p">(</span><span class="nx">spans</span><span class="p">[</span><span class="mi">1</span><span class="p">]).</span><span class="nx">text</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="p">});</span></code></pre></figure>
Rufus Pollock
Archiving Twitter the Hacky Way
2013-01-08T00:00:00+00:00
http://okfnlabs.org/blog/2013/01/08/archiving-twitter-feeds-the-hacky-way
<p>There are many circumstances where you want to archive a tweets - maybe just from your own account or perhaps for a hashtag for an event or topic.</p>
<p>Unfortunately Twitter search queries do not give data more than 7 days old and for a given account you can only get approximately the last 3200 of your tweets and 800 items from your timeline. [Update: People have pointed out that <a href="http://blog.twitter.com/2012/12/your-twitter-archive.html">Twitter released a feature to download an archive of your personal tweets at the end of December</a> - this, of course, still doesn’t help with queries or hashtags]</p>
<p>Thus, if you want to archive twitter you’ll need to come up with another solution (or pay them, or a reseller, a bunch of money - see Appendix below!). Sadly, most of the online solutions have tended to disappear or be acquired over time (e.g. twapperkeeper). So a DIY solution would be attractive. After reading various proposals on the web I’ve found the following to work pretty well (but see also this <a href="http://mashe.hawksey.info/2012/01/twitter-archive-tagsv3/">excellent google spreadsheet based solution</a>).</p>
<p>The proposed process involves 3 steps:</p>
<ol>
<li>Locate the Twitter Atom Feed for your Search</li>
<li>Use Google Reader as your Archiver</li>
<li>Get your data out of Google Reader (a 1000 items at a time!)</li>
</ol>
<p>One current drawback of this solution is that each stage has to be done by hand. It could be possible to automate more of this, and especially the important third step, if I could work out how to do more with the <a href="http://undoc.in/">Google Reader API</a>. Contributions or suggestions here would be very welcome!</p>
<p><strong><em>Note that the above method will become obsolete as of March 5 2013 when <a href="https://dev.twitter.com/docs/api/1.1/overview#New_Twitter_client_policies">Twitter close down RSS and Atom feeds</a> - continuing their long march to becoming a <del>fully</del> more closed and controlled ecosystem.</em></strong></p>
<p><strong><em>As you struggle, like me, to get precious archival information out of Twitter it may be worth reflecting on just how much information you’ve given to Twitter that you are now unable to retrieve (at least without paying) …</em></strong></p>
<h2 id="twitter-atom-feed">Twitter Atom Feed</h2>
<p>Twitter still have Atom feeds for their search queries:</p>
<p><a href="http://search.twitter.com/search.atom?q=my_search">http://search.twitter.com/search.atom?q=my_search</a></p>
<p>Note that if you want to search for a hash tag like #OpenData or a user e.g. @someone you’ll need to escape the symbols:</p>
<p><a href="http://search.twitter.com/search.atom?q=%23OpenData">http://search.twitter.com/search.atom?q=%23OpenData</a></p>
<p>Unfortunately twitter atom queries are limited to only a few items (around 20) so we’ll need to continuously archive that feed to get full coverage.</p>
<h2 id="archiving-in-google-reader">Archiving in Google Reader</h2>
<p>Just add the previous feed URL in your Google Reader account. It will then start archiving.</p>
<p>Aside: because the twitter atom feed is limited to a small number of items and the check in google reader only happens every 3 hours (1h if someone else is archiving the same feed) you can miss a lot of tweets. One option could be to use Topsy’s RSS feeds <a href="http://otter.topsy.com/searchdate.rss?q=%23okfn">http://otter.topsy.com/searchdate.rss?q=%23okfn</a> (though not clear how to get more items from this feed either!)</p>
<h2 id="gettting-data-out-of-google-reader">Gettting Data out of Google Reader</h2>
<p>Google Reader offers a decent (though still beta) API. Unoffical docs for it can be found here: <a href="http://undoc.in/">http://undoc.in/</a></p>
<p>The key URL we need is:</p>
<p><a href="http://www.google.com/reader/atom/feed/[feed_address]?n=1000">http://www.google.com/reader/atom/feed/[feed_address]?n=1000</a></p>
<p>Note that the feed is limited to a maximum of 1000 items and you can only access it for your account if you are logged in. This means:</p>
<ul>
<li>If you have more than a 1000 items you need to find the continuation token in each set of results and then at &c={continuation-token} to your query.</li>
<li>Because you need to be logged in your browser you need to do this by hand :-( (it may be possible to automate via the API but I couldn’t get anything work - any tips much appreciated!)</li>
</ul>
<p>Here’s a concrete example (note, as you need to be logged in this won’t work for you):</p>
<p><a href="http://www.google.com/reader/atom/feed/http://search.twitter.com/search.atom%3Fq%3D%2523OpenData?n=1000">http://www.google.com/reader/atom/feed/http://search.twitter.com/search.atom%3Fq%3D%2523OpenData?n=1000</a></p>
<p>And that’s it! You should now have a local archive of all your tweets!</p>
<h2 id="appendix">Appendix</h2>
<p>Increasing Twitter is selling access to the full Twitter archive and there are a variety of 3rd services (such as Gnip, DataSift, Topsy <a href="https://dev.twitter.com/programs/twitter-certified-products/products#Data">and possibly more</a>) who are offering full or partial access for a fee.</p>
Rufus Pollock
Bundes-Git – German Laws on GitHub
2012-12-13T00:00:00+00:00
http://okfnlabs.org/blog/2012/12/13/bundesgit-german-laws-on-github
<p>If you compare software code and legislation you can find many similarities: both are big bodies of text spread over multiple units (laws/files). The total amount of text inevitably grows bigger over time with many small changes to existing parts while most of the corpus stays the same.</p>
<p>However, the tooling and editing process for these domains is very different: while developers are in the fortunate position that they can build and improve their own tools, legislators are stuck with proprietary tools like MS Word that are simply not built to collaboratively work on a big corpus of text.</p>
<p>But if source code and laws have a similar information structure, why not apply the tools used in software development to the legislative process? That is what Bundes-Git (“Federal Git”) is currently trying out in Germany.</p>
<p><a href="https://github.com/bundestag/gesetze">Bundes-Git</a> is a Git version control repository of all German Federal Laws and Regulations as Markdown. The goal was to come up with the simplest solution to handle laws that could possibly work and integrate it well into the existing developer ecosystem.</p>
<p>The idea has been well received with <a href="http://www.wired.com/wiredenterprise/2012/08/bundestag/">an article on Wired.com</a> and articles on German IT news sites <a href="http://www.heise.de/open/meldung/Entwicklungshistorie-von-Gesetzen-mit-Git-verfolgen-1662758.html">Heise</a> and <a href="http://www.golem.de/news/bundesgit-ein-git-repository-fuer-deutsche-gesetze-1208-93709.html">Golem</a>.</p>
<p>The popularity can surely also be attributed to our marvelous Bundes-Git mascot, dubbed octo eagle, thought up by myself and designed by <a href="https://kkaefer.com/">Konstantin Käfer</a> released under <a href="https://creativecommons.org/publicdomain/zero/1.0/">CC0</a> (please go this way if you are <a href="http://bundesgit.spreadshirt.de/">interested in a t-shirt or hoodie</a>).</p>
<h3 id="design-decisions-explained">Design decisions explained</h3>
<p>All other law storage formats use XML. But to me XML is neither human readable nor human writable. Let me get into the details of some of the design decisions:</p>
<ul>
<li><strong>Git</strong> because it’s the most popular distributed version control system right now.</li>
<li><strong>GitHub</strong> because it’s the most popular Git host right now and comes with some nice perks like Pull Request and GitHub Pages.</li>
<li><strong>Markdown</strong> because any more structure like XML or JSON would make it harder for humans to read or write the format and diffs would be difficult to read.</li>
<li>Naming files <code class="language-plaintext highlighter-rouge">index.md</code> because it works nicely with <strong>Jekyll and GitHub Pages</strong> renders all laws into a currently very simple page.</li>
<li><strong>YAML Front Matter</strong> is necessary for Jekyll but also serves as nice a meta data store on laws.</li>
<li>Committing from branches with non-fast-forward merges because… uhmm. This is really up for discussion. I want to keep track of where changes originate and branches are created for each law publication but this heavily diverts from the clean commit history philosophy that e.g. the Linux kernel lives by.</li>
</ul>
<p>There are some more software development concepts that can be applied to the legislation process. Here are some fun things I’d like to try:</p>
<ul>
<li>A <a href="http://prose.io/">prose.io</a>-like editor to easily create law proposals and make a pull request.</li>
<li>Measuring the complexity of corpus/laws/paragraphs and using Travis CI to test pull requests if they make the complexity worse. <a href="http://www.clips.ua.ac.be/pages/pattern">Pattern</a> is a Python NLP library and they recently released a <a href="http://www.clips.ua.ac.be/pages/pattern-de">German module</a> which I want to try on our laws.</li>
<li>Testing foreign key integrity: are all referenced paragraphs still available?</li>
<li>Create an informative visualization out of the Git log automatically like <a href="http://blog.openingparliament.org/post/37650393621/what-opening-parliamentary-information-can-tell-us">Gregor Aisch did by hand for the German political party law</a>.</li>
<li>Let the German president sign off on commits to master.</li>
</ul>
<p>The design decisions around Bundes-Git fit nicely into the Git/GitHub eco system but they are not set in stone. They also create some problems and annoyances that need to be fixed or circumvented. While I believe the general philosophy and the freshness of the approach is the right direction, we clearly need more discussion.</p>
<h3 id="future-happenings-around-bundes-git">Future happenings around Bundes-Git:</h3>
<ul>
<li>We applied for funding at <a href="http://innovation.globalintegrity.org/idea-submissions/2012/12/10/applying-version-control-to-the-legislative-process">Testing 123 Global Integrity Innovation Fund</a>. Bundes-Git definitely fits their criteria of brand new, innovative and high-risk. The decision will be made later this month, fingers crossed!</li>
<li>I will talk at the <a href="http://events.ccc.de/congress/2012/Fahrplan/events/5263.en.html">29th Chaos Communication Congress about Bundes-Git</a>.</li>
<li>There will be Bundes-Git Hacker Meetup in mid January. If you are interested, <a href="https://terminplaner.dfn.de/foodle.php?id=hhndrdx742az60wf">sign up here</a>.</li>
</ul>
<p>We decided that the language of discussion on GitHub will be German, but feel free to start a conversation on the <a href="http://lists.okfn.org/mailman/listinfo/open-legislation">OKF Open Legislation mailing list</a>.</p>
<p><strong>Also be sure to follow <a href="https://twitter.com/bundesgit">@bundesgit on Twitter</a>!</strong></p>
Stefan Wehrmeyer
Speeding Up Your PyBossa App
2012-12-12T00:00:00+00:00
http://okfnlabs.org/blog/2012/12/12/speeding-up-pybossa-apps
<p>Thanks to the free <a href="http://crowdcrafting.org">crowd-crafting</a> tool <a href="http://dev.pybossa.com/">PyBossa</a>, nowadays the biggest challenge for successful crowd-sourcing is engaging users for participating in tasks, and to keep that motivation at a high level over time. Therefor the user experience of crowd-sourcing apps plays a crucial role.</p>
<p>After participating in quite a few tasks myself, I found that the loading time in between two tasks was the most annoying thing. Doing crowd-sourcing tasks often feels like doing something stupid, and you really want to get things done as fast as possible. Sometimes it needs just a single click to solve a task, but then it takes seconds to load the next one.</p>
<p>This is because all existing apps where designed in a synchronous fashion. The client requests a new task and presents it to the user as soon as it has been loaded. <em>After</em> the user has solved the task, the result is submitted and <em>after</em> the result has been stored a new task is requested and so on.</p>
<p><a rel="lightbox" title="Process flow in current PyBossa apps" href="/img/pybossa-workflow-old.png"><img src="/img/pybossa-workflow-old.png" alt="current workflow" /></a></p>
<p>(click to enlarge)</p>
<p>Some apps even need to load additional information, such as images or data coming from external APIs. This loading time accumulates quickly, and will most probably lower the motivation of your users!</p>
<h2 id="pre-loading-subsequent-tasks--magic">Pre-loading subsequent tasks == magic</h2>
<p>The idea for reducing the loading time is actually pretty simple: We let the app load the next task <em>while</em> the user is solving the current one. This results in a parallel process as described in the following chart:</p>
<p><a rel="lightbox" title="Process flow in current PyBossa apps" href="/img/pybossa-workflow-new.png"><img src="/img/pybossa-workflow-new.png" alt="proposed workflow" /></a></p>
<p>To implement this in PyBossa, we needed to change the PyBossa API a little bit (thanks @<a href="https://github.com/PyBossa/pybossa/commit/4f5bdd4698a1ac21f3021347cd9ec08e68f18bdc">teleyinex</a>). Before that change consecutive calls to the <a href="https://pybossa.readthedocs.io/en/latest/model.html#requesting-a-new-task-for-current-user">newtask endpoint</a> would return the same task again and again, until the user has solved it. Now with the newly introduced parameter <strong>offset</strong> you can request the next tasks in line.</p>
<p>Another requirement for pre-loading of tasks is to keep the entire app on one page as otherwise the cached task would be lost. The rest of this post describes a smart way to implement this using <a href="http://api.jquery.com/category/deferred-object/">jQuery.Deferred</a>.</p>
<h2 id="smart-implementation-using-jquerydeferred">Smart implementation using jQuery.Deferred</h2>
<p>Looking from our PyBossa app, the pre-loading of the next task and the user solving the current one are two asynchronous actions running in parallel. We have to wait until both are completed before we can proceed to the next task.</p>
<p><a href="http://eng.wealthfront.com/2012/12/jquerydeferred-is-most-important-client.html">This article</a> reminded me of a smart way to implement this using jQuery.Deferred. The following function shows everything we need for our main loop.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">function</span> <span class="nx">run</span><span class="p">(</span><span class="nx">task</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">nextLoaded</span> <span class="o">=</span> <span class="nx">loadTask</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
<span class="nx">taskSolved</span> <span class="o">=</span> <span class="nx">presentTask</span><span class="p">(</span><span class="nx">task</span><span class="p">);</span>
<span class="nx">$</span><span class="p">.</span><span class="nx">when</span><span class="p">(</span><span class="nx">nextLoaded</span><span class="p">,</span> <span class="nx">taskSolved</span><span class="p">).</span><span class="nx">done</span><span class="p">(</span><span class="nx">run</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>To start the loop, we need to load the first task and pass it to run.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="nx">loadTask</span><span class="p">().</span><span class="nx">done</span><span class="p">(</span><span class="nx">run</span><span class="p">);</span></code></pre></figure>
<p>Now let’s take a look at <code class="language-plaintext highlighter-rouge">loadTask()</code>. The parameter offset is passed to the API. After the task and everything else we might need is loaded we mark the deferred as resolved and pass the task over the done handler. Finally we return a ‘locked’ version of the deferred object.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">function</span> <span class="nx">loadTask</span><span class="p">(</span><span class="nx">offset</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">offset</span> <span class="o">=</span> <span class="nx">offset</span> <span class="o">||</span> <span class="mi">0</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">taskLoaded</span> <span class="o">=</span> <span class="nx">$</span><span class="p">.</span><span class="nx">Deferred</span><span class="p">();</span>
<span class="nx">$</span><span class="p">.</span><span class="nx">getJSON</span><span class="p">(</span><span class="dl">'</span><span class="s1">/api/app/</span><span class="dl">'</span><span class="o">+</span><span class="nx">appid</span><span class="o">+</span><span class="dl">'</span><span class="s1">/newtask?offset=</span><span class="dl">'</span> <span class="o">+</span> <span class="nx">offset</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">task</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// load more data if you need</span>
<span class="c1">// and then, resolve Deferred</span>
<span class="nx">taskLoaded</span><span class="p">.</span><span class="nx">resolve</span><span class="p">(</span><span class="nx">task</span><span class="p">);</span>
<span class="p">});</span>
<span class="k">return</span> <span class="nx">taskLoaded</span><span class="p">.</span><span class="nx">promise</span><span class="p">();</span>
<span class="p">}</span></code></pre></figure>
<p>We can use exactly the same method to model the user action. Therefor <code class="language-plaintext highlighter-rouge">presentTask()</code> will returned a deferred object, too. It gets resolved as soon as the user has solved the task and the answer is correctly submitted to PyBossa.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">function</span> <span class="nx">presentTask</span><span class="p">(</span><span class="nx">task</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">taskSolved</span> <span class="o">=</span> <span class="nx">$</span><span class="p">.</span><span class="nx">Deferred</span><span class="p">();</span>
<span class="c1">// update presenter html</span>
<span class="nx">$</span><span class="p">(</span><span class="dl">'</span><span class="s1">.question</span><span class="dl">'</span><span class="p">).</span><span class="nx">html</span><span class="p">(</span><span class="nx">task</span><span class="p">.</span><span class="nx">question</span><span class="p">);</span>
<span class="c1">// wait for user action</span>
<span class="nx">$</span><span class="p">(</span><span class="dl">'</span><span class="s1">button.submit</span><span class="dl">'</span><span class="p">).</span><span class="nx">off</span><span class="p">(</span><span class="dl">'</span><span class="s1">click</span><span class="dl">'</span><span class="p">).</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">click</span><span class="dl">'</span><span class="p">,</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">answer</span> <span class="o">=</span> <span class="p">{</span> <span class="na">foo</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Bar</span><span class="dl">"</span> <span class="p">};</span> <span class="c1">// fetch answer from UI</span>
<span class="nx">pybossa</span><span class="p">.</span><span class="nx">saveTask</span><span class="p">(</span><span class="nx">task</span><span class="p">.</span><span class="nx">id</span><span class="p">,</span> <span class="nx">answer</span><span class="p">).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="nx">taskSolved</span><span class="p">.</span><span class="nx">resolve</span><span class="p">();</span>
<span class="p">});</span>
<span class="p">});</span>
<span class="k">return</span> <span class="nx">taskSolved</span><span class="p">.</span><span class="nx">promise</span><span class="p">();</span>
<span class="p">}</span></code></pre></figure>
<p>And that’s it.</p>
<p>This method will significantly speed up your PyBossa app, especially if you need to fetch data from third party APIs. Remind yourself that even a speedup of a few seconds is a huge benefit for your voluntary users, as they are likely to go through this process quite often. And you really don’t want to waste their time, do you?</p>
<p><em>Update:</em> Why not try the <a href="http://crowdcrafting.org/app/flickrperson2/newtask">FlickrPerson demo app the speedy way</a>?</p>
Gregor Aisch
Javascript Timeline Libaries - A Review
2012-12-04T00:00:00+00:00
http://okfnlabs.org/blog/2012/12/04/javascript-timeline-libaries-a-review
<p>This post is a rough and ready overview of various javascript timeline libraries that arose from research in creating a timeline view for <a href="http://reclinejs.com/">Recline JS</a>. Note this material hung around on my hard disk for a few months so some of it may already be a little bit out of date!</p>
<div class="alert alert-info">
<strong>October 2013</strong>: We have released <strong><a href="http://timemapper.okfnlabs.org/">TimeMapper</a></strong> a new online app for creating <strong>Timelines and TimeMaps</strong> quickly and easily. Check it out at <strong><a href="http://timemapper.okfnlabs.org/">http://timemapper.okfnlabs.org/</a></strong>
</div>
<p>I want to start with a general comment. Timeline libraries consist of various components:</p>
<ul>
<li>Data loading
<ul>
<li>Date parsing</li>
</ul>
</li>
<li>Band (timeline) rendering</li>
<li>Showing render info on individual items</li>
</ul>
<p>For me a timeline visualization library need only be the second of these but most that I’ve come across do more.</p>
<p>In fact a major issue in my opinion with most libraries is that they are <em>under-componentized</em> - they don’t separate cleanly into these different components and end up doing everything.</p>
<p>To take one example, the Verite timeline (in my view is one of the best libraries out there) has a whole bunch of its own custom date parsing built in inside an internal utility library which are hard to override or replace and also has a large chunk of code just for loading from google docs and other data sources. (You can of course somewhat solve this somewhat – as I do in Recline by parsing the dates directly and then submitting in a standardized form).</p>
<p>In my view, even if library authors do want to include these sorts of things, it would be good to do it in a way that allowed for a clean separation so that you could just use the parts you wanted (and/or over-ride parts more cleanly).</p>
<h2 id="propublica-timeline-setter">Propublica Timeline Setter</h2>
<ul>
<li><a href="http://propublica.github.com/timeline-setter/">http://propublica.github.com/timeline-setter/</a></li>
<li>HTML + JS
<ul>
<li>But Requires a build step (using ruby)</li>
</ul>
</li>
<li>Very simple and compact design (nice!)</li>
</ul>
<h2 id="verite-timeline">Verite Timeline</h2>
<ul>
<li><a href="http://timeline.verite.co/">http://timeline.verite.co/</a></li>
<li>Very elegant frontend design</li>
<li>2 bands in timeline segment and tight integration of item display</li>
<li>Includes much more than Timeline (e.g. sourcing data from google docs etc)</li>
<li>Mozilla Public License (was GPL)</li>
</ul>
<h2 id="simile-timeline">Simile Timeline</h2>
<ul>
<li>http://www.simile-widgets.org/timeline/</li>
<li>The original open-source JS timeline but less regularly update and maintained today: “As of Spring 2012, Exhibit is the only Simile widget seeing active development.” and the timeline control has not been updated since 2009 (see this <a href="http://stackoverflow.com/questions/4700419/alternative-to-simile-timeline-for-timeline-visualization">stackoverflow question for more</a></li>
</ul>
<h2 id="chronoline">Chronoline</h2>
<ul>
<li><a href="http://stoicloofah.github.com/chronoline.js/">http://stoicloofah.github.com/chronoline.js/</a></li>
<li>Recently developed and updated</li>
<li>MIT licensed</li>
</ul>
<h2 id="timeglider">Timeglider</h2>
<ul>
<li><a href="https://github.com/timeglider/jquery_widget">https://github.com/timeglider/jquery_widget</a></li>
<li>Non-open license (but was MIT licensed <a href="https://github.com/timeglider/jquery_widget/tree/345442fa3dc7c66b23c36031a6569693ecf309bd">earlier on</a></li>
</ul>
<h2 id="chaps-timeline">CHAPS Timeline</h2>
<ul>
<li><a href="http://almende.github.com/chap-links-library/timeline.html">http://almende.github.com/chap-links-library/timeline.html</a></li>
<li>Looks pretty nice though CSS is not quite as elegant (probably fixable!)</li>
<li>Not clear whether it supports multiple bands</li>
</ul>
Rufus Pollock
Following Money and Influence in the EU - the Open Interests Hackathon
2012-11-29T00:00:00+00:00
http://okfnlabs.org/blog/2012/11/29/openinterests-review
<p>
Making sense of massive datasets that document
the processes of lobbying and public procurement at European Union level
is not an easy task. Yet a group of 25 journalists, developers, graphic
designers and activists worked together at the <a href="http://okfnlabs.org/events/hackdays/lobbying.html">Open Interests
Europe</a> hackathon last weekend to create tools and maps that make it
easier for citizens and journalists to see how lobbyists try to
influence European policies and to understand how governments award
contracts for public services. The hackathon was organised by the
European Journalism Centre and the Open Knowledge Foundation with
support from Knight-Mozilla OpenNews.</p>
<p>
At the Google Campus Cafe in Londonndon, one group dived into European
lobbying data made available via an API: <a href="http://api.lobbyfacts.eu/">api.lobbyfacts.eu</a>. Created by a
group of five NGOs: Corporate Europe Observatory, Friends of the Earth
Europe, Lobby Control, Tactical Tech and the Open Knowledge Foundation,
the API gives access to up-to-date, structured information about persons
and organisations registered as lobbyists in the <a href="http://europa.eu/transparency-register/">EU Transparency
Register</a>. The API is part of lobbyfacts.eu, a website that aims
to make it easy for anyone to track lobbyists and their influence at
European Union level, due to launch in January 2013.</p>
<p>
One of the projects Createdd with the lobby register data is a map
showing the locations of the offices of lobby firms based on their
turnover. The size of the bubbles on the map corresponds to the turnover
of the firm. Built by <a href="https://twitter.com/pudo">Friedrich
Lindenberg</a>, the map is an overlay of a Stamen Design map with
Leafletjs.</p>
<p style="text-align: center">
<img alt="" src="https://lh4.googleusercontentercontent.com/Gz7dg2T1mfSb2U7uDfotj2_giiIj8-gSIa5GEpw0SoB7negarpQpeHEW13-QmxOF5YkC_vHg7fyQNeFGU65iyfYdx_cmzxf8nfLYVigKXBamuD8Roe0C" style="height: 320px;width: 600px" /></p>
<p style="text-align: center">
<em>Screenshot of <a hrefef="http://api.lobbyfacts.eu/map">api.lobbyfacts.eu/map</a> showing
locations of lobbying firms across Europe</em></p>
<p>
Other teams focused on data analysis, comparing the data from the EU
Transparency Register with that of the <a href="http://www.google.com/url?q=http%3A%2F%2Fec.europa.eu%2Ftransparency%2Fregexpert%2F&sa=D&sntz=1&usg=AFQjCNE2JbDkGcyojnufFa8-lw8sMFEpyA">Register
of Expert Groups</a>. Interesting leads for possible further
investigative work resulted from the comparison of the figures reported
by lobby firms in the Transparency Register with those collected by the
<a href="http://www.google.com/url?q=http%3A%2F%2Fwww.nbb.be%2Fpub%2Fhome.htm&sa=D&sntz=1&usg=AFQjCNEOiiu39BbbE6C8eJF7FI_8J1vT9Q">National
Bank of Belgium</a>. “Some companies underreported massively to
the National Bank of Belgium and some of them were making themselves
look bigger in the Transparency Register,” said Eric Wesselius,
leader of the lobby transparency challenge and co-founder of <a href="http://corporateeurope.org/">Corporate Europe Observatory</a>.
Wesselius’ organisation will continue investigations in this
area.</p>
<p>
A second group of journalists and graphic designers led by Jack
Thurston, an activist involved in <a href="http://fishsubsidy.org/">Fishsubsidy.org</a>, discussed how fish
subsidy data could be used for finding journalistic stories and explored
various ways in which the unintended consequences of the EU fish
subsidies programme, such as overfishing, could be compellingly
presented to the general public. </p>
<p style="text-align: center">
<img alt="" src="https://lh3.googleusercontent.compellinglym/aRSFEijY87FeGF1vDcWwVJBYQlvNV1uordwuc7kVcjheSV6uDBvmRyKn9e4R5GgtFjTuA1-lh_1m2sAp-3S6qKb7QPW1ASFV3WIWWv_2ff9YX7gEWA0" style="width: 400px;height: 300px" /></p>
<p style="text-align: center">
<em>Sketch for interactive graphic showing fishing vessels, their
trajectory and the subsidies they receive, made by graphic designer <a href="http://helenesears.carbonmade.com/">Helene Sears</a></em></p>
<p>
A theyhird group looked into European public procurement data.
“Public procurement is an area that is underreported by
journalists,” said data journalist Anders Pedersen, founder of <a href="http://opented.org/">OpenTED</a>. “9-25% of the GDP in the
EU is procurement - highest in the Netherlands where it is around 35%.
It’s a real issue in times of austerity who provides our
services,” he added.</p>
<p>
Several <a href="http://www.google.com/urll?q=https%3A%2F%2Fgithub.com%2Fmiha-stopar%2Fsandbox&sa=D&sntz=1&usg=AFQjCNEPCecCTO1CWVEDufnaAtGGR4Q4Tw">scrapers</a>
were built to access the data relating to winners of contracts and the
values of these contracts from the EU publication <a href="http://ted.europa.eu/TED/main/HomePage.do">TED</a> (Tenders
Electronic Daily). A map of public procurement contracts by awarding
city was created using Google Fusion Tables by geocoding the original
CSV file, enriched with OpenStreetMap.</p>
<p style="text-align: center">
<img src="https://lh5.googleusercontentnt.com/oJnD9EYVOLshaLA4j3dsMHf4JxU3tzTHiQQcnjF8XFY20Psfm4Z4xlgWBOSePQzwE4SplYfyc_b_W19eCVtKMQgl00eDlDQDMxMjkkM2ghgmGYV6_AZc" style="height: 321px;width: 600px" /></p>
<p style="text-align: center">
<enrichedm>Screenshot of <a href="https://www.google.com/fusiontables/data?docid=1Cq8cKQ2r739is5gXegmX-fkI6ASAi5OOe9mepIo&pli=1#map:id=3">map
of public procurement contracts</a> by Benjamin Simatos and Martin
Stabe</em></p>
<p>
Pedersen’s long term goal is to create an interface and an API for
EU public procurement data and to publish some more visualisations.
“A lot of the work that got done here [at the hackathon] we would
not have gotten done in the next months maybe. It really helped us push
far ahead in terms of ideas and in terms of getting stuff
done.”</p>
<p>
This blog post is cross-posted from the <a href="http://datadrivenjournalism.net/news_and_analysis/Following_Money_and_Influence_in_the_EU_the_Open_Interests_Europe_Hackday">Data-driven
Journalism Blog</a>.
</p>
<em>Photo of participants at the hackahon by <a href="http://www.flickr.com/photos/fred2baro/">Mehdi
Guiraud</a>.</em></p>
</enrichedm></p>
Liliana Bounegru
Scraping Data Behind a CAPTCHA
2012-11-13T00:00:00+00:00
http://okfnlabs.org/blog/2012/11/13/scrapping-data-behind-a-captcha
<p>How much does the highest paid person in the Brazilian Federal Senate earns?
That’s the question I asked myself a few weeks ago, and one that should be
easy to answer. In Brazil, every public body must publish its employees’
salaries online, but some do so in a terrible way. The Federal Senate is
one of these.</p>
<p>To access its data you have to not only fill in your personal info, but also
solve a CAPTCHA for each salary you want to see. With no other tricks, it would
take ages to answer my question. I needed a way to gather all salaries and
compare them. But how to scrape a page that’s “protected” behind a CAPTCHA?</p>
<p><img src="/img/res/senado-gov-br-captcha.jpg" style="margin: 0 auto; display: block;" alt="senado.gov.br CAPTCHA" /></p>
<p><a href="http://decaptcher.com">Decaptcher</a> is a company that sells CAPTCHA-solving
services. They provide an API that you can send an image, and get the contained
text. It’s really cheap (US$ 1.38 per 1.000 CAPTCHAs), and works well, albeit a
bit slow (30~40 secs). They promise a success rate of over 95%, but I got only
43% in my tests. Probably because the CAPTCHAs I’m sending are really hard to read.</p>
<p><a href="http://decaptcher.org/api">Their API</a> is simple to implement, with only 3
actions (upload, refund, and balance). There’re examples in C# and PHP, and
I’ve hacked together <a href="https://gist.github.com/4063793">one in Ruby</a>. For a
bit more than US$ 5.92, I was able to access and publish the salaries of
4,487 public servants in <a href="http://senado.cc">http://senado.cc</a>.</p>
<p>There’re many other companies that offer the same service, like
<a href="http://deathbycaptcha.com">Death by CAPTCHA</a>, <a href="http://bypasscaptcha.com/">Bypass CAPTCHA</a>,
<a href="http://www.beatcaptchas.com/">Beat CAPTCHA</a>, and <a href="http://antigate.com/">Antigate</a>.
These services allow us to access public data that would be unreachable otherwise,
but they might be considered illegal in some countries. As we’re not breaking the
CAPTCHA, but paying people to solve them, we should be fine. But don’t take my word
for it: ask a lawyer.</p>
Vitor Baptista
Recline JS Search Demo
2012-11-01T00:00:00+00:00
http://okfnlabs.org/blog/2012/11/01/recline-js-search-demo
<p><a href="http://reclinejs.com/"><img src="http://assets.okfn.org/p/recline/img/logo.png" style="float: right; height: 100px;" alt="Recline JS" /></a></p>
<p>We’ve recently finished a demo for ReclineJS showing how it can be used to build
JS-based (ajax-style) search interfaces in minutes (or even seconds!):
<a href="http://reclinejs.com/demos/search/">http://reclinejs.com/demos/search/</a></p>
<p>Because of Recline’s <a href="http://reclinejs.com/docs/backends.html">pluggable backends</a> you get out of the box
support for data sources such as SOLR, Google Spreadsheet, ElasticSearch, or
plain old JSON or CSV – see examples below for live examples of using
different backends.</p>
<p>Interested in using this yourself? The <a href="http://reclinejs.com//docs/src/demo.search.app.html">(prettified) source JS for the demo is
available</a> (plus the <a href="http://reclinejs.com/demos/search/demo.search.app.js">raw version</a>) and it shows how simple
it is to build an app like this using Recline – plus it has tips on how
to customize and extend).</p>
<p><a href="http://reclinejs.com/demos/search/"><img src="http://i.imgur.com/Ja8SV.png" alt="demo" style="width: 100%" /></a></p>
<h2 id="more-examples">More Examples</h2>
<p>In addition to the simple example with local data there are several other
examples showing how one can use this with other data sources including Google
Docs and SOLR:</p>
<ol>
<li>
<p>A <a href="http://reclinejs.com/demos/search/?backend=gdocs&url=https://docs.google.com/spreadsheet/ccc?key=0Aon3JiuouxLUdExXSTl2Y01xZEszOTBFZjVzcGtzVVE">search example using a google docs listing Shell Oil spills in the Niger
delta</a></p>
</li>
<li>
<p>A <a href="http://reclinejs.com/demos/search/?backend=solr&url=http://openspending.org/api/search">search example running of OpenSpending SOLR
API</a>
– we suggest searching for something interesting like “Drugs” or “Nuclear
power”!</p>
</li>
</ol>
<h2 id="code">Code</h2>
<p>The full <a href="http://reclinejs.com//docs/src/demo.search.app.html">(prettified) source JS for the demo is available</a>
(plus the <a href="http://reclinejs.com/demos/search/demo.search.app.js">raw version</a>) but here’s a key code sample to give a flavour:</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="c1">// ## Simple Search View</span>
<span class="c1">//</span>
<span class="c1">// This is a simple bespoke Backbone view for the Search. It Pulls together</span>
<span class="c1">// various Recline UI components and the central Dataset and Query (state)</span>
<span class="c1">// object</span>
<span class="c1">//</span>
<span class="c1">// It also provides simple support for customization e.g. of template for list of results</span>
<span class="c1">//</span>
<span class="c1">// var view = new SearchView({</span>
<span class="c1">// el: $('some-element'),</span>
<span class="c1">// model: dataset</span>
<span class="c1">// // EITHER a mustache template (passed a JSON version of recline.Model.Record</span>
<span class="c1">// // OR a function which receives a record in JSON form and returns html</span>
<span class="c1">// template: mustache-template-or-function</span>
<span class="c1">// });</span>
<span class="kd">var</span> <span class="nx">SearchView</span> <span class="o">=</span> <span class="nx">Backbone</span><span class="p">.</span><span class="nx">View</span><span class="p">.</span><span class="nx">extend</span><span class="p">({</span>
<span class="na">initialize</span><span class="p">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">options</span><span class="p">)</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">el</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">el</span><span class="p">);</span>
<span class="nx">_</span><span class="p">.</span><span class="nx">bindAll</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="dl">'</span><span class="s1">render</span><span class="dl">'</span><span class="p">);</span>
<span class="k">this</span><span class="p">.</span><span class="nx">recordTemplate</span> <span class="o">=</span> <span class="nx">options</span><span class="p">.</span><span class="nx">template</span><span class="p">;</span>
<span class="c1">// Every time we do a search the recline.Dataset.records Backbone</span>
<span class="c1">// collection will get reset. We want to re-render each time!</span>
<span class="k">this</span><span class="p">.</span><span class="nx">model</span><span class="p">.</span><span class="nx">records</span><span class="p">.</span><span class="nx">bind</span><span class="p">(</span><span class="dl">'</span><span class="s1">reset</span><span class="dl">'</span><span class="p">,</span> <span class="k">this</span><span class="p">.</span><span class="nx">render</span><span class="p">);</span>
<span class="k">this</span><span class="p">.</span><span class="nx">templateResults</span> <span class="o">=</span> <span class="nx">options</span><span class="p">.</span><span class="nx">template</span><span class="p">;</span>
<span class="p">},</span>
<span class="c1">// overall template for this view</span>
<span class="na">template</span><span class="p">:</span> <span class="dl">'</span><span class="s1"> </span><span class="se">\</span><span class="s1">
<div class="controls"> </span><span class="se">\</span><span class="s1">
<div class="query-here"></div> </span><span class="se">\</span><span class="s1">
</div> </span><span class="se">\</span><span class="s1">
<div class="total"><h2><span></span> records found</h2></div> </span><span class="se">\</span><span class="s1">
<div class="body"> </span><span class="se">\</span><span class="s1">
<div class="sidebar"></div> </span><span class="se">\</span><span class="s1">
<div class="results"> </span><span class="se">\</span><span class="s1">
{{{results}}} </span><span class="se">\</span><span class="s1">
</div> </span><span class="se">\</span><span class="s1">
</div> </span><span class="se">\</span><span class="s1">
<div class="pager-here"></div> </span><span class="se">\</span><span class="s1">
</span><span class="dl">'</span><span class="p">,</span>
<span class="c1">// render the view</span>
<span class="na">render</span><span class="p">:</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">results</span> <span class="o">=</span> <span class="dl">''</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">_</span><span class="p">.</span><span class="nx">isFunction</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">templateResults</span><span class="p">))</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">results</span> <span class="o">=</span> <span class="nx">_</span><span class="p">.</span><span class="nx">map</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">model</span><span class="p">.</span><span class="nx">records</span><span class="p">.</span><span class="nx">toJSON</span><span class="p">(),</span> <span class="k">this</span><span class="p">.</span><span class="nx">templateResults</span><span class="p">).</span><span class="nx">join</span><span class="p">(</span><span class="dl">'</span><span class="se">\n</span><span class="dl">'</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// templateResults is just for one result ...</span>
<span class="kd">var</span> <span class="nx">tmpl</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">{{#records}}</span><span class="dl">'</span> <span class="o">+</span> <span class="k">this</span><span class="p">.</span><span class="nx">templateResults</span> <span class="o">+</span> <span class="dl">'</span><span class="s1">{{/records}}</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">results</span> <span class="o">=</span> <span class="nx">Mustache</span><span class="p">.</span><span class="nx">render</span><span class="p">(</span><span class="nx">tmpl</span><span class="p">,</span> <span class="p">{</span>
<span class="na">records</span><span class="p">:</span> <span class="k">this</span><span class="p">.</span><span class="nx">model</span><span class="p">.</span><span class="nx">records</span><span class="p">.</span><span class="nx">toJSON</span><span class="p">()</span>
<span class="p">});</span>
<span class="p">}</span>
<span class="kd">var</span> <span class="nx">html</span> <span class="o">=</span> <span class="nx">Mustache</span><span class="p">.</span><span class="nx">render</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">template</span><span class="p">,</span> <span class="p">{</span>
<span class="na">results</span><span class="p">:</span> <span class="nx">results</span>
<span class="p">});</span>
<span class="k">this</span><span class="p">.</span><span class="nx">el</span><span class="p">.</span><span class="nx">html</span><span class="p">(</span><span class="nx">html</span><span class="p">);</span>
<span class="c1">// Set the total records found info</span>
<span class="k">this</span><span class="p">.</span><span class="nx">el</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="dl">'</span><span class="s1">.total span</span><span class="dl">'</span><span class="p">).</span><span class="nx">text</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">model</span><span class="p">.</span><span class="nx">recordCount</span><span class="p">);</span>
<span class="c1">// ### Now setup all the extra mini-widgets</span>
<span class="c1">//</span>
<span class="c1">// Facets, Pager, QueryEditor etc</span>
<span class="kd">var</span> <span class="nx">view</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">recline</span><span class="p">.</span><span class="nx">View</span><span class="p">.</span><span class="nx">FacetViewer</span><span class="p">({</span>
<span class="na">model</span><span class="p">:</span> <span class="k">this</span><span class="p">.</span><span class="nx">model</span>
<span class="p">});</span>
<span class="nx">view</span><span class="p">.</span><span class="nx">render</span><span class="p">();</span>
<span class="k">this</span><span class="p">.</span><span class="nx">el</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="dl">'</span><span class="s1">.sidebar</span><span class="dl">'</span><span class="p">).</span><span class="nx">append</span><span class="p">(</span><span class="nx">view</span><span class="p">.</span><span class="nx">el</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">pager</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">recline</span><span class="p">.</span><span class="nx">View</span><span class="p">.</span><span class="nx">Pager</span><span class="p">({</span>
<span class="na">model</span><span class="p">:</span> <span class="k">this</span><span class="p">.</span><span class="nx">model</span><span class="p">.</span><span class="nx">queryState</span>
<span class="p">});</span>
<span class="k">this</span><span class="p">.</span><span class="nx">el</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="dl">'</span><span class="s1">.pager-here</span><span class="dl">'</span><span class="p">).</span><span class="nx">append</span><span class="p">(</span><span class="nx">pager</span><span class="p">.</span><span class="nx">el</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">queryEditor</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">recline</span><span class="p">.</span><span class="nx">View</span><span class="p">.</span><span class="nx">QueryEditor</span><span class="p">({</span>
<span class="na">model</span><span class="p">:</span> <span class="k">this</span><span class="p">.</span><span class="nx">model</span><span class="p">.</span><span class="nx">queryState</span>
<span class="p">});</span>
<span class="k">this</span><span class="p">.</span><span class="nx">el</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="dl">'</span><span class="s1">.query-here</span><span class="dl">'</span><span class="p">).</span><span class="nx">append</span><span class="p">(</span><span class="nx">queryEditor</span><span class="p">.</span><span class="nx">el</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">});</span></code></pre></figure>
Rufus Pollock
Labs Show and Tell - 26th October!
2012-10-23T00:00:00+00:00
http://okfnlabs.org/blog/2012/10/23/show-and-tell
<p><img src="http://assets.okfn.org/p/labs/img/tent.png" style="margin-left: 30px; float: right;" /></p>
<p>We’re having the next Show and Tell on Friday, <a href="http://www.timeanddate.com/worldclock/fixedtime.html?iso=20121026T1430&p1=136">26 October at 2:30 pm BST</a> via Google Hangout on Air. As usual, the URL will be posted on <a href="https://plus.google.com/108417336285743833546/posts">OKFN Labs’ G+ Page</a>.</p>
<p>If you’d like to present, add your name to <a href="http://okfnpad.org/show-and-tell-Oct-26">the list</a>. Remember, <a href="http://webchat.freenode.net/?channels=okfn">#okfn on irc.freenode.net</a> will be the backchannel for discussion and questions, so don’t forget to hang out there.</p>
<h3 id="whats-show-and-tell">What’s Show and Tell?</h3>
<p>Have you built some cool tech you want to show everyone? Played around with some data? The Labs Show and Tell is your chance to share it with the OKFN Labs community! You get 2 to 5 minute to show us what you built!</p>
<h3 id="missed-the-last-one">Missed the last one?</h3>
<p>On Oct 12, 2012, we had the first Labs Show and Tell. Here’s what we talked about:</p>
<h4 id="scientific-promiscuity---michael-bauer"><a href="http://promiscuity.tentacleriot.eu/">Scientific Promiscuity</a> - Michael Bauer</h4>
<p>Scientific papers are rarely written by a single person. Usually many authors come together to work on a specific issue. This visualization uses data obtained from Pubmed to show collaboration between authors.</p>
<p><img src="/img/dashboard.png" style="margin-left: 30px; float: right;" /></p>
<h4 id="activity-api---code---tom-rees"><a href="http://activityapi.herokuapp.com/">Activity API</a> - <a href="https://github.com/okfn/activityapi">Code</a> - Tom Rees</h4>
<p>Activity API scrapes through multiple data sources and creates one single PostgreSQL database with all the data. It scrapes through GitHub, Twitter, mailing lists posts, and Twitter.</p>
<h4 id="dashboard---code---tom-rees"><a href="http://okfnlabs.org/dashboard/#project/labs">Dashboard</a> - <a href="https://github.com/okfn/dashboard">Code</a> - Tom Rees</h4>
<p>The OKFN Community Dashboard provides an overview of community activity. We have a flourishing and diverse set of activities and it can be hard, even for people ‘inside’ to see what is going on. The Dashboard helps us see quickly what is going on.</p>
<h4 id="nomenklatura---code---friedrich"><a href="http://nomenklatura.okfnlabs.org/">nomenklatura</a> - <a href="https://github.com/pudo/nomenklatura">Code</a> - Friedrich</h4>
<p>A lot of time in data wrangling is spent making mappings of variant names to a canonical form. This app provides an easy-to-use, web-based method for creating such mappings, to allow for a more managed data cleansing pipeline.</p>
<h4 id="messy-tables---friedrich"><a href="https://github.com/okfn/messytables">Messy Tables</a> - Friedrich</h4>
<p>A library for dealing with messy tabular data in several formats, guessing types and detecting headers.</p>
<h4 id="froide---stefanw"><a href="https://github.com/stefanw/froide">froide</a> - stefanw</h4>
<p>Froide is a Freedom Of Information tracker. The name comes from Freedom of Information (de). Also Froide sounds like Freude which is German for joy.</p>
<h4 id="pybossa-on-travis-ci---nigel"><a href="https://travis-ci.org/#!/PyBossa/pybossa">PyBossa on Travis CI</a> - Nigel</h4>
<p>PyBossa now uses Travis CI for continuous integration. Makes reviewing pull requests easier since we can see test status right away.</p>
Nigel Babu
Wrangling dirty data with messytables.
2012-10-22T00:00:00+00:00
http://okfnlabs.org/blog/2012/10/22/messytables
<p>One of the largest data collection projects we have done so far
has been the <a href="http://openspending.org/resources/gb-spending/">consolidation of the UK’s departmental expenditure</a>.
Over 370 different government entities have published a total
of more than 7000 spreadsheets. Many of those have obviously
been hand-crafted or at least manually processed. Our goal was to
consolidate the contained information into a single
spreadsheet, discarding all the eccentricities included by the individual
publishers.</p>
<p><a href="https://github.com/okfn/messytables">messytables</a> is a simple
Python library that tries to extract tabular contents from
spreadsheet documents created by human editors. Often, even files
released as CSV or Excel are still not easy to parse
programmatically. Some people like to start off spreadsheets with
a title column or some metadata, while others use inapproriate
formats to represent numbers or dates.</p>
<p>The tool offers a set of functions that help to make parsing data
easier:</p>
<ul>
<li>
<p>A <strong>headers detector</strong> tries to determine which row in a spreadsheet
contains the actual header definitions (as opposed to any trailing
content).</p>
</li>
<li>
<p><strong>type detection</strong> attempts to guess the data type for each column,
including a wide range of commonly used date formats.</p>
</li>
<li>
<p>support for <strong>streaming data</strong>, so that extremely large tables can
be processed without loading the entire data into memory.</p>
</li>
<li>
<p>and, of course, it supports a <strong>range of spreadsheet types</strong> - from
trusty CSV to Excel and even OpenOffice formats.</p>
</li>
</ul>
<p>We’ve since also started using messytables to load data into the
<a href="http://ckan.org/2012/10/22/ckan-1-8-released/">data API of CKAN</a>,
where it serves as the ETL for the datastore and related
<a href="http://reclinejs.com/">ReclineJS</a> previews.</p>
<p>If you’re interested, check out the <a href="https://messytables.readthedocs.io/en/latest/index.html">messytables documentation</a>
and the <a href="https://github.com/openspending/dpkg-uk25k/blob/master/extract.py">uk25k scripts</a>
which use it to gather UK government finance.</p>
<p>Of course, messytables is not a cure-all and only useful for reading
data.</p>
<ul>
<li><a href="http://docs.python-tablib.org/en/latest/">tablib</a>, for example, has
a fantastic API that makes writing, analyzing and converting data a
breeze.</li>
<li><a href="https://csvkit.readthedocs.io/en/latest/index.html">csvkit</a> has a
set of command line utilities that should be pre-installed on any
computer.</li>
</ul>
<p>But when it comes to tables that are a complete mess: give it a try!</p>
Friedrich Lindenberg
Open Interests Hackathon in London, 24-25 November
2012-10-15T00:00:00+00:00
http://okfnlabs.org/blog/2012/10/15/openinterests
<p>
The <a href="http://ejc.net">European Journalism Centre</a> and the Open Knowledge Foundation,
sponsored by <a href="http://mozillaopennews.org/">Knight-Mozilla
OpenNews</a>, invite you to the <a href="/events/hackdays/lobbying.html">Open Interests
Hackathon</a> to track the the interests and money flows which shape European policy.
</p>
<p>
<strong>When</strong>: 24-25 November
</p>
<p>
<strong>Where</strong>: Google Campus Cafe, 4-5 Bonhill Street, EC2A 4BX London
</p>
<p>
How EU money is spent is an issue that concerns everyone who pays taxes to the EU. As the influence of Brussels lobbyists grows, it is increasingly important to draw the connections between lobbying, policy-making and funding. Journalists and activists need browsable databases, tools and platforms to investigate lobbyists’ influence and where the money goes in the EU. Join us and help build these tools!
</p>
<p>
Open Interests Europe brings together developers, designers, activists, journalists and other geeks for two days of collaboration, learning, fun, intense hacking and app building.
</p>
<div class="teaser boxed">
<a href="/events/hackdays/lobbying.html">Visit the event page to learn
more</a>
</div>
Velichka Dimitrova
Labs Show and Tell - All Welcome!
2012-10-10T00:00:00+00:00
http://okfnlabs.org/blog/2012/10/10/show-and-tell
<p><img src="http://assets.okfn.org/p/labs/img/tent.png" style="margin-left: 30px; float: right;" /></p>
<p><strong>Built an app or tool you want to show people? Played around with some
interesting data? Know of a new development people should know about? Want to
find out what others are doing?</strong></p>
<p>Come to the <strong>Show and Tell this Friday</strong> and share what you are up to with the
community!</p>
<h3 id="sign-up">Sign up</h3>
<p>Want to participate? Just add your name to <a href="http://okfnpad.org/show-and-tell-Oct-12">the list on the etherpad</a>! If
you want to present just add a brief title and/or short description.</p>
<p>Remember, <a href="http://webchat.freenode.net/?channels=okfn">#okfn on irc.freenode.net</a> will be the backchannel for
discussion and questions, so feel free jump in there if you have questions or
queries or just want to shoot the breeze.</p>
<h3 id="when">When?</h3>
<p>Friday, <a href="http://www.timeanddate.com/worldclock/fixedtime.html?iso=20121012T1430&p1=136">12 October at 2:30 pm BST - that’s 10:30am EST, 3:30pm CET etc</a>.
The session will last <strong>30m with presentation slots of 2-5m</strong>.</p>
<h3 id="where">Where?</h3>
<p>Google Hangout on Air and <a href="http://webchat.freenode.net/?channels=okfn">#okfn on irc.freenode.net</a>. We’ll post the on air
URL on <a href="https://plus.google.com/108417336285743833546/posts">OKFN Labs’ G+ Page</a>, here and on <a href="http://twitter.com/okfnlabs">OKFN Labs twitter</a>.</p>
Nigel Babu
Data Catalogues are People!
2012-09-25T00:00:00+00:00
http://okfnlabs.org/blog/2012/09/25/datacatalogues
<p>Last week, <a href="https://twitter.com/matejkurian">Matej Kurian</a> published
a message on the <a href="http://lists.okfn.org/mailman/listinfo/okfn-labs">okfn-labs mailing</a>
list, <a href="http://lists.okfn.org/pipermail/okfn-labs/2012-September/000376.html">describing</a> the various sources he had discovered for
machine-readable excerpts of the EU’s joint procurement system, TED.
What struck me about this message was that, apparently, this polite
and brilliant policy wonk had turned into something strange: into a
data catalogue.</p>
<p>While not quite a Kafka-grade transformation, it’s an odd turn to
take for a researcher. But Matej is not the only one: the team of
<a href="http://farmsubsidy.org/">FarmSubsidies.org</a> has experienced a similar re-definition, as did
the ERDF researchers at the <a href="http://www.thebureauinvestigates.com/">Bureau of Investigative Journalists</a>.</p>
<p>The best data catalogues today are well-informed people.</p>
<p>When I talk to journalists about data acquisition, they seem to know
this already: its often not just about where to look; it’s even more
important to know who to talk to. But why does this observation from a
telephone-and-filofax world hold true even in digital space, where
every bit of knowledge is supposed to be only a click away?</p>
<p>I believe that some blame goes to the simplistic model underlying our
efforts to catalogue data: the question of where to find a dataset is
certainly important, but for those actually working with the data it’s
just not enough. Once you dig into data, other questions rise to the
foreground:</p>
<ul>
<li>
<p>How do the different available datasets interact and integrate? Does
the data I am looking for even make sense on its own - or do I need
to combine several sources? Take, for example, the UKs <em>Whole of
Government Accounts</em>: while data.gov.uk <a href="http://data.gov.uk/dataset/coins">lists</a> a few gigabytes worth of
downloads for this dataset, it is completely impossible to interpret
the data without also fetching Excel files (and PDF guidance) off the
Treasury web site, the Department of Communities and Local Government
site and - bonus points - emailing the Treasury for their internal
toolkit.</p>
</li>
<li>
<p>How complete and up-to-date is the data? What technical and political
constraints apply to the publication? Again, FarmSubsidies provide a
nice example, as a 2010 European Court of Justice verdict has severely
limited the availablity of the data - leading to an oddly limited
dataset today.</p>
</li>
<li>
<p>Who else is working with this data and what are they doing? Are there
derivative datasets that I should use instead of the source material?
It may be worth knowing, for example, that as well as browsing the
6000-odd departmental spending spreadsheets, journalists can also search
across a consolidated version of this data on OpenSpending.org</p>
</li>
</ul>
<p>But why are current data portals so bad at capturing such information?
Certainly, adding a few comment boxes and an app gallery can do a good
job glossing over the problem, but the real problems seem to lie deeper in
the technology:</p>
<ul>
<li>
<p>Datasets are a useless unit. A while ago, <a href="http://richard.cyganiak.de/">Richard Cyganiak</a> defined a
dataset as “a set of data” - which I assume is a computer scientists
way of telling you to get lost. And while I’m not normally a big fan
of LOD-clouds, they got this right: all the interesting stuff is
happening in between datasets. Whether it’s about reconstructing a
process across several datasets or finding out about geographical and
temporal coverage - datasets are at best building blocks, more often
they are just arbitrary. So maybe its time to think about other
mechanisms to represent data sources: what about policy maps and
government wiring plans?</p>
</li>
<li>
<p>Even worse, the metadata we keep about datasets is mostly based on a
bureaucratic mindset: they’re library-inspired, static index
cards that hope to represent datasets, while data are really subject
to complex processes both within and outside the institutions that
produce them. For anyone using the data, activity metadata is
the interesting part. We’ve already figured this out for software,
where libraries like FreshMeat and SourceForge have been replaced by
activity-driven platforms like GitHub. The key aspect here is that
GitHub doesn’t require me to explictly make metadata - the relevant
narrative is simply summarized from my working pattern.</p>
<p>Of course, all of this is just a long way of saying that the best
metadata is in the data itself. So unless you’re working on the LHC
stuff there really isn’t much of a reason to separate the two any
longer: let’s make public, audit-trailed databases that report on
themselves. This, of course, is easier said then done as it implies
that all data will fit into one storage mechanism. In the real
world (i.e. outside Linked Data land), this is unlikely to be true
of structured data any time soon.</p>
</li>
</ul>
<p>Still, even after fixing our model of how we talk about datasets on the
web, I think we would still find that the best way to ensure that people
collaborate around data is community-building: creating networks that
garden the commons. Perhaps we should start cataloguing those.</p>
Friedrich Lindenberg
WikipediaJS - accessing Wikipedia article data through Javascript
2012-09-10T00:00:00+00:00
http://okfnlabs.org/blog/2012/09/10/wikipediajs-a-javascript-library-for-accessing-wikipedia-article-information
<p><a href="http://okfnlabs.org/wikipediajs/">WikipediaJS</a> is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc.</p>
<p>The library is the work of Labs member <a href="http://rufuspollock.org/">Rufus
Pollock</a>. In essence, it is a small wrapper around the data and <a href="http://dbpedia.org/sparql/">APIs</a> of the <a href="http://dbpedia.org/">DBPedia project</a> and it is they who have done all
the heavy lifting of extracting structured data from Wikipedia - huge credit
and thanks to DBPedia folks!</p>
<p><a href="http://okfnlabs.org/wikipediajs/"><img src="http://farm9.staticflickr.com/8029/7961793920_7436dba276_c.jpg" style="display: block; margin: auto; width: 80%; border: #ccc 5px solid; margin-top: 20px; margin-bottom: 20px;" /></a></p>
<h3 id="demo-and-examples">Demo and Examples</h3>
<p>A demo is included and you can see some examples of the library in action at the following links:</p>
<ul>
<li><a href="http://okfnlabs.org/wikipediajs/?url=http://en.wikipedia.org/wiki/Normandy_landings">http://okfnlabs.org/wikipediajs/?url=http://en.wikipedia.org/wiki/Normandy_landings</a></li>
<li><a href="http://okfnlabs.org/wikipediajs/?url=?url=http://en.wikipedia.org/wiki/Securitas_AB">http://okfnlabs.org/wikipediajs/?url=?url=http://en.wikipedia.org/wiki/Securitas_AB</a></li>
<li><a href="http://okfnlabs.org/wikipediajs/?url=http://en.wikipedia.org/wiki/Richard_I_of_England">http://okfnlabs.org/wikipediajs/?url=http://en.wikipedia.org/wiki/Richard_I_of_England</a></li>
<li><a href="http://okfnlabs.org/wikipediajs/?url=http://en.wikipedia.org/wiki/CERN">http://okfnlabs.org/wikipediajs/?url=http://en.wikipedia.org/wiki/CERN</a></li>
</ul>
<h3 id="colophon">Colophon</h3>
<ul>
<li><a href="https://github.com/okfn/wikipediajs">WikipediaJS source code is on github</a></li>
</ul>
<p>One of the reasons for creating WikipediaJS is that we think it can be
useful in <a href="http://timeliner.reclinejs.com/">Timeliner</a> and other apps as a
way to quickly add new items to your timeline.</p>
Rufus Pollock
Timeliner - Make Nice Timelines Fast
2012-08-08T00:00:00+00:00
http://okfnlabs.org/blog/2012/08/08/timeliner-make-nice-timelines-fast
<p>As part of the <a href="http://reclinejs.com/">Recline</a> launch I put together quickly some very simple demo apps one of which was called Timeliner:</p>
<p><a href="http://timeliner.reclinejs.com/">http://timeliner.reclinejs.com/</a></p>
<p>This uses the Recline timeline component (which itself is a relatively thin wrapper around the <em>excellent</em> <a href="http://timeline.verite.co/">Verite timeline</a>) plus the Recline Google docs backend to provide an easy way for people to make timelines backed by a Google Docs spreadsheet.</p>
<p>As an example of use, I started work on a <a href="http://timeliner.reclinejs.com/?backend=gdocs&url=https://docs.google.com/spreadsheet/ccc?key=0Aon3JiuouxLUdDQ3QlJhOHJnS2x0NkxibUp1YnYwR1E%23gid=0#explorer">“spending stories” timeline about the bankruptcy of US cities (esp in California)</a> as a result of the “Great Recession” (<a href="https://docs.google.com/spreadsheet/ccc?key=0Aon3JiuouxLUdDQ3QlJhOHJnS2x0NkxibUp1YnYwR1E#gid=0">source spreadsheet</a>). I’ve also created an example <a href="http://timeliner.reclinejs.com/?backend=gdocs&url=https://docs.google.com/spreadsheet/ccc?key=0Aon3JiuouxLUdDQ3QlJhOHJnS2x0NkxibUp1YnYwR1E%23gid=0#explorer">timeline of major wars</a>, a screenshot of which I’ve inlined:</p>
<p><img src="http://farm9.staticflickr.com/8285/7508403206_420de3ce5e_b.jpg" style="width: 600px;; margin: auto; display: block; margin-top: 20px;" /></p>
<h3 id="code">Code</h3>
<p>Source code for the Timeliner is here: <a href="https://github.com/okfn/timeliner">https://github.com/okfn/timeliner</a></p>
<p>If you have suggestions for improvements, want to see the ones that already exist, or, <em>gasp</em>, find a bug please see the issue tracker: <a href="https://github.com/okfn/timeliner/issues">https://github.com/okfn/timeliner/issues</a></p>
Rufus Pollock
The Data Transformer - Cleaning Up Data in the Browser
2012-07-31T00:00:00+00:00
http://okfnlabs.org/blog/2012/07/31/data-transformer-cleaning-up-data-in-the-browser
<p>This a brief post to announce an alpha prototype version of the Data Transformer, an app to let you clean up data in the browser using javascript:</p>
<p><a href="http://transformer.datahub.io/">http://transformer.datahub.io/</a></p>
<h3 id="2m-overview-video">2m overview video:</h3>
<iframe width="560" height="315" src="http://www.youtube.com/embed/zM1USNaEcVQ" frameborder="0" allowfullscreen="1" style="margin-bottom: 30px;"> </iframe>
<h3 id="what-does-this-app-do">What does this app do?</h3>
<ol>
<li>You load a CSV file from github (fixed at the moment but soon to be customizable)</li>
<li>You write simple javascript to edit this file (uses ReclineJS transform and grid views + CSV backends – here’s the <a href="http://reclinejs.com/demos/multiview/?currentView=transform">original ReclineJS transform demo</a>)</li>
<li>You save this updated file back to github (via oauth login - this utilizes Michael’s great work in Prose!)</li>
</ol>
<p>This prototype was hacked together in an afternoon a couple of weeks ago when I was fortunate enough to spend an an afternoon with Michael Aufreiter, Chris Herwig, Mike Morris and others at the Development Seed offices. It builds on ReclineJS + oauth / github connectors borrowed from Prose.</p>
<p>It’s part of an ongoing plan to create a “Data Orchestra” of lightweight data services that can play nicely together with each
other and connect to things like the DataHub (or GitHub …): <a href="http://notebook.okfn.org/2012/06/22/datahub-small-pieces-loosely-joined/">http://notebook.okfn.org/2012/06/22/datahub-small-pieces-loosely-joined/</a></p>
Rufus Pollock
Displaying PyBossa Urban Parks Data on a 3D Globe
2012-07-14T00:00:00+00:00
http://okfnlabs.org/blog/2012/07/14/pybossa-urban-parks-data-on-3d-globe
<p>Labs member <a href="http://twitter.com/teleyinex">Daniel Lombraña González</a> has built a <a href="http://teleyinex.github.com/pybossa-urbanpark-globe/">3-d globe showing the locatoins of urban parks around the world</a> as located by volunteers using the <a href="http://pybossa.com/app/urbanpark">Pybossa Urban Park geocoding app</a>:</p>
<p><strong><a href="http://teleyinex.github.com/pybossa-urbanpark-globe/">http://teleyinex.github.com/pybossa-urbanpark-globe/</a></strong> — (<a href="https://github.com/teleyinex/pybossa-urbanpark-globe">Source code</a>)</p>
<p><img src="https://p.twimg.com/AxxDoY9CIAET_0L.png:large" alt="screenshot" /></p>
<h3 id="background">Background</h3>
<p>The Urban Parks geo-coding application is a micro-tasking app running on <a href="http://pybossa.com">PyBossa</a>. In the app volunteers are asked to find an urban park for cities around the world. The volunteers use a web map to browse the city, and then submit an answer: the coordinates of the urban park by placing a marker in the map, or saying: I don’t find any park.</p>
<p>More details about PyBossa can be found on the official site <a href="http://pybossa.com">http://pybossa.com</a> and also in the <a href="http://docs.pybossa.com">online documentation</a>.</p>
Daniel Lombraña Gonzalez
dataissues.org - public issue tracking for data defects
2012-07-10T00:00:00+00:00
http://okfnlabs.org/blog/2012/07/10/dataissues
<p><em>On June 21st, the Knight News Challenge Round on Data ended. The day before,
<a href="http://rufuspollock.org/">Rufus</a>, <a href="https://twitter.com/rossjones">Ross</a> and
I sat down to write out some ideas that we’d been discussing for a while. While
we submitted proposals for <a href="/blog/2012/07/09/grano.html">Grano</a> and <a href="http://newschallenge.tumblr.com/post/25576949597/data-protocols-rough-consensus-running-code-and">DataProtocols</a>, we decided to hold back on this idea for another round. Still, sharing is caring.</em></p>
<p><strong>1. What do you propose to do? [20 words]</strong></p>
<p>We’ll create a web service where data wranglers and consumers can log errors arising from processing, viewing or using data.</p>
<p><strong>2. How will your project make data more useful? [50 words]</strong></p>
<p>All data has errors. While data quality is often talked about, the best practice for data apps is often to have half a paragraph on the ‘about’ page. We want to build a service that is useful to data wranglers, but can also serve as documentation for end-users and basis for further discussion.</p>
<p><strong>3. How is your project different from what already exists? [30 words]</strong></p>
<p>Error reporting for software is either done as task tickets (e.g. github.com) or by capturing raw application output (e.g. exceptional.io). For data, we want to combine these two approaches to let users group recurring errors into issues that can then be discussed and fixed.</p>
<p><strong>4. Why will it work? [100 words]</strong></p>
<p>While all data processing workflows are different from dataset to dataset, the types of errors that occur are often quite similar and can be stored in a shared service. This is both immediately useful when doing data work - especially scheduled, unsupervised processes - but also as an activity log for other people to see.</p>
<p>We’ll create both an easy-to-use online validation tool to check spreadsheets against a certain schema and an API with client libraries that can be integrated into existing processing pipelines. The reported issues can be full-out errors, but also probes that highlight implausible values.</p>
<p><strong>5. Who is working on it? [100 words]</strong></p>
<p>The Open Knowledge Foundation is…</p>
<p><strong>6. What part of the project have you already built? [100 words]</strong></p>
<p>We’ve got extensive experience working with dataset metadata from DataHub.io and produced a number of complex data processing pipelines (e.g. for UK spending data, that merges over 5000 spreadsheets in different formats). These clearly show the need for better reporting, and we have built several ad-hoc solutions but know that is a major area that is inadequately addressed in our work and those of others. We have already got a basic prototype and can build a first increment quickly.</p>
<p><strong>7. How would you use News Challenge funds? [50 words]</strong></p>
<p>We’ll built it! We’ll develop a full version of this service iteratively, test and promote it. We plan to work together with civic data projects as early adopters to get quick feedback and adapt the service to suit their needs.</p>
<p><strong>8. How would you sustain the project after the funding expires? [50 words]</strong></p>
<p>This will be perfectly suited to SaaS freemium model in which heavy and/or professional users who need to report large amounts of errors and generate complex reports pay a subscription fee. In addition as open-source software the project can be re-used and extended by others.</p>
<p><strong>If you think this is a good idea, <a href="http://github.com/okfn/dataissues">help hacking and contribute patches to the dataissues repository</a>!</strong></p>
Friedrich Lindenberg
Grano - social network analysis for advocates and journalists.
2012-07-09T00:00:00+00:00
http://okfnlabs.org/blog/2012/07/09/grano
<p><em>On June 21st, the Knight News Challenge Round on Data ended. The day before,
<a href="http://rufuspollock.org/">Rufus</a>, <a href="https://twitter.com/rossjones">Ross</a> and
I sat down to write out some ideas that we’d been discussing for a while. The
first idea I want to repost here is a proposal for Grano, which I’ve <a href="http://pudo.org/2011/12/19/sna.html">discussed
in this blog before</a>.</em></p>
<p><strong>1. What do you propose to do? [20 words]</strong></p>
<p>We’ll make a powerful tool for journalists and advocates to keep track of actors and their relationships in complex environments.</p>
<p><img src="http://pudo.org/images/grano.png" alt="Grano" /></p>
<p><strong>2. How will your project make data more useful? [50 words]</strong></p>
<p>It’ll enable users to manage research in a structured way, helping them to link raw data to the actors, events and organisations they’re already investigating and to find those that they may have missed before. We’ll help users do their job more thoroughly, while creating a structure that can be re-used later.</p>
<p><strong>3. How is your project different from what already exists? [30 words]</strong></p>
<p>Network analysis means many things to people: it’s graph algorithms to coders, network diagrams to designers and CRM to business. Journalists and advocates need evidence gathering and information linkage to be at the core of these things.</p>
<p><strong>4. Why will it work? [100 words]</strong></p>
<p>We want to focus on four functions that will make this a practical tool instead of a gimmick:</p>
<p>a) allowing users to easily integrate bulk data to complement manually entered information,</p>
<p>b) helping them to keep track of the source for each fact that is entered and keeping a full version history,</p>
<p>c) providing easy access control so that users can choose which information to keep private and which links to publish with others and</p>
<p>d) text snippets, so that researchers can combine structured analysis and narrative fragments in which the tool will detect references to the network’s entities.</p>
<p><strong>5. Who is working on it? [100 words]</strong></p>
<p>The Open Knowledge Foundation wants to cooperate with investigative networks around the world to develop this project. We’ve already been pioneering data collection and presentation tools, such as DataHub.io and OpenSpending, as well as efforts like the Data Journalism Handbook and the School of Data to widen data literacy.</p>
<ul>
<li>Friedrich Lindenberg (OKFN) has worked on several data projects and data-journalism training and will lead this project.</li>
<li>Ross Jones (OKFN) will contribute as a software architect.</li>
<li>Stefan Candea (2011 Nieman Fellow at Harvard, Director of the Romanian Center for Investigative Journalism) has offered to advise us.</li>
</ul>
<p><strong>6. What part of the project have you already built? [100 words]</strong></p>
<p>We’ve already built Grano, a REST backend that can store network information, generate custom reports about nodes and relations and run full-text search. Because we think that meaningful network analysis is hard, we are conservative in the choice of technology to focus on outcomes. To force that, we decided to base our tool on a concrete use case. The software is now first used in an unannounced project that tracks lobbying in the EU, powering a special-purpose, JavaScript-only site. Unfortunately, this means the current prototype does not have a stand-alone web interface and the serious data integration capabilities we think it needs.</p>
<p><strong>7. How would you use News Challenge funds? [50 words]</strong></p>
<p>We want to develop Grano to give investigative journalists and civic hackers a (re-usable) web interface to design their network structure, manually enter data, integrate bulk data sets and to explore the resulting network, make notes, calculate key metrics and to export reports, rankings and network visualizations.</p>
<p><strong>8. How would you sustain the project after the funding expires? [50 words]</strong></p>
<p>While the service is going to be of immediate use, we believe that advocacy groups and newsrooms will also deploy it as a backend to their features and campaign sites. We aim to make Grano into a thriving open source project, supported through custom services for power users.</p>
<p><strong>If you like this idea, please vote for it on the <a href="http://newschallenge.tumblr.com/post/25572174408/grano">proposal page</a>.</strong></p>
Friedrich Lindenberg