Labs newsletter: 14 November, 2013

Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.

Data Pipes: lots of improvements

Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.

This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:

  • new operations: strip (removes empty rows), tail (truncate dataset to its last rows)
  • new features: a range function and a “complement” switch for cut; options for grep
  • all operations in pipeline are now trimmed for whitespace
  • basic tests have been added

Have a look at the closed issues to see more of what Andy has been up to.

Webshot: new homepage and feature

Last week we introduced you to Webshot, a web API for screenshots of web pages.

Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.

Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.

On the blog: natural language processing with Python

Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.

“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.

Data Packages workflow à la Node

Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init?

Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.

Nomenklatura: looking forward

Nomenklatura is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.

Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed “a fairly radical re-framing of the service”.

The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.

Data Issues: following issues

Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.

Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”

Get involved

Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see what’s cooking in the Labs.