This is a web service available for converting a multitude of document types to simple text. It is a public facing instance of the Apache Tika server (developer version). It lives at:
To test it, just throw some images with text in them at it. For example, on a terminal on Mac or Linux:
curl -T tiff_example.tif http://beta.offenedaten.de:9998/tika
More details at http://okfnlabs.org/blog/2015/02/21/documents-to-text.html. Please note that Matt is not the author of the software, just the developer for the Dockerfile that makes setting up an instance of what is quite a large piece of software very straightforward.
DataPortals.org is the most comprehensive list of open data portals in the world. It is curated by a group of leading open data experts from around the world - including representatives from local, regional and national governments, international organisations such as the World Bank, and numerous NGOs.
The goal of this project is to build the largest open product database in the world. Such a database would allow consumers to get information about a given product in real time by simply scanning its barcode with their mobile phone. It would also allow consumers to publish their opinions about products in an easily accessible way.
We are currently rebooting this project and are looking for contributors. An Android client (source) is available, and work has started on an iOS client. If you’d like to contribute or find out more, you can join the working group mailing list, follow the project on Twitter, and join the discussion on the Open Knowledge forum.
Webshot is a Node.js-based webservice provided by Open Knowledge Labs for taking automated screenshots of webpages using node-webshot. It is available for anyone to use. As always, contributions are welcome.
PublicBodies.org is a website hosting a database of so-called public bodies—that is government-run or -controlled organizations (which may or may not have a distinct corporate existence). Examples include government ministries or departments, as well as state-run organizations such as libraries, police and fire departments.
BibServer is an open-source RESTful bibliographic data server. BibServer makes it easy to create and manage collections of bibliographic records such as reading lists, publication lists and even complete library catalogs. FacetView is included to provide a rich interface for complex search queries. BibServer supports bibtex, MARC, RIS, BibJSON, RDF and other bibliographic formats.
BibJSON is a convention for representing bibliographic metadata in JSON. It is intended as a specification for developers of web applications that generate or accept bibliographic metadata. It is currently used in the Open Knowledge Labs BibServer project.
Bad Data is a site detailing real-world examples of how not to prepare or provide data. It showcases poorly structured, misformatted, or just plain ugly datasets and what they get wrong.
While its primary purpose is to serve as an educational tool for governments and other organizations, it is also a good source of practice material for budding data wranglers—those tasked with cleaning and transforming data. The project began as a place to keep practice data for Data Explorer, another project of Open Knowledge Labs.
dpm (data package manager) is a library and
command-line tool for installing and
packages. Inspired by software package management tools
apt for Debian,
to reduce the friction of
sharing and working with data.
A FireFox extension which allows you to view, search, graph and map CSV files in the browser (built using Recline). This is an port of the great Rufus Pollock's chrome-csv-viewer on FireFox.
The Open Knowledge Labs website (i.e. the site you’re looking at right now) is itself a collaborative project of Open Knowledge. It is built using Jekyll, a static site generator, and hosted on GitHub Pages. You can contribute by addressing some of the items on the project’s GitHub issue queue (hint: you can start with the easy ones).
Head Start is intended for scholars who want to get an overview of a research field. They could be young PhDs getting into a new field, or established scholars who venture into a neighboring field. The idea is that you can see the main areas and papers in a field at a glance without having to do weeks of searching and reading. A prototypical implementation for the field of educational technology can be found on Mendeley Labs. The visualization is also used in Conference Navigator 3 and in the Organic Edunet portal.
A collection of the works of Ivo, bishop of Chartres.
This site has four elements
- draft texts and some concordances for the three collections traditionally
associated with Ivo of Chartres:
- the Collectio Tripartita
- the Decretum and
- the Panormia
- a draft list of manuscripts which contain a significant number of Ivo’s letters.
Originally developed as part of ReclineJS but now fully standalone.
Data Pipes is a service to provide streaming, "pipe-like" data transformations on the web – things like deleting rows or columns, find and replace, head, grep etc.
Reconcile-CSV is an OpenRefine reconciliation service running on top of a CSV file. It uses fuzzy matching to find the most likely candidates for matching. If you ever needed to join two datasets that didn't have unique identifiers and where things are written slightly different: this is a way to go.
Open Correspondence is an attempt to explore the letters network of the nineteenth century. At the moment, the project contains some of the letters of Charles Dickens, but we're working to expand to it include many other authors such as Jane Austen, George Eliot and Byron.
It also provides a backend interface to ElasticSearch suitable for use with the Recline suite of data libraries.
There’s too much friction working with data - friction getting data, friction processing data, friction sharing data.
This friction stops people doing stuff: stops them creating, sharing, collaborating, and using data - especially amongst more distributed communities. It kills the cycles of find, improve, share that would make for a dynamic, productive and attractive (open) data ecosystem.
We need to make an ecosystem that, like open-source for software, is useful and attractive to those without any principled interest, the vast majority who simply want the best tool for the job, the easiest route to their goal.
We think that by getting a few key pieces in place we can reduce friction enough to revolutionize how the (open) data ecosystem operates with massively improved data quality, utilization and sharing.
- Standards: A small set of lightweight ‘data package’ standards and patterns providing a base structure on which tooling and integration can build.
- Tooling and Integration: Making it easy to use and publish data packages from your existing apps and workflows whether that’s Excel, R, or Hadoop!
- Outreach and Community: Engaging and evangelizing around the concepts, standards and tooling and building a community of users and contributors.
The BubbleTree can be used to display hierarchical (spending) data in an interactive visualization. The setup is easy and independent from the OpenSpending platform. However, there is an optional integration module to connect with data from the OpenSpending API.
Nomenklatura de-duplicates and integrates different names for entities - people, organisations or public bodies - to help you clean up messy data and to find links between different datasets. The service will create references for all entities mentioned in a source dataset. It then helps you to define which of these entities are duplicates and what the canonical name for a given entity should be. This information is available in data cleaning tools like OpenRefine or in custom data processing scripts, so that you can automatically apply existing mappings in the future. The focus of nomenklatura is on data integration, it does not provide further functionality with regards to the people and organisations that it helps to keep track of.
Nomenklatura is a simple service that makes it easy to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names against that canonical list – for example, matching Acme Widgets, Acme Widgets Inc and Acme Widgets Incorporated to the canonical "Acme Widgets".
With Nomenklatura its a matters of minutes to set up your own set of master data to match against and it provides a simple user interface and API which you can then use do matching (the API is compatible with Open Refine's reconciliation function).
Nomenklatura can not only store the master set of entities you want to match against but also will learn and record the various aliases for a given entity - such as a person, organisation or place - may have in various datasets.
CrowdCrafting is a free, open-source crowd-sourcing and micro-tasking platform powered by the PyBossa software. This platform enables people to create and run projects that utilize on-line assistance in performing tasks that require human cognition such as image classification, transcription, geocoding and more. CrowdCrafting is there to help researchers, civic hackers and developers to create projects where anyone around the world with some time, interest and an Internet connection can contribute.
Froide is a Freedom of Information portal written in Python using the Django Web framework. It manages contactable entities, requests and much more. Users can send emails to these entities and receive public answers via the platform.
It was developed to power Frag den Staat – the German Freedom of Information Portal, but is internationalized, localized and themable and has deployed in several different countries.
Germany Freedom of Information portal powered by the Froide platform. Responsible for managing over a 1/3 of all Freedom of Information request in Germany.
Visualization on violence against women in the brazilian state Rio Grande do Sul.
In a nutshell it is an open source platform for working with collections of texts. It enables students, researchers and teachers to share and collaborate around texts using a simple and intuitive interface.
Library for working with IATI data and converting it into a relational database. http://blog.okfn.org/2012/06/05/from-xml-to-visualisations-iati/
A chrome extension which allows you to view, search, graph and map CSV files in the browser (built using Recline)
Kartograph is a simple and lightweight framework for building interactive map applications without Google Maps or any other mapping service. It was created with the needs of designers and data journalists in mind.
Tools for parsing messy tabular data. http://okfnlabs.org/blog/2012/10/22/messytables.html
Python library and command line tool for converting data from one format to another. It builds on messytables, GDAL and many more great open-source libraries for processing data, and provides one easy to use standard API.
WikipediaJS is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc. The library is the work of Labs member Rufus Pollock. In essence, it is a small wrapper around the data and APIs of the DBPedia project and it is they who have done all the heavy lifting of extracting structured data from Wikipedia - huge credit and thanks to DBPedia folks!
OffenesParlement is a site (and open-source codebase) for gathering and presenting information about the work of the Bundestag and Bundesrat.