Glossary

Open Data

Open data is data that can be used, reused and redistributed freely by anyone for any purpose. More details can be found at at opendefinition.org.

Machine-readable

Formats that are machine readable are ones which are able to have their data extracted by computer programs easily. PDF documents are not machine readable. Computers can display the text nicely, but have great difficulty understanding the context that surrounds the text. Common machine-readable file formats are CSV and Excel Files.

Readme

A file (usually named README or README.txt) that explains new users what the current directory or set of files is about. This is very commonly found in open source software projects and is considered good practice to be included with various publications (including datasets). The file usually contains a short description of what to expect.

BitTorrent

BitTorrent is a protocol for distributing the bandwith for transferring very large files between the computers which are participating in the transfer. Rather than downloading a file from a specific source, BitTorrent allows peers to download from each other.

JSON

JavaScript Object Notation. A common format to exchange data. Although it is derived from Javascript, libraries to parse JSON data exist for many programming languages. Its compact style and ease of use has made it widespread. To make viewing JSON in a browser easier you can install a plugin such as JSONView in Chrome and JSONView in Firefox.

GDP

Gross domestic product (GDP) is the market value of all officially recognized goods and services produced within a country in a given period of time. GDP per capita is often considered an indicator of a country’s standard of living. (Source: Wikipedia.)

GeoJSON

GeoJSON is a format for encoding a variety of geographic data structures. It is based on the :term:JSON specification. More documentation can be found on http://www.geojson.org.

Geocoding

From Geographical Coding. Describes the practice of attaching geographical coordinates to items.

Geocode

see Geocoding

CSV

Comma Separated Values. A very simple, open format for tabular data which can be exported and imported by all spreadsheet applications and is easily manipulable with command line tools.

Comma-separated Values

See CSV

curl

http://curl.haxx.se/ - a command line tool for transferring data to and from online systems over standard internet protocols including FTP and HTTP. Very powerful and great for working with Web API s from the command line.

DAP

See Data Access Protocol.

Data Access Protocol

A system that allows outsiders to be granted access to databases without overloading either system.

etherpad

A piece of software for collaborative real-time editing of text. See http://etherpad.org/.

Attribution Licence

A licence that requires attributing the original source of the licensed material.

API

See Application Programming Interface.

Application Programming Interface

A way computer programmes talk to one another. Can be understood in terms of how a programmer sends instructions between programmes.

Web API

An API that is designed to work over the Internet.

Share-alike Licence

A licence that requires users of a work to provide the content under the same or similar conditions as the original.

Public domain

No copyright exists over the work. Does not exist in all jurisdictions.

Open standards

Generally understood as technical standards which are free from licencing restrictions. Can also be interpreted to mean standards which are developed in a vendor-neutral manner.

Anonymisation

The process of treating data such that it cannot be used for the identification of individuals.

IP rights

See Intellectual property rights.

Intellectual property rights

Monopolies granted to individuals for intellectual creations.

Tab-separated values

Tab-separated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly machine-readable.

Taxonomy

Classification. Taxonomy refers to hierarchical classification of things. One of the best known is the Linnean classification of species - still used today to classify all living beings.

Qualitative Data

Qualitative data is data telling you something about qualities: e.g. description, colors etc. Interviews count as qualitative data

Quantitative Data

Quantitative data tells you something about a measure or quantification. Such as the quantity of things you have, the size (if measured) etc.

Crowdsourcing

Mashup of crowd and outsourcing: Having a lot of people do simple tasks to complete the whole work.

Choropleth Map

A choropleth map is a map where value are encoded onto regions using colormapping. The whole region is colored using the underlying value.

Mean

The arithmetic mean of a set of values. Calculated by summing up all values and then dividing by the number of values.

Normal Distribution

The normal (or Gaussian) distribution is a continuous probability distribution with a bell shaped curve.

Median

The median is defined as the value where 50% of values in a range will be below, 50% of values above the value.

Quartiles

Quartiles are the values where 25, 50 and 75% of values in a range are below the given value.

Percentiles

Percentiles are a value where n% of values are below in a given range. e.g. the 5th percentile: 5 percent of values are lower than this value.

Scraping

The process of extracting data in :term:machine-readable formats of non-pure data sources e.g.: webpages or PDF documents. Often prefixed with the source (web-scraping PDF-scraping).

Categorical Data

Data that helps put things into categories. E.g.: Country names, Groups, Conditions, Tags

Discrete Data

Numerical Data that, if you plot all possible values, has gaps in it. E.g. the count of things (there are no 1.5 children). Compare to Continuous Data

Continuous Data

Numerical data that, if you plot all possible values, has no gaps. E.g. Sizes (you can be 155.55 or 155.56cm tall etc.) Compare to Discrete Data

Boolean logic

A form of algebra in which all values are reduced to either TRUE or FALSE.