Data Wrangling Walk-through
This guide will introduce you to key routines we usually do, before publishing any data package. Nevertheless, these are basic tasks that require data wrangling skills.
There are many sources of data. You should always look for the most reliable ones.
For this example, lets say we want to get data about the Historical GDP, like we did, here.
The source of that indicator is a private spreadsheet and we preferred that because it contains data from the longest time we could find.
Preparing the Data
The final outcome we want is to have a
JSON file. In that case, the first step is to find that data in the pretended
.CSV file and you can do that by simply building a Python script to remove the many white and blank spaces we do not need. Personally, if you are not too comfortable, it might make sense to simply remove those spaces using Excel/Spreadsheets. However, this dataset in particular is very human readable, but not machine readable. This table is a what we call pivot table. In summary, we need to “unpivot” it.
Assuming you have deleted all the blank spaces manually, and stored all sheets into separate
.CSV files, using Python’s
pandas package, you can wrangle this table by running the following chunk of code:
import pandas as pd def main(): df_pop = pd.read_csv('source/population.csv') df_gdp = pd.read_csv('source/gdp.csv') df_gdp_grow = pd.read_csv('source/gdp-growth.csv') df_gdp_pc = pd.read_csv('source/gdp-per-capita.csv') df_gdp_pc_growth = pd.read_csv('source/gdp-per-capita-growth.csv') df_pop_growth = pd.read_csv('source/population-growth.csv') le_pop = pd.melt(df_pop, id_vars=['Country'], var_name="Year", value_name="Value") le_popgr = pd.melt(df_pop_growth, id_vars=['Country'], var_name="Year", value_name="Value") le_gdp = pd.melt(df_gdp, id_vars=['Country'], var_name="Year", value_name="Value") le_gdpgr = pd.melt(df_gdp_grow, id_vars=['Country'], var_name="Year", value_name="Value") le_gdppc = pd.melt(df_gdp_pc, id_vars=['Country'], var_name="Year", value_name="Value") le_gdppcgr = pd.melt(df_gdp_pc_growth, id_vars=['Country'], var_name="Year", value_name="Value") # Sort by country name - convenience only le_pop2 = le_pop.sort_values(['Country']) le_pop2gr = le_popgr.sort_values(['Country']) le_gdp2 = le_gdp.sort_values(['Country']) le_gdpgr2 = le_gdpgr.sort_values(['Country']) le_gdppc2 = le_gdppc.sort_values(['Country']) le_gdppcgr2 = le_gdppcgr.sort_values(['Country']) # Write each df to csv le_pop2.dropna().to_csv('data/population.csv', sep=',', index=False) le_pop2gr.dropna().to_csv('data/population-growth.csv', sep=',', index=False) le_gdp2.dropna().to_csv('data/gdp.csv', sep=',', index=False) le_gdpgr2.dropna().to_csv('data/gdp-growth.csv', sep=',', index=False) le_gdppc2.dropna().to_csv('data/gdp-per-capita.csv', sep=',', index=False) le_gdppcgr2.dropna().to_csv('data/gdp-per-capita-growth.csv', sep=',', index=False) if __name__ == '__main__': main()
In summary, what this chunk of code does it loading the many
.CSV files, removes the pivot format by running the
pandas.melt function, sorts it by country (alphabetically), and then it stores all files as new
.CSV files, later to be used in the datapackage.
Setting up a Git Repository
We strongly advise the usage of GitHub as a great way to keep your code open and free of access. Do not worry if you do not know what Git or GitHub is. We have a small guide here.
In order to keep everything organized and as universal as possible, your directory should look similar to:
dir data id-name.csv archive source.csv scripts README.md py.py requirements.txt - optional README.md datapackage.json Makefile
In summary, you should fit under
data the final CSV that you will use for your data package and the source data into the
archive folder. If you need to perform any script to clean and wrangle any bit of the dataset, you have to post it under
scripts, preferably with the name
process.py, but this is not a convention. The
dir/README.md should contain information about the package, source and licenses (if it applies). On the other hand,
scripts/README.md should talk about the script and any particular information about it.
Note for Python users: Do not forget to create the
requirements.txt if you use any special Python package.
As for the Makefile, which allows us to automate the process of maintaining a data package, it should contain a piece of code similar to:
version="0.1.0" DATADIR=data SCRIPTDIR=scripts all: data data: python $(SCRIPTDIR)/process.py clean: rm -f $(DATADIR)/* .PHONY: all data clean
This is telling Python where are the base directories to search for the script and to search for the data and it also tells Python to erase the previous outcomes and store new ones with the new data.
At this stage, you should have everything pretty much covered in this simple walk-through, but we advice you to double check your data package by looking for details that make it not machine readable:
- The common structure of these packages is
COUNTRY,YEAR,VALUE. This is not static though. The World Bank of Data usually provides such structure that we generally use
COUNTRY,COUNTRY CODE, YEAR, VALUE.
- We prefer using
,to separate decimal values. We also want to avoid using certain symbols such as
%, &, #, ;, :,and a few others that can interfere with the data package.
- If data is not available, you either make that cell as
- The data package name should be
your-package.csv. We rather use
-(underscores) to refer to spaces.
By now you can understand why we use programming skills here, since you can basically automate most of these processes.
From an API
Another common method to get data is from an API. The process is very similar, although can load the data directly to your Python script. In other words, if you recurr to an API, you can actually fasten this process even more.
Lets take another example and assume we want to retrieve data about the GINI Index over time, like we did here.
There are two ways you can get data from the World Bank:
Downloading the CSV file manually, in the indicator page - in the case of the GINI, that would be http://data.worldbank.org/indicator/SI.POV.GINI. Or
Use their API.
The first method requires that you download the
.CSV as the picture bellow demonstrates and then you have to repeat the process as described above.
The second method, speeds a lot of this process.
To build an API query, you will have to tell Python the base URL, then the special part of the URL you want and the format. In the case of the World Bank of Data and the GINI Index, you will to have a similar script:
import pandas as pd apiBase = "http://api.worldbank.org/indicator/" apiIndicator = "SI.POV.GINI" # This can be changed to any other indicator FILE_NAME = 'gini-index.csv' source = apiBase+apiIndicator+"?format=csv" def main(): giniIndex = pd.read_csv(source) giniIndex.to_csv('archive/gini-index.csv', sep=",", index_col=0, index=False)
What this does is fetches the
.CSV file from the API, the
pandas package reads it as a dataframe and the script then stores it as a
.CSV file. However, this is not the end case, as you will need to revise the data and perform the same tasks to ensure quality on these packages.