Data Wrangling Walk-through
This guide will introduce you to key routines we usually do, before publishing any data package. Nevertheless, these are basic tasks that require data wrangling skills.
Getting data
There are many sources of data. You should always look for the most reliable ones.
For this example, lets say we want to get data about the Historical GDP, like we did, here.
The source of that indicator is a private spreadsheet and we preferred that because it contains data from the longest time we could find.
Preparing the Data
The final outcome we want is to have a .CSV
and JSON
file. In that case, the first step is to find that data in the pretended .CSV
file and you can do that by simply building a Python script to remove the many white and blank spaces we do not need. Personally, if you are not too comfortable, it might make sense to simply remove those spaces using Excel/Spreadsheets. However, this dataset in particular is very human readable, but not machine readable. This table is a what we call pivot table. In summary, we need to “unpivot” it.
Assuming you have deleted all the blank spaces manually, and stored all sheets into separate .CSV
files, using Python’s pandas
package, you can wrangle this table by running the following chunk of code:
import pandas as pd
def main():
df_pop = pd.read_csv('source/population.csv')
df_gdp = pd.read_csv('source/gdp.csv')
df_gdp_grow = pd.read_csv('source/gdp-growth.csv')
df_gdp_pc = pd.read_csv('source/gdp-per-capita.csv')
df_gdp_pc_growth = pd.read_csv('source/gdp-per-capita-growth.csv')
df_pop_growth = pd.read_csv('source/population-growth.csv')
le_pop = pd.melt(df_pop, id_vars=['Country'], var_name="Year", value_name="Value")
le_popgr = pd.melt(df_pop_growth, id_vars=['Country'], var_name="Year", value_name="Value")
le_gdp = pd.melt(df_gdp, id_vars=['Country'], var_name="Year", value_name="Value")
le_gdpgr = pd.melt(df_gdp_grow, id_vars=['Country'], var_name="Year", value_name="Value")
le_gdppc = pd.melt(df_gdp_pc, id_vars=['Country'], var_name="Year", value_name="Value")
le_gdppcgr = pd.melt(df_gdp_pc_growth, id_vars=['Country'], var_name="Year", value_name="Value")
# Sort by country name - convenience only
le_pop2 = le_pop.sort_values(['Country'])
le_pop2gr = le_popgr.sort_values(['Country'])
le_gdp2 = le_gdp.sort_values(['Country'])
le_gdpgr2 = le_gdpgr.sort_values(['Country'])
le_gdppc2 = le_gdppc.sort_values(['Country'])
le_gdppcgr2 = le_gdppcgr.sort_values(['Country'])
# Write each df to csv
le_pop2.dropna().to_csv('data/population.csv', sep=',', index=False)
le_pop2gr.dropna().to_csv('data/population-growth.csv', sep=',', index=False)
le_gdp2.dropna().to_csv('data/gdp.csv', sep=',', index=False)
le_gdpgr2.dropna().to_csv('data/gdp-growth.csv', sep=',', index=False)
le_gdppc2.dropna().to_csv('data/gdp-per-capita.csv', sep=',', index=False)
le_gdppcgr2.dropna().to_csv('data/gdp-per-capita-growth.csv', sep=',', index=False)
if __name__ == '__main__':
main()
In summary, what this chunk of code does it loading the many .CSV
files, removes the pivot format by running the pandas.melt
function, sorts it by country (alphabetically), and then it stores all files as new .CSV
files, later to be used in the datapackage.
Setting up a Git Repository
We strongly advise the usage of GitHub as a great way to keep your code open and free of access. Do not worry if you do not know what Git or GitHub is. We have a small guide here.
Directory structure
In order to keep everything organized and as universal as possible, your directory should look similar to:
dir
data
id-name.csv
archive
source.csv
scripts
README.md
py.py
requirements.txt - optional
README.md
datapackage.json
Makefile
In summary, you should fit under data
the final CSV that you will use for your data package and the source data into the archive
folder. If you need to perform any script to clean and wrangle any bit of the dataset, you have to post it under scripts
, preferably with the name process.py
, but this is not a convention. The dir/README.md
should contain information about the package, source and licenses (if it applies). On the other hand, scripts/README.md
should talk about the script and any particular information about it.
Note for Python users: Do not forget to create the requirements.txt
if you use any special Python package.
As for the Makefile, which allows us to automate the process of maintaining a data package, it should contain a piece of code similar to:
version="0.1.0"
DATADIR=data
SCRIPTDIR=scripts
all: data
data:
python $(SCRIPTDIR)/process.py
clean:
rm -f $(DATADIR)/*
.PHONY: all data clean
This is telling Python where are the base directories to search for the script and to search for the data and it also tells Python to erase the previous outcomes and store new ones with the new data.
Quality Assurance
At this stage, you should have everything pretty much covered in this simple walk-through, but we advice you to double check your data package by looking for details that make it not machine readable:
- The common structure of these packages is
COUNTRY,YEAR,VALUE
. This is not static though. The World Bank of Data usually provides such structure that we generally useCOUNTRY,COUNTRY CODE, YEAR, VALUE
. - We prefer using
.
rather than,
to separate decimal values. We also want to avoid using certain symbols such as%, &, #, ;, :,
and a few others that can interfere with the data package. - If data is not available, you either make that cell as
0
orNaN
. - The data package name should be
your-package.csv
. We rather use-
(underscores) to refer to spaces.
By now you can understand why we use programming skills here, since you can basically automate most of these processes.
From an API
Another common method to get data is from an API. The process is very similar, although can load the data directly to your Python script. In other words, if you recurr to an API, you can actually fasten this process even more.
Lets take another example and assume we want to retrieve data about the GINI Index over time, like we did here.
There are two ways you can get data from the World Bank:
-
Downloading the CSV file manually, in the indicator page - in the case of the GINI, that would be http://data.worldbank.org/indicator/SI.POV.GINI. Or
-
Use their API.
The first method requires that you download the .CSV
as the picture bellow demonstrates and then you have to repeat the process as described above.
The second method, speeds a lot of this process.
To build an API query, you will have to tell Python the base URL, then the special part of the URL you want and the format. In the case of the World Bank of Data and the GINI Index, you will to have a similar script:
import pandas as pd
apiBase = "http://api.worldbank.org/indicator/"
apiIndicator = "SI.POV.GINI" # This can be changed to any other indicator
FILE_NAME = 'gini-index.csv'
source = apiBase+apiIndicator+"?format=csv"
def main():
giniIndex = pd.read_csv(source)
giniIndex.to_csv('archive/gini-index.csv', sep=",", index_col=0, index=False)
What this does is fetches the .CSV
file from the API, the pandas
package reads it as a dataframe and the script then stores it as a .CSV
file. However, this is not the end case, as you will need to revise the data and perform the same tasks to ensure quality on these packages.
- Improve this page Edit on Github Help and instructions
-
Donate
If you have found this useful and would like to support our work please consider making a small donation.