Pandas Numerical Operation
UNESCO Example
As mentioned in the instructions, all materials can be open in Colab
as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.
This tutorial is a follow-up from the last Pandas tutorial which introduce functions for working with text. Nonetheless, numerical operations in Pandas is also essential knowledge when it comes to statistics in the data. Here we use an example of UNESCO heritage sites to demonstrate how to work with numbers and datetime in Pandas. It will also cover some basic knowledge about plotting using Matplotlib and Pandas.
Presumptions: Same as the previous notebook
import io
import pandas as pd
import requests
# read data
url = 'https://examples.opendatasoft.com/explore/dataset/world-heritage-unesco-list/download/?format=csv&timezone=Europe/Berlin&lang=en&use_labels_for_header=true&csv_separator=%3B'
df = pd.read_csv(url, sep=";")
df.head(2) # head() is used for viewing the first few rows of data
df["Country (EN)"] # selecting one column
By typing .values, we can convert one column in the Pandas dataframe to Numpy array.
country_arr = df["Country (EN)"].values # to numpy
country_arr
import numpy as np
unique_name = np.unique(country_arr) # the list of country. np.unique() return values only one time no matter how many times do they appear
unique_name[:3] # check the first three countries only
df.filter(items=["Name (EN)","Category","Country (EN)"])
What we can also do is to query. For example, we can query heritages that are from China and belong to cultural category. Please remember all operations without assigning back to the dataframe itself is only temporary (except inplace = True). Using query(), we need to input a query string and we need to be careful that we need to use back ticks(`) to enclose column names with space, as well as to use single and double quoatation marks to avoid confusion.
For example, "
Country (EN)== 'China' & Category == 'Cultural'"
will be okay but "
Country (EN)== "China" & Category == "Cultural""
will run into errors.
df.query("`Country (EN)` == 'China' & Category == 'Cultural'")
To avoid those confusions, another option is to subset data frame without query string. We can do it simply using []. However, we also need to make use of () to group our query into orders.
For example:
df["Country (EN)"] == "China" & df["Category"] == "Cultural"
In the above line, there is no separation between "China" and & so it might be interpreted as:
df["Country (EN)"] == ("China" & df["Category"] == "Cultural")
which will run into errors.
What we need to do is to group them:
(df["Country (EN)"] == "China") & (df["Category"] == "Cultural")
so we make sure it will be interpreted as:
(col-A == a)
& (col-B == b)
df[(df["Country (EN)"] == "China") & (df["Category"] == "Cultural")]
So now, let's filter all Chinese sites regardless of heritage types.
china_site = df[df["Country (EN)"] == "China"]
china_site.head(1)
Check the number of sites we have in the data frame:
china_site['Name (EN)'].count()
As the column namings are a bit confusion with the spacings and capital letters, we will rename the columns:
china_site = china_site[["Name (EN)","Date inscribed","Category"]] # select multiple columns in a list []
china_site = china_site.rename(columns={"Name (EN)": "name", "Date inscribed": "date", "Category": "type"}) # rename the columns for easy reading
china_site.head(1) # check the updates
china_site['date'] = pd.to_datetime(china_site["date"])
By converting the column to datetime objects, we can do multiple operations inside the data fame, just as extracting only the year information. It can be done by .year
. As the year information for us is more relevant than the month and day, we will remove the original column and add a year column.
china_site['year'] = pd.DatetimeIndex(china_site['date']).year # set up a new year column
china_site = china_site.drop(columns=['date']) # remove the original date column
china_site.head(1) # check the first row
UNESCO Sites
Using the china_site
variable, we can check on the number of sites for each year using groupby()
. After groupby(), we keep only the name
column and count the number of rows using count()
. Yet, we will get a Pandas series (pandas.core.series.Series
) as return, not a data frame. We need to convert it back to data frame using reset_index()
and add name="count"
to tell Pandas the new column will be called "count".
count = china_site.groupby("year")["name"].count()
count
count_df = count.reset_index(name="count")
count_df
- ### Cumulative totals of the heritage sites
The above table displays the number of sites inscribed in China every year, however, what if we want to know the total number of heritage sites in China for every year? We can use cumsum()
, standing for cumulative summation. Let's put the total number of sites into a new column called "total".
count_df["total"] = count_df["count"].cumsum()
- ### Set Index
Let's reset the index using year.
count_df = count_df.set_index("year")
count_df.head()
import matplotlib.pyplot as plt # import library
plt.figure(figsize=(15,5)) # optional, define figure size
count_df.total.plot(color="red") # plot, add color argument
plt.xlabel("Year") # x label
plt.ylabel("Number of UNESCO sites") # y label
plt.title("Increasing number of UNESCO sites in China") # title
We can also plot a table instead.
data = {"Year": count_df.index.values, "Total Sites": count_df.total.values}
df = pd.DataFrame(data)
fig, ax = plt.subplots(1, 1)
# Hide axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.axis('tight')
ax.axis('off')
ax.table(cellText=df.values, colLabels=df.keys(), loc='center')
plt.show()
To export the data for further use, we can export them as csv file.
from google.colab import files
df.to_csv('UNESCO.csv', encoding='utf_8_sig', index=False)
files.download('UNESCO.csv')
Or print it as LaTeX.
print(df.to_latex(index=False))
Plotting
Do some plotting using our data.
Let's say we are interested in the progress of UNESCO sites application from different countries. We want to do a plot to see which countries are having the largest share of sites and how the trend develops over time. Creating the plot is typically the final step for data visualization. Before that, there are some steps need to be done.
Typical Workflow for Visualization:
1) Get data: grab them online or offline
2) Clean data: get rid of missing data, clean the irrelevant information, groupping parameters, etc.
3) Design on the type of visualization: what type of chart (depends on your message & purpose & nature of data)? by what parameters (eg. by year or by country or focus on one country only)?
4) Prepare the data in the format fitting your visualization type (wide or long format?)
Data Preparation
the first thing we need to do is to collect the top 10 countries having the most UNESCO sites.
df = df[['Country (EN)',"Name (EN)","Date inscribed","Category"]] # select multiple columns in a list []
df = df.rename(columns={"Country (EN)": "country","Name (EN)": "name", "Date inscribed": "date", "Category": "type"}) # rename the columns for easy reading
top_10 = df.groupby(df["country"]).count().sort_values(by=['name'], ascending=False).head(10)
top_10
Then we convert the countries to Numpy array and save it to a variable sub_cnty
.
sub_cnty = top_10.index.values
sub_cnty
top_df
is the data frame including the top 10 countries only. As we aim to plot number of heritage sites every years for each country, we need to use groupby()
grouping both country
and date
after filtering the rows to the top 10 countries using df['country'].isin(sub_cnty)
.
After groupby()
we need to indicate the method count()
. We will only select the name
column only (country
and date
will also be included as they are the objects used to "groupby").
In order to get a data frame as output, we need to use reset_index()
.
top_df = df[df['country'].isin(sub_cnty)].groupby(['country','date']).count()['name'].reset_index()
top_df.head(5)
pivot() from Pandas is a function to convert a data frame from long to wide. In this case, it basically display every unique item in the country column into a separate column.
pivot = top_df.pivot(index='date', columns='country', values='name')
pivot = pivot.fillna(0)
pivot.head(10)
pivot = pivot.cumsum()
pivot.head(10)
import matplotlib.pyplot as plt # import library
pivot.plot(figsize=(20,6)) # define plot size
plt.xlabel("Year", fontsize=14) # x label
plt.ylabel("Count", fontsize=14) # y label
plt.title("UNESCO Sites", fontsize=20) # title
import seaborn as sns # import library
custom_params = {"axes.spines.right": False, "axes.spines.top": False} # define axes parameters
sns.set_theme(style="ticks", rc=custom_params, context="talk") # define theme
plt.style.use("dark_background") # define background color (here we try with a dark theme)
import matplotlib.dates as mdates # import mdates to plot datae time data
plt.figure(figsize=(20, 6))
sns.lineplot(data=pivot) # call lineplot using seaborn library
plt.title("UNESCO Sites", fontsize=28)
plt.xlabel("Year")
plt.ylabel("Count")
plt.legend(bbox_to_anchor=(1.02, 0.9), loc=2, borderaxespad=0.) # display a legend at the position out of our plot
plt.tight_layout() # adjust spacings of elements
Stacked Area Chart
Apart from line chart, we can also do a stacked area plot to create more visuals. The principle is similar, but this time we will use stackplot together with "sym" baseline (Symmetric around zero and is sometimes called 'ThemeRiver'). If you are particularly interested in different plotting types, stick with the tutorials as we will discuss more in the next chapters.
Also, feel free to check out this for a cataloge of data visualization using Python.
fig, ax = plt.subplots(figsize=(15, 6)) # figure size
ax.stackplot(pivot.index.values, [pivot[name].values for name in pivot], baseline="sym", colors=palette, labels=sub_cnty) # stacked area
ax.axhline(0, color="red", ls="--", linewidth=.8) # red line in the middle
plt.title("UNESCO Sites", fontsize=25) # title
plt.xlabel("Year") # labels
plt.ylabel("Count")
plt.legend(bbox_to_anchor=(1.02, 0.9), loc=2, borderaxespad=0.) # legend
plt.tight_layout() # adjust spacings
plt.figure(figsize=(24,6))
plt.hlines(y=pivot.index, xmin=0, xmax=pivot['China'], color="#FCC700")
plt.plot(pivot['China'], pivot.index, "o", markersize=8, color="#FC4900")
plt.title("UNESCO Sites in China", fontsize=26)
plt.xlabel("Count")
plt.ylabel("Year")
Pandas Text Analysis
Previous Lesson:Next Lesson: Coming soon...
Additional information
This notebook is provided for educational purpose and feel free to report any issue on GitHub.
Author: Ka Hei, Chow
License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.
Last modified: December 2021
References:
https://examples.opendatasoft.com/explore/dataset/world-heritage-unesco-list/table/