Bubble Timeline using plotly.express
Interactive data visualization
- plotly.express
- Data Cleaning
- Data Visualization
- Almost Done!
- Customization
- Previous Lesson: Simple Bubble Chart
- Next Lesson: Coming soon...
- Additional information
- References:
As mentioned in the instructions, all materials can be open in Colab
as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.
plotly.express
Hi there! In the last tutorial, we began to explore the potential of plotly.express, which is a wrapper for Plotly.py to allow more interaction in our graphics. Last time we made a simple scatter plot/ bubble chart. This time we will continue with a variation of bubble chart to represent temporal development of the UNESCO inscriptions in different countries. This timeline is characterised by the bubbles along the x-axis with varied sizes and can be used to contrast temporal trends of multiply categories.
In order to create the timeline, we first have to import needed libraries, and read the data into a pandas data frame.
import io
import pandas as pd
import requests
# read data
url = 'https://examples.opendatasoft.com/explore/dataset/world-heritage-unesco-list/download/?format=csv&timezone=Europe/Berlin&lang=en&use_labels_for_header=true&csv_separator=%3B'
df = pd.read_csv(url, sep=";")
df.head()
Data Cleaning
We need to start with some preprocessing and data cleaning. We will start with subsetting and renaming the columns, followed by a calculation of total UNESCO sites in the "top 10 countries" using groupby() (we do not need this data frame for the plot, only list of the top 10 countries). We will sort the values using sort_vales(by=['name']) to order the countries from the most to the least UNESCO sites.
df = df[["Name (EN)","Date inscribed","Category","Country (EN)","Continent (EN)"]] # select multiple columns in a list []
df = df.rename(columns={"Name (EN)": "name", "Date inscribed": "date", "Category": "type", "Country (EN)": "country", "Continent (EN)": "continent"}) # rename the columns for easy reading
top_10 = df.groupby(df["country"]).count().sort_values(by=['name'], ascending=False).head(10)
top_10
Get the top 10 countries as a numpy array.
sub_cnty = top_10.index.values
sub_cnty
With the information of the top 10 countries, we can now delete all the rows from other countries using isin(sub_cnty). We will then group the rows by country and date and count the rows for every country and every year. We will then reset the index.
top_df = df[df['country'].isin(sub_cnty)].groupby(['country','date']).count()['name'].reset_index()
top_df.head(5)
As we need only the information of year, not the full date, we will create a new column year. We can extrate the year by first interpreting the date column as date time, then take the year values (simply with .year).
top_df['year'] = pd.DatetimeIndex(top_df['date']).year # set up a new year column
top_df.head()
Now we will group by again with the country and year and get the sum (count of inscriptions every year).
group_df = top_df.groupby(["country","year"]).sum()
import numpy as np
country_list = np.array(group_df.index.get_level_values(0))
year_list = np.array(group_df.index.get_level_values(1))
As we want need the country and year column not only as index. We will assign the columns again.
group_df['country'] = country_list
group_df['year'] = year_list
Renaming the name column to count.
group_df = group_df.rename(columns={"name": "count"})
group_df.head()
To improve the visuals, we will simplified the name of UK.
group_df['country'] = group_df['country'].str.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom')
Data Visualization
To make a interactive scatter plot in plotly.express, we only need to use px.scatter(). It is highly compatible with pandas, so we can input a pandas data frame, and specify x and y (as well as size and color which are optional) with the column names.
Every changes in layout we can change using update_layout(). All the options can be found here.
import plotly.express as px
fig = px.scatter(group_df, x="year", y="country", size="count", color="country")
fig.update_layout(showlegend=False)
fig.show()
Almost Done!
Good job! Let's look at our plot. It is interactive so you can pan around and zoom in/ out. If you put your mouse on the bubbles, you will also get information such as the country name and counts at a specific year. It is the default Plotly option.
However, we can also gain control over what information we want to put in the hover labels, as well as the layout (like the font, fontsize and so on). Isn't it much cooler if we can show names of all UNESCO sites instead of the count?!
Also, we can control to display hover labels for the whole xaxis instead of an individual bubble, which means, we can display all UNESCO sites inscripted in a year! Let's say we also want to display a moving yaxis too.
Let's do all the adjustments mentioned above.
Customization
df.head()
Adjust Data Frame
As we need the information about UNESCO site name this time, we need to make use of df to make a subset for the top 10 countries then merged with our group_df. Let's go back to df and do some cleaning. First, we add the year column for df too. We group by country and year, and do a transformation here.
It is a bit tricky. The transformation aims to get all the rows with same country and year, and join all the values from ['name'] separated with a comma (,). This transformation is only done to the top 10 countries df[df['country'].isin(sub_cnty)]. As this is repeatedly done for every row, we will end up with rows that are duplicated, so we will remove them.
df['year'] = pd.DatetimeIndex(df['date']).year
# join the site names
df['site'] = df[df['country'].isin(sub_cnty)].groupby(['country','year'])['name'].transform(lambda x: ', '.join(x))
# remove duplicates
df.drop_duplicates()
# look at the rows for China
df[df["country"] == "China"].head(5)
Make sure only top 10 countries are included.
df_sub = df[df['country'].isin(sub_cnty)]
df_sub.head(1)
group_df.head(1)
group_df.reset_index(drop=True, inplace=True)
Now, we have the name information from df_sub. We can merge it to our group_df data frame using the keys "country" and "year". We select only the relevant columns [["country","year","site","count"]], and call the new data frame final.
final = df_sub.merge(group_df, left_on=["country","year"], right_on=["country","year"])
final = final[["country","year","site","count"]]
final.head()
Great! Almost everything is ready. We only need to replace the comma with a <br>
to make sure every item will be put in a new line in the hover labels.
final.site = final.site.apply(lambda x: x.replace(', ', '<br>'))
final.site.head()
fig = px.scatter(final, x="year", y="country", size="count", color="country",
custom_data=['year', 'site'])
# remove legend
fig.update_layout(showlegend=False)
# show labels for whole x axis
fig.update_layout(hovermode='x')
# change layout for hover labels
fig.update_layout(
hoverlabel=dict(
bgcolor="white",
font_size=12,
font_family="Rockwell"
)
)
# control info for hover labels using custom_data we specified above in pxscatter()
# join items with new line <br>
fig.update_traces(
hovertemplate="<br>".join([
"%{y}",
"Site: %{customdata[1]}"
])
)
# add title, x- and y- labels, and a moving line along x axis
# change font styles for the texts inside plot (y ticks and so on)
fig.update_layout(
title={
'text': "Timeline of UNESCO Inscriptions",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="Year of Inscription",
yaxis_title="Top 10 Countries",
xaxis={'showspikes': True,
'spikemode': 'across',
'spikesnap': 'cursor',
'showline': True,
'showgrid': True},
font=dict(
family="Rockwell",
size=15,
color="black"
)
)
# display out plot
fig.show()
Cool! That's it!
Now we have an interactive plot with enhanced visuals and all information we need in the labels. Not only can we clearly see the trends of inscriptions in different countries, we can also clearly see the "inscription peak" of some countries (such as 1997 in Italy). We can tell, for example, countries like Russia and China are late players in the field.
Simple Bubble Chart
Previous Lesson:Next Lesson: Coming soon...
Additional information
This notebook is provided for educational purpose and feel free to report any issue on GitHub.
Author: Ka Hei, Chow
License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.
Last modified: December 2021