As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.

plotly.express

Hi there! In the last tutorial, we began to explore the potential of plotly.express, which is a wrapper for Plotly.py to allow more interaction in our graphics. Last time we made a simple scatter plot/ bubble chart. This time we will continue with a variation of bubble chart to represent temporal development of the UNESCO inscriptions in different countries. This timeline is characterised by the bubbles along the x-axis with varied sizes and can be used to contrast temporal trends of multiply categories.

In order to create the timeline, we first have to import needed libraries, and read the data into a pandas data frame.

import io
import pandas as pd
import requests

# read data
url = 'https://examples.opendatasoft.com/explore/dataset/world-heritage-unesco-list/download/?format=csv&timezone=Europe/Berlin&lang=en&use_labels_for_header=true&csv_separator=%3B'

df = pd.read_csv(url, sep=";")
df.head()
Name (EN) Name (FR) Short description (EN) Short Description (FR) Justification (EN) Justification (FR) Date inscribed Danger list Longitude Latitude Area hectares Category Country (EN) Country (FR) Continent (EN) Continent (FR) Geographical coordinates
0 Architectural, Residential and Cultural Comple... Ensemble architectural, résidentiel et culture... The Architectural, Residential and Cultural Co... L’ensemble architectural, résidentiel et cultu... Criterion (ii): The architectural, residential... Critère (ii) : L’ensemble architectural, résid... 2005-01-01 NaN 26.691390 53.222780 0.00 Cultural Belarus Bélarus Europe and North America Europe et Amérique du nord 53.22278,26.69139
1 Rock Paintings of the Sierra de San Francisco Peintures rupestres de la Sierra de San Francisco From c. 100 B.C. to A.D. 1300, the Sierra de S... Dans la réserve d'El Vizcaíno, en Basse-Califo... NaN NaN 1993-01-01 NaN -112.916110 27.655560 182600.00 Cultural Mexico Mexique Latin America and the Caribbean Amérique latine et Caraïbes 27.65556,-112.91611
2 Monastery of Horezu Monastère de Horezu Founded in 1690 by Prince Constantine Brancova... Fondé en 1690 par le prince Constantin Brancov... NaN NaN 1993-01-01 NaN 24.016667 45.183333 22.48 Cultural Romania Roumanie Europe and North America Europe et Amérique du nord 45.18333333,24.01666667
3 Mount Etna Mont Etna Mount Etna is an iconic site encompassing 19,2... Ce site emblématique recouvre une zone inhabit... NaN NaN 2013-01-01 NaN 14.996667 37.756111 19237.00 Natural Italy Italie Europe and North America Europe et Amérique du nord 37.7561111111,14.9966666667
4 Belfries of Belgium and France Beffrois de Belgique et de France Twenty-three belfries in the north of France a... Vingt-trois beffrois, situés dans le nord de l... NaN NaN 1999-01-01 NaN 3.231390 50.174440 0.00 Cultural Belgium,France Belgique,France Europe and North America Europe et Amérique du nord 50.17444,3.23139

Data Cleaning

We need to start with some preprocessing and data cleaning. We will start with subsetting and renaming the columns, followed by a calculation of total UNESCO sites in the "top 10 countries" using groupby() (we do not need this data frame for the plot, only list of the top 10 countries). We will sort the values using sort_vales(by=['name']) to order the countries from the most to the least UNESCO sites.

df = df[["Name (EN)","Date inscribed","Category","Country (EN)","Continent (EN)"]] # select multiple columns in a list []
df = df.rename(columns={"Name (EN)": "name", "Date inscribed": "date", "Category": "type", "Country (EN)": "country", "Continent (EN)": "continent"}) # rename the columns for easy reading
top_10 = df.groupby(df["country"]).count().sort_values(by=['name'], ascending=False).head(10)
top_10
name date type continent
country
China 49 49 49 49
Italy 47 47 47 47
Spain 41 41 41 41
France 38 38 38 38
Germany 35 35 35 35
Mexico 34 34 34 34
India 33 33 33 33
United Kingdom of Great Britain and Northern Ireland 27 27 27 27
Russian Federation 21 21 21 21
Iran (Islamic Republic of) 21 21 21 21

Get the top 10 countries as a numpy array.

sub_cnty = top_10.index.values
sub_cnty
array(['China', 'Italy', 'Spain', 'France', 'Germany', 'Mexico', 'India',
       'United Kingdom of Great Britain and Northern Ireland',
       'Russian Federation', 'Iran (Islamic Republic of)'], dtype=object)

With the information of the top 10 countries, we can now delete all the rows from other countries using isin(sub_cnty). We will then group the rows by country and date and count the rows for every country and every year. We will then reset the index.

top_df = df[df['country'].isin(sub_cnty)].groupby(['country','date']).count()['name'].reset_index()
top_df.head(5)
country date name
0 China 1987-01-01 6
1 China 1990-01-01 1
2 China 1992-01-01 3
3 China 1994-01-01 4
4 China 1996-01-01 2

As we need only the information of year, not the full date, we will create a new column year. We can extrate the year by first interpreting the date column as date time, then take the year values (simply with .year).

top_df['year'] = pd.DatetimeIndex(top_df['date']).year # set up a new year column

top_df.head()
country date name year
0 China 1987-01-01 6 1987
1 China 1990-01-01 1 1990
2 China 1992-01-01 3 1992
3 China 1994-01-01 4 1994
4 China 1996-01-01 2 1996

Now we will group by again with the country and year and get the sum (count of inscriptions every year).

group_df = top_df.groupby(["country","year"]).sum()
import numpy as np
country_list = np.array(group_df.index.get_level_values(0))
year_list = np.array(group_df.index.get_level_values(1))

As we want need the country and year column not only as index. We will assign the columns again.

group_df['country'] = country_list
group_df['year'] = year_list

Renaming the name column to count.

group_df = group_df.rename(columns={"name": "count"})
group_df.head()
count country year
country year
China 1987 6 China 1987
1990 1 China 1990
1992 3 China 1992
1994 4 China 1994
1996 2 China 1996

To improve the visuals, we will simplified the name of UK.

group_df['country'] = group_df['country'].str.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom')

Data Visualization

To make a interactive scatter plot in plotly.express, we only need to use px.scatter(). It is highly compatible with pandas, so we can input a pandas data frame, and specify x and y (as well as size and color which are optional) with the column names.

Every changes in layout we can change using update_layout(). All the options can be found here.

import plotly.express as px
fig = px.scatter(group_df, x="year", y="country", size="count", color="country")
fig.update_layout(showlegend=False)
fig.show()

Almost Done!

Good job! Let's look at our plot. It is interactive so you can pan around and zoom in/ out. If you put your mouse on the bubbles, you will also get information such as the country name and counts at a specific year. It is the default Plotly option.

However, we can also gain control over what information we want to put in the hover labels, as well as the layout (like the font, fontsize and so on). Isn't it much cooler if we can show names of all UNESCO sites instead of the count?!

Also, we can control to display hover labels for the whole xaxis instead of an individual bubble, which means, we can display all UNESCO sites inscripted in a year! Let's say we also want to display a moving yaxis too.

Let's do all the adjustments mentioned above.


Customization

df.head()
name date type country continent
0 Architectural, Residential and Cultural Comple... 2005-01-01 Cultural Belarus Europe and North America
1 Rock Paintings of the Sierra de San Francisco 1993-01-01 Cultural Mexico Latin America and the Caribbean
2 Monastery of Horezu 1993-01-01 Cultural Romania Europe and North America
3 Mount Etna 2013-01-01 Natural Italy Europe and North America
4 Belfries of Belgium and France 1999-01-01 Cultural Belgium,France Europe and North America

Adjust Data Frame

As we need the information about UNESCO site name this time, we need to make use of df to make a subset for the top 10 countries then merged with our group_df. Let's go back to df and do some cleaning. First, we add the year column for df too. We group by country and year, and do a transformation here.

It is a bit tricky. The transformation aims to get all the rows with same country and year, and join all the values from ['name'] separated with a comma (,). This transformation is only done to the top 10 countries df[df['country'].isin(sub_cnty)]. As this is repeatedly done for every row, we will end up with rows that are duplicated, so we will remove them.

df['year'] = pd.DatetimeIndex(df['date']).year

# join the site names
df['site'] = df[df['country'].isin(sub_cnty)].groupby(['country','year'])['name'].transform(lambda x: ', '.join(x))

# remove duplicates
df.drop_duplicates()

# look at the rows for China
df[df["country"] == "China"].head(5)
name date type country continent year site
5 Sichuan Giant Panda Sanctuaries - Wolong, Mt S... 2006-01-01 Natural China Asia and the Pacific 2006 Sichuan Giant Panda Sanctuaries - Wolong, Mt S...
32 Tusi Sites 2015-01-01 Cultural China Asia and the Pacific 2015 Tusi Sites
38 The Great Wall 1987-01-01 Cultural China Asia and the Pacific 1987 The Great Wall, Mausoleum of the First Qin Emp...
68 Mausoleum of the First Qin Emperor 1987-01-01 Cultural China Asia and the Pacific 1987 The Great Wall, Mausoleum of the First Qin Emp...
72 Chengjiang Fossil Site 2012-01-01 Natural China Asia and the Pacific 2012 Chengjiang Fossil Site, Site of Xanadu

Make sure only top 10 countries are included.

df_sub = df[df['country'].isin(sub_cnty)]
df_sub.head(1)
name date type country continent year site
1 Rock Paintings of the Sierra de San Francisco 1993-01-01 Cultural Mexico Latin America and the Caribbean 1993 Rock Paintings of the Sierra de San Francisco,...
group_df.head(1)
count country year
country year
China 1987 6 China 1987
group_df.reset_index(drop=True, inplace=True)

Now, we have the name information from df_sub. We can merge it to our group_df data frame using the keys "country" and "year". We select only the relevant columns [["country","year","site","count"]], and call the new data frame final.

final = df_sub.merge(group_df, left_on=["country","year"], right_on=["country","year"])

final = final[["country","year","site","count"]]
final.head()
country year site count
0 Mexico 1993 Rock Paintings of the Sierra de San Francisco,... 3
1 Mexico 1993 Rock Paintings of the Sierra de San Francisco,... 3
2 Mexico 1993 Rock Paintings of the Sierra de San Francisco,... 3
3 Italy 2013 Mount Etna, Medici Villas and Gardens in Tuscany 2
4 Italy 2013 Mount Etna, Medici Villas and Gardens in Tuscany 2

Great! Almost everything is ready. We only need to replace the comma with a <br> to make sure every item will be put in a new line in the hover labels.

final.site = final.site.apply(lambda x: x.replace(', ', '<br>'))
final.site.head()
0    Rock Paintings of the Sierra de San Francisco<...
1    Rock Paintings of the Sierra de San Francisco<...
2    Rock Paintings of the Sierra de San Francisco<...
3    Mount Etna<br>Medici Villas and Gardens in Tus...
4    Mount Etna<br>Medici Villas and Gardens in Tus...
Name: site, dtype: object

Ploting

Now, let's do our plot again using px.scatter().

fig = px.scatter(final, x="year", y="country", size="count", color="country",
                 custom_data=['year', 'site'])
# remove legend
fig.update_layout(showlegend=False)

# show labels for whole x axis
fig.update_layout(hovermode='x')

# change layout for hover labels
fig.update_layout(
    hoverlabel=dict(
        bgcolor="white",
        font_size=12,
        font_family="Rockwell"
    )
)

# control info for hover labels using custom_data we specified above in pxscatter()
# join items with new line <br>
fig.update_traces(
    hovertemplate="<br>".join([
        "%{y}",
        "Site: %{customdata[1]}"
    ])
)

# add title, x- and y- labels, and a moving line along x axis
# change font styles for the texts inside plot (y ticks and so on)
fig.update_layout(
    title={
        'text': "Timeline of UNESCO Inscriptions",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Year of Inscription",
    yaxis_title="Top 10 Countries",
    xaxis={'showspikes': True,
        'spikemode': 'across',
        'spikesnap': 'cursor',
        'showline': True,
        'showgrid': True},
    font=dict(
        family="Rockwell",
        size=15,
        color="black"
    )
)

# display out plot
fig.show()

Cool! That's it!

Now we have an interactive plot with enhanced visuals and all information we need in the labels. Not only can we clearly see the trends of inscriptions in different countries, we can also clearly see the "inscription peak" of some countries (such as 1997 in Italy). We can tell, for example, countries like Russia and China are late players in the field.

Previous Lesson: Simple Bubble Chart

Next Lesson: Coming soon...




Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: December 2021




References:

Plotly