As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.


This notebook aims to provide some useful tips for the readers before they begin with their first data visualization.


Presumptions:

Not applicable.


Data visualization is essentially the process of translating data, either qualitative or quantitative, into charts, graphs and other visuals, in order to deliver messages or insights as a way of communication. It can reduce large amount of data into more interpretable information, or even map spatial and temporal domain into 2D (or higher dimensional) space for more intuitive communication (Let's think about how counterintuitive it is to read a list of events in random time orders even with time labels). Human are highly visual creatures. We have learned to interact with the world using visual cues. Hence, by representing text or numberical information in graphical ways, readers can absorb and react to information more effectively.

Nonetheless, there are many factors which can determine if graphics are effective or not. If the graphics or charts are not created properly, we cannot leverage the potential of data visualization to deliver our messages, and might even cause confusions. Since there are already lots of online resources explaining the idea of data visualization, this lesson will mainly direct readers to some existing resources and to summarize using a Chinese historical example.

What are the goals and common methods for data visualization? Some Harvard scholars have categorized the approaches into four directions:

  • Distribution

  • Comparison

  • Relationship

  • Composition

Let's read more about it yourself.

>>> Ultimate resource for understanding & creating data visualization

Now you have understood some basic approaches for making data visualization. Next step is to think about the structure of a plot. Do you know how do we build a plot and what elements are there? The following paper which builds on the grammar of graphics for R (ggplot2) have shown some nice insights about the components of a good graphic. Although we will not learn ggplot2 in this blog, this article still nicely describes the basic structure of many different Python plotting libraries, which requires users to construct different graphical elements, as well as customized their styles. You do not need to go into all the details: the most important is that you have a big idea about the basic elements in a plotting library, such as parameterization, scale, markers, coordinate systems and layers.

>>> A Layered Grammar of Graphics

It should be clear by now why do we need data visualization and how does the theory being implemented in many plotting libraries. Afterwards, we need to learn about the practices. For example, what are the criterium to select the type of data visualization? How shall we choose the colors and the markers? How many visual elements do we need in a graphic? Let's learn about them by reading the article below.

These are the key messages quoted from the article below:

  • Data visualizations should be audience-specific with a clear requirement

  • Choose the right visualization for your data

  • Keep your visualizations simple

  • Label your data visualizations

  • Understand the importance of text in charts

  • Use colors effectively in data visualizations

  • Avoid deceiving with your visualizations

  • Make interpretable data visualizations

>>> Data Visualization Tips to Improve Data Stories

Until now, let's check what we have learnt by looking into a simple example of this paper about THE SHORT-LIVED CHINESE EMPERORS. This paper describes the statistics on the ages of death for Chinese emperors, buddhist Monks, and traditional Doctors. What kind of data visualization would fit into this example?


Age at Death Emperor (n=241) Buddhist Monk (n=140) Traditional Doctor (n=181)
<20, n (%) 28 (11.6) 2 (1.4) 0 (0)
20–29, n (%) 46 (19.1) 7 (5.0) 0 (0)
30–39, n (%) 47 (19.5) 12 (8.6) 3 (1.7)
40–49, n (%) 38 (15.8) 5 (3.6) 3 (1.7)
50–59, n (%) 42 (17.4) 19 (13.6) 20 (11.0)
60–69, n (%) 29 (12.0) 27 (19.3) 34 (18.8)
70–79, n (%) 7 (2.9) 39 (27.9) 56 (30.9)
80–89, n (%) 4 (1.7) 14 (10) 42 (23.2)
90–99, n (%) 0 (0) 8 (5.7) 16 (8.8)
≥100, n (%) 0 (0) 7 (5.0) 7 (3.9)
Range 2–89 17–120 32–109
Mean ± standard deviation 41.3 ± 17.9 66.9 ± 20.7 75.1 ± 13.4

Table retrieved from Zhao, H. L., Zhu, X., & Sui, Y. (2006).

1. One Key Message Per Graphic

First, it is recommanded to deliver one message per graphic. In this case,it means we might not want to emphasize causes of deaths of the emperors and the longevity comparison between groups in a single graphic. For example, we would focus on the longevity differences between emperors and other groups by creating multiple histograms or boxplots in a graph.

2. Avoid Redundant Styling

Also, we need to pay attention to the styling. Although multiple colors or marker symbols can be eye-catching, it can create unnecessary confusions if they do not embed meanings. For example, we shall avoid using different colors for the same group. Also, we should either use a different color or marker symbol for different groups, but not both of them.

In this example, we can use different colors for emperor, monk, and traditional doctor in our chart.

3. Be Careful of your Color scheme

Besides, we need to pay attention to the color scheme. Remember that many elements have intuitive meanings in our brain, so it is the best if we would follow the expected patterns. For example, if we want to show the frequency of war occurred in different periods, it is better to use red-blue to represent more-less frequent wars than to use the reverse order of color scheme. It is the same with other styling elements too. For example, using the same marker symbol with larger size to describe another category will not be sensible.

We also need to think about the color palette. There are color palettes (sequential and diverging palettes) with changing saturation or lightness used to describe data with continuous nature, as well as qualitative color palettes designated for categorical data. We should always make sure the color selected is visually distinguisble (even for color blind audience) and is approperiate for the data nature.

Here in this example, we have three categories that cannot be ordered. So we can use a categorical palette.

4. Select type of graphics depends on the Nature of your data

We also need to learn about advantages and disadvantages of using different chart types. For example, a circle packing chart provides attractive visuals, but fails to show precise comparisons. A groupped bar chart displays decent comparisons, but it can look messy if we have many groups. A polar bar chart emphasizes the dominant groups, but do not work well with data showing development (eg. time series).

In this example, a boxplot will fit better than a pie chart. It will also fit better than a bar chart if we want to emphasize the general distribution for all ages groups rather than some distinct characteristics of certain age groups.

5. Pay Attention to your Axis

Finally, your axis are important too. If it is a map, make sure that they have north on the top. Or if it is a scatter plot, make sure to put the dependent variable on the y-axis. If the charts represent any temporal development, make sure that the time dimension shall be put in the x-axis. In this example, the boxplot can be both vertical and horizontal.

Previous Lesson: Pandas Numerical Operation

Next Lesson: Coming soon...




Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: December 2021




References:

Zhao, H. L., Zhu, X., & Sui, Y. (2006). THE SHORT‐LIVED CHINESE EMPERORS. Journal of the American Geriatrics Society, 54(8), 1295-1296.