As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.

Presumptions:

Here the examples used is Chinese texts from this link.

List

The following is a review of Python basics with presumptions of knowledge in w3school from Python intro to Python Arrays. The following is the a subset of titles from 孟浩然诗全集 卷一百五十九.

String "" is used to save the titles in variables. Pay attention that in Python capital letters and spacing matters. So for example, "CHINA" is not equal to "China". True and False in Python is called boolean. It is a way to express binary result.

"China" == "China" # == means "is it equal to?" (the opposite is !=)
True
"CHINA" == "China" # This is the way how comments can be written. They will not impacts the code structure itself
False
first = "从张丞相游南纪城猎,戏赠裴迪张参军"

second = "登江中孤屿,赠白云先生王迥"

third = "晚春卧病寄张八"

fourth = "秋登兰山寄张五"

fifth = "入峡寄弟"

We can also put them in a list, which is a method to store multiple items in a single variable. There are much more ways to store information that we can retrieve later but list is the most simple form. Also, all text without "" in Python will be understood as a variable and it might cause errors if it is not one.

Be careful, there are some key words that we cannot used to store variables as they are reserved. To understand more about key words: https://www.w3schools.com/python/python_ref_keywords.asp

list_ = [first, second, third, fourth, fifth]

Then we can print them out.

list_
['从张丞相游南纪城猎,戏赠裴迪张参军', '登江中孤屿,赠白云先生王迥', '晚春卧病寄张八', '秋登兰山寄张五', '入峡寄弟']

Slicing and Basic Manipulation

Sometime we only wish to retrieve selected items. By slicing, we can easily access them. Python start from 0 so [0] always mean the first item. We can also use negetive values, in which the order counted in the reverse order. The last item indicated will be excluded from selection (eg. [0:2] means the 3rd item is excluded).

print(list_[0]) # first element only
从张丞相游南纪城猎,戏赠裴迪张参军
print(list_[0:2]) # first two elements only
['从张丞相游南纪城猎,戏赠裴迪张参军', '登江中孤屿,赠白云先生王迥']
print(list_[-1]) # last item
湖中旅泊,寄阎九司户防

Using text (inside "") some of the operation cannot be done as using numbers (eg. division). However, there are multiple ways we can manipulate them.


For example, we can add them togehter.

add = list_[-1] + "," + list_[-1]
add
'入峡寄弟,入峡寄弟'

We can subtract text (in an indirect way).

add.replace(",入峡寄弟","") # replace ",入峡寄弟" with nothing ""
'入峡寄弟'

We can also repeat them.

list_[-1] * 10 # * ten times
'入峡寄弟入峡寄弟入峡寄弟入峡寄弟入峡寄弟入峡寄弟入峡寄弟入峡寄弟入峡寄弟入峡寄弟'

We can also insert more items into the list. Let's put the 6th title into position 5 (Python start from 0).

list_.insert(5, '湖中旅泊,寄阎九司户防')
list_
['从张丞相游南纪城猎,戏赠裴迪张参军',
 '登江中孤屿,赠白云先生王迥',
 '晚春卧病寄张八',
 '秋登兰山寄张五',
 '入峡寄弟',
 '湖中旅泊,寄阎九司户防']
  • ## Numpy Array

We can also put them in numpy array, which make many manipulation easier and faster, but first we need to import the library. It is applied to all functionalities not included in the base library.

After import the numpy library, it is imported as np and later when we need to call a function from the library, eg. min(), we can type np.min(). The item we put in () is called arguments. They are the inputs to compute the outputs. When we use functions, we have to be careful what arguments are needed (sometimes they are compulsory, sometimes they are not necessary, sometimes they are optional but there will be always a default option).

import numpy as np # import library. When we write real code, all libraries will typically be imported all together at the beginning
arr = np.array(list_)
arr
array(['从张丞相游南纪城猎,戏赠裴迪张参军', '登江中孤屿,赠白云先生王迥', '晚春卧病寄张八', '秋登兰山寄张五', '入峡寄弟'],
      dtype='<U17')

Do you see dtype='<U17'? It means datatype of the elements in the Numpy array. The U indicates that the elements are Unicode strings; Unicode is the standard Python uses to represent strings.


In fact, it is important when we write code because if the action can be performed always depends on data types. To better understand data type: https://realpython.com/python-data-types/


  • ## Data Type

If we want to check the data type of any objects, we can use type().

type(100) # integer
int
type("list_") # string
str
type(list_) # list
list
type(np.array(list_)) # array
numpy.ndarray

Other Operations using Numpy Array

We can manipulate the text, for example, by getting the title with the most characters.

sort_arr = sorted(list_,key=len,reverse=False) # first we sort them by length (key = len), default in ascending order

sort_arr
['入峡寄弟',
 '晚春卧病寄张八',
 '秋登兰山寄张五',
 '湖中旅泊,寄阎九司户防',
 '登江中孤屿,赠白云先生王迥',
 '从张丞相游南纪城猎,戏赠裴迪张参军']
sort_arr[-1]
'从张丞相游南纪城猎,戏赠裴迪张参军'

We can for example count the number of titles using len().

len(arr)
5

We can also put condition into array. For example, let's get item with over 5 characters.

# To review list comprehension: https://www.w3schools.com/python/python_lists_comprehension.asp

arr_len = [len(i) for i in arr]
arr_len
[17, 13, 7, 7, 4]
arr_len = np.array(arr_len)

Get arr with arr_len in the correponding position larger than 5.

To better understand the basic operators: https://www.tutorialspoint.com/python/python_basic_operators.htm

arr[arr_len > 5] # slicing with conditions
array(['从张丞相游南纪城猎,戏赠裴迪张参军', '登江中孤屿,赠白云先生王迥', '晚春卧病寄张八', '秋登兰山寄张五',
       '湖中旅泊,寄阎九司户防'], dtype='<U17')

Let's look at the minimum length of titles.

np.min(arr_len)
4

Or maximum.

np.max(arr_len)
17

Basic Plotting

We can also do some basic plotting using matplotlib. But we first need to import the library.

Understanding more about matplotlib: https://www.youtube.com/watch?v=qErBw-R2Ybk

  • ### Histogram
import matplotlib.pyplot as plt

# Histogram
plt.hist(arr_len, bins=[2, 5, 10, 20], width=1)
(array([1., 2., 3.]), array([ 2,  5, 10, 20]), <a list of 3 Patch objects>)

It might not be helpful using little titles, but the principle is the same if we have thousands of items.

It can be demonstrated by creating some random numbers.

lengths = np.arange(0,30)

lengths
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
import random

# choose 1000 samples times another 1000 samples
list2_ = np.array(random.choices(lengths, k=1000))*np.array(random.choices(lengths, k=1000))
list2_
import matplotlib.pyplot as plt

plt.hist(list2_)
(array([347., 182., 129.,  89.,  78.,  60.,  42.,  43.,  17.,  13.]),
 array([  0. ,  84.1, 168.2, 252.3, 336.4, 420.5, 504.6, 588.7, 672.8,
        756.9, 841. ]),
 <a list of 10 Patch objects>)
  • ### Bar Chart

We can improve the layout of the plot, such as adding xlabel, ylabel, a title, and change colors and so on.

plt.figure(figsize=(12,5)) # define the size of the plot
plt.hist(list2_, color="green", orientation='horizontal') # color is an argument for color, this time we make it horizontal
plt.grid(color='r', linestyle='--', linewidth=0.25, alpha=0.8) # add grid lines
plt.ylabel("Numbers")
plt.xlabel("Frequency")
plt.title("Title", fontsize=18)
Text(0.5, 1.0, 'Title')

There are also many more different types for plotting. For example:

plt.bar([1,2,3,4,5], arr_len, width=0.5)
plt.xlabel("ID")
plt.ylabel("Lengths")
Text(0, 0.5, 'Lengths')
  • ### Line Chart
Year = [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010]
Rate = [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]

plt.plot(Year, Rate, color="red")
plt.xlabel("Year")
plt.ylabel("Rate")
plt.grid(alpha=0.5) # alpha means transparency (0 to 1), the higher, the more visible
  • ### Table
fig, ax = plt.subplots(figsize=(20,6))

# Hide axes
ax.xaxis.set_visible(False) 
ax.yaxis.set_visible(False)
ax.axis('tight')
ax.axis('off')

# Table 
data = np.random.random((5,3))
label=("1997", "1998", "1999")
ax.table(cellText=data,colLabels=label,loc='center')
<matplotlib.table.Table at 0x7f72c6afccd0>



Previous Lesson: JupyterNotebook Colab Basics

Next Lesson: Coding Practice Basics


Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: December 2021