Python Pandas Library
Basic Data Manipulation
- Pandas Series
- Data Manipulation
- Data Analysis
- Basic Data Visualization
- Additional information
- References:
As mentioned in the instructions, all materials can be open in Colab
as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.
This notebook aims to intrduce users to the Pandas library, a useful tool for tabular data manipulation. Its capabilities are similar to Excel but it is much more flexible and can manage huge datasets in an efficient manner.
Presumptions:
Pandas Series
We have learnt about lists, which are a simple way to handle information. Pandas however includes many additional features to handle data, such as handling missing data and indexing objects with text. The corresponding form of a list in Pandas is a Pandas series, which can also be understood as a column of the Pandas dataframe. A Pandas Series can be created using pd.Series()
.
import pandas as pd
names = pd.Series(['登江中孤屿,赠白云先生王迥', '秋登兰山寄张五',' ','入峡寄弟'])
names
Nearly all of Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str
methods:
len() |
lower() |
translate() |
islower() |
ljust() |
upper() |
startswith() |
isupper() |
rjust() |
find() |
endswith() |
isnumeric() |
center() |
rfind() |
isalnum() |
isdecimal() |
zfill() |
index() |
isalpha() |
split() |
strip() |
rindex() |
isdigit() |
rsplit() |
rstrip() |
capitalize() |
isspace() |
partition() |
lstrip() |
swapcase() |
istitle() |
rpartition() |
Although many of the strings methods are not applicable to Chinese languages, some functions can still be really helpful.
For example,
names.str.startswith('秋') # looking for an item that starts with a character
names.str.isspace() # looking for whitespace
names.str.find('秋') # look for where (index) is a character
names.str.split(pat="") # split characters
names.str.extract('([A-Za-z]+)', expand=False) # look for letters (this is called regular expression, you will learn about it later)
names.str.findall(r'^[秋].*$') # find item with characters
There are also some methods that allow convenient operations:
Method | Description |
---|---|
get() |
Index each element |
slice() |
Slice each element |
slice_replace() |
Replace slice in each element with passed value |
cat() |
Concatenate strings |
repeat() |
Repeat values |
normalize() |
Return Unicode form of string |
pad() |
Add whitespace to left, right, or both sides of strings |
wrap() |
Split long strings into lines with length less than a given width |
join() |
Join strings in each element of the Series with passed separator |
get_dummies() |
extract dummy variables as a dataframe |
names.str[0:1] # get the first character only
names.str.split(pat=',').str.get(0) # get first clause
We can even get some statistics about the length of our text using describe()
:
names.str.len().describe()
We can also create a Pandas DataFrame from a dictionary: while the keys will be used as the name of the column in the Pandas DataFrame, the values will be used as the data (rows).
Let's build a data frame using capital names for Qin and Han dynasties as an example.
dictionary = {
'Time': ['','– 677 BC','677 BC –','– 383 BC','383 BC – 250 BC','350 BC – 207 BC','202 BC','202 BC – 200 BC','200 BC – 8 BC'],
'Dynasty': ['Qin','Qin','Qin','Qin','Qin','Qin','Han','Han','Han'],
'Capital': ['Xiquanqiu','Pingyang','Yong','Jingyang','Yueyang','Xianyang','Luoyang','Yueyang','Changan']
} # remember all column in the data frame have to have the same length
df = pd.DataFrame(data=dictionary) # use pd.DataFrame() and put the dict as an argument
df
df = df.set_index('Time') # we can also set text as index
df
To view only the first row:
df.head(1)
To view only the last row:
df.tail(1)
In order to get a better understanding of the group characteristics, we can also use the groupby function. It is used with a groupby()
followed by a method().
For example, we can use groupby("Dynasty").count()
to count number of rows (number of capital) in each dynasty.
Remarks: Index cannot be used as the groupby object.
df.groupby("Dynasty").count()
Data Manipulation
After understanding what is a Pandas Series and what is a Pandas DataFrame, we can start with some basic manipulation using a data frame. Let us start with an example to build a Pandas dataframe from a text file.
First, we upload a text and select the titles.txt file (it can be found in the data folder).
from google.colab import files
uploaded = files.upload()
for f in uploaded.keys():
file = open(f, 'r')
titles = file.read()
Then, we can split the text into paragraphs (separated by two new lines) and sentences (separated by one new line).
titles = titles.split("\n\n") # two new lines
titles = [lines.split('\n') for lines in titles] # one new line, done in a list comprehension
titles[0:2] # check the first two items we have: 卷159_1 and 卷159_2
We can also check how many titles (卷) are there:
len(titles)
Then we construct our dataframe:
import pandas as pd # In case library is not imported
df = pd.DataFrame({"content": titles}) # construct a data frame using a dictionary
df.head()
We realize that the content column is quite messy as it contains different information such as title, author name and the content itself. So we want to set up different new columns:
1) title: The title
2) content: The text
3) index: Use the text ID (e.g. 159_1) as index
We can set up new columns by simply writing
df["name of the new column"] = [what we plan to put in]
What we plan to put in can be for example, a NumPy array with the same length, or just a number (in this case all rows will have the same value). For example,
import numpy as np
df["test"] = 1
df.head()
Or this:
df["test"] = np.arange(0,269)
df.head()
We can also add a new row:
df.append({'content': np.nan, 'test': np.nan}, ignore_index=True) # This is temporary only and will not change the data frame itself
# np.nan means missing values
df = df.append({'content': np.nan, 'test': np.nan}, ignore_index=True) # df = <- now the df is replaced
df.tail(1)
We can then drop the row (all missing values) and the column again.
df = df.dropna(how="any",axis="index") # this is for dropping all missing values in the rows
df.tail(1)
df = df.drop(columns="test") # this is for dropping the column named "test"
df.head()
Now we finally start with our new columns:
Let's look at one of our data sets: we can see the first item is the ID, title and author name, and the other items are the text. So let us set the first item as title, and the rest of the items as content.
df["content"][0]
df["title"] = df["content"].str[0] # the first item
df["content"] = df["content"].str[1:] # second to last item
df["content"] = df["content"].str.get(0) # get rid of the [] by extracting the first item from list
df.head()
df = df.replace(r'\n',' ', regex=True) # replace the next line symbol '\n' with empty space
df.head()
type(df.title[0]) # title is type of string
Now we can set up a new column "id" and then use it as our index using set_index().
We learnt in the last notebook. (\d+_\d+) means a number followed by _ followed by a number again. This is the pattern used for the extraction of the title column.
df["id"] = df.title.str.extract('(\d+_\d+)')
df = df.set_index("id")
df.head()
Now our data frame looks better. But we still want to get rid of the ID, author name, and 「」. To remove them, we replace them with an empty string (""). Save them back to the title column.
df['title'] = df['title'].str.replace('卷(\d+_\d+)', '').str.replace('孟浩然', '')
df['title'] = df['title'].str.replace('「', '').str.replace('」', '')
Let's create df2
now by coping "title" and "content" from df
. And set the index using the index from df
.
df2 = (df[['title','content']]).set_index(df.index)
df2.head()
Finally, we also need to know that conversion between Pandas column and Numpy array is very simple. We can basically select the column and then call .values
. Then we get our array.
For example, we can try to convert our title column to array:
df2.title.values[:5] # first 5 items
Data Analysis
After performing some basic processing of our data, let us try to do some analysis based on what we have. Let us say, we want to understand how the key words of season have been used in the text. How many times have they been used and how are they distributed? At the end we want to use the results to make a bar chart and a dispersion plot using matplotlib. We will learn much more about visualization later, but for now we will stick to simple plots.
In case we still have rows with missing values, we use dropna() again to clean our data frame.
df2 = df2.dropna(how="any",axis="index") # this is for dropping all missing values in the rows
Then we calculate the word offset. It is done by geting the length of strings in the content column (.len()) and calculate the cumulative sum of it (.cumsum()). It means that now the values in the rows is not the length of the one title, but the word offset starting from the first character of the first title. We use it as our word offset for the plot later. With (.to_numpy()) we get the list to numpy array.
word_count = df2.content.str.len().cumsum().to_numpy()
word_count[:5]
Now we want to look for the keywords for every season. We do it using the find() function and convert the list to numpy. The same is done for every season.
spring_count = df2.content.str.find('春').to_numpy() # for spring
spring_count[:5]
summer_count = df2.content.str.find('夏').to_numpy() # for summer
summer_count[:5]
autumn_count = df2.content.str.find('秋').to_numpy() # for autumn
autumn_count[:5]
winter_count = df2.content.str.find('冬').to_numpy() # for winter
winter_count[:5]
Then, we use list comprehension and enumerate() to loop through all items and the index in the array, and save (index + word offset) that is not equal to one.
The resultuing value can be understood as the total word offset of that character starting from the first title. For example, from the next cell we can tell "春" appears in the 82th, 198th, 660th, ... characters.
spring_occur = np.array([v + word_count[i] for i, v in enumerate(spring_count) if v != -1]) # for spring
spring_occur
summer_occur = np.array([v + word_count[i] for i, v in enumerate(summer_count) if v != -1]) # for summer
summer_occur
autumn_occur = np.array([v + word_count[i] for i, v in enumerate(autumn_count) if v != -1]) # for autumn
autumn_occur
winter_occur = np.array([v + word_count[i] for i, v in enumerate(winter_count) if v != -1]) # for winter
winter_occur
Afterwards, we have our arrays which store information about the occurences of season keywords. We can make a plot out of it using matplotlib. The dispersion plot we are making is based on a scatter plot. We will thus do a scatter plot for every season with custom marker styles.
import matplotlib.pyplot as plt # import library
fig, ax = plt.subplots(figsize=(15,3)) # create an empty plot with defined size
# in the scatter plot function (plt.scatter()), we need X for 1st argument and Y for 2nd argument.
# for X we will put the word offset values for keyword occurence, for Y we will put a constant value from 1 to 4
# because we want the same season in the same row
# np.ones() creates 1 with defined shape, in this case the shape is [length of X,1]
spring_y = 1*np.ones([len(spring_occur),1]) # all 1 (1*1)
summer_y = 2*np.ones([len(summer_occur),1]) # all 2 (2*1)
autumn_y = 3*np.ones([len(autumn_occur),1]) # all 3 (3*1)
winter_y = 4*np.ones([len(winter_occur),1]) # all 4 (4*1)
# scatter plot
plt.scatter(spring_occur,spring_y, marker=5, s=100, alpha=0.8) # first scatter plot for spring, alpha is transparency of the markers
plt.scatter(summer_occur,summer_y, marker=5, s=100, alpha=0.8) # for summer
plt.scatter(autumn_occur,autumn_y, marker=5, s=100, alpha=0.8) # for autumn
plt.scatter(winter_occur,winter_y, marker=5, s=100, alpha=0.8) # for winter
# set y limits
plt.ylim(0.8, 4.5)
# we want our y labels as text, not number. so here we define them.
plt.yticks(np.arange(0,5,1))
labels = ['','Spring', 'Summer', 'Autumn', 'Winter']
ax.set_yticklabels(labels, fontsize=12)
# no plot frame is needed
plt.box(False)
# labels and title
plt.xlabel("Word Offset", fontsize=14)
plt.ylabel("Season Keywords", fontsize=14)
plt.title("Lexical Dispersion", fontsize=16)
On another hand, we can also make a bar chart by simply showing the occurence frequency of keywords.
We use count() of the content column from df2 to count the occurence. We need to add sum() to sum up count for all rows, not single row.
spring = int(df2.content.str.count('春').sum()) # spring
spring
summer = int(df2.content.str.count('夏').sum()) # summer
summer
autumn = int(df2.content.str.count('秋').sum()) # autumn
autumn
winter = int(df2.content.str.count('冬').sum()) # winter
winter
Now, we can use our results to make another data frame. We do it because having results in a separate data frame makes visualization easier.
count = {'season': ["spring", "summer", "autumn", "winter"],'count': [spring, summer, autumn, winter]}
season = pd.DataFrame.from_dict(count)
# set season as index
season = season.set_index("season")
season.head()
In order to have Chinese characters shown in our plot, we need to download a package and change the font from the Python library. Please just use this code:
!wget -O TaipeiSansTCBeta-Regular.ttf https://drive.google.com/uc?id=1eGAsTN1HBpJAkeVM57_C7ccp7hbgSz3_&export=download
# import library
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.font_manager import fontManager
# change font setting
fontManager.addfont('TaipeiSansTCBeta-Regular.ttf')
mpl.rc('font', family='Taipei Sans TC Beta')
Having a Pandas data frame makes visualization simple. We can basically call the dataframe with .plot
., followed by the type of plot we want to have. For example, a bar chart is (name of dataframe).plot.bar().
plt.figure(figsize=(18,6))
# bar chart
season.plot.bar(rot=0, color="#830045")
# labels and title
plt.xlabel("Season", fontsize=15)
plt.ylabel("Occurence", fontsize=15)
plt.title("Description of season in \n孟浩然诗 卷一百五十九 and 卷一百六十?", fontsize=15)
# remove legend and add keywords in the x-axis
ax = plt.gca()
ax.set_xticklabels(['春','夏','秋','冬'], fontsize=16)
ax.get_legend().remove()
# adjust spacing in plot
plt.tight_layout()
Regular Expression
Previous Lesson:Pandas Numerical Operation
Next Lesson:Additional information
This notebook is provided for educational purposes only. Feel free to report any issues on GitHub.
Author: Ka Hei, Chow
License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.
Last modified: December 2021