As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.

Presumption:

List Comprehension

Below we use a short list of local gazetteers from LoGaRT as our example.

gazetteer = ["天下一統志(明)","大明一統志(明)","嘉善縣志(明)","雍大記南畿志(明)","上海縣志(明)","全遼志(明)","喬三石耀州志(明)","宣府鎮志(明)","雲南通志(明)","大明一統志輯錄(明)","雲中郡志(清)"]
import re # this you will learn in the regular expression notebook so you do not need to understand everything for now

def str_check(text): # but you need to know the function start with def, the name of the function, and a () with/ without argument(s) inside.
  pattern = re.compile(r'明') # inside the function you need to indent every line (with a tab)
  match = any(pattern.findall(text)) # you somehow get to the result (desired output)
  return match # and you return them (typically in a variable)

Suppose we need all the local gazetteers from 明 only. What we can do is to build a function (there is another easier option using Pandas but now we stick with this function). And then we can print all results which fit the condition.


The structure of list comprehension is:

[ ] <- in a list <- key components

(name1) for (name2) in <- for and in are keywords, (name2) is the variable name to be assigned for the item in loop, (name1) is the output you want, which should be expressed in terms of (name2)

(list) <- the variable which stores all the items or the list to be loop through

if (condition) <- add a condition, this part is optional


Remarks: (name1) and (name2) can be the same but do not need to be

[x for x in gazetteer if str_check(x)] # if ad_check(x) is the same as if ad_check(x) == True
['天下一統志(明)',
 '大明一統志(明)',
 '嘉善縣志(明)',
 '雍大記南畿志(明)',
 '上海縣志(明)',
 '全遼志(明)',
 '喬三石耀州志(明)',
 '宣府鎮志(明)',
 '雲南通志(明)',
 '大明一統志輯錄(明)']

We can also save the list to another new list.

ming_gazetteer = [x for x in gazetteer if str_check(x)]
ming_gazetteer
['天下一統志(明)',
 '大明一統志(明)',
 '嘉善縣志(明)',
 '雍大記南畿志(明)',
 '上海縣志(明)',
 '全遼志(明)',
 '喬三石耀州志(明)',
 '宣府鎮志(明)',
 '雲南通志(明)',
 '大明一統志輯錄(明)']

We can also change the outputs: for example, instead of the item itself, we want to get the lengths of the strings which fit the same condition:

[len(x) for x in gazetteer if str_check(x)]
[8, 8, 7, 9, 7, 6, 9, 7, 7, 10]
[x for x in gazetteer if str_check(x)]
['天下一統志(明)',
 '大明一統志(明)',
 '嘉善縣志(明)',
 '雍大記南畿志(明)',
 '上海縣志(明)',
 '全遼志(明)',
 '喬三石耀州志(明)',
 '宣府鎮志(明)',
 '雲南通志(明)',
 '大明一統志輯錄(明)']

We can also combine list comprehension with any functions from any library. For example, we want to get the piyin of the items in our list. We can use library pinyin for that.

! pip install pinyin # install library
import pinyin # import library
[pinyin.get(x, format="strip", delimiter=" ") for x in gazetteer] # get the pinyin for all items
['tian xia yi tong zhi ( ming )',
 'da ming yi tong zhi ( ming )',
 'jia shan xian zhi ( ming )',
 'yong da ji nan ji zhi ( ming )',
 'shang hai xian zhi ( ming )',
 'quan liao zhi ( ming )',
 'qiao san shi yao zhou zhi ( ming )',
 'xuan fu zhen zhi ( ming )',
 'yun nan tong zhi ( ming )',
 'da ming yi tong zhi ji lu ( ming )',
 'yun zhong jun zhi ( qing )']
[pinyin.get(x, format="strip", delimiter=" ") for x in gazetteer if str_check(x)] # get the pinyin for all items from ming
['tian xia yi tong zhi ( ming )',
 'da ming yi tong zhi ( ming )',
 'jia shan xian zhi ( ming )',
 'yong da ji nan ji zhi ( ming )',
 'shang hai xian zhi ( ming )',
 'quan liao zhi ( ming )',
 'qiao san shi yao zhou zhi ( ming )',
 'xuan fu zhen zhi ( ming )',
 'yun nan tong zhi ( ming )',
 'da ming yi tong zhi ji lu ( ming )']

Apart from that, we can also combine the index using enumerate() (we have learnt it from the previous lesson).

Remember that enumerate() return two values as we need two names between the keywords "for" and "in"!

[print(index,pinyin.get(item, format="strip", delimiter=" ")) for index, item in enumerate(gazetteer)] # get the pinyin with indices for all items
0 tian xia yi tong zhi ( ming )
1 da ming yi tong zhi ( ming )
2 jia shan xian zhi ( ming )
3 yong da ji nan ji zhi ( ming )
4 shang hai xian zhi ( ming )
5 quan liao zhi ( ming )
6 qiao san shi yao zhou zhi ( ming )
7 xuan fu zhen zhi ( ming )
8 yun nan tong zhi ( ming )
9 da ming yi tong zhi ji lu ( ming )
10 yun zhong jun zhi ( qing )
[None, None, None, None, None, None, None, None, None, None, None]
[print(index,pinyin.get(item, format="strip", delimiter=" ")) for index, item in enumerate(gazetteer) if index in [1,3]]
1 da ming yi tong zhi ( ming )
3 yong da ji nan ji zhi ( ming )
[None, None]
[print(index,pinyin.get(item, format="strip", delimiter=" ")) for index, item in enumerate(gazetteer) if index not in [0]]
1 da ming yi tong zhi ( ming )
2 jia shan xian zhi ( ming )
3 yong da ji nan ji zhi ( ming )
4 shang hai xian zhi ( ming )
5 quan liao zhi ( ming )
6 qiao san shi yao zhou zhi ( ming )
7 xuan fu zhen zhi ( ming )
8 yun nan tong zhi ( ming )
9 da ming yi tong zhi ji lu ( ming )
10 yun zhong jun zhi ( qing )
[None, None, None, None, None, None, None, None, None, None]

There are numerous options what we can do using list comprehension. Another example demonsrating the functionality of list comprehension is word frequency count: searching for the keywords which appears in the top frequencies.

First, we join all the items to a single list and then split all Chinese character using list() after removing () using replace().

string = ''.join(gazetteer) # join all items in list to a single string
string = string.replace("(", "").replace(")", "") # replace "(" and ")" to nothing ""

wordlist = list(string) # split all Chinese characters
wordlist[:5] # print first 5 characters
['天', '下', '一', '統', '志']

Now we use list comprehension to count all characters.

wordfreq = [wordlist.count(w) for w in wordlist] # list comprehension to count all characters

# print the single string
print("String\n" + string +"\n") # \n means new line

# print a list of all characters
print("List\n" + str(wordlist) + "\n")

# print frequencies
print("Frequencies\n" + str(wordfreq) + "\n")

# print zip objects by combining characters and occurences
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))
String
天下一統志明大明一統志明嘉善縣志明雍大記南畿志明上海縣志明全遼志明喬三石耀州志明宣府鎮志明雲南通志明大明一統志輯錄明雲中郡志清

List
['天', '下', '一', '統', '志', '明', '大', '明', '一', '統', '志', '明', '嘉', '善', '縣', '志', '明', '雍', '大', '記', '南', '畿', '志', '明', '上', '海', '縣', '志', '明', '全', '遼', '志', '明', '喬', '三', '石', '耀', '州', '志', '明', '宣', '府', '鎮', '志', '明', '雲', '南', '通', '志', '明', '大', '明', '一', '統', '志', '輯', '錄', '明', '雲', '中', '郡', '志', '清']

Frequencies
[1, 1, 3, 3, 11, 12, 3, 12, 3, 3, 11, 12, 1, 1, 2, 11, 12, 1, 3, 1, 2, 1, 11, 12, 1, 1, 2, 11, 12, 1, 1, 11, 12, 1, 1, 1, 1, 1, 11, 12, 1, 1, 1, 11, 12, 2, 2, 1, 11, 12, 3, 12, 3, 3, 11, 1, 1, 12, 2, 1, 1, 11, 1]

Pairs
[('天', 1), ('下', 1), ('一', 3), ('統', 3), ('志', 11), ('明', 12), ('大', 3), ('明', 12), ('一', 3), ('統', 3), ('志', 11), ('明', 12), ('嘉', 1), ('善', 1), ('縣', 2), ('志', 11), ('明', 12), ('雍', 1), ('大', 3), ('記', 1), ('南', 2), ('畿', 1), ('志', 11), ('明', 12), ('上', 1), ('海', 1), ('縣', 2), ('志', 11), ('明', 12), ('全', 1), ('遼', 1), ('志', 11), ('明', 12), ('喬', 1), ('三', 1), ('石', 1), ('耀', 1), ('州', 1), ('志', 11), ('明', 12), ('宣', 1), ('府', 1), ('鎮', 1), ('志', 11), ('明', 12), ('雲', 2), ('南', 2), ('通', 1), ('志', 11), ('明', 12), ('大', 3), ('明', 12), ('一', 3), ('統', 3), ('志', 11), ('輯', 1), ('錄', 1), ('明', 12), ('雲', 2), ('中', 1), ('郡', 1), ('志', 11), ('清', 1)]

In order to produce more useful outputs to inspect the keywords, we can combine functions and list comprehensions to output a dictionary. Click here if you need to review how to build a function.

def wordListToFreqDict(wordlist):
"""
This function convert a word list to dictionary displaying frequencies of character occurences
"""
    wordfreq = [wordlist.count(p) for p in wordlist] # same as what we did above
    return dict(list(zip(wordlist,wordfreq))) # return a dictionary of zip objects

word_count = wordListToFreqDict(wordlist)

Let's look at our dictionary word_count.

word_count
{'一': 3,
 '三': 1,
 '上': 1,
 '下': 1,
 '中': 1,
 '全': 1,
 '南': 2,
 '善': 1,
 '喬': 1,
 '嘉': 1,
 '大': 3,
 '天': 1,
 '宣': 1,
 '州': 1,
 '府': 1,
 '志': 11,
 '明': 12,
 '海': 1,
 '清': 1,
 '畿': 1,
 '石': 1,
 '統': 3,
 '縣': 2,
 '耀': 1,
 '記': 1,
 '輯': 1,
 '通': 1,
 '遼': 1,
 '郡': 1,
 '錄': 1,
 '鎮': 1,
 '雍': 1,
 '雲': 2}

However, as we can see, the data seems to be a bit messy. It would be much easier to read if we order the keywords by frequency. We can do this by building another small function to sort the values.

def sortFreqDict(freqdict):
  """
  This function sort dictionary by keyword frequencies
  """
    aux = [(freqdict[key], key) for key in freqdict] # convert dictionary back to a list of tuples
    aux.sort() # sort the values
    aux.reverse() # reverse the values so the items are in descending order
    return aux # return sorted list

sortFreqDict(word_count)[:5] # apply our function and print the first 5 items in list
[(12, '明'), (11, '志'), (3, '統'), (3, '大'), (3, '一')]



Previous Lesson: Functions and Loops Basics

Next Lesson: Regular Expression


Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: December 2021




References:

https://gist.github.com/acrymble/1065661