List Comprehension
Python basics
- Previous Lesson: Functions and Loops Basics
- Next Lesson: Regular Expression
- Additional information
- References:
As mentioned in the instructions, all materials can be open in Colab
as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.
Presumption:
Below we use a short list of local gazetteers from LoGaRT as our example.
gazetteer = ["天下一統志(明)","大明一統志(明)","嘉善縣志(明)","雍大記南畿志(明)","上海縣志(明)","全遼志(明)","喬三石耀州志(明)","宣府鎮志(明)","雲南通志(明)","大明一統志輯錄(明)","雲中郡志(清)"]
import re # this you will learn in the regular expression notebook so you do not need to understand everything for now
def str_check(text): # but you need to know the function start with def, the name of the function, and a () with/ without argument(s) inside.
pattern = re.compile(r'明') # inside the function you need to indent every line (with a tab)
match = any(pattern.findall(text)) # you somehow get to the result (desired output)
return match # and you return them (typically in a variable)
Suppose we need all the local gazetteers from 明 only. What we can do is to build a function (there is another easier option using Pandas but now we stick with this function). And then we can print all results which fit the condition.
The structure of list comprehension is:
[ ]
<- in a list <- key components
(name1) for (name2) in
<- for and in are keywords, (name2) is the variable name to be assigned for the item in loop, (name1) is the output you want, which should be expressed in terms of (name2)
(list)
<- the variable which stores all the items or the list to be loop through
if (condition)
<- add a condition, this part is optional
Remarks: (name1) and (name2) can be the same but do not need to be
[x for x in gazetteer if str_check(x)] # if ad_check(x) is the same as if ad_check(x) == True
We can also save the list to another new list.
ming_gazetteer = [x for x in gazetteer if str_check(x)]
ming_gazetteer
We can also change the outputs: for example, instead of the item itself, we want to get the lengths of the strings which fit the same condition:
[len(x) for x in gazetteer if str_check(x)]
[x for x in gazetteer if str_check(x)]
We can also combine list comprehension with any functions from any library. For example, we want to get the piyin of the items in our list. We can use library pinyin for that.
! pip install pinyin # install library
import pinyin # import library
[pinyin.get(x, format="strip", delimiter=" ") for x in gazetteer] # get the pinyin for all items
[pinyin.get(x, format="strip", delimiter=" ") for x in gazetteer if str_check(x)] # get the pinyin for all items from ming
Apart from that, we can also combine the index using enumerate() (we have learnt it from the previous lesson).
Remember that enumerate() return two values as we need two names between the keywords "for" and "in"!
[print(index,pinyin.get(item, format="strip", delimiter=" ")) for index, item in enumerate(gazetteer)] # get the pinyin with indices for all items
[print(index,pinyin.get(item, format="strip", delimiter=" ")) for index, item in enumerate(gazetteer) if index in [1,3]]
[print(index,pinyin.get(item, format="strip", delimiter=" ")) for index, item in enumerate(gazetteer) if index not in [0]]
There are numerous options what we can do using list comprehension. Another example demonsrating the functionality of list comprehension is word frequency count: searching for the keywords which appears in the top frequencies.
First, we join all the items to a single list and then split all Chinese character using list() after removing () using replace().
string = ''.join(gazetteer) # join all items in list to a single string
string = string.replace("(", "").replace(")", "") # replace "(" and ")" to nothing ""
wordlist = list(string) # split all Chinese characters
wordlist[:5] # print first 5 characters
Now we use list comprehension to count all characters.
wordfreq = [wordlist.count(w) for w in wordlist] # list comprehension to count all characters
# print the single string
print("String\n" + string +"\n") # \n means new line
# print a list of all characters
print("List\n" + str(wordlist) + "\n")
# print frequencies
print("Frequencies\n" + str(wordfreq) + "\n")
# print zip objects by combining characters and occurences
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))
In order to produce more useful outputs to inspect the keywords, we can combine functions and list comprehensions to output a dictionary. Click here if you need to review how to build a function.
def wordListToFreqDict(wordlist):
"""
This function convert a word list to dictionary displaying frequencies of character occurences
"""
wordfreq = [wordlist.count(p) for p in wordlist] # same as what we did above
return dict(list(zip(wordlist,wordfreq))) # return a dictionary of zip objects
word_count = wordListToFreqDict(wordlist)
Let's look at our dictionary word_count.
word_count
However, as we can see, the data seems to be a bit messy. It would be much easier to read if we order the keywords by frequency. We can do this by building another small function to sort the values.
def sortFreqDict(freqdict):
"""
This function sort dictionary by keyword frequencies
"""
aux = [(freqdict[key], key) for key in freqdict] # convert dictionary back to a list of tuples
aux.sort() # sort the values
aux.reverse() # reverse the values so the items are in descending order
return aux # return sorted list
sortFreqDict(word_count)[:5] # apply our function and print the first 5 items in list
Functions and Loops Basics
Previous Lesson:Regular Expression
Next Lesson:Additional information
This notebook is provided for educational purpose and feel free to report any issue on GitHub.
Author: Ka Hei, Chow
License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.
Last modified: December 2021