As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.

When we perform a larger chunk of text analysis, the workflow might get longer and longer which can be unhandy and prone to mistakes. Both functions and loops are ways to "ask for repeated computation", either by looping through what you have or refering back to code you already wrote (the function).

This notebook aims to introduce users how to use functions and loops in Python using basic text exmaple from the humanity discipline. It aims to provide users basic skills to automate small chunks of text analysis using text resources.

Presumption:

Functions

Loops

Enumerate

It is also recommanded that the user has some basic understandings already with Python.

## Functions

In Python, a function is a group of related statements that performs a specific task.

Functions break our tasks into smaller chunks and make it more manageable. Furthermore, it avoids repetition and makes the code reusable. It is very helful when the workflow (task) need to be repeated done (Get information from a long list of document or webpages) so the user do not need to specify everything for multiple times to execute the task. It is also less prone to mistakes as users do not need to code everytime.

A function is started with the keyword def. Once the function is defined, it can be called by typing the function name with appropriate parameters.

Remark:

Both # and """ """ in the code are comments and will not run in Python.

Example of a function:

import datetime

# define funciton
def getTime(n): # name and input
  """
  This function acquire future days from now 
  """
  day = datetime.datetime.now() + datetime.timedelta(days=n) # action
  return str(day) # output

# call function
getTime(1)

'2021-12-10 20:06:05.499284'

婦女雜誌廣告 (Advertisements from Chinese Womens Magazine)

This is the subset of the advertisements published in 婦女雜誌1915年第01期.

Let's say we want to know if the advertistment is published form a 公司.

ad = ["商務印書館發行書目介紹","女界寶、非洲樹皮丸、助肺呼吸香膠、家普魚肝油、清血解毒海波藥、納佛補天汁、良丹(五洲大藥房)","泰豐罐頭食品有限公司製造廠攝影","武進莊苣史女士菊花寒菜寫生","中華眼鏡公司","中將湯(東亞公司經理批發)"]

import re # this you will learn in the regular expression notebook so you do not need to understand everything for now

def ad_check(text): # but you need to know the function start with def, the name of the function, and a () with/ without argument(s) inside.
  pattern = re.compile(r'公司') # inside the function you need to indent every line (with a tab)
  match = any(pattern.findall(text)) # you somehow get to the result (desired output)
  return match # and you return them (typically in a variable)

Now we can apply the function using our ad list. We can only pass a string in the function so we need to slice the item. For example, we can check on the first advertisement in list.

ad_check(ad[0])

False

It returns False, a boolean meaning the word "公司" cannot be found. Now we check on the third item.

ad_check(ad[2])

True

True is returned, meaning the keyword is found. So now we do not need to type the whole function everytime we want to check for an ad.

## Loops

In fact, we can even automate all ads using a loop. A loop can be very simple, but can get complicated if multiple elements are looped in parallel or when it is nested (loop inside another loop).

Here we use a simple for loop: it is started with a "for (something) in (something):", and followed by the next line(s) (all operations needed). All lines under the loop needed to be indented.

It basically tells Python:

For the item (i) in my list (ad),

apply the function using input item (i)

and print the result before the next round (item)

for i in ad: # i is the item you name, ad is our list
  print(ad_check(i)) # action

False
False
True
False
True
True

Sometime when we loop, we do not want to loop using the item itself, but the index of our item (For example, here let's say we want to print the index of the company found). Then we can do it in the following way:

len(ad) # first we found the len of our list

6

np.arange(len(ad)) # then we generate a sequence with the same length

array([0, 1, 2, 3, 4, 5])

This is then the index we loop through instead of the item.

Remark:

ad_check(ad[i]) is the same as ad_check(ad[i]) == True but the first way is more efficient to run in Python.

(ad_check(ad[i])) == (ad_check(ad[i]) == True)

True

for i in np.arange(len(ad)): # loop tho index
  if ad_check(ad[i]): # provide a condition: only print when it is true
    print(i,". ",ad[i]) # print the index and the item

2 .  泰豐罐頭食品有限公司製造廠攝影
4 .  中華眼鏡公司
5 .  中將湯(東亞公司經理批發)

Nonetheless, print the results is not very helpful because it is not stored in a variable and we cannot recall them. A helpful way is to stored them in another list.

Before we try to store them into our list, we need to first define it, as an empty list.

check_list = [] # define our output list

for i in ad: # for loop again
  check_list.append(ad_check(i)) # now we use append (adding element at the end of the list) to put our result every round

check_list

[False, False, True, False, True, True]

Now we know the keyword is found in 3rd, 5th, 6th ads. We can print out the company ads. This can be better done using numpy array. so we will first convert both lists to arrays.

import numpy as np # always need to import library first

ad = np.array(ad) # convert using np.array()
check_list = np.array(check_list)

ad[check_list] # slice ad using checklist, it only works when the check_list consists boolean (True and False, or 1 and 0)

array(['泰豐罐頭食品有限公司製造廠攝影', '中華眼鏡公司', '中將湯(東亞公司經理批發)'], dtype='<U46')

We also need to be careful that looping is considered an inefficient way to get things done in Python, so if we can get the job done without loop, then it is better we do it without.

## Enumerate

Sometimes, we do not only want to loop through the items, but also the indices of the item so we want do some further operations. In the above example we use a loop of index to print all the items with "公司".

2 . 泰豐罐頭食品有限公司製造廠攝影 4 . 中華眼鏡公司 5 . 中將湯(東亞公司經理批發)

However, we can use enumerate() instead which is a more efficient approach. It provides us two numbers: the first one is index starting from 0, another one is the item itself.

for count, value in enumerate(ad):
  print(count, value)

0 商務印書館發行書目介紹
1 女界寶、非洲樹皮丸、助肺呼吸香膠、家普魚肝油、清血解毒海波藥、納佛補天汁、良丹(五洲大藥房)
2 泰豐罐頭食品有限公司製造廠攝影
3 武進莊苣史女士菊花寒菜寫生
4 中華眼鏡公司
5 中將湯(東亞公司經理批發)

The following code give use exactly the same result as the above example:

In this case, both codes do not seem to make a difference, however, in other case, using enumerate() is much more efficient and allows use to write much shorter code.

for count, value in enumerate(ad):
  if ad_check(value):
    print(count, '.', value)

2 . 泰豐罐頭食品有限公司製造廠攝影
4 . 中華眼鏡公司
5 . 中將湯(東亞公司經理批發)

Previous Lesson: Debugging and Understanding Errors Basics

Next Lesson: List Comprehension

Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.

Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: December 2021

References:

https://mhdb.mh.sinica.edu.tw/fnzz/view.php?book=1501&str=%E5%A9%A6%E5%A5%B3