As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.

This notebook aims to introduce users how to use regular expression to extract useful information from text in Python which would be from documents or websites.

Presumption:


Before starting with this tutorial, please watch this video beforehand so that you would already understand:

1) What is the group method in a regular expression?

2) What is a raw string?

3) How to create a character set?

4) What is the function of quantifiers?



Review

Here are the summary tables from the video:

Syntax Meaning
. Any character except newline
\d Digit (0-9)
\D Not a digit (0-9)
\w Word character (a-z, A-Z, 0-9, _)
\W Not a word character
\s Whitespace (space, tab, newline)
\S Not whitespace (space, tab, newline)
Syntax Meaning
\b Word boundary
\B Not a word boundary
^ Beginning of a string
$ End of a string
[] Matches characters in brackets
[^ ] Matches characters NOT in brackets
Either or
( ) Group


Quantifiers Meaning
* 0 or more
+ 1 or more
? 0 or one
{3} Exact number
{3,4} Range of numbers (minimum, maximum)

Information Retrieval

  • ## re

Before we analyse any text, the relevant information needs to be first extracted to exclude all irrelavant information. Sometimes this is not very straight-forward since the text might be mixed with other information, particularly when the texts are mined from online sources.

Below we can look at an exmaple of an entry extracted from Historical GIS for Japan. We can see the information is in multiple rows with each row giving different information. If we only aim for one piece of information, it might be easy to copy in one entry but the task gets challenging once we have thousands of them. This is why text mining can be helpful to save us time and effort.


First of all, we have to import the library.

import re
lord_entry = """
name:	abemasaharu\n
vernacular name definition	kanji:	阿部正春\n
alternate vernacular name definition	hiragana:	あべまさはる\n
feature type definition	feature type:	feudal lord 大名 daimyo\n
date range definition	date range:	1664 to 1664\n
time slice definition	valid as:	time slice 年份\n
present location definition	present location:	岩槻市 iwatsukishi\n
point id definition	point id:	jp_dmy_40\n
data source definition	data source:	JP_CHGIS\n
feature type definition	coordinate type:	centroid\n
feature type definition	latitude:	35.93\n
feature type definition	longitude:	139.70\n
admin hierarchy definition	admin hierarchy: 武蔵国 musashi no kuni
"""

Name

Here we can try to get the kanji name of the entry.

From what we have learnt, we can use the group option to get the first group kanji: at the word boundary (\b) followed by space (\s) and everything (regardless of length) behind it. Using pattern1, we have the name we need in the second group.

We will use re.compile() to compile our pattern (faster if the pattern is frequently used), then use findall() to look for all matches.

pattern1 = re.compile(r'(\bkanji:\s)(.*)')

match1 = pattern1.findall(lord_entry) # get all matches
match1 # print them out
[('kanji:\t', '阿部正春')]

We can then access the first element of list [0] (there is only one element) and second element of the tuple [1].

match1[0][1]
'阿部正春'

Alternative: Lookaround

However, we can also use the lookaround method from re, which means we use kanji: to identify what we search for (behind the keyword) but we do not select kanji: itself because it is not important for us.


Be careful, space might not be obvious, but it is also counts as character, so we always need to address them too.



Given the string foobarbarfoo:


bar(?=bar) finds the 1st bar ("bar" which has "bar" after it)

bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)

(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)

(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)


They can also be combined:

(?<=foo)bar(?=bar) finds the 1st bar ("bar" with "foo" before it and "bar" after it)


Here we use (?<=text1)text2 to select text 2 from identifying text 1, in which text 1 is before text 2 in the text.

pattern2 = re.compile(r'(?<=kanji:\s).*')

match2 = pattern2.findall(lord_entry)
match2
['阿部正春']

Coordinates

Now, we can try to get the latitude and longitude from the lord_entry (for example, to make a map in GIS). Since we have already learnt the principle, the code we need is indeed very similar.

  • #### Latitude
lat_pattern = re.compile(r'(?<=latitude:\s).*')

match = lat_pattern.findall(lord_entry)
match
['35.93']

We need to be careful here. Normally when we think of coordinates, we expect a floating number. But here what we get (match) is a list. It will cause errors if we later directly use the list for any geospatial operations. So always check the type.

type(match) # it is a list
list
type(match[0]) # we can get the first item of the list to remove [], now it is a string
str

We need to further convert the string into float using float().

type(float(match[0]))
float
lat = float(match[0]) # save the final result to lat
lat
35.93

Now we got what we need! Let's do the same for longitude.

  • #### Longitude
lon_pattern = re.compile(r'(?<=longitude:\s).*')

match = lon_pattern.findall(lord_entry)
match # list
['139.70']
lon = float(match[0])
lon # float
139.7

Chinese Characters

Here is another small text from 韓愈. Now for Chinese characters, we can use unicode characters to select a specific type of characters.


The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:

  • U+3040 - U+30FF: hiragana and katakana (Japanese only)

  • U+3400 - U+4DBF: CJK unified ideographs extension A (Chinese, Japanese, and Korean)

  • U+4E00 - U+9FFF: CJK unified ideographs (Chinese, Japanese, and Korean)

  • U+F900 - U+FAFF: CJK compatibility ideographs (Chinese, Japanese, and Korean)

  • U+FF66 - U+FF9F: half-width katakana (Japanese only)

text = "或問諫議大夫陽城於愈:可以為有道之士乎哉?學廣而聞多,不求聞於人也,行古人之道,居於晉之鄙,晉之鄙人薰其德而善良者幾千人。大臣聞而薦之,天子以為諫議大夫。人皆以為華,陽子不色喜。居於位,五年矣,視其德如在野,彼豈以富貴移易其心哉!"
pattern = re.compile(r'[\u4e00-\u9fff]+')

match = pattern.findall(text)
match
['或問諫議大夫陽城於愈',
 '可以為有道之士乎哉',
 '學廣而聞多',
 '不求聞於人也',
 '行古人之道',
 '居於晉之鄙',
 '晉之鄙人薰其德而善良者幾千人',
 '大臣聞而薦之',
 '天子以為諫議大夫',
 '人皆以為華',
 '陽子不色喜',
 '居於位',
 '五年矣',
 '視其德如在野',
 '彼豈以富貴移易其心哉']

We can also look for every character instead:

pattern = re.compile(r'[\u4e00-\u9fff]')

match = pattern.findall(text)
match[:5] # print first 5 characters only
['或', '問', '諫', '議', '大']

Here is another example entry from 清代檔案. Here let's say we want to extract the time from the document.

text = """
撥給各種工匠銀乾隆01年8月
--內務府奏銷檔
第1筆

事由:撥給各種工匠銀

內文:雍正十三年四月起至 乾隆 元年五月給發匠役工價所用大制錢數目
郎中永保等文開恭畫坤寧宮神像需用外僱畫匠畫短工九十五工每工錢一百三十四文領去大制錢十二串七百三三十文
銀庫郎中邁格等據掌儀司郎中謨爾德等文開恭造坤寧宮祭祀所用鏨花銀香碟八個爵盤二個漏子一個格漏一個箸一雙匙三張小碟二十個鍾十一個大碗五個壺一把大小盤二十四個鑲銀裹楠木肉槽四個三鑲烏木箸二雙畫像上用掛釣三分亭子上用銀面葉一分需用外僱鏨花匠大器匠做短工七百九十一工四分五厘每工錢一百三十四文領去大制錢一百六串五十四文
...

時間:乾隆01年8月

官司:

官員:

微捲頁數:173-194

冊數:194

資料庫:內務府奏銷檔案
"""

We can also perform a quick retrieval using what we have just learnt.

pattern = re.compile(r'(?<=時間.).*')

match = pattern.findall(text)
match
['乾隆01年8月']

Combining with Web Scrapping, which we will learn later, we can then easily get the required information for text analysis.




Previous Lesson: List Comprehension

Next Lesson: Pandas Text Analysis


Additional information

This notebook is provided for educational purposes only. Feel free to report any issues on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: December 2021




References:

https://github.com/CoreyMSchafer/code_snippets/blob/master/Python-Regular-Expressions/snippets.txt

https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups

https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters