Regular Expression
Python RegEx
- Review
- Information Retrieval
- Name
- Alternative: Lookaround
- Coordinates
- Chinese Characters
- Additional information
- References:
As mentioned in the instructions, all materials can be open in Colab
as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.
This notebook aims to introduce users how to use regular expression to extract useful information from text in Python which would be from documents or websites.
Presumption:
Before starting with this tutorial, please watch this video beforehand so that you would already understand:
1) What is the group method in a regular expression?
2) What is a raw string?
3) How to create a character set?
4) What is the function of quantifiers?
Syntax | Meaning | |
---|---|---|
\b | Word boundary | |
\B | Not a word boundary | |
^ | Beginning of a string | |
$ | End of a string | |
[] | Matches characters in brackets | |
[^ ] | Matches characters NOT in brackets | |
Either or | ||
( ) | Group |
Quantifiers | Meaning |
---|---|
* | 0 or more |
+ | 1 or more |
? | 0 or one |
{3} | Exact number |
{3,4} | Range of numbers (minimum, maximum) |
Information Retrieval
- ## re
Before we analyse any text, the relevant information needs to be first extracted to exclude all irrelavant information. Sometimes this is not very straight-forward since the text might be mixed with other information, particularly when the texts are mined from online sources.
Below we can look at an exmaple of an entry extracted from Historical GIS for Japan. We can see the information is in multiple rows with each row giving different information. If we only aim for one piece of information, it might be easy to copy in one entry but the task gets challenging once we have thousands of them. This is why text mining can be helpful to save us time and effort.
First of all, we have to import the library.
import re
lord_entry = """
name: abemasaharu\n
vernacular name definition kanji: 阿部正春\n
alternate vernacular name definition hiragana: あべまさはる\n
feature type definition feature type: feudal lord 大名 daimyo\n
date range definition date range: 1664 to 1664\n
time slice definition valid as: time slice 年份\n
present location definition present location: 岩槻市 iwatsukishi\n
point id definition point id: jp_dmy_40\n
data source definition data source: JP_CHGIS\n
feature type definition coordinate type: centroid\n
feature type definition latitude: 35.93\n
feature type definition longitude: 139.70\n
admin hierarchy definition admin hierarchy: 武蔵国 musashi no kuni
"""
Name
Here we can try to get the kanji name of the entry.
From what we have learnt, we can use the group option to get the first group kanji:
at the word boundary (\b
) followed by space (\s
) and everything (regardless of length) behind it. Using pattern1
, we have the name we need in the second group.
We will use re.compile() to compile our pattern (faster if the pattern is frequently used), then use findall() to look for all matches.
pattern1 = re.compile(r'(\bkanji:\s)(.*)')
match1 = pattern1.findall(lord_entry) # get all matches
match1 # print them out
We can then access the first element of list [0]
(there is only one element) and second element of the tuple [1]
.
match1[0][1]
Alternative: Lookaround
However, we can also use the lookaround method from re, which means we use kanji:
to identify what we search for (behind the keyword) but we do not select kanji:
itself because it is not important for us.
Be careful, space might not be obvious, but it is also counts as character, so we always need to address them too.
Given the string foobarbarfoo
:
bar(?=bar)
finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar)
finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar
finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar
finds the 2nd bar ("bar" which does not have "foo" before it)
They can also be combined:
(?<=foo)bar(?=bar)
finds the 1st bar ("bar" with "foo" before it and "bar" after it)
Here we use (?<=text1)text2
to select text 2 from identifying text 1, in which text 1 is before text 2 in the text.
pattern2 = re.compile(r'(?<=kanji:\s).*')
match2 = pattern2.findall(lord_entry)
match2
- #### Latitude
lat_pattern = re.compile(r'(?<=latitude:\s).*')
match = lat_pattern.findall(lord_entry)
match
We need to be careful here. Normally when we think of coordinates, we expect a floating number. But here what we get (match) is a list. It will cause errors if we later directly use the list for any geospatial operations. So always check the type.
type(match) # it is a list
type(match[0]) # we can get the first item of the list to remove [], now it is a string
We need to further convert the string into float using float().
type(float(match[0]))
lat = float(match[0]) # save the final result to lat
lat
Now we got what we need! Let's do the same for longitude.
- #### Longitude
lon_pattern = re.compile(r'(?<=longitude:\s).*')
match = lon_pattern.findall(lord_entry)
match # list
lon = float(match[0])
lon # float
Chinese Characters
Here is another small text from 韓愈. Now for Chinese characters, we can use unicode characters to select a specific type of characters.
The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:
-
U+3040 - U+30FF: hiragana and katakana (Japanese only)
-
U+3400 - U+4DBF: CJK unified ideographs extension A (Chinese, Japanese, and Korean)
-
U+4E00 - U+9FFF: CJK unified ideographs (Chinese, Japanese, and Korean)
-
U+F900 - U+FAFF: CJK compatibility ideographs (Chinese, Japanese, and Korean)
-
U+FF66 - U+FF9F: half-width katakana (Japanese only)
text = "或問諫議大夫陽城於愈:可以為有道之士乎哉?學廣而聞多,不求聞於人也,行古人之道,居於晉之鄙,晉之鄙人薰其德而善良者幾千人。大臣聞而薦之,天子以為諫議大夫。人皆以為華,陽子不色喜。居於位,五年矣,視其德如在野,彼豈以富貴移易其心哉!"
pattern = re.compile(r'[\u4e00-\u9fff]+')
match = pattern.findall(text)
match
We can also look for every character instead:
pattern = re.compile(r'[\u4e00-\u9fff]')
match = pattern.findall(text)
match[:5] # print first 5 characters only
Here is another example entry from 清代檔案. Here let's say we want to extract the time from the document.
text = """
撥給各種工匠銀乾隆01年8月
--內務府奏銷檔
第1筆
事由:撥給各種工匠銀
內文:雍正十三年四月起至 乾隆 元年五月給發匠役工價所用大制錢數目
郎中永保等文開恭畫坤寧宮神像需用外僱畫匠畫短工九十五工每工錢一百三十四文領去大制錢十二串七百三三十文
銀庫郎中邁格等據掌儀司郎中謨爾德等文開恭造坤寧宮祭祀所用鏨花銀香碟八個爵盤二個漏子一個格漏一個箸一雙匙三張小碟二十個鍾十一個大碗五個壺一把大小盤二十四個鑲銀裹楠木肉槽四個三鑲烏木箸二雙畫像上用掛釣三分亭子上用銀面葉一分需用外僱鏨花匠大器匠做短工七百九十一工四分五厘每工錢一百三十四文領去大制錢一百六串五十四文
...
時間:乾隆01年8月
官司:
官員:
微捲頁數:173-194
冊數:194
資料庫:內務府奏銷檔案
"""
We can also perform a quick retrieval using what we have just learnt.
pattern = re.compile(r'(?<=時間.).*')
match = pattern.findall(text)
match
Combining with Web Scrapping, which we will learn later, we can then easily get the required information for text analysis.
List Comprehension
Previous Lesson:Pandas Text Analysis
Next Lesson:Additional information
This notebook is provided for educational purposes only. Feel free to report any issues on GitHub.
Author: Ka Hei, Chow
License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.
Last modified: December 2021
References:
https://github.com/CoreyMSchafer/code_snippets/blob/master/Python-Regular-Expressions/snippets.txt
https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups