Combining Web Scraping with Keyword Extraction
Chinese Keyword Extraction using Jieba (II)
- Set Up Environment 🌲
- Writing a Function
- 🤔 Put a URL here ⬇️
- 🎉 Great!
Keyword extraction is one of the very popular techniques in Natural Language Processing (NLP) and text analysis. Last time we learnt about how to extract keywords from Chinese text using Jieba, this time we will learn how to extract keywords directly from the web using web scraping technique. It can be achieved using BeautifulSoup, a Python library for pulling data out of HTML and XML files. What is web scraping? Web scraping is an automated process used to download the page (fetching) and copy data from the web. Examples include copying a table or book titles from a website.
In this lesson, we will download the Chinese blog 时差播客︱宗教学:信仰,魔法,身份,权力 from 澎湃新闻 and extract keywords from the content. You will also learn how to do it with any website you want.
IMPORTANT:>> As mentioned in the instructions, you can click on the icon "open in Colab" to open the script in a Jupyter notebook to run the code. It is highly recommended to follow the tutorials in the correct order.
Set Up Environment 🌲
First, we have to set up our cloud environment in Colab.
! pip install jieba
- Import Libraries
We will then import Jieba, BeautifulSoup and other libraries we need. 📚
from __future__ import unicode_literals
import sys
sys.path.append("../")
# Jieba for tokenization and keyword extraction
import jieba
import jieba.posseg
import jieba.analyse
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from skimage import filters
import time
# Open URL
from urllib.request import urlopen, Request
import ssl
# Web scarping
from bs4 import BeautifulSoup
from google.colab import drive
drive.mount('/content/drive/')
Download Resources using wget
In this lesson, there are two materials we need to download from the web. The first one is the Chinese font which we need to display characters in the plot. The second one is a list of Chinese stopwords which we need for tokenization. We can access both of them using wget.
- Download Chinese Font
!wget -O TaipeiSansTCBeta-Regular.ttf https://drive.google.com/uc?id=1eGAsTN1HBpJAkeVM57_C7ccp7hbgSz3_&export=download
# after download, we have to add the font into the plotting library
# we need matplotlib.font_manager for that
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.font_manager import fontManager
fontManager.addfont('TaipeiSansTCBeta-Regular.ttf')
mpl.rc('font', family='Taipei Sans TC Beta')
- Download Chinese Stopword List
From the github link, we can access the list of stopwords.
! wget https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt -P /content/drive/MyDrive/
Web Scraping Basics 🌱
We have learnt how to extract keywords from strings. What if this time, we do not want to copy the whole text, but directly get the text from the web? It can be done by directly scrapping the text from URLs using BeautifulSoup 🥣.
The function BeautifulSoup from the library can parse the HTML code to Python objects. Data parsing is a process in which a string of data is converted from one format to another. To start with, we need to pass the web address to variable url
, then open url
using urllib.request and convert the code with "html.parser"
. We will get the soup
at the end.
In order to extract only text for our analysis, we will remove the HTML tags using extract(), following by get_text(). To further exclude irrelevant texts from the headers, we can choose to select only the blog content by specifying the class of the content using find(). To find out the class, we can go to the webpage, open the developer tool and use the inspector to click on the blog content. We will then find out, the class we need is called "newsdetail_content".
To guide you through each step of the process, we will first get the text without filtering the data.
url = "https://m.thepaper.cn/newsDetail_forward_13762466"
# this line is needed to avoid running into HTTP error 403 (access denial because of security)
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read()
print(html[500:1000])
soup = BeautifulSoup(html, features="html.parser")
type(soup)
soup_string = str(soup)
print(soup_string[:1500])
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
print(text[500:1500])
Step 4: Get the text with tags removal and class selection
We can remove all irrelevant sessions from the website by specifying the class. We can filter a specific class using find(). The class is identified using the developer tools in the browser. Please pay attention: we can only get the first item in this class using find(). Another option would be find_all() which returns a list of matches. The output is the clean text we were expecting.🌟
text = soup.find("div", {"class": "newsdetail_content"}).get_text()
text[:1000]
Step 5: Keyword Extraction
The final step is exactly what we did in the last tutorial! It defines the stopwords and extracts 10 keywords from the text.
stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
url = "https://m.thepaper.cn/newsDetail_forward_16254733"
jieba.analyse.set_stop_words(stopwords)
tags = jieba.analyse.extract_tags(text, topK=10, withWeight=True)
tags
Writing a Function
To simplify the steps, we can condense everything into a short function. If you do not know yet how to build a function, check it out. This function will take an URL, the number of keywords and a stopword list. It will then return the keywords in the list.
def extract_keywords(url,n,stopwords, withWeight=False):
"""
This function extract a number of keywords from a webpage after excluding the stopwords
url: str
the webpage
n: int
number of keywords extracted
stopwords: str
a path to the stopword text file
returns: list
list of keywords extracted from the webpage
"""
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.find("div", {"class": "newsdetail_content"}).get_text()
# exclude stopwords
jieba.analyse.set_stop_words(stopwords)
# get keywords
tags = jieba.analyse.extract_tags(text, topK=n, withWeight=withWeight)
return tags
By applying the function, we get a list of 10 keywords: '研究', '诗歌', '中国', '蔡宗齐', '学者', '澎湃', '文学', '诗境', '语法', '汉诗'
stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
n = 10
url = "https://m.thepaper.cn/newsDetail_forward_16254733"
extract_keywords(url=url,n=n,stopwords=stopwords)
def extract_keywords_general(url,n,stopwords, withWeight=False):
"""
This function extract a number of keywords from a webpage after excluding the stopwords
url: str
the webpage
n: int
number of keywords extracted
stopwords: str
a path to the stopword text file
returns: list
list of keywords extracted from the webpage
"""
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# exclude stopwords
jieba.analyse.set_stop_words(stopwords)
# get keywords
tags = jieba.analyse.extract_tags(text, topK=n, withWeight=withWeight)
return tags
stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
myKeywords = extract_keywords_general(url="https://ctext.org/zh",n=10,stopwords=stopwords) # Put in any URL you want
print(myKeywords)
Extract Keywords from Multiple Blogs
Until now, we can only extract keywords for one text at a time. To further automate what we did, we can loop through multiple articles. If you do not know yet how to build a loop, check it out. Please pay attention: as we are going through the webpages using Python, the server might be overloaded with too many requests in a very short time. To avoid potential errors, we can catch the errors using try and except, and put time.sleep() in between using time library. We will let the program sleep for 5 seconds after scrapping each web address.
stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
n = 10
keyword_list = []
for page in range(10000000,10000010):
url = "https://m.thepaper.cn/newsDetail_forward_{}".format(page)
print(url)
time.sleep(5)
try:
keywords = extract_keywords(url=url,n=n,stopwords=stopwords)
except Exceptions:
print("Interrupted")
keyword_list.append(keywords)
keyword_list
We can also choose to put the list in a Pandas data frame and export it to a csv file. If you want to learn more about Pandas, check it out here.
df = pd.DataFrame(keyword_list)
Save each keyword into separate columns as strings.
df = df.fillna(value=np.nan).astype(str)
df
We can also put all keywords together in a single column. It is done by applying function join() along axis 1 of our DataFrame.
df_join = pd.DataFrame()
df_join["keywords"] = df.apply(lambda row: ','.join(row.values.astype(str)), axis=1)
df_join
To download the data frame as a csv, we can use to_csv() and files.download().
from google.colab import files
df_join.to_csv('keywords.csv')
files.download('keywords.csv')
🎉 Great!
We have just learnt how to extract keywords from a webpage using Jieba and BeautifulSoup. The web scraping techniques we used are only the basics to work with a simple webpage. To better understand the potential of BeautifulSoup, I recommand you to further search for BeautifulSoup tutorials on Youtube.
Next time we will learn how to perform some basic data visualization based on the extracted keywords. Stay tuned!
Additional information
This notebook is provided for educational purpose and feel free to report any issue on GitHub.
Author: Ka Hei, Chow
License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.
Last modified: February 2022