Keyword extraction is one of the very popular techniques in Natural Language Processing (NLP) and text analysis. Last time we learnt about how to extract keywords from Chinese text using Jieba, this time we will learn how to extract keywords directly from the web using web scraping technique. It can be achieved using BeautifulSoup, a Python library for pulling data out of HTML and XML files. What is web scraping? Web scraping is an automated process used to download the page (fetching) and copy data from the web. Examples include copying a table or book titles from a website.

In this lesson, we will download the Chinese blog 时差播客︱宗教学:信仰,魔法,身份,权力 from 澎湃新闻 and extract keywords from the content. You will also learn how to do it with any website you want.

IMPORTANT:>> As mentioned in the instructions, you can click on the icon "open in Colab" to open the script in a Jupyter notebook to run the code. It is highly recommended to follow the tutorials in the correct order.

Set Up Environment 🌲

First, we have to set up our cloud environment in Colab.

Python Library

  • Download Library

We need to download Jieba using pip.

! pip install jieba
Requirement already satisfied: jieba in /usr/local/lib/python3.7/dist-packages (0.42.1)
  • Import Libraries

We will then import Jieba, BeautifulSoup and other libraries we need. 📚

from __future__ import unicode_literals
import sys

# Jieba for tokenization and keyword extraction
import jieba
import jieba.posseg
import jieba.analyse

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from skimage import filters
import time

# Open URL
from urllib.request import urlopen, Request
import ssl

# Web scarping
from bs4 import BeautifulSoup

Google Drive

  • Connect to Google Drive

To access resources in your own Google Drive, we need to permit it by running the following code.

from google.colab import drive
Mounted at /content/drive/

Download Resources using wget

In this lesson, there are two materials we need to download from the web. The first one is the Chinese font which we need to display characters in the plot. The second one is a list of Chinese stopwords which we need for tokenization. We can access both of them using wget.

  • Download Chinese Font
!wget -O TaipeiSansTCBeta-Regular.ttf

# after download, we have to add the font into the plotting library
# we need matplotlib.font_manager for that
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.font_manager import fontManager

mpl.rc('font', family='Taipei Sans TC Beta')
  • Download Chinese Stopword List

From the github link, we can access the list of stopwords.

! wget -P /content/drive/MyDrive/

Web Scraping Basics 🌱

We have learnt how to extract keywords from strings. What if this time, we do not want to copy the whole text, but directly get the text from the web? It can be done by directly scrapping the text from URLs using BeautifulSoup 🥣.

The function BeautifulSoup from the library can parse the HTML code to Python objects. Data parsing is a process in which a string of data is converted from one format to another. To start with, we need to pass the web address to variable url, then open url using urllib.request and convert the code with "html.parser". We will get the soup at the end.

In order to extract only text for our analysis, we will remove the HTML tags using extract(), following by get_text(). To further exclude irrelevant texts from the headers, we can choose to select only the blog content by specifying the class of the content using find(). To find out the class, we can go to the webpage, open the developer tool and use the inspector to click on the blog content. We will then find out, the class we need is called "newsdetail_content".

To guide you through each step of the process, we will first get the text without filtering the data.

Step 1: Get the HTML without parsing.

First, we will solely read the HTML code from the URL. We can see the result includes not only text but also HTML code. All the Chinese characters are also displayed in UTF-8.

url = ""

# this line is needed to avoid running into HTTP error 403 (access denial because of security)
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read()

Step 3: Get the text with tags removal and without class selection

To clean it up, we need to extract the content using extract() and get_text(). Now things look much better! Nonetheless, we still need to remove the header texts.

for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()




                2021-07-30 12:29 








Step 4: Get the text with tags removal and class selection

We can remove all irrelevant sessions from the website by specifying the class. We can filter a specific class using find(). The class is identified using the developer tools in the browser. Please pay attention: we can only get the first item in this class using find(). Another option would be find_all() which returns a list of matches. The output is the clean text we were expecting.🌟

text = soup.find("div", {"class": "newsdetail_content"}).get_text()


Step 5: Keyword Extraction

The final step is exactly what we did in the last tutorial! It defines the stopwords and extracts 10 keywords from the text.

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
url = ""

tags = jieba.analyse.extract_tags(text, topK=10, withWeight=True)


Let's look at our result.

[('宗教', 0.2162035113456979),
 ('研究', 0.1056884665277731),
 ('宗教学', 0.07345268363971678),
 ('基督教', 0.061966732508025965),
 ('女性', 0.05076592702813553),
 ('神学院', 0.04378034740564733),
 ('天主教', 0.040147481422144304),
 ('传统', 0.033391922322156105),
 ('现在', 0.032659074433310856),
 ('社会', 0.031786899511790284)]

Writing a Function

To simplify the steps, we can condense everything into a short function. If you do not know yet how to build a function, check it out. This function will take an URL, the number of keywords and a stopword list. It will then return the keywords in the list.

def extract_keywords(url,n,stopwords, withWeight=False):
  This function extract a number of keywords from a webpage after excluding the stopwords
  url: str
    the webpage
  n: int
    number of keywords extracted
  stopwords: str
    a path to the stopword text file
  returns: list
    list of keywords extracted from the webpage
  req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
  html = urlopen(req).read()
  soup = BeautifulSoup(html, features="html.parser")

  # kill all script and style elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  # get text
  text = soup.find("div", {"class": "newsdetail_content"}).get_text()

  # exclude stopwords

  # get keywords
  tags = jieba.analyse.extract_tags(text, topK=n, withWeight=withWeight)
  return tags

By applying the function, we get a list of 10 keywords: '研究', '诗歌', '中国', '蔡宗齐', '学者', '澎湃', '文学', '诗境', '语法', '汉诗'

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
n = 10
url = ""

['研究', '诗歌', '中国', '蔡宗齐', '学者', '澎湃', '文学', '诗境', '语法', '汉诗']

Try it out yourself 🧐

def extract_keywords_general(url,n,stopwords, withWeight=False):
  This function extract a number of keywords from a webpage after excluding the stopwords
  url: str
    the webpage
  n: int
    number of keywords extracted
  stopwords: str
    a path to the stopword text file
  returns: list
    list of keywords extracted from the webpage
  req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
  html = urlopen(req).read()
  soup = BeautifulSoup(html, features="html.parser")

  # kill all script and style elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  # get text
  text = soup.get_text()

  # exclude stopwords

  # get keywords
  tags = jieba.analyse.extract_tags(text, topK=n, withWeight=withWeight)
  return tags

🤔 Put a URL here ⬇️

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"

myKeywords = extract_keywords_general(url="",n=10,stopwords=stopwords) # Put in any URL you want
['資料', '顯示', '來源', '文獻', '字體', '中國', '圖書館', '這些', '本站', '算經']

Extract Keywords from Multiple Blogs

Until now, we can only extract keywords for one text at a time. To further automate what we did, we can loop through multiple articles. If you do not know yet how to build a loop, check it out. Please pay attention: as we are going through the webpages using Python, the server might be overloaded with too many requests in a very short time. To avoid potential errors, we can catch the errors using try and except, and put time.sleep() in between using time library. We will let the program sleep for 5 seconds after scrapping each web address.

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
n = 10
keyword_list = []

for page in range(10000000,10000010):
  url = "{}".format(page)

    keywords = extract_keywords(url=url,n=n,stopwords=stopwords)
  except Exceptions:

Export Keyword List

We have appended the keywords to a list keyword_list when we loop through the articles. Now, we can access all keywords by looking into our list.

[['拔萝卜', '萝卜', '采摘', '孩子', '菜园', '收获', '周末版', '实践', '热爱劳动', '终觉'],
 ['绿色', '循环', '湖南省', '10', '环保', '生态', '活动', '兑换', '垃圾', '分类'],
 ['吴邮邮', '三轮车', '孩子', '小孙子', '事迹', '归仁', '2019', '12', '高文娟', '微信'],
 ['浙大', '浙江大学', '新人', '记者团', '缘定', '星河', '母校', '2020', '123', '李兰娟'],
 ['济南', '公安', '报告', '原文', '交警', '标题', '阅读'],
 ['年会', '环保', '湖南省', '社会', '行动者', '2020', '生态', '组织', '绿色', '环境治理'],
 ['斩肉', '海安', '炸制', '白斩', '炖煮', '--', '葱姜', '猴急', '麻虾', '黄毛'],
 ['高杰', '执法', '学法', '公安', '公安机关', '多面手', '全市', '法治', '复议', '民警'],
 ['栗子', '好钰', '炒栗子', '虹口', '小虹', '海宁路', '好好', '00', '野栗', '板栗'],
 ['海安', '博物馆', '陶瓷', '鸣谢', '匠心独运', '美轮美奂', '林裕翔', '邰颖', '喜欢', '光辉灿烂']]

We can also choose to put the list in a Pandas data frame and export it to a csv file. If you want to learn more about Pandas, check it out here.

df = pd.DataFrame(keyword_list)

Save each keyword into separate columns as strings.

df = df.fillna(value=np.nan).astype(str)
0 1 2 3 4 5 6 7 8 9
0 拔萝卜 萝卜 采摘 孩子 菜园 收获 周末版 实践 热爱劳动 终觉
1 绿色 循环 湖南省 10 环保 生态 活动 兑换 垃圾 分类
2 吴邮邮 三轮车 孩子 小孙子 事迹 归仁 2019 12 高文娟 微信
3 浙大 浙江大学 新人 记者团 缘定 星河 母校 2020 123 李兰娟
4 济南 公安 报告 原文 交警 标题 阅读 nan nan nan
5 年会 环保 湖南省 社会 行动者 2020 生态 组织 绿色 环境治理
6 斩肉 海安 炸制 白斩 炖煮 -- 葱姜 猴急 麻虾 黄毛
7 高杰 执法 学法 公安 公安机关 多面手 全市 法治 复议 民警
8 栗子 好钰 炒栗子 虹口 小虹 海宁路 好好 00 野栗 板栗
9 海安 博物馆 陶瓷 鸣谢 匠心独运 美轮美奂 林裕翔 邰颖 喜欢 光辉灿烂

We can also put all keywords together in a single column. It is done by applying function join() along axis 1 of our DataFrame.

df_join = pd.DataFrame()
df_join["keywords"] = df.apply(lambda row: ','.join(row.values.astype(str)), axis=1)
0 拔萝卜,萝卜,采摘,孩子,菜园,收获,周末版,实践,热爱劳动,终觉
1 绿色,循环,湖南省,10,环保,生态,活动,兑换,垃圾,分类
2 吴邮邮,三轮车,孩子,小孙子,事迹,归仁,2019,12,高文娟,微信
3 浙大,浙江大学,新人,记者团,缘定,星河,母校,2020,123,李兰娟
4 济南,公安,报告,原文,交警,标题,阅读,nan,nan,nan
5 年会,环保,湖南省,社会,行动者,2020,生态,组织,绿色,环境治理
6 斩肉,海安,炸制,白斩,炖煮,--,葱姜,猴急,麻虾,黄毛
7 高杰,执法,学法,公安,公安机关,多面手,全市,法治,复议,民警
8 栗子,好钰,炒栗子,虹口,小虹,海宁路,好好,00,野栗,板栗
9 海安,博物馆,陶瓷,鸣谢,匠心独运,美轮美奂,林裕翔,邰颖,喜欢,光辉灿烂

To download the data frame as a csv, we can use to_csv() and

from google.colab import files


🎉 Great!

We have just learnt how to extract keywords from a webpage using Jieba and BeautifulSoup. The web scraping techniques we used are only the basics to work with a simple webpage. To better understand the potential of BeautifulSoup, I recommand you to further search for BeautifulSoup tutorials on Youtube.

Next time we will learn how to perform some basic data visualization based on the extracted keywords. Stay tuned!

Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.

Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: February 2022