Keyword extraction is one of the very popular techniques in Natural Language Processing (NLP) and text analysis. Last time we learnt about how to extract keywords from Chinese text using Jieba, this time we will learn how to extract keywords directly from the web using web scraping technique. It can be achieved using BeautifulSoup, a Python library for pulling data out of HTML and XML files. What is web scraping? Web scraping is an automated process used to download the page (fetching) and copy data from the web. Examples include copying a table or book titles from a website.

In this lesson, we will download the Chinese blog 时差播客︱宗教学:信仰,魔法,身份,权力 from 澎湃新闻 and extract keywords from the content. You will also learn how to do it with any website you want.

IMPORTANT:>> As mentioned in the instructions, you can click on the icon "open in Colab" to open the script in a Jupyter notebook to run the code. It is highly recommended to follow the tutorials in the correct order.

Set Up Environment 🌲

First, we have to set up our cloud environment in Colab.

Python Library

  • Download Library

We need to download Jieba using pip.

! pip install jieba
Requirement already satisfied: jieba in /usr/local/lib/python3.7/dist-packages (0.42.1)
  • Import Libraries

We will then import Jieba, BeautifulSoup and other libraries we need. 📚

from __future__ import unicode_literals
import sys
sys.path.append("../")

# Jieba for tokenization and keyword extraction
import jieba
import jieba.posseg
import jieba.analyse

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from skimage import filters
import time

# Open URL
from urllib.request import urlopen, Request
import ssl

# Web scarping
from bs4 import BeautifulSoup

Google Drive

  • Connect to Google Drive

To access resources in your own Google Drive, we need to permit it by running the following code.

from google.colab import drive
drive.mount('/content/drive/')
Mounted at /content/drive/

Download Resources using wget

In this lesson, there are two materials we need to download from the web. The first one is the Chinese font which we need to display characters in the plot. The second one is a list of Chinese stopwords which we need for tokenization. We can access both of them using wget.

  • Download Chinese Font
!wget -O TaipeiSansTCBeta-Regular.ttf https://drive.google.com/uc?id=1eGAsTN1HBpJAkeVM57_C7ccp7hbgSz3_&export=download

# after download, we have to add the font into the plotting library
# we need matplotlib.font_manager for that
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.font_manager import fontManager

fontManager.addfont('TaipeiSansTCBeta-Regular.ttf')
mpl.rc('font', family='Taipei Sans TC Beta')
  • Download Chinese Stopword List

From the github link, we can access the list of stopwords.

! wget https://github.com/stopwords-iso/stopwords-zh/blob/master/stopwords-zh.txt -P /content/drive/MyDrive/

Web Scraping Basics 🌱

We have learnt how to extract keywords from strings. What if this time, we do not want to copy the whole text, but directly get the text from the web? It can be done by directly scrapping the text from URLs using BeautifulSoup 🥣.

The function BeautifulSoup from the library can parse the HTML code to Python objects. Data parsing is a process in which a string of data is converted from one format to another. To start with, we need to pass the web address to variable url, then open url using urllib.request and convert the code with "html.parser". We will get the soup at the end.

In order to extract only text for our analysis, we will remove the HTML tags using extract(), following by get_text(). To further exclude irrelevant texts from the headers, we can choose to select only the blog content by specifying the class of the content using find(). To find out the class, we can go to the webpage, open the developer tool and use the inspector to click on the blog content. We will then find out, the class we need is called "newsdetail_content".

To guide you through each step of the process, we will first get the text without filtering the data.

Step 1: Get the HTML without parsing.

First, we will solely read the HTML code from the URL. We can see the result includes not only text but also HTML code. All the Chinese characters are also displayed in UTF-8.

url = "https://m.thepaper.cn/newsDetail_forward_13762466"

# this line is needed to avoid running into HTTP error 403 (access denial because of security)
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read()

print(html[500:1000])
b'-touch-fullscreen"/>\n<meta name="Keywords" content="\xe6\xbe\x8e\xe6\xb9\x83\xef\xbc\x8cPaper\xef\xbc\x8cThe Paper\xef\xbc\x8c\xe7\x83\xad\xe9\x97\xae\xe7\xad\x94\xef\xbc\x8c\xe6\x96\xb0\xe9\x97\xbb\xe8\xb7\x9f\xe8\xb8\xaa\xef\xbc\x8c\xe6\x94\xbf\xe6\xb2\xbb\xef\xbc\x8c\xe6\x97\xb6\xe6\x94\xbf\xef\xbc\x8c\xe6\x94\xbf\xe7\xbb\x8f\xef\xbc\x8c\xe6\xbe\x8e\xe6\xb9\x83\xe6\x96\xb0\xe9\x97\xbb\xef\xbc\x8c\xe6\x96\xb0\xe9\x97\xbb\xef\xbc\x8c\xe6\x80\x9d\xe6\x83\xb3\xef\xbc\x8c\xe5\x8e\x9f\xe5\x88\x9b\xe6\x96\xb0\xe9\x97\xbb\xef\xbc\x8c\xe7\xaa\x81\xe5\x8f\x91\xe6\x96\xb0\xe9\x97\xbb\xef\xbc\x8c\xe7\x8b\xac\xe5\xae\xb6\xe6\x8a\xa5\xe9\x81\x93\xef\xbc\x8c\xe4\xb8\x8a\xe6\xb5\xb7\xe6\x8a\xa5\xe4\xb8\x9a\xef\xbc\x8c\xe4\xb8\x9c\xe6\x96\xb9\xe6\x97\xa9\xe6\x8a\xa5\xef\xbc\x8c\xe4\xb8\x9c\xe6\x96\xb9\xe6\x8a\xa5\xe4\xb8\x9a\xef\xbc\x8c\xe4\xb8\x8a\xe6\xb5\xb7\xe4\xb8\x9c\xe6\x96\xb9\xe6\x8a\xa5\xe4\xb8\x9a" />\n<meta name="Description" content="\xe6\xbe\x8e\xe6\xb9\x83\xef\xbc\x8c\xe6\xbe\x8e\xe6\xb9\x83\xe6\x96\xb0\xe9\x97\xbb\xef\xbc\x8c\xe6\xbe\x8e\xe6\xb9\x83\xe6\x96\xb0\xe9\x97\xbb\xe7\xbd\x91\xef\xbc\x8c\xe6\x96\xb0\xe9\x97\xbb\xe4\xb8\x8e\xe6\x80\x9d\xe6\x83\xb3\xef\xbc\x8c\xe6\xbe\x8e\xe6\xb9\x83\xe6\x98\xaf\xe6\xa4\x8d\xe6\xa0\xb9\xe4\xba\x8e\xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe7\x9a\x84\xe6\x97\xb6\xe6\x94\xbf\xe6\x80\x9d\xe6\x83\xb3\xe7\xb1\xbb\xe4\xba\x92\xe8\x81\x94\xe7\xbd\x91\xe5\xb9\xb3\xe5\x8f\xb0\xef\xbc\x8c\xe4\xbb\xa5\xe6\x9c\x80\xe6\xb4\xbb\xe8\xb7\x83\xe7\x9a\x84\xe5\x8e\x9f\xe5\x88\x9b\xe6\x96\xb0\xe9\x97\xbb\xe4\xb8\x8e\xe6\x9c\x80\xe5\x86\xb7\xe9\x9d\x99\xe7\x9a\x84\xe6\x80\x9d\xe6\x83\xb3\xe5\x88\x86\xe6\x9e\x90\xe4\xb8\xba\xe4\xb8'

Step 2: Parse the HTML.

After parsing the HTML, the layout gets easier to read. We begin to recognize the Chinese characters, but still with a lot of code.

soup = BeautifulSoup(html, features="html.parser")

type(soup)
bs4.BeautifulSoup
soup_string = str(soup)
print(soup_string[:1500])
<!DOCTYPE html>

<html lang="cn">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="zh-CN" http-equiv="content-language"/>
<meta content="initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no,viewport-fit=cover" name="viewport"/>
<meta content="no" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style">
<meta content="telephone=no" name="format-detection"/>
<meta content="yes" name="apple-touch-fullscreen">
<meta content="澎湃,Paper,The Paper,热问答,新闻跟踪,政治,时政,政经,澎湃新闻,新闻,思想,原创新闻,突发新闻,独家报道,上海报业,东方早报,东方报业,上海东方报业" name="Keywords"/>
<meta content="澎湃,澎湃新闻,澎湃新闻网,新闻与思想,澎湃是植根于中国上海的时政思想类互联网平台,以最活跃的原创新闻与最冷静的思想分析为两翼,是互联网技术创新与新闻价值传承的结合体,致力于问答式新闻与新闻追踪功能的实践。" name="Description"/>
<meta content="max-age=1700" http-equiv="Cache-control"/>
<meta content="on" http-equiv="cleartype"/>
<title>时差播客︱宗教学:信仰,魔法,身份,权力</title>
<link href="https://file.thepaper.cn/wap/v6/css/reset.css?v=2.1.5" rel="stylesheet" type="text/css"/>
<link href="https://file.thepaper.cn/wap/v6/css/swiper-bundle.min.css" rel="stylesheet" type="text/css"/>
<link href="https://file.thepaper.cn/wap/v6/css/base_v6.css?v=2.1.5" rel="stylesheet" type="text/css"/>
<link href="https://file.thepaper.cn/wap/v6/css/homepage_v6.css?v=2.1.5" rel="stylesheet" type="text/css"/>
<link href="https://file.thepaper.cn/wap/v6/css/newsdetail_v6.css?v=2.1.5" rel="stylesheet" type="text/css"/>
<script src="//7b71.t4m.cn/applink.js" type="text/jav

Step 3: Get the text with tags removal and without class selection

To clean it up, we need to extract the content using extract() and get_text(). Now things look much better! Nonetheless, we still need to remove the header texts.

for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()
print(text[500:1500])
私家地理
非常品
楼市
生活方式
澎湃联播
视界
亲子学堂
北京冬奥
汽车圈











思想市场

去APP听


时差播客︱宗教学:信仰,魔法,身份,权力
时差播客

                2021-07-30 12:29 
                
                    
                        来源:澎湃新闻
                    
                    
                
            







            {{newsTimeline.name}}
        







                        {{item.occurrenceDay}}
                    





{{content.occurrenceTime}}

 {{content.name}}

查看详情







全部展开
收起时间线



本期《时差》播客,主持人多伦多大学助理教授郭婷邀请到了来自宾夕法尼亚大学的程晓文教授、香港大学的李纪教授、弗吉尼亚理工大学的倪湛舸教授以及芝加哥大学神学院的神务硕士、医院宗教师郑利昕,以“宗教学:信仰,魔法,身份,权力”为题展开讨论。宗教并不外在于日常生活,而是弥散在社会、历史、文化、政治中的点滴;宗教学帮助我们反思历史,同时理解今天的世界。本文为时差播客与澎湃新闻合作刊发的文字稿,由澎湃新闻(www.thepaper.cn)记者龚思量整理。观音老母洞郭婷:在今天节目开始之前,我想先表达一下沉痛的悼念,前几天有一位宗教学界的前辈,台大的林富士老师(1960-2021)去世了。我本来并不是研究中国宗教的,也不研究传统的中国文史哲,所以并没有和林老师见过面。但我一直从他的研究中得到灵感,所以非常感谢他。这两天也在脸书上,看到很多他过去的同事和学生对他的纪念。虽然学术界很多时候是一个有失公正的地方,但还是有一些地方让人觉得温暖,就好像点亮了一盏灯,而那盏灯一直会亮下去。这一期我们来谈宗教学,不只是谈学界,也谈它的实践。在座几位虽然是跨学科的研究者或实践者,但也是宗教学出身。那我相信,大家在和别人介绍说自己研究宗教学的时候,通常会听到几个问题:

Step 4: Get the text with tags removal and class selection

We can remove all irrelevant sessions from the website by specifying the class. We can filter a specific class using find(). The class is identified using the developer tools in the browser. Please pay attention: we can only get the first item in this class using find(). Another option would be find_all() which returns a list of matches. The output is the clean text we were expecting.🌟

text = soup.find("div", {"class": "newsdetail_content"}).get_text()

text[:1000]
'本期《时差》播客,主持人多伦多大学助理教授郭婷邀请到了来自宾夕法尼亚大学的程晓文教授、香港大学的李纪教授、弗吉尼亚理工大学的倪湛舸教授以及芝加哥大学神学院的神务硕士、医院宗教师郑利昕,以“宗教学:信仰,魔法,身份,权力”为题展开讨论。宗教并不外在于日常生活,而是弥散在社会、历史、文化、政治中的点滴;宗教学帮助我们反思历史,同时理解今天的世界。本文为时差播客与澎湃新闻合作刊发的文字稿,由澎湃新闻(www.thepaper.cn)记者龚思量整理。观音老母洞郭婷:在今天节目开始之前,我想先表达一下沉痛的悼念,前几天有一位宗教学界的前辈,台大的林富士老师(1960-2021)去世了。我本来并不是研究中国宗教的,也不研究传统的中国文史哲,所以并没有和林老师见过面。但我一直从他的研究中得到灵感,所以非常感谢他。这两天也在脸书上,看到很多他过去的同事和学生对他的纪念。虽然学术界很多时候是一个有失公正的地方,但还是有一些地方让人觉得温暖,就好像点亮了一盏灯,而那盏灯一直会亮下去。这一期我们来谈宗教学,不只是谈学界,也谈它的实践。在座几位虽然是跨学科的研究者或实践者,但也是宗教学出身。那我相信,大家在和别人介绍说自己研究宗教学的时候,通常会听到几个问题:一个是那你有没有宗教信仰?或者你研究哪一种宗教?以前还会听到的一个问题是,那你毕业之后做什么,是不是准备出家等等。我以前会开玩笑说,对,以后出家给人算命。其实不只是学界之外,包括学界之内,不同学科对宗教学领域都会有一些陌生,因为它确实是一个比较特殊的学科。就我自己而言,我博士的训练在爱丁堡大学的神学院。爱大神学院作为一个新兴科系,比较有抗争精神和创新精神。它设立之初就是为了和传统的神学或者是和宗教有关的学科对抗,所以它非常讲究世俗化和社会科学方法。我记得大部分宗教系的学者不论男女都打扮得非常不羁。在开会的时候,美国宗教学、尤其是圣经研究的学者尤其男性会打扮得非常闪亮,头发焗过、穿西装、带领带、鞋子都擦得很亮,但是英国宗教系的老师就穿得很随便。而宗教学学科的训练讲究宗教和社会的关系、宗教和当下社会的关系。虽然我当时的研究是从AI人工智能切入,但其实是研究是英国的世俗化的情况。当然,在神学院也会碰到其他科系的同学,比如有旧约研究、新约研究,神学研究,然后也有一些道学博士或者是教牧学的学位。那想请几位聊聊,你们的研究背景是怎么样的,也可以跟'

Step 5: Keyword Extraction

The final step is exactly what we did in the last tutorial! It defines the stopwords and extracts 10 keywords from the text.

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
url = "https://m.thepaper.cn/newsDetail_forward_16254733"

jieba.analyse.set_stop_words(stopwords)
tags = jieba.analyse.extract_tags(text, topK=10, withWeight=True)

Keywords:

Let's look at our result.

tags
[('宗教', 0.2162035113456979),
 ('研究', 0.1056884665277731),
 ('宗教学', 0.07345268363971678),
 ('基督教', 0.061966732508025965),
 ('女性', 0.05076592702813553),
 ('神学院', 0.04378034740564733),
 ('天主教', 0.040147481422144304),
 ('传统', 0.033391922322156105),
 ('现在', 0.032659074433310856),
 ('社会', 0.031786899511790284)]

Writing a Function

To simplify the steps, we can condense everything into a short function. If you do not know yet how to build a function, check it out. This function will take an URL, the number of keywords and a stopword list. It will then return the keywords in the list.

def extract_keywords(url,n,stopwords, withWeight=False):
  """
  This function extract a number of keywords from a webpage after excluding the stopwords
  url: str
    the webpage
  n: int
    number of keywords extracted
  stopwords: str
    a path to the stopword text file
  returns: list
    list of keywords extracted from the webpage
  """
  req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
  html = urlopen(req).read()
  soup = BeautifulSoup(html, features="html.parser")

  # kill all script and style elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  # get text
  text = soup.find("div", {"class": "newsdetail_content"}).get_text()

  # exclude stopwords
  jieba.analyse.set_stop_words(stopwords)

  # get keywords
  tags = jieba.analyse.extract_tags(text, topK=n, withWeight=withWeight)
  return tags

By applying the function, we get a list of 10 keywords: '研究', '诗歌', '中国', '蔡宗齐', '学者', '澎湃', '文学', '诗境', '语法', '汉诗'

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
n = 10
url = "https://m.thepaper.cn/newsDetail_forward_16254733"

extract_keywords(url=url,n=n,stopwords=stopwords)
['研究', '诗歌', '中国', '蔡宗齐', '学者', '澎湃', '文学', '诗境', '语法', '汉诗']

Try it out yourself 🧐

def extract_keywords_general(url,n,stopwords, withWeight=False):
  """
  This function extract a number of keywords from a webpage after excluding the stopwords
  url: str
    the webpage
  n: int
    number of keywords extracted
  stopwords: str
    a path to the stopword text file
  returns: list
    list of keywords extracted from the webpage
  """
  req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
  html = urlopen(req).read()
  soup = BeautifulSoup(html, features="html.parser")

  # kill all script and style elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  # get text
  text = soup.get_text()

  # exclude stopwords
  jieba.analyse.set_stop_words(stopwords)

  # get keywords
  tags = jieba.analyse.extract_tags(text, topK=n, withWeight=withWeight)
  return tags

🤔 Put a URL here ⬇️

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"

myKeywords = extract_keywords_general(url="https://ctext.org/zh",n=10,stopwords=stopwords) # Put in any URL you want
print(myKeywords)
['資料', '顯示', '來源', '文獻', '字體', '中國', '圖書館', '這些', '本站', '算經']

Extract Keywords from Multiple Blogs

Until now, we can only extract keywords for one text at a time. To further automate what we did, we can loop through multiple articles. If you do not know yet how to build a loop, check it out. Please pay attention: as we are going through the webpages using Python, the server might be overloaded with too many requests in a very short time. To avoid potential errors, we can catch the errors using try and except, and put time.sleep() in between using time library. We will let the program sleep for 5 seconds after scrapping each web address.

stopwords= r"/content/drive/MyDrive/NLP/stopwords.txt"
n = 10
keyword_list = []

for page in range(10000000,10000010):
  url = "https://m.thepaper.cn/newsDetail_forward_{}".format(page)
  print(url)
  time.sleep(5)

  try:
    keywords = extract_keywords(url=url,n=n,stopwords=stopwords)
  except Exceptions:
      print("Interrupted")

  keyword_list.append(keywords)
https://m.thepaper.cn/newsDetail_forward_10000000
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.096 seconds.
Prefix dict has been built successfully.
https://m.thepaper.cn/newsDetail_forward_10000001
https://m.thepaper.cn/newsDetail_forward_10000002
https://m.thepaper.cn/newsDetail_forward_10000003
https://m.thepaper.cn/newsDetail_forward_10000004
https://m.thepaper.cn/newsDetail_forward_10000005
https://m.thepaper.cn/newsDetail_forward_10000006
https://m.thepaper.cn/newsDetail_forward_10000007
https://m.thepaper.cn/newsDetail_forward_10000008
https://m.thepaper.cn/newsDetail_forward_10000009

Export Keyword List

We have appended the keywords to a list keyword_list when we loop through the articles. Now, we can access all keywords by looking into our list.

keyword_list
[['拔萝卜', '萝卜', '采摘', '孩子', '菜园', '收获', '周末版', '实践', '热爱劳动', '终觉'],
 ['绿色', '循环', '湖南省', '10', '环保', '生态', '活动', '兑换', '垃圾', '分类'],
 ['吴邮邮', '三轮车', '孩子', '小孙子', '事迹', '归仁', '2019', '12', '高文娟', '微信'],
 ['浙大', '浙江大学', '新人', '记者团', '缘定', '星河', '母校', '2020', '123', '李兰娟'],
 ['济南', '公安', '报告', '原文', '交警', '标题', '阅读'],
 ['年会', '环保', '湖南省', '社会', '行动者', '2020', '生态', '组织', '绿色', '环境治理'],
 ['斩肉', '海安', '炸制', '白斩', '炖煮', '--', '葱姜', '猴急', '麻虾', '黄毛'],
 ['高杰', '执法', '学法', '公安', '公安机关', '多面手', '全市', '法治', '复议', '民警'],
 ['栗子', '好钰', '炒栗子', '虹口', '小虹', '海宁路', '好好', '00', '野栗', '板栗'],
 ['海安', '博物馆', '陶瓷', '鸣谢', '匠心独运', '美轮美奂', '林裕翔', '邰颖', '喜欢', '光辉灿烂']]

We can also choose to put the list in a Pandas data frame and export it to a csv file. If you want to learn more about Pandas, check it out here.

df = pd.DataFrame(keyword_list)

Save each keyword into separate columns as strings.

df = df.fillna(value=np.nan).astype(str)
df
0 1 2 3 4 5 6 7 8 9
0 拔萝卜 萝卜 采摘 孩子 菜园 收获 周末版 实践 热爱劳动 终觉
1 绿色 循环 湖南省 10 环保 生态 活动 兑换 垃圾 分类
2 吴邮邮 三轮车 孩子 小孙子 事迹 归仁 2019 12 高文娟 微信
3 浙大 浙江大学 新人 记者团 缘定 星河 母校 2020 123 李兰娟
4 济南 公安 报告 原文 交警 标题 阅读 nan nan nan
5 年会 环保 湖南省 社会 行动者 2020 生态 组织 绿色 环境治理
6 斩肉 海安 炸制 白斩 炖煮 -- 葱姜 猴急 麻虾 黄毛
7 高杰 执法 学法 公安 公安机关 多面手 全市 法治 复议 民警
8 栗子 好钰 炒栗子 虹口 小虹 海宁路 好好 00 野栗 板栗
9 海安 博物馆 陶瓷 鸣谢 匠心独运 美轮美奂 林裕翔 邰颖 喜欢 光辉灿烂

We can also put all keywords together in a single column. It is done by applying function join() along axis 1 of our DataFrame.

df_join = pd.DataFrame()
df_join["keywords"] = df.apply(lambda row: ','.join(row.values.astype(str)), axis=1)
df_join
keywords
0 拔萝卜,萝卜,采摘,孩子,菜园,收获,周末版,实践,热爱劳动,终觉
1 绿色,循环,湖南省,10,环保,生态,活动,兑换,垃圾,分类
2 吴邮邮,三轮车,孩子,小孙子,事迹,归仁,2019,12,高文娟,微信
3 浙大,浙江大学,新人,记者团,缘定,星河,母校,2020,123,李兰娟
4 济南,公安,报告,原文,交警,标题,阅读,nan,nan,nan
5 年会,环保,湖南省,社会,行动者,2020,生态,组织,绿色,环境治理
6 斩肉,海安,炸制,白斩,炖煮,--,葱姜,猴急,麻虾,黄毛
7 高杰,执法,学法,公安,公安机关,多面手,全市,法治,复议,民警
8 栗子,好钰,炒栗子,虹口,小虹,海宁路,好好,00,野栗,板栗
9 海安,博物馆,陶瓷,鸣谢,匠心独运,美轮美奂,林裕翔,邰颖,喜欢,光辉灿烂

To download the data frame as a csv, we can use to_csv() and files.download().

from google.colab import files

df_join.to_csv('keywords.csv')
files.download('keywords.csv')

🎉 Great!

We have just learnt how to extract keywords from a webpage using Jieba and BeautifulSoup. The web scraping techniques we used are only the basics to work with a simple webpage. To better understand the potential of BeautifulSoup, I recommand you to further search for BeautifulSoup tutorials on Youtube.

Next time we will learn how to perform some basic data visualization based on the extracted keywords. Stay tuned!




Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: February 2022