As mentioned in the instructions, all materials can be open in Colab as Jupyter notebooks. In this way users can run the code in the cloud. It is highly recommanded to follow the tutorials in the right order.

The importance of Python is often written from the perspective of computer science and data science. Nonetheless, the research needs for history are essentially different, so do the criterium for selecting a digital tool. Here, we will explain why Python is still highly relevant for historians.


1) Python can Process Text in any Digital Formats 📜

Doing research often requires working with different media formats in this digital age, regardless of what your research interests are. Even you are not working with a large number of data sources, you might either need to work with long texts or sources that are only offered in digital formats. Meanwhile, you might want to output your research in a digital format.

Either you might want to write your blog for your projects, or you need to process text in pdf format, Python is a useful tool to READ, PROCESS, and CONVERT between different digital formats. Sometimes you might download historical data in Stata Data File Format (.dta), other times what you have is a large CSV file. When you try to do a specific task, you might search for online tools which often is not free or is not convenient (For example, you might need to process the file one by one). Learning Python enable you to gain multiple skills working with different data formats, including but not limited to txt, XML, csv, xlsx, json, geojson, shp, pdf, and dta. Reading those format is the first step of everything coming after, such as cleaning them or filtering them.

One of the most popular libraries for working with tabular data is Pandas. In contrast to Excel, you can use Pandas to work with CSV files with millions of rows in ease. You do not need to manually select any cells or deal with missing values. Similar to Excel, you can also create graphics from the table using Pandas. The difference is that you have way more controls on the layouts and types of visualization.

Python is one of the programming languages having the most readable syntax and the largest open source community to support and contribute. Programming is not an individual task, it requires resources from others: either to fix a bug, find a realistic solution, or use libraries or code written by someone else. It is why having a large community make it easier to get the job done.

2) Python gives you Access to Digital Data 🔑

With the growth of digital humanities or digital history, more and more historical sources have been stored and published in a digital format, such as 明清妇女著作 and 中国历代人物传记资料库CBDB. Depending on the website and the nature of the data, the download format will also change (.mdb, csv, etc.). Sometimes the data you are looking for is not systematically stored, which might also require some web scraping, such as grabbing tables on a webpage (In this case you can use beautifulsoup or selenium). No matter how and in what format your data is stored, you can always access them using Python.

Data access does not simply mean copying the resources, but also having them in a structural framework so you can use them effectively. By reading the data in a structured way, you can clean the irrelevant information, filter items that fit your interests without any manual effort. For example, you can clean all the empty rows, or convert your table from long to wide format. Python is even more advantageous when you work with time series or spatial data owing to its powerful libraries Pandas and Geopandas. With the geospatial libraries, you can easily perform geospatial operations and make maps just like in geographic information system (GIS).

3) It offers Big Picture of the Data Collection 🖼️

Amid the ongoing information explosion, we do not simply have much more information for the current time, but also much more digitalized data from the past with the emerging digital scholarship projects. Although big data offers us more material to conduct research, often it is not realistic for us to work with a large amount of data, for instance, thousands of novels/ publications. We might solely want to filter very little information from the data pool or have an overview of all materials. We can achieve it by performing topic modelling, counting the frequency of keywords, or computing statistics.

Using machine learning, we can also perform unsupervised clustering on the texts for categorization. The mentioned approaches require different skills in text mining, big data analysis, and NLP. With Python, we can smoothly combine multiple domains in a single script. Learning a new tool always takes time. Hence, the best is to select tools that can reasonably perform most tasks. With the aid of Python, you can easily save time for the redundant tasks.

4) It helps you Understand Patterns in a Systematical way 🕸️

Machine reads differently than human eye. It is also why distant reading or Natural language processing (NLP) can aid with historical text analysis. By aggregating and analyzing massive amounts of data, we can observe the patterns of texts in a large scale, for example, some formal aspects of literature. It aims to have a more abstract view of the texts by visualizing their global features. One of the technique is Term Frequency(TF) — Inverse Dense Frequency(IDF) used to classify text resources. Other techniques include sentiment analysis and text similarity metrics. It can also be combined with different elements such as geospatial maps to visualize geographical information. All these can be easily implemented in Python suiting to your research needs.

5) You can Visualize Data from infographics to map 🗺️

Research using historical data does not end with text analysis. Often, particularly if you are an expert of your data, you need to communicate with the audience, either your coworkers, students, fellow researchers, or the funding agencies. To achieve it, you can create charts for your publication, infographics to put on the social media, or making a simple web application, with interactive figures and maps to allow more involvements from the audience.

Python provides multiple plotting libraries, from Matplotlib, Seaborn to Plotly. Not only can they produce publication-quality figures, but also different types of data visualization. You can decide if you wish to have animation or interactive dashboard. You can even build your web application with embedded text, graphics, and web map.

6) You can Customize to your research needs 🛠️

The specific tasks for histircal research are diverse, and often there are not yet tools doing what you exactly need. It is a pain when you realize an online tool is a bit deviated from what you expected, as there is often no way to further customize it. For instance, when you try to create a word cloud online, but am not satistified with the color use, the font, or simply that they do not support the language you use, or you need to filter or clean your data in a way that is not supported.

Using Python, however, there are vast choices for customization as long as you are willing to get your hands dirty. This does not only apply to data visualization, but also data collection and text analysis. Besides, thanks for the large open source community of Python, tools are available even for a relatively small niche, such as performing NLP on classical Chinese. As Python is free and open, once you learn about how to use a library, you can utilize it fully without the need to worry about licenses or subscriptions. You can also save your script, reuse it for another text, and freely share your approach with others without any barriers.

Summary

Commercial tools are mostly specialised in a certain domain, and the need of a license makes them difficult to use for reproduction and to share with others. Besides, if you need one task after another, for example, from web scraping to generating word clouds to making a web application for presentation, what you might need to do is jumping from one tool to another. Using open-source programming languages, however, provide you with flexibility for customization and integration so that you can perform all tasks seamlessly.

R is another programming language very popular among researchers, as it is for statistical computing and research. Similar to Python, it has a large community and can perform similar tasks: data wrangling, data visualization, feature selection web scrapping, app and so on. Both of the languages share many tools, for example, spacyr provides a convenient R wrapper around the Python spaCy for distant reading and plotnine is a Python implementation of R ggplot. Many useful libraries also support both languages, for example, Plotly for interactive plotting and Jieba for Chinese NLP. In fact, R and Python can even be bridged together using rpy2 and reticulate.

It needs to be said in advance that both R and Python have their strengths and weaknesses. Different to R, Python is a general-purpose language. It focuses on deployment and production. Hence, while R suits well for ad-hoc analysis, when it comes to building a product from machine learning, or text processing which is often needed before any data analysis can be done, Python can be a better choice.




Additional information

This notebook is provided for educational purpose and feel free to report any issue on GitHub.


Author: Ka Hei, Chow

License: The code in this notebook is licensed under the Creative Commons by Attribution 4.0 license.

Last modified: January 2022