Workshop Title Slide

Background photo image courtesy Ed Robertson via unsplash

Computational Text Analyses Bootcamp

Do you have a textual analysis project that you are trying to get off the ground? Or are you simply interested in learning more about how to analyze texts with computers? Join us for an intensive - but fun! - bootcamp at the Sherman Centre with opportunities to work on your own documents or a sample corpus if you just want to practice the techniques.

Through hands-on exercises, we will introduce to the fundamental concepts, processes, and methodological approaches for preparing and analyzing text using computational approaches. We’ll show you how to use tools like OpenRefine and Python for text preparation and introduce analytic techniques including named entity recognition (NER), topic modeling, sentiment analysis and stylometry. Participants are not expected to have any pre-requisite knowledge of text preparation and analysis, but experience with Python is an asset. Participants will be given an opportunity to complete exercises in advance of the workshop to build basic competency.

This is an in-person event and open to all who are able to travel to the Sherman Centre, which is accessibly located on the first floor of Mills Library at McMaster University.

Workshop Preparation

In this workshop, we will use the following tools and platforms:

Google Colab, which requires a Google account. If this poses a challenge, please reach out to the Sherman Centre for alternative arrangements.
OpenRefine: Download and install prior to the first session.
(optional) Constellate.org, which is available to all McMaster members, as well as members of other institutions. If you do not have access, please contact the Sherman Centre for alternative arrangements.

Facilitator Bios

Devon Mordell is an Educational Developer at The MacPherson Institute for Teaching and Learning. Devon draws on her experience in media art, hobbyist programming and instructional design to teach workshops for the Sherman Centre. Her areas of interest in digital scholarship include data visualization, computational analyses of texts, sonification and critical digital humanities. Her research practice explores the algorithmic culture industry and platform psychogeography.

Jay Brodeur (he/him) is the Director of Digital Scholarship Infrastructure & Services and the Administrative Director of the Sherman Centre for Digital Scholarship. Jay has years of experience working with data in a wide variety of formats and interdisciplinary contexts. A scientist by training with a PhD in Earth and Environmental Sciences, he’s comfortable working and advising on all kinds of data-related activities, ranging from data wrangling and integration to analysis and mapping to research data management. Jay’s also keenly interested in the application of digital approaches to support experiential learning opportunities within and outside of the classroom.

Subhanya Sivajothy (she/her) brings a background of research in data justice, science and technology studies, and environmental humanities. She is currently thinking through participatory data design which allow for visualizations that are empowering for the end user. She also has experience in Research Data Management—particularly data cleaning and curation. Do not hesitate to reach out to her if you would like to talk more about data analysis and visualization as they evolve throughout the research process. Contact Subhanya at sivajos@mcmaster.ca.

Workshop Materials and Preparation

All files for the bootcamp are available in this shared Google Drive folder (u.mcmaster.ca/cta-bootcamp). Download the contents to your local computer and unzip them, AND copy the contents into your own Google Drive before beginning the exercises.

Day 1

Time: 0930 - 1600 The Jupyter notebook name for each exercise is indicated below.
View/download slides

Segment	Time Allotted	Key Topics / Activities
Introductory remarks	20 minutes	Introduction to text preparation and analysis Overview of concepts and methods
Text preparation	120 minutes	Text prep with OpenRefine Building workflows with Python (`CTA-Bootcamp-2024-python-prep.ipynb`)
Lunch (1200 - 1300)	60 minutes	Lunch
Text Analysis	180 minutes	Named Entity Recognition [45 mins] (`CTA-Bootcamp-2024-NER.ipynb`) Sentiment Analysis [45 mins] (`CTA-Bootcamp-2024-SA.ipynb`) Topic Modeling [45 mins] (`CTA-Bootcamp-2024-TM.ipynb`) Stylometry [45 mins] (`CTA_Bootcamp_2024_stylometry.ipynb`)
Wrap up	10 minutes	Recap & thinking about day 2 projects

Day 2

Time: 0930 - 1600
View/download slides

Segment	Time Allotted	Key Topics / Activities
Corpora Selection	30 minutes	Sources and types Key considerations for different source materials and analyses Case studies
Visualization for Dissemination	75 minutes	Core concepts Visualization types Hands-on exercises
Working Period	75 minutes	Work on your own data or a pre-selected project
Lunch (1230 - 1330)	60 minutes	Lunch
Working Period	120 minutes	Continue project work
Share Back, Closing Comments	30 minutes	Share your work Questions and wrap-up

Links and Resources

Here are a variety of helpful resources to explore and learn more.

Natural Language Processing Training and Resources

Constellate (NLP training and analysis)

Constellate is a text analysis learning and analysis platform supported by JSTOR Labs and ITHAKA. McMaster members can access tutorials, digitized materials, and an integrated python notebook environment by registering with their McMaster email address.

Constellate provides:

A comprehensive set of interactive Jupyter Notebook-based tutorials for text analysis, shared via GitHub under a CC-BY license.
Analytical access to content from 35+ million articles, books, and newspapers from JSTOR, Portico, Chronicling America, etc.
A computational platform to develop notebooks and collect, create, analyze, and store data (to members of McMaster and other subscribing institutions).
Access to advanced support (to members of McMaster and other subscribing institutions).

To access the features of the pedagogy package, McMaster members must pair their Constellate accounts with the institution at least once per year. This can be done by logging in while connected to the McMaster network (via wired or wifi connection), or by connecting via the [Library’s off-campus access] service.

When your account is paired with McMaster, you will see the message Full access provided by McMaster University after logging in.

To sign up and pair your account to the instutution:

From on campus, navigate to https://constellate.org/register. From off campus, log in via the Library’s off-campus access service.
Follow the instructions to verify your account
Log in via https://constellate.org/login. If you are off campus and aren’t recognized as a McMaster member, log out and back in via McMaster Library off-campus access.

HathiTrust Research Centre

The HathiTrust Research Center (HTRC) enables computational analysis of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational uses of the collection. McMaster members have access to the HTRC services, materials, and training through the institution’s membership in the HathiTrust.

To access the HTRC:

Sign in at https://analytics.hathitrust.org/.
Select McMaster University as your institution and choose Continue.
Follow through McMaster’s Single Sign On process, if prompted.
When prompted, check your email for a confirmation link and confirm your account.

OpenRefine

Library Carpentry lesson on OpenRefine
University of Toronto Libraries OpenRefine tutorials
OpenRefine Manual on Regular Expressions
Using regular expressions in OpenRefine: Tutorial by Peter Green, includes non-Latin script.

Regular Expressions

Regular expression testers
- https://www.regular-expressions.info/
- https://regex101.com/
- Regexr: Interactive regular expression (regex) coder and explainer

Python

Python Integrated Development Environments

There are many, many different Python IDEs. Find which one is best for you. Jay is partial to Pyzo.

Python packages for text prep and Natural Langauge Processing

PyTesseract: Simple Python Optical Character Recognition
spaCy NLP library and documentation
NLTK NLP library and docmentation
natas: Library for processing historical English corpora, especially for studying neologisms
Python phonetics package, which includes methods for matching and clustering words by phonetic similarity
pyspellchecker: A simple Python-based spell checking algorithm
BookNLP: A natural language processing pipeline that scales to books and other long documents (in English).