Background photo image courtesy Ed Robertson via unsplash
Computational Text Analyses Bootcamp
Do you have a textual analysis project that you are trying to get off the ground? Or are you simply interested in learning more about how to analyze texts with computers? Join us for an intensive - but fun! - bootcamp at the Sherman Centre with opportunities to work on your own documents or a sample corpus if you just want to practice the techniques.
Through hands-on exercises, we will introduce to the fundamental concepts, processes, and methodological approaches for preparing and analyzing text using computational approaches. We’ll show you how to use tools like OpenRefine and Python for text preparation and introduce analytic techniques including named entity recognition (NER), topic modeling, sentiment analysis and stylometry. Participants are not expected to have any pre-requisite knowledge of text preparation and analysis, but experience with Python is an asset. Participants will be given an opportunity to complete exercises in advance of the workshop to build basic competency.
This is an in-person event and open to all who are able to travel to the Sherman Centre, which is accessibly located on the first floor of Mills Library at McMaster University.
Workshop Preparation
In this workshop, we will use the following tools and platforms:
- Google Colab, which requires a Google account. If this poses a challenge, please reach out to the Sherman Centre for alternative arrangements.
- OpenRefine: Download and install prior to the first session.
- (optional) Constellate.org, which is available to all McMaster members, as well as members of other institutions. If you do not have access, please contact the Sherman Centre for alternative arrangements.
Facilitator Bios
Devon Mordell is an Educational Developer at The MacPherson Institute for Teaching and Learning. Devon draws on her experience in media art, hobbyist programming and instructional design to teach workshops for the Sherman Centre. Her areas of interest in digital scholarship include data visualization, computational analyses of texts, sonification and critical digital humanities. Her research practice explores the algorithmic culture industry and platform psychogeography.
Jay Brodeur (he/him) is the Director of Digital Scholarship Infrastructure & Services and the Administrative Director of the Sherman Centre for Digital Scholarship. Jay has years of experience working with data in a wide variety of formats and interdisciplinary contexts. A scientist by training with a PhD in Earth and Environmental Sciences, he’s comfortable working and advising on all kinds of data-related activities, ranging from data wrangling and integration to analysis and mapping to research data management. Jay’s also keenly interested in the application of digital approaches to support experiential learning opportunities within and outside of the classroom.
Subhanya Sivajothy (she/her) brings a background of research in data justice, science and technology studies, and environmental humanities. She is currently thinking through participatory data design which allow for visualizations that are empowering for the end user. She also has experience in Research Data Management—particularly data cleaning and curation. Do not hesitate to reach out to her if you would like to talk more about data analysis and visualization as they evolve throughout the research process. Contact Subhanya at sivajos@mcmaster.ca.
Workshop Materials and Preparation
All files for the bootcamp are available in this shared Google Drive folder (u.mcmaster.ca/cta-bootcamp). Download the contents to your local computer and unzip them, AND copy the contents into your own Google Drive before beginning the exercises.
Contents
Day 1
Time: 0930 - 1600 The Jupyter notebook name for each exercise is indicated below.
View/download slides
Segment | Time Allotted | Key Topics / Activities |
---|---|---|
Introductory remarks | 20 minutes | Introduction to text preparation and analysis Overview of concepts and methods |
Text preparation | 120 minutes | Text prep with OpenRefine Building workflows with Python ( CTA-Bootcamp-2024-python-prep.ipynb ) |
Lunch (1200 - 1300) | 60 minutes | Lunch |
Text Analysis | 180 minutes | Named Entity Recognition [45 mins] (CTA-Bootcamp-2024-NER.ipynb )Sentiment Analysis [45 mins] ( CTA-Bootcamp-2024-SA.ipynb )Topic Modeling [45 mins] ( CTA-Bootcamp-2024-TM.ipynb )Stylometry [45 mins] ( CTA_Bootcamp_2024_stylometry.ipynb ) |
Wrap up | 10 minutes | Recap & thinking about day 2 projects |
Day 2
Time: 0930 - 1600
View/download slides
Segment | Time Allotted | Key Topics / Activities |
---|---|---|
Corpora Selection | 30 minutes | Sources and types Key considerations for different source materials and analyses Case studies |
Visualization for Dissemination | 75 minutes | Core concepts Visualization types Hands-on exercises |
Working Period | 75 minutes | Work on your own data or a pre-selected project |
Lunch (1230 - 1330) | 60 minutes | Lunch |
Working Period | 120 minutes | Continue project work |
Share Back, Closing Comments | 30 minutes | Share your work Questions and wrap-up |
Links and Resources
Here are a variety of helpful resources to explore and learn more.
Natural Language Processing Training and Resources
Constellate (NLP training and analysis)
Constellate is a text analysis learning and analysis platform supported by JSTOR Labs and ITHAKA. McMaster members can access tutorials, digitized materials, and an integrated python notebook environment by registering with their McMaster email address.
Constellate provides:
- A comprehensive set of interactive Jupyter Notebook-based tutorials for text analysis, shared via GitHub under a CC-BY license.
- Analytical access to content from 35+ million articles, books, and newspapers from JSTOR, Portico, Chronicling America, etc.
- A computational platform to develop notebooks and collect, create, analyze, and store data (to members of McMaster and other subscribing institutions).
- Access to advanced support (to members of McMaster and other subscribing institutions).
To access the features of the pedagogy package, McMaster members must pair their Constellate accounts with the institution at least once per year. This can be done by logging in while connected to the McMaster network (via wired or wifi connection), or by connecting via the [Library’s off-campus access] service.
When your account is paired with McMaster, you will see the message Full access provided by McMaster University after logging in.
To sign up and pair your account to the instutution:
- From on campus, navigate to https://constellate.org/register. From off campus, log in via the Library’s off-campus access service.
- Follow the instructions to verify your account
- Log in via https://constellate.org/login. If you are off campus and aren’t recognized as a McMaster member, log out and back in via McMaster Library off-campus access.
HathiTrust Research Centre
The HathiTrust Research Center (HTRC) enables computational analysis of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational uses of the collection. McMaster members have access to the HTRC services, materials, and training through the institution’s membership in the HathiTrust.
To access the HTRC:
- Sign in at https://analytics.hathitrust.org/.
- Select McMaster University as your institution and choose
Continue
. - Follow through McMaster’s Single Sign On process, if prompted.
- When prompted, check your email for a confirmation link and confirm your account.
Other tutorials and resources
- Check out Devon Mordell’s two excellent text prep and analysis modules shared through the SCDS: Pre-Processing Digitized Texts and Named Entity Recognition.
- How to Clean Text for Machine Learning with Python. An excellent step-by-step walkthrough of the fundamentals of text prep with Python.
- Python Regex (Regular Expressions) for Data Scientists
- Cleaning OCR’d text with Regular Expressions by Laura Turner O’Hara for The Programming Historian.
- Natural Language Processing With Python’s NLTK Package: An excellent end-to-end tutorial using the nltk package
- Natural Language Processing with Python: Introduction. This is an excellent step-by-step introduction to basic pre-processing steps (though no clustering or error find/replace)
- Using Binder to connect GitHub repositories to Jupyter Notebooks
OpenRefine
- Library Carpentry lesson on OpenRefine
- University of Toronto Libraries OpenRefine tutorials
- OpenRefine Manual on Regular Expressions
- Using regular expressions in OpenRefine: Tutorial by Peter Green, includes non-Latin script.
Regular Expressions
- Regular expression testers
- https://www.regular-expressions.info/
- https://regex101.com/
- Regexr: Interactive regular expression (regex) coder and explainer
Python
Python Integrated Development Environments
- There are many, many different Python IDEs. Find which one is best for you. Jay is partial to Pyzo.
Python packages for text prep and Natural Langauge Processing
- PyTesseract: Simple Python Optical Character Recognition
- spaCy NLP library and documentation
- NLTK NLP library and docmentation
- natas: Library for processing historical English corpora, especially for studying neologisms
- Python phonetics package, which includes methods for matching and clustering words by phonetic similarity
- pyspellchecker: A simple Python-based spell checking algorithm
- BookNLP: A natural language processing pipeline that scales to books and other long documents (in English).