Congratulations
Congratulations! You’ve just finished this workshop.
You should now be able to:
- Perform initial data analysis on OCR text output
- Explain the importance of data provenance
- Apply computational techniques to correct common OCR errors
- Identify an appropriate data pre-processing approach
Additional Resources
To learn more about any particular topic, take a look at the links below.
OpenRefine
As we are using OpenRefine for our pre-processing tasks, having a better grasp of OpenRefine will assist your error correction effort! There are numerous tutorials available; the Library Carpentries workshop on OpenRefine will reinforce your learning and gets into greater depth on some topics (though in a more general context - i.e. not specific to OCR error correction).
You may also wish to refer to the documentation for OpenRefine to really dive in to what’s possible with the tool.
Regular Expressions
Likewise, since one of the major OCR error correction strategies discussed involves using regular expressions (RegEx), a strong grasp of RegEx will help you make the most of OpenRefine. In addition to the resources listed on the “Correcting OCR Errors with OpenRefine: Strategies” page:
- Library Carpentries has also developed a workshop on RegEx,
- the Rex Egg website offers both cheatsheets and comprehensive tutorials,
- Peter Green of Princeton University Library has developed a RegEx cheatsheet tailored to OpenRefine,
- another helpful resource, of course, is the OpenRefine documentation.
You can also dynamically test your RegEx patterns with Regular Expressions 101 or RegExr.
Critical Data Studies
If “Behind the Interface” piqued your curiosity about how language frames our understanding of data, you may be interested in the following texts:
- Benjamin, Ruha. Race after technology: Abolitionist tools for the new Jim code. Polity, 2019.
- Boyd, Danah, and Kate Crawford. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.” Information, communication & society 15.5 (2012): 662-679.
- Fordyce, Robbie, and Suneel Jethani. “Critical data provenance as a methodology for studying how language conceals data ethics.” Continuum 35.5 (2021): 775-787.
- Gitelman, Lisa, ed. “Raw data” is an oxymoron. MIT press, 2013.