Text Correction for Mining Historical Documents

Applying advanced deep-learning techniques to improve the quality of poor OCR in the British Library Newspapers collection.

This project addresses the critical issue of correcting noisily OCR’d historical documents, focusing on the British Library Newspapers (BLN) collection. BLN is a major corpus of over 200 years of scanned British newspapers from over 240 newspapers with textual data, visual data, and metadata available. Scanned newspaper images have undergone OCR (optical character recognition) processing, resulting in inaccurate transcriptions due to the degradation of the original documents. The project aims to employ advanced deep-learning techniques to improve the quality of these transcriptions. The final outputs will be high-quality corrected transcriptions of BLN and open-source code for OCR text correction, both of which would serve as valuable resources for humanities researchers.

Since the early 2000s, significant digitisation efforts have been undertaken to preserve and make accessible historical primary sources such as newspapers, early printed books, and handwritten documents. While these efforts have been instrumental in advancing humanities research, the low quality of OCR transcriptions remains a significant barrier to discovering new historical insights. The successful completion of this project promises both short-term and long-term benefits. In the short term, it will significantly enhance the transcription quality of BLN, enabling accurate and efficient searching within the collection as well as unlocking the potential for text mining, which was previously impractical due to low transcription quality. In the long term, the project’s success could revolutionise research on other large collections of historical documents, allowing researchers to track content changes, language evolution, and shifts in thought across different time periods. By being language-independent, the impact could extend to historical documents worldwide, advancing research on a global scale.

Project Website

Project Team

  • Professor Robert Gaizuskas (Project Director — Professor of Computer Science, Co-Director of CDT in Speech and Language Technologies, and Member of the Natural Language Processing (NLP) Research Group)
  • Alan Thomas (AI Research Engineer — Department of Computer Science)
  • Professor Robert Shoemaker (Professor of Eighteenth-Century British History — Department of History)
  • Michael Pidd (Director of the Digital Humanities Institute)
  • Dr Valeria Vitale (Lecturer in Digital Humanities — Digital Humanities Institute)
  • Haiping Lu (Head of AI Research Engineering, Professor of Machine Learning, and Turing Academic Lead — Department of Computer Science)