Discovering the Unknown Unknowns: What NLP Reveals About Historical Datasets

Analysing the natural language in historical sources presents several particular challenges, arising not only from the nature of the documents and the differing forms of language used but also from the varying quality of the digital versions of these documents.  These challenges become even more problematic when attempting to extract meaning from large, disparate datasets.  This paper will consider these challenges, in relation to three Humanities Research Institute Projects.

Connected Histories currently brings together twenty-two digital datasets related to early modern and nineteenth century Britain with a single federated search that allows sophisticated searching of names, places and dates.  Manuscripts Online, its sister site, enables users to search twenty online primary resources relating to written and early printed culture in Britain during the period 1000 to 1500.   Digital Panopticon is an on-going project that attempts to bring together existing and new genealogical, biometric and criminal justice datasets to explore the impact of the different types of penal punishments, particularly transportation, on the lives of 66,000 people sentenced at The Old Bailey between 1780 and 1875.

This paper will consider the feedback loop inherent in NLP approaches – how our failures not only improve our processing techniques but can also improve our datasets.  It will discuss how NLP and associated techniques not only add an interpretative layer to our datasets but can also raise research questions about the assumptions that we make about historical sources.