Session 3 — Text Analytics 1: Between numbers and words

Friday 09:30 - 11:00

High Tor 2

Chair: James O'Sullivan

How not to read texts: giving context to big data

  • Iona Hine

University of Sheffield

Computational techniques enable humans to seek out patterns in collections of texts that exceed what one human can read. This permits the identification of textual and linguistic phenomena that may otherwise defy human recognition. It first requires texts in suitable digital format, texts that are “machine readable”. However, the use of the verb “read” to describe the discrete activities of human and machine can mask considerable difference between the two audiences’ needs.

Whether output as keywords-in-context, lists of associated terms, or mediated by visualisations, many of the mechanisms that make managing large language datasets possible for humans to explore simultaneously limit engagement with the words’ contexts. If words are known by the company they keep, computational resources (and perhaps the user’s patience) set unseen boundaries around the company interrogated. 

Emerging from a collaborative project that seeks to identify and trace the movement of paradigmatic terms in early modern English, this paper will consider different ways of moving from the products of machine reading to the work of human reading (and back again), weighing up their strengths and weaknesses in the context of this work.  The paper will respond to questions such as:


• What may be gained, lost, (or simply hidden) when historical texts are prepared for computational analysis? 

• What do the different audiences (computers, humanities scholars) not read?

• How can humanities scholars test claims about collections of texts that are too big to read?

• How does one systematise close reading? What checks and balances may be employed when investigating large datasets?

Reflections will be grounded in recent work with early modern English text collections, including Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO), making reference to corpus linguistics, distributional semantics, and literary and historical studies.

Quantitative analysis and textual interpretation in Caxton

  • Rosie Shute

University of Sheffield

Is it possible to isolate the language of an individual within printed material from the fifteenth century? How can specialised digital approaches (such as vector space modelling) illuminate our understanding of a book's material production? This paper discusses the theoretical and practical challenges of compiling and interpreting quantitative data when studying the work of compositors (type-setters) working for the premier English publisher, William Caxton (c.1422—c.1491), a study that contributes to a wider scholarly debate about spelling variation in early modern English and the constraints of a compositor's work.


The data has been selected on the basis of two factors: firstly, texts for which a transcription was already in existence (courtesy of EEBO-TCP), and secondly, texts for which breaks in compositors could be determined on the basis of bibliographical evidence. For each compositorial section, a wordlist was created comprising variant spellings and their frequencies. These frequencies form vectors then used in statistical similarity testing, in order to show the similarity of spelling systems between different compositors. 


This paper discusses the processes involved in taking the results of statistical analyses and interpreting them as they apply to the text itself, which is a material artefact. In doing so, I consider the challenges of extracting from a series of vectors the basis for drawing conclusions about language use. When there are myriad causes behind spelling variation, I discuss the value of statistical analysis in negotiating variation.  Finally, I consider the extent to which digital methods are viable in the pursuit of research questions such as that in this case study.

Comparing like with like? Tools for exploring families of corpora

  • Harri Siirtola,
  • Terttu Nevalainen,
  • Tanja Säily

The number of families of digital corpora has increased dramatically in recent years, including the extension of the Brown Corpus family over time and assembling an International Corpus of English (ICE). Access to such structurally comparable materials is indeed a prerequisite for the study of linguistic change and regional variation in a global language like English.


Using the same sampling frame optimizes corpus comparisons over time and space. However, past research has shown that genre comparability is not necessarily easy to achieve, and even the “same” genres can vary considerably. The aim of our paper is not to evaluate the reasons behind this variation but rather to provide tools for exploring the matchingness of corpora and spotting such differences.


One way of comparing corpora is keyword analysis. As a complementary approach, we introduce a new version of our easy-to-use interactive visualization tool, Text Variation Explorer (TVE). TVE includes three helpful diagnostics for genre variation: type/token ratio, average word length, and the proportion of hapax legomena. Furthermore, TVE can cluster text samples according to a user-given set of words by applying principal component analysis. TVE 2.0 will provide enhanced access to corpus metadata, making it easy to explore variation according to social categories such as gender and social rank.

Using ICE and Brown as examples, we will show how TVE can provide a quick overview of similarities and differences across corpora, highlighting sections that require more careful analysis. Our third example showcases the new metadata features of TVE by exploring gender differences in the Corpus of Early English Correspondence (CEEC). We argue that exploratory, highly interactive techniques can usefully complement traditional statistical analysis, especially when the goal is to generate insights rather than test a well-defined hypothesis.