Comparing like with like? Tools for exploring families of corpora

The number of families of digital corpora has increased dramatically in recent years, including the extension of the Brown Corpus family over time and assembling an International Corpus of English (ICE). Access to such structurally comparable materials is indeed a prerequisite for the study of linguistic change and regional variation in a global language like English.


Using the same sampling frame optimizes corpus comparisons over time and space. However, past research has shown that genre comparability is not necessarily easy to achieve, and even the “same” genres can vary considerably. The aim of our paper is not to evaluate the reasons behind this variation but rather to provide tools for exploring the matchingness of corpora and spotting such differences.


One way of comparing corpora is keyword analysis. As a complementary approach, we introduce a new version of our easy-to-use interactive visualization tool, Text Variation Explorer (TVE). TVE includes three helpful diagnostics for genre variation: type/token ratio, average word length, and the proportion of hapax legomena. Furthermore, TVE can cluster text samples according to a user-given set of words by applying principal component analysis. TVE 2.0 will provide enhanced access to corpus metadata, making it easy to explore variation according to social categories such as gender and social rank.

Using ICE and Brown as examples, we will show how TVE can provide a quick overview of similarities and differences across corpora, highlighting sections that require more careful analysis. Our third example showcases the new metadata features of TVE by exploring gender differences in the Corpus of Early English Correspondence (CEEC). We argue that exploratory, highly interactive techniques can usefully complement traditional statistical analysis, especially when the goal is to generate insights rather than test a well-defined hypothesis.