The Utility of Count-based Models for the Digital Humanities

Discussing the history of collaboration between “humanities computing” and the computational linguistics community, Susan Hockey noted that  despite efforts to bring the fields closer together in the late 1980s, “there was limited communication between these communities, and humanities computing did not benefit as much as it could have done from computational linguistics techniques” (p. 13, Hockey 2004).Today, it would appear that this has changed, at least with respect to models of distributional semantics—computational models that attend to statistical patterns of association between words in large text corpora. For example, an entire issue of the Journal of Digital Humanities was devoted to topic models in 2012, and another class of distributional models known as word embeddings has recently been making waves (Schmidt 2015; Bjerva & Praet 2015; Heuser 2016). Such methods have particular promise for humanists interested in identifying groups of words that appear in similar contexts, identifying the semantic fields relevant to corpora that may be too large to read in their entirety (Newman & Block 2006), or identifying changes in concept use across time (Goldstone & Underwood 2012; Wevers, Kenter, & Huijnen 2015).

 

However, topic models and word embeddings often result in relatively opaque mathematical representations, creating difficulties for researchers attempting to use them to draw conclusions about the use of particular words. This paper will argue for the utility of count-based distributional models, which have largely escaped the purview of the humanities despite their long history in computational linguistics and recent articles arguing in their favour (Lebret & Collobert 2015; Levy, Goldberg, & Dagan 2015). Because the interpretation of every component of a vector in a count-based model is much clearer than in an embedding model, questions about why specific words appear as highly associated according to the model can be more clearly and rigorously investigated. By applying a count-based model to tasks that have recently been highlighted as use cases for word embeddings in the digital humanities, the present paper will illustrate that novel insights can be uncovered that would not have been possible with word embeddings or topic models alone.


 

References

 

Bjerva, Johannes, and Raf Praet. "Word Embeddings Pointing the Way for Late Antiquity." LaTeCH 2015 (2015): 53.

 

Goldstone, Andrew, and Ted Underwood. "What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?" Journal of Digital Humanities 2.1 (2012): 40-49.

 

Heuser, Ryan. "Word Vectors in the Eighteenth Century, Episode 1: Concepts." Adventures of the Virtual. 14 Apr. 2016. Web. 14 May 2016.

 

Hockey, Susan. “The History of Humanities Computing.” A Companion to Digital Humanities. Eds. Susan Schreibman, Ray Siemens, & John Unsworth. Oxford: Blackwell, 2004. 3-19.

 

Lebret, Rémi, and Ronan Collobert. "Rehabilitation of Count-based Models for Word Vector Representations."  Computational Linguistics and Intelligent Text Processing. Springer International Publishing, 2015. 417-429.

 

Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.

 

Newman, David J., and Sharon Block. "Probabilistic Topic Decomposition of an Eighteenth‐Century American Newspaper." Journal of the American Society for Information Science and Technology 57.6 (2006): 753-767.

 

Schmidt, Ben. "Word Embeddings for the Digital Humanities." Ben's Bookworm Blog. 25 Oct. 2015. Web. 14 May 2016.

 

Wevers, Melvin, Kenter, Tom, and Huijnen, Pem. 2015. "Concepts Through Time: Tracing Concepts In Dutch Newspapers Discourse (1890-1990) Using Word Embeddings." DH2015, Sydney, Australia.