Modelling Eighteenth-Century Epistolarity: Unsupervised Classification of the Voltaire Correspondence

Letters, by their very nature, should be almost ideal candidates for unsupervised classification using the Latent Dirichlet Allocation (LDA) topic modelling algorithm. Specifically, given the size of the documents and the wide array of topics discussed therein, LDA should be able to provide a general overview of the various topics discussed in the 22,000-letter correspondence of Voltaire, a collection that has yet to be fully indexed. However, due perhaps to the formal nature of 18th-century letter-writing, which in French includes a preponderance of formules de politesse and other formulaic expressions, the topics detected are often too general in nature (i.e., concerned with the act of letter-writing itself), or conversely, too limited in terms of overall topic distribution. Indeed, the trade-off between large general topics (e.g., ‘health and wellbeing’, ‘politics’, ‘religion’) and those that are more content-specific (e.g., ‘the Calas affair’ or the ‘Seven Years’ War’) yet sparse, remains a challenge for LDA.

To address these issues, we aim to evaluate LDA’s fitness-for-task by comparing the model output of several lesser-known (at least in terms of digital humanities coverage) unsupervised algorithms in order to gage if these approaches might be better suited at capturing the complexity of eighteenth-century epistolary collections. These algorithms include Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA) and Probabilistic LSA (pLSA), and an implementation of LDA and the word2vec algorithm. By modelling Voltaire’s ‘epistolarity’ we aim, on the one hand, to gain a better understanding of the discursive makeup of his massive correspondence, its most important topics, and their distribution and evolution over a more than 70-year span. On the other hand, by bringing different algorithms into productive conversation with our texts and methods, we offer a necessary critique of LDA’s prominence as the de facto unsupervised learning method in the digital humanities today.