Linguistic Bede

The Bede supercomputer will be used to explore the viability of machine learning approaches for interpreting billions of Linguistic DNA data.

This N8 CIR internship, funded by EPSRC, will use Bede, the N8 Universities’ supercomputer, to explore the viability of machine learning approaches in replacing and improving upon the existing computational linguistic methods used in the AHRC project Linguistic DNA.

The problem with Linguistic DNA

Funded from 2015 to 2018, Linguistic DNA developed a process for modelling the semantic and conceptual changes which occurred in English discourse (c.1500-c.1800). The process is intended to help answer research questions such as: what are all the concepts and ideas in early modern thought, how do they evolve over time, and how are they characterised lexically in 250,000 representative printed texts from this period?

The project developed a new methodology within linguistics which involved generating trios – groups of three words that frequently co-occur within windows of 100 adjacent words. The project created a data analysis process using the Apache Hadoop framework with MapReduce and Apache Hive to analyse every single word in our input dataset. The output dataset contained many billions of rows of words and accompanying metrics – too many for us to analyse at once (the trio data can be downloaded from here: https://www.dhi.ac.uk/data/linguisticdna).

More recently the project has improved the methodology for detecting semantic and conceptual change by generating quads instead of trios – groups of four frequently co-occurring words that contain the same keyword. However, the size of the output data has increased significantly. Quads for 83 keywords produce more than 2.3 billion rows. Whereas our input dataset contains more than one billion words. A demonstrator showing these quads can be seen here: https://www.linguisticdna.org/cmd/ 

The size of the output data means that we have to apply a range of thresholds in order to reduce its size for presentation through the demonstrator, and we are therefore limited in our ability to see patterns of semantic change at scale (synchronically and diachronically). Ideally we would be able to implement some form of clustering process. 

This new project seeks to explore Machine Learning as an alternative approach to identifying quads and summarising the output in order to make subsequent analysis more viable.

What is Bede and how might it help Linguistic DNA?

Bede is the N8 Universities supercomputer, housed at Durham University. It is comprised of 32 IBM Power 9 dual-CPU nodes, each with 4 NVIDIA V100 GPUs and high-performance interconnect. This is the same architecture as the US government’s SUMMIT and SIERRA supercomputers which occupied the top two places in a recently published list of the world’s fastest supercomputers.

The Linguistic DNA project has been hampered by a series of problems relating to the nature of its data analysis task which the Bede project sets out to address:

  • The quantity of word-by-word comparisons is computationally intense. We had originally sought a data-driven approach whereby every word in our corpus of 250,000 texts would be compared against every other word in order to generate the quads – but this was unfeasible. As a result, we had to resort to selecting keywords and only generate quads for these keywords. A HPC approach could potentially facilitate a data-driven approach, which then enables us to discover concepts that we do not know even exist.
  • The size of the output data is vast. It is far larger than the input data. This means that we can never produce and then view all the data at once. Instead we are forced to apply strict thresholds to the results in order to begin to make the size more manageable, but this potentially removes significant information. Ideally, a Machine Learning process using HPC infrastructure would generate quads for every word in our corpus and then analyse the results to identify frequently occurring patterns. Then the results of this pattern analysis would be reported to us (textually, visually etc). At present, we have to manually analyse snapshots of the data in order to establish what the patterns might be, whereas an ML and HPC approach should be able to conduct this task at scale, comprehensively and consistently.

Related Links

Links to new data arising from this current project will be posted here in due course.

Project Team

  • Jonathan Clayton (Research Fellow, Digital Humanities Institute)
  • Matthew Groves (Senior Research Software Engineer, Digital Humanities Institute)
  • Michael Pidd (Principal Investigator, Digital Humanities Institute)
  • Dr Seth Mehl (Co-Investigator, Digital Humanities Institute)