The AHRC-funded Linguistic DNA project developed a process called Linguistic Concept Modelling which is able to identify and extract concepts from large text corpora. Concepts (abstract ideas) are represented as frequently re-occurring relationships between words in discourse. The process of identifying concepts is data-driven and involves comparing every word in a corpus with every other word within windows of 100 tokens using measures such as frequency, our enhanced PMI, and Chi square scoring. The resulting concept models are trios of strongly co-occurring terms. Concept models for EEBO-TCP and ECCO-TCP total 10 billion rows of data with current thresholds.
This project emerges from a need, identified at the Oxford English Dictionary (OED), to improve the established method of manually selecting example sentences for headwords. Current practice at the OED is for lexicographers to systematically search a range of historical text resources, narrowing down the examples of polysemous words by context, as well as manually comparing example sentences to lists of collocational evidence drawn separately from existing corpora, and relying a great deal on researchers’ own intuitions. We aim to support and expedite this process, and to offer new insights into the nuances of word senses and subsenses, and their change over time.
Our Linguistic Concept Modelling will enable OED staff to: access a lookup list of concepts that match the current headword (i.e. the headword appears as a term within the concept trio); select one or more concept trios that embody the meaning of the headword, in the desired historical time frame; run a search against EEBO-TCP and ECCO-TCP to retrieve a ranked, dated list of sentences in which the concept(s) occur; then review the results for use in the headword definition.
- The user submits a lemma as a search term, such as air.
- The interface returns a list of trios which contain the lemma, such as air-earth-water, air-spirit-animal, and air-mien-demeanour.
- The user selects a trio which is closest to their intended sense, such as air-spirit-animal.
- The interface returns all EEBO-TCP occurrences of the trio air-spirit-animal as a hit list.
- The user can navigate the hit list, viewing text snippets from EEBO-TCP and ECCO-TCP in which the trio is highlighted.
- Dr Seth Mehl (Project Leader – The Digital Humanities Institute)
- Matthew Groves (Developer – The Digital Humanities Institute)
- Tamara Bowler (Product Manager — Dictionaries Division, OUP)
- James McCracken (Development Team Lead — Dictionaries Division, OUP)
- Philip Durkin (Deputy Chief Editor — Oxford English Dictionary, OUP)
- Michael Proffitt (Chief Editor — Oxford English Dictionary, OUP)