“The value of this collection is substantial – glancing through any of the scripts reveals long-forgotten stories, and the nature of the hour by hour story development offers a ‘real time’ perspective on history, rather than the ‘after the event’ summary that tends to end up in history books” (Jake Berger — Executive Product Manager, BBC Archive Editorial).
The AHRC-funded Linguistic DNA project developed a process called Linguistic Concept Modelling which is able to identify and extract concepts (ideas) from large text corpora. Concepts (abstract ideas) are represented as frequently re-occurring relationships between words in discourse. For example, the concept of democracy might exhibit in a text as democracy-freedom-election, democracy-war-fascism, and democracy-Athens-history.
The process of identifying concepts is data-driven and involves comparing every word in a corpus with every other word within windows of 100 tokens using measures such as frequency, our enhanced PMI, and Chi square scoring. The resulting ‘concept models’ can then be used to underpin text discovery services, such as improving search queries so that results are more semantically accurate and meaningful for the end-user.
Funded by the University of Sheffield’s Knowledge Exchange Fund, this project aims to process the BBC’s Radio News Scripts using our Linguistic Concept Modelling process, to generate concept models which the BBC can use as a search index as part of a Radio News Scripts search interface, or as descriptive metadata in any subsequent formal cataloguing of the collection.
- Dr Seth Mehl (Project Leader – The Digital Humanities Institute)
- Matthew Groves (Developer – The Digital Humanities Institute)
- Jake Berger (Executive Product Manager — BBC Archive Editorial)
- Andrew Armstrong (Principal Software Engineer — BBC Archive Development, BBC Design & Engineering)