The last two years have seen the creation of the British Telecom Correspondence Corpus at Coventry University. Working with material from the vast public archive of BT I have extracted, transcribed and, using TEI compliant XML, marked-up just over six hundred business letters written by nearly four hundred authors on a wide variety of topics spanning the years 1853-1982. Though this is a crucial era in the development of business correspondence, it is currently under-represented in correspondence and business English corpora.
The wealth of authentic language data available in the public archive of BT makes it a very promising subject for linguistic study. However there are a number of things about the way that archive material is collected and organised that make archives problematic as sources of corpus material. Many of these issues are relevant to the creation of digital resources in general. One major challenge is the identification of suitable material on an item level. Archive collections are typically organised by ‘series’, meaning types of records and documents relating to particular events are grouped together, but finding individual text types and individual items can prove difficult.
In this talk I will discuss the way in which the public archives I worked with are organised and catalogued, and the general implications that this has for the digitisation of that material. I will also talk about the specific challenges of working with the BT archive to construct the BT Correspondence Corpus, looking at letter identification, letter selection, and metadata extraction. Finally I will discuss how, despite these challenges, the creation of digital resources can enhance physical archive collections.