Developing the Historical Thesaurus Semantic Tagger

As an ever growing quantity of corpus data becomes available for humanities research, it has become increasingly important to develop tools and methods for automatically identifying in texts the precise meaning of a word with multiple possible meanings. In the SAMUELS Projecti, we are addressing this issue by developing a semantic annotation system that is capable of tagging words and phrases utilising a finely-grained comprehensive semantic structure developed for the Historical Thesaurus of Englishii. This system is aimed at supporting accurate conceptual searches in order to obtain results which can be automatically aggregated at a range of levels of precision, particularly in big data contexts.

Our work draws upon the lexical knowledge resource of the Historical Thesaurus of English (HT) compiled at the University of Glasgow and a software toolkit developed in Lancaster University including the CLAWS part-of-speech tagger, USAS semantic annotation system and VARD variant spelling normaliseriii. The HT provides a high-quality semantic lexical database containing approximately 800,000 entries classified into around 236,000 semantic categories organised in a hierarchical structure. A key challenge is to scale up the semantic disambiguation, currently based on a smaller semantic field taxonomy of 232 tags for modern English, to that of the HT with much finer grained distinctions that are more suitable for annotation of Early Modern English onwards. In this paper we will provide an overview of our integration of existing word sense disambiguation techniques with novel methods based on date-based disambiguation while exploiting the complex hierarchy of meanings in the HT. Figure one illustrates the outline of the systemiv. During the project, this system will be provided for the academic community as part of a SAMUELS web-based service.

Fig. 1: Outline of semantic annotation system

This system will be applied to and tested in a number of sub-projects, including the distant reading of a large corpus of historical parliamentary speeches (Hansard 1803-2003), and a study which uses the concept of aggression to explore the nuances of shifting genre-specific meanings in the annotated EEBO-TCP corpus.

