Session 18 — Modelling Meaning
Saturday 11:30 - 13:00
Chair: Michael Pidd
King’s College London
Ontologies are formal, structured representations of knowledge within a domain, modelled as the types of concepts or objects which exist within that domain, together with their properties and relationships. They are widely used in knowledge management systems, for functions as varied as automated reasoning, recommendation systems, and faceted searching and browsing, and they form an integral part of the Semantic Web. Ontologies are most effective in knowledge domains with an agreed common terminology and a shared understanding of the semantic and conceptual structures of the domain.
This paper will examine issues related to the deployment of ontologies in digital infrastructure for the humanities. Processes for naming, describing and categorizing phenomena are fundamental to research in these disciplines, while interpretation and framing are also central. Ontologies are therefore potentially very valuable for structuring discourse and analysis. But the use of ontologies in the humanities raises a number of important difficulties.
Some of these are primarily linguistic in nature. The same concept or phenomenon may be described using different terms by different researchers. The same term may be capable of referring to different concepts or phenomena, depending on the context. The understanding of a knowledge domain is likely to have changed dramatically over time, along with the vocabulary used. Much humanities research involves languages other than English, either for the subject of the research or for the research discourse itself.
There are important conceptual and semantic differences between the various humanities disciplines, as well as between researchers within a single discipline. This is reflected in the terminologies, vocabularies and intellectual models they use. While a common ground is essential for interdisciplinary communication, it is difficult for traditional ontologies to provide such a bridge. Most upper-level interdisciplinary ontologies are too generalized, while domain-specific ontologies are too narrow.
Does this mean that ontologies are ineffective for interdisciplinary research in the humanities? Are specialized vocabularies the most we can aim for? Do we have to fall back on purely linguistic associations discovered through keyword and phrase searches? Is it inherently impractical to build knowledge management systems of the kind developed for scientific disciplines?
This paper will discuss these questions, drawing particularly on experiences gained during the Humanities Networked Infrastructure (HuNI) project. It will offer some suggestions for the use of ontology-like approaches in the humanities, aimed at building a network of semantic assertions, which embody a range of different perspectives on knowledge and provide a foundation for future interdisciplinary research and development.
As an ever growing quantity of corpus data becomes available for humanities research, it has become increasingly important to develop tools and methods for automatically identifying in texts the precise meaning of a word with multiple possible meanings. In the SAMUELS Projecti, we are addressing this issue by developing a semantic annotation system that is capable of tagging words and phrases utilising a finely-grained comprehensive semantic structure developed for the Historical Thesaurus of Englishii. This system is aimed at supporting accurate conceptual searches in order to obtain results which can be automatically aggregated at a range of levels of precision, particularly in big data contexts.
Our work draws upon the lexical knowledge resource of the Historical Thesaurus of English (HT) compiled at the University of Glasgow and a software toolkit developed in Lancaster University including the CLAWS part-of-speech tagger, USAS semantic annotation system and VARD variant spelling normaliseriii. The HT provides a high-quality semantic lexical database containing approximately 800,000 entries classified into around 236,000 semantic categories organised in a hierarchical structure. A key challenge is to scale up the semantic disambiguation, currently based on a smaller semantic field taxonomy of 232 tags for modern English, to that of the HT with much finer grained distinctions that are more suitable for annotation of Early Modern English onwards. In this paper we will provide an overview of our integration of existing word sense disambiguation techniques with novel methods based on date-based disambiguation while exploiting the complex hierarchy of meanings in the HT. Figure one illustrates the outline of the systemiv. During the project, this system will be provided for the academic community as part of a SAMUELS web-based service.
Fig. 1: Outline of semantic annotation system
This system will be applied to and tested in a number of sub-projects, including the distant reading of a large corpus of historical parliamentary speeches (Hansard 1803-2003), and a study which uses the concept of aggression to explore the nuances of shifting genre-specific meanings in the annotated EEBO-TCP corpus.
i For further details, see http://www.gla.ac.uk/schools/critical/research/fundedresearchprojects/samuels.
Computational analyses of texts are often based on prior quantifications of low-level linguistic features, such as the most frequent words or occurrences of specific grammatical constructions. Arguably, analyses on the basis of such data are intellectually remote from traditional forms of literary scholarship, which generally focuses on the description and the interpretation of aspects such as meter, figures of speech, imagery or themes. In this presentation, the results are presented of a study which has made a contribution to the alignment of traditional practices and scholarship based on data processing, through the quantification of a wide range of literary devices. The focus in the study was on the investigation of poetry. Software was built for the recognition of various forms of rhyme, alliteration, enjambment, onomatopoeia, refrains and forms of imagery. The resultant annotations have been recorded on the basis of the data model that was proposed by the Open Annotation Collaboration. In addition, a number of techniques have been developed for the visualisation of these annotations. These visualisation techniques can firstly be used to expose patters within the corpus in its entirety, allowing for a form of distant reading. Next to this, the graphic abstractions derived from data on individual poems may also support close reading processes. The algorithms gave been tested extensively on machine-readable versions of the poetry of Louis MacNeice. The software that was developed enables scholars to explore correlations between, for instance, specific figures of speech and imagery, or to identify noteworthy uses of literary devices which specific parts of the corpus. Such forms of analysis clearly help to bridge a gap between the essentially quantitative and realist inclinations of the toolset on the one hand, and the largely interpretative and qualitative approach of the discipline in which these methods are adopted on the other.