Document annotation is a central method for many Digital humanities projects. Depending on the complexity of the information to be annotated, it can be done in an automatic or manual manner. Current annotation systems have a major drawback when used on documents from humanities: one can state only if a certain feature /information is present or not in the material: thus a typical 1/0 discrimination. Although it is true that through digitization the “continuum” of any humanities object gest lost, such annotation strategy leads not only to quite dramatic discretization of the initial object but can lead to wrong interpretations. Especially annotation of historical facts and documents must consider and integrate degrees of vagueness in the annotation.
In the project HerCoRe –Hemeneutic and Computer-based Analysis of Reliability, Consistency and Vagueness in historical texts (financed by the Volkswagen Stiftung), we investigate:
Which is the influence linguistic vagueness (e.g. vague adjectives, non intersectives, hedges, subjunctives) on the presented historical facts
The degree to which the author is making use by purpose of vague expressions
How reliable are expressions of “certainty” compared with quoted historical sources
We distinguish between several vagueness layers: linguistic, editorial (markers by editors /translators) and factual (omission, wrong quotation).
The object of our analysis are the texts of Dimitrie Cantemir, prince of Moldavia, whose works from the 18th century about the Ottoman Empire and Moldavia were already translated at that time in many languages like English, German and French.
We propose a framework for vagueness annotation, which allows the user:
to navigate any time between different annotation levels,
to mark-up discontinuous assertions and
to introduce concurrent mark-up
Linguistic vagueness is annotated in a semi-automatic manner, editorial vagueness is inferred partially from the text, factual vagueness is annotated manually. A fuzzy knowledge base allows also the annotation of named entities (person, geographical information, dates).