by Toby Burrows
Over the last twenty years there has been a proliferation of digital data relating to medieval and Renaissance manuscripts, in the form of catalogues, databases, digital editions and digital images. But there is little in the way of interoperable digital infrastructure to link these disparate sources together, and the evidence base for manuscript research is, for the most part, fragmented and scattered. As a result, large-scale research questions remain very difficult, if not impossible, to answer.
The Mapping Manuscript Migrations (“MMM”) project, funded by the Trans-Atlantic Platform under its Digging into Data Challenge for 2017-2019, is aiming to address these problems. It is led by the University of Oxford, in partnership with the University of Pennsylvania, Aalto University in Helsinki, and the Institut de recherche et d’histoire des textes (IRHT) in Paris. The project is building a coherent framework to link manuscript data from various disparate sources, with the aim of enabling searchable and browsable semantic access to aggregated evidence about medieval and Renaissance manuscripts.
This paper reports on the first twelve months of the MMM project. It also compares this project with earlier work of a related kind, undertaken during a Marie Curie International Incoming Fellowship at King’s College London (2014-2016). The earlier project (the “Phillipps Project”) focused on a single, very large manuscript collection, that of Sir Thomas Phillipps (1792-1872). It adopted a different approach to data modelling and matching, and used a different software platform.
2. Scope and Research Questions
Provenance research in the field of cultural heritage studies deals with the origin and ownership histories of objects such as artworks, sculptures, books, and manuscripts. For the history of art, the main focus is on authentication – proving that a painting can be attributed to a specific artist by tracing the chain of ownership from the present day back to the creation of the work (Yeide et al. 2001). For manuscripts, on the other hand, a much wider range of research questions are investigated (Pearson 1998). They might include:
- The owners and origins of a specific manuscript;
- The assembling and dispersal of collections of manuscripts;
- The movement of manuscripts between different locations at different times; and,
- The relationship between manuscript collections and wider cultural and social trends.
The Phillipps Project focused on a vast collection of manuscripts assembled by one man in the nineteenth century, but subsequently dispersed, and looked at where those manuscripts are now and who owned them after Phillipps himself. It also covered the earlier histories of those manuscripts: what their origins were, how and when Phillipps acquired them, and where they had been in the intervening centuries.
Mapping Manuscript Migrations has a much broader scope, reflecting the breadth of the data contained in its source datasets. This project covers the histories of medieval and Renaissance manuscripts from Western Europe since their creation, and aims to address a broad set of research questions across the whole body of data. These include:
- A general analysis of the history and movement of medieval and Renaissance manuscripts over the centuries: how many manuscripts have survived; where they are now; and which people and institutions have been involved in their history;
- A more specific analysis of the mobility of manuscripts and its traceability, within and from French space, since the Wars of Religion;
- A comparison of the respective significance of institutional and individual manuscript collectors in the 19th and 20th centuries.
The MMM Project is also interested in more specific questions relating to these manuscripts, such as the following:
- Tracing the history and movement across time and space of a single manuscript;
- Tracing the history and movement across time and space of a group of manuscripts, with reference to one or more specific criteria: a similar place of origin; same current location; the same language; a similar subject; a similar text; the same period of origin; and the same current or previous owner(s);
- Finding the connections between specific actors, and also between specific manuscripts or groups of manuscripts.
A representative example of such questions would be the history of a specific manuscript collection, like that of the Collège du Sorbonne.
3. Sourcing the Data
At the heart of the MMM Project is the ingestion of data from three existing digital sources: the Schoenberg Database of Manuscripts, the catalogue of Medieval Manuscripts in Oxford Libraries, and the IRHT database Bibale (Wijsman 2016). Two of these (Schoenberg and Bibale) focus on evidence relating to manuscript provenance, while the Oxford catalogue consists of more general manuscript records. Between them they contain about 260,000 entries relating to manuscripts, as well as numerous records for persons, places, and organizations. Another IRHT database, Medium, is expected to be added during 2019.
Schoenberg and Bibale are relational databases, which can be transformed into a raw RDF output, prior to mapping to the unified RDF data model devised by the project. The Oxford database, on the other hand, consists of TEI-encoded XML documents. To convert these to RDF required a more elaborate pipeline, in which a defined subset of elements and attributes from each of the TEI files was extracted and combined into a single XML document. This document was then mapped against the project’s unified data model to produce RDF representations of the relationships involved. All three RDF representations were then loaded to the same server, where they could be inspected and debugged, prior to matching on names, places, and other Linked Data identifiers.
For the Phillipps Project, a much more selective and manual process was used for creating and ingesting data. This partly reflected the capabilities of the nodegoat software used in the project (discussed in section 6 below), but also reflected the nature of the evidence. There is no data source devoted exclusively to Phillipps manuscripts, so relevant records had to be extracted from a variety of different sources. In some cases, exporting a selection of records is relatively straightforward; the Schoenberg Database of Manuscripts allows for CSV exports of any number of records, including the entire database. In other cases – notably some kinds of library catalogues – it simply was not possible to identify a selection of records and package them for download as a CSV file.
As well as these digital sources, there were numerous non-digital sources of evidence. Most important of these were the annotated copies of the printed Phillipps Catalogus originally compiled by A. N. L. Munby and distributed, after his death in 1974, to the Cambridge University Library and the Bodleian Library in Oxford (Phillipps and Munby 1968). Munby’s handwritten (and often heavily abbreviated) notes on the sales and ownership of individual manuscripts were subsequently added to, over at least thirty years, by librarians in the institutions concerned. These entries cannot be directly digitized and transformed into database entries; they have to be manually transcribed into spreadsheets before they can be loaded to nodegoat. A similar situation applies to the many drawers of index cards in the British Library’s Manuscripts Reading Room, which contain similar handwritten and abbreviated information.
Batch uploads of data in nodegoat are carried out through CSV files. For the ingest process, the columns in a CSV file must be mapped to the appropriate element in a nodegoat object or sub-object record. The ingest process allows for various choices relating to the automatic creation of new records and mapping to existing names, places, and other entities.
4. Data Models
The two projects have taken different approaches to data modelling. For the MMM Project, the data models used by the three main data sources were analysed and compared, and a unified data model was derived from that comparison. This process was complicated by the fact that two of the data sources (the Oxford catalogue and Bibale) take the manuscript as their fundamental unit, around which a range of descriptive and event-based elements are constructed. The Schoenberg Database, on the other hand, puts a provenance event (described as an “observation”) at the centre of its data model. Provenance events (sales, ownership, gifts, and so on) relating to the same manuscript may be linked to a “manuscript” entity, but this does not have the same breadth of descriptive attributes and relationships that the manuscript entities in the other two databases have. For the Schoenberg Database these descriptive attributes and relationships are attached, in the main, to the provenance events.
The MMM unified data model is expressed using elements from the CIDOC-CRM and FRBROO ontologies (Ore et al. 2015; Bekkari et al. 2016). These were chosen for two main reasons: they are standards for data harmonization in museums and libraries respectively, and they support event-based modelling. This reflects the event-based nature of provenance data and research.
For the Phillipps Project, on the other hand, a customized data model was developed which was structured in accordance with the requirements of the nodegoat software. The central object type was the manuscript, to which various descriptive metadata elements were attached. The events in the history of a manuscript were handled as types of “sub-objects”, in nodegoat terminology. These included sale, creation, ownership, and donation – as well as time-based descriptions, e.g., in sale catalogues or collection inventories. The TEI Manuscript Description Guidelines (TEI Consortium 2019) and the CIDOC-CRM ontology were both consulted in developing the nodegoat data model, with the result that it can be mapped to both of these sources.
Although the data models from the two projects were developed using two different technical approaches, the essential structure is similar. At the centre of each model is the manuscript as a physical object, together with various descriptive properties. Associated with it are a series of different types of provenance events, through which the manuscript is linked to other classes of entities: persons, organizations, places, and sources.
5. Data Matching and Reconciliation
A vital step in aggregating and combining data from the different source datasets in the MMM Project has been to identify references in each dataset to the same person, place, organization, work, or manuscript. Fortunately, all the sources had already implemented their own programme of linking to standard identifiers, including several core Linked Data vocabularies. Pointers to the same identifier from two different datasets have been taken to mean that the same place or person is being referred to. This process has been most effective for places, where the Getty Thesaurus of Geographical Names (TGN) has been used systematically by three of the datasets, and persons, where the Virtual International Authority File (VIAF) has been similarly used.
For works, the Bodleian data contain some identifiers for Linked Data vocabularies. Out of more than 12,000 works with unique Bodleian identifiers, more than 500 are linked to either Pinakes (for Greek works) or Mirabile (for Latin works). Like the identifiers for other entity types, these can be seen in the relevant files on the Bodleian Library’s GitHub site. The other source datasets either contain little information about works or do not yet provide external identifiers. For manuscripts, an International Standard Manuscript Identifier is under development (Cassin 2018), but the best current option for matching appears to be the identifiers in Medium, which are already re-used by Bibale. Medium also contains links to manuscript records in other IRHT databases, such as Pinakes. The MMM Project is currently focusing on manual matching of manuscript records, with a view to developing algorithms for semi-automated identification of candidates for matching.
For the Phillipps Project, in contrast, data matching and reconciliation have not been a high-level requirement, since matching and reconciliation of named entities have primarily been done manually. nodegoat does, however, enable the lookup of SPARQL endpoints on the Web, including vocabularies like DBPedia, Wikidata, and VIAF (nodegoat 2015). The Phillipps Project has included some work on incorporating name identifiers from these sources. The personalized hosted version of nodegoat defaults to GeoNames identifiers for contemporary place names, so these are embedded in place entity records.
6. Software Environments
The MMM Project is designing its own search and discovery interface based around the six main classes of aggregated data: Manuscripts, Places, Persons, Organizations, Works, and Events. At present, the Manuscripts perspective has been implemented, and the other five perspectives will be based on it. The Manuscripts perspective provides a list of all manuscripts, which can be filtered by data source and sorted by place of production; it can also be filtered and searched by author and/or place of production. Each entry in the list is linked to its source record in the original dataset.
Two initial visualizations of the data have been tested, both of which are grounded in an OpenStreetMap base layer. The first mapped the places of production, while the second mapped the migrations between places of production and most recently observed location – calculated as the place connected with the “source” agent of the most recent acquisition or observation (i.e., a sale or auction or institutional catalogue).
The MMM system is being built by the Semantic Computing Group at Aalto University, using components already developed for services like BiographySampo. The backend is built in NodeJS, and the frontend in React and Redux. The RDF triples are stored in the Linked Data Finland platform, which consists of Fuseki SPARQL servers and a Varnish Cache application for routing URIs and negotiating content.
The Phillipps Project uses a personal instance of the nodegoat software developed by Lab1100 in the Netherlands. nodegoat is a Web-based data management, analysis and visualisation environment (van Bree and Kessels 2015). It is built around an SQL relational database and runs on top of a Web application framework called 1100CC, which provides a front-end web and communication interface as well as a back-end management system. The design of nodegoat is influenced by actor-network theory, inasmuch as it implements three main levels for constructing an assemblage of objects: (1) the identification of all objects and their own possible definitions, relations and associations (cross-referencing); (2) the identification of all definitions, relations and associations related to each object (cross-referenced); and (3) the identification of objects which are associated relationally, both in space and in time. An open source version of nodegoat became available in early 2019.
Two main visualizations are produced out-of-the-box by nodegoat. The first is map-based, and shows the geographical location associated with each sub-object (i.e., the places involved in the production, sales, and ownership events of every manuscript). Lines between places connected with the same manuscript enable that manuscript’s migrations over time to be visualized. A time-slider enables the visualizations to be limited to a specific time period. The second visualization is a network diagram showing connections between entities (defined as nodegoat “objects”), with the lines connecting them representing events or sub-objects. A time-slider here too enables the state of the network to be shown for a specific time period.
7. Analysis and Evaluation
The MMM Project’s digital service is intended to be used as the basis for large-scale analysis of the history and movement of Western medieval and Renaissance manuscripts over the centuries, as well as for more specific research focusing on particular collectors and collections. It has not yet been released beyond the project team, but two types of evaluation are planned during the second year of the project: (1) the effectiveness of the service in answering specific research questions, and (2) the functionality of the user interface. Some benchmark information about user requirements was gathered at the start of the project from a focus group of medieval and Renaissance manuscript researchers, held in the Bodleian Library at the University of Oxford.
A set of specific research questions has been developed and is currently being used to evaluate the effectiveness of the data model and its implementation in the initial aggregation of the source datasets. These include such topics as:
- How many manuscripts were produced in London in the 15th century?
- What was the most popular text by a medieval author in France in the 17th century?
- Which manuscripts did French collectors acquire from dissolved English monasteries?
The initial testing of these questions is being carried out using SPARQL queries against the MMM Project’s triple store. The results will subsequently be assessed by project researchers familiar with the contents of the source datasets.
The user interface for the MMM service was still under development at the end of the first year of the project and had not yet undergone any formal evaluation. It is expected that a similar set of research questions will be used in the testing process, which will incorporate the use of the visualization functionality planned for the interface as well as its searching and browsing functions.
The digital environment developed for the Phillipps Project was developed primarily as a personal resource. Though a public version of the nodegoat site has been made available, and some informal feedback about it has been received, its user interface and content have not been subject to any formal evaluation. It has been used, for the most part, as the basis for analysing specific aspects of the Phillipps Collection and its subsequent dispersal. Two examples of this kind of analysis are an assessment of the legacy of Thomas Phillipps in Australia and New Zealand (Burrows 2018) and an overview of Phillipps manuscripts in North America (Burrows 2017).
The MMM Project and the Phillipps Project are both concerned with assembling and making use of data relating to the history and provenance of Western medieval and Renaissance manuscripts. Each project has assembled a series of research questions relevant to this field, which can be used to test the effectiveness of their data models and user interfaces. But they are doing this at rather different scales, and for rather different purposes, so any direct comparisons must be made with some caution.
The digital environment developed for the Phillipps Project was fit-for-purpose inasmuch as it met the requirements of a single researcher looking for an all-in-one solution for gathering, ingesting, analysing, and visualizing data. It was not designed specifically with data sharing in mind, but its data model is compatible with ontologies like CIDOC-CRM and FRBROO, and its use of identifiers makes it possible to envisage exposing its data as Linked Data in the future.
The MMM Project is explicitly intended to exist in a Linked Data world, both in the ontologies which feed into its data model and in the way in which identifiers are central to its matching and reconciliation processes. As well as building and testing pipelines for aggregating and combining manuscript-related data from relational databases and TEI-XML documents into an RDF triple store, and providing a user interface for exploring manuscript history and provenance on a significantly large scale, the MMM service is also making a contribution towards the future development of a broader Linked Data environment for medieval and Renaissance studies.
Burrows, T 2018, ‘The Legacy of Sir Thomas Phillipps in Australia and New Zealand’, Script & Print vol. 42, no. 2, pp. 94-116.
Burrows, T 2017, ‘Manuscripts from the Collection of Sir Thomas Phillipps in North American Institutional Collections’, Manuscript Studies vol. 1, no. 2, pp. 307-327.
Bekiari, C & Doerr, M & Le Bœuf, P & Riva, P 2016, Definition of FRBRoo: a Conceptual Model for Bibliographic Information in Object-Oriented Formalism, version 2.4, viewed 28 February 2019 <https://www.ifla.org/files/assets/cataloguing/FRBRoo/frbroo_v_2.4.pdf>.
Cassin, M 2018, ‘ISMI: International Standard Manuscript Identifier: Project of unique and stable identifiers for Manuscripts’, viewed 28 February 2019 <https://www.manuscript-cultures.uni-hamburg.de/files/mss_cataloguing_2018/Cassin_pres.pdf>.
nodegoat 2015, ‘Linked Data versus Curation Island’, nodegoat.net blog, viewed 27 February 2019 <https://nodegoat.net/blog.p/82.m/12/linked-data-vs-curation-island>.
Ore, CE & Doerr, M & Le Bœuf, P & Stead, S 2015, Definition of the CIDOC Conceptual Reference Model, version 6.2.1, viewed 28 February 2019 <http://www.cidoc-crm.org/Version/version-6.2.1>.
Pearson, D 1998, Provenance Research in Book History: a Handbook, British Library, London.
Phillipps, T & Munby, ANL 1968, The Phillipps Manuscripts: Catalogus Librorum Manuscriptorum in Bibliotheca D. Thomæ Phillipps, Bt.: Impressum Typis Medio-montanis 1837-1871, Holland Press, London.
TEI Consortium 2019, ‘10 Manuscript Description’, TEI P5: Guidelines for Electronic Text Encoding and Interchange, version 3.5.0, viewed 28 February 2019. <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/MS.html>.
van Bree, P & Kessels, G 2015, ‘Mapping Memory Landscapes in nodegoat’, Social Informatics: SocInfo 2014 (Lecture Notes in Computer Science, vol. 8852), Springer, Cham, pp. 274-278.
Wijsman, H 2017, ‘The Bibale Database at the IRHT: a Digital Tool for Researching Manuscript Provenance’, Manuscript Studies vol. 1, no. 2, pp. 328-341.
Yeide, NH & Walsh, A & Akinsha, K 2001, The AAM Guide to Provenance Research, American Association of Museums, Washington, D.C.