Decoding the Past: Leveraging Text Encoding Initiative (TEI) Markup for Data Analysis in Early Modern Documents

By Deborah Leem, Julianne Nyhan and Antonis Bikakis

1. Introduction

Sir Hans Sloane, a renowned physician, naturalist, and collector, bequeathed his collection of 71,000 items to the British Nation upon his death in 1753 (Sir Hans Sloane, no date), laying the groundwork for three of the UK’s national memory institutions: the British Museum, the British Library, and the Natural History Museum in London. Sloane’s assiduous efforts in compiling, organising, and cataloguing these collections reflect his intellectual legacy (Delbourgo, 2017; Ortolja-Baird et al., 2019). The highly complex information architecture of Sloane’s catalogues, along with the breadth and depth of these collections, make them an extraordinary resource for computational analysis. Sloane’s cataloguing practices not only offer a structured and detailed description of a diverse range of objects but also present an exceptional opportunity for employing computational methods to uncover new layers of understanding from these vast collections. The potential of computational approaches to illuminate hidden patterns and connections within the catalogues demonstrates the value of Sloane’s collections as a rich data source for research and scholarship.

The project, Enlightenment Architectures: Sir Hans Sloane’s catalogues of his collections (2016–21), funded by the Leverhulme Trust, is a collaborative initiative between UCL and the British Museum, endeavoured to identify and interrogate the highly complex information architecture of Sloane’s catalogues and their intellectual legacies (see Ortolja-Baird et al., 2019). With the overarching aim to contribute to ongoing conversations in historical, curatorial, museum studies, and digital humanities, the project published new research and methodologies to further decode Sloane’s organisational practices (Ortolja-Baird and Nyhan, 2021). The encoding of five volumes of Sloane’s manuscript catalogues in alignment with a project-modified schema of the Text Encoding Initiative (TEI) Guidelines was one such significant initiative.

As part of the Enlightenment Architectures project, the case study presented in this paper primarily examines the catalogue entitled Miscellanea, a bound volume encompassing seven distinct catalogues: ‘Miscellanies’; ‘Antiquities’; ‘Seals’; ‘Pictures’; ‘Mathematical Instruments’; ‘Agate Handles’; and ‘Agate Cups, Bottles, Spoons’. It also incorporates indices to the ‘Seals’ and ‘Mathematical Instruments’. These catalogues, filled with descriptions of objects, present a rich source of data for analysis.

This study elucidates the potential of leveraging computational methods for the analysis of early modern documents, specifically through the lens of the Collections as Data perspective. The primary objective is to shed light on the benefits and complexities involved in managing collections as data, adhering to the Collections as Data principles. These principles emphasise usability, interoperability, and access to digital cultural heritage collections, while also considering the ethical implications of collections as data work (see Padilla et al., 2019). In the context of this study, we have applied these principles to transform Sloane’s catalogues into a structured, machine-readable dataset, enriching their accessibility and research potential.

Using the digitised and encoded catalogues from Miscellanea as a case study, we aim to demonstrate how computational techniques can be harnessed to transform these historical documents into structured, machine-readable data sets. While TEI XML encoding does make the catalogues accessible to computational processes, the irregular structure and historical idiosyncrasies within the catalogues can make it challenging for algorithms to interpret the data effectively. By utilising Python scripts, which implement computational techniques for text processing and data extraction, we can identify person and place names and transform these catalogues into a more structured, machine-readable format. This transformation process enhances the efficiency of engaging with the material, eliminating the need for manual perusal, and allowing for quick searches, filtering, and cross-referencing. More importantly, it opens up new avenues for more detailed investigations into the early modern period. The machine-readable format facilitates advanced computational techniques like data mining, network analysis, and machine learning, revealing hidden patterns, and relationships within the data. Furthermore, it supports the integration of this data with other datasets for comparative and interdisciplinary studies, ultimately enriching our understanding of the early modern period in innovative ways.

By shifting our focus from viewing collections merely as repositories of information to considering them as data-rich resources, we open up new possibilities for research and scholarship. This study highlights how the application of computational methods and data modelling techniques can reveal patterns and connections that might remain hidden in traditional, non-data-driven studies. In doing so, we aim to advance the discourse on best practices in managing digital collections as data, potentially contributing to the evolution of methodologies in the cultural heritage sector.

The process of data extraction, however, is not without its challenges. These challenges, which include the irregular structure of the catalogues, variations in spelling of person and place names, and the need for meticulous manual list compilation, demand both in-depth understanding of the TEI XML document and careful consideration of anomalies and inconsistencies within the catalogues. However, these obstacles do not detract from the potential value of this approach. They provide valuable insights into the intricate process of encoding historical documents and the considerations that must be taken into account when doing so. These challenges, their implications, and the strategies we have employed to address them will be discussed in detail in the later sections of this paper.

2. Methodology

In this study, we selected the Miscellanea manuscript catalogue – an aggregation of seven distinct catalogues – due to its comprehensive encoding and the diversity of cataloguing styles it exhibits. Our research methodology primarily involved two tasks: extracting person and place names from the catalogue to form new data outputs, and assigning unique IDs to each extracted name to account for variations in spelling.

The analysis of the documents constituted an iterative process, necessitating several rounds of review and fine-tuning during both the XML editing and Python script development phases. A crucial element in this process was the domain expertise provided by Kim Sloan, a former curator at the British Museum. Her expert knowledge and insights substantially augmented our understanding of the catalogues, thereby facilitating a more precise refinement of our data extraction methods. This in turn enhanced the accuracy of our subsequent data outputs.

3. Results and Discussion

The outcomes of this case study present a compelling blend of fascinating insights and novel challenges. The transformation of Sloane’s manuscript catalogues into machine-readable formats not only illuminates the potential of digitised collections but also spotlights the intricacies of adapting historical documents for modern computational systems (see Ortolja-Baird et al., 2019). The act of transforming these catalogues into machine-readable formats through data modelling has highlighted the latent complexities within these collections (ibid.). The task of disambiguating and encoding various elements within the catalogues, such as catalogue entry and changing hands required a deep understanding of the material and careful attention to detail. This process was further complicated by the need to maintain, as far as possible, a historically accurate representation of the information, a task that often came into conflict with the perspectives of information implicit in 21st-century encoding specifications (see Ortolja-Baird and Nyhan, 2021). The approach of viewing the catalogues as ‘bifocal data’ has allowed the Enlightenment Architectures project to consider the documents both in their own right and as windows into the wider socio-historical landscape of the early modern period (ibid.). This dual perspective has facilitated a deeper appreciation of the catalogues, not only as repositories of historical information but also as dynamic and interconnected networks of knowledge and culture. However, striving for historical accuracy in the encoding process comes with its own set of challenges (ibid.). While modern computational methods are designed to interact with standardised, structured data, early modern documents such as Sloane’s catalogues often defy these norms. Their content, steeped in the idiosyncrasies of the time, often presents an irregular structure and organisation that prove challenging for automated data extraction.

Following rigorous analysis and editing, we were able to extract a total of 1390 catalogue records from the Miscellanea catalogue, all of which included encoded person or place names. Among these records, 1044 were identified as person names, and 970 as place names. This dataset forms a robust foundation for future research, offering unique opportunities to examine the networks of scientific and cultural exchange embodied within the catalogue. By examining these encoded person and place names, one can start to delineate the intricate social and geographical networks that were integral to the early modern period. The data extracted from these catalogues presents a multi-faceted view of the people and locations Sloane interacted with or was influenced by, providing invaluable insights into the transmission of knowledge, ideas, and objects during this time. Further, the geographical locations tagged within these records can help us understand the spatial distribution of these exchanges and the wider context of Sloane’s network, highlighting the connections between different regions and their significance in the broader historical narrative (Ortolja-Baird and Nyhan, 2021). In line with the observations made by Ortolja-Baird and Nyhan, this dataset’s significance extends beyond the mere presence of names and places. It allows researchers to critically evaluate the discrepancies and gaps within these records, which can shed light on social disparities, potential biases, or unrecorded actors within the early modern period (ibid.).

However, the extraction of these data was not without its complexities. The catalogues presented numerous challenges. One such challenge was dealing with variations in the spelling of person and place names, a common characteristic of early modern documents. To ensure accuracy, we compiled a manual list of all person and place names, along with their variations, and assigned each name a unique ID with a ref attribute as shown in Figure 1 and Figure 2. This approach enabled us to link each name to the correct person or place, even when different spellings were used.

Figure 1 Showing parts of the TEI-XML markup of <listPerson> in the <teiHeader>, Sloane’s Manuscript Catalogue Miscellanea

Figure 2 Parts of the TEI-XML markup showing ref attribute with a unique ID for all person and place names in Sloane’s Manuscript Catalogue Miscellanea

The catalogues presented several anomalies and inconsistencies, some of which originated from errors dating back 300 years. Instances of duplicated catalogue numbers, as demonstrated with number 1492, were detected within the catalogue (see Figure 3). Additionally, space constraints often led Sloane or his assistants to extend their writing onto the opposite page. They employed symbols, such as a plus sign, to indicate the continuation of the text, as observed with catalogue number 447 (see Figure 4). Such practices posed significant challenges during the data extraction phase using Python scripts. As these continuation signs were neither encoded nor clearly marked, ensuring the accurate linkage and inclusion of catalogue entries that extended onto the reverse page is a substantial task.

Figure 3 Sloane’s Manuscript Catalogue Miscellanea showing the catalogue number 1492 entries in Miscellanies, fol. 135

Figure 4 Sloane’s Manuscript Catalogue Miscellanea showing the catalogue number 447 entry in Antiquities, fol. 192 and fol. 191v

A lack of documentation compounded the complexity of the data and highlighted numerous ambiguous and inconsistent cases in the catalogue. The extraction process highlighted the need for more thorough manual inspection of the data and the importance of providing clear rationales for encoding and modelling decisions to enhance data reusability. For example, inconsistencies in the cataloguing and encoding of Sloane’s collections became evident in the inconsistent assignment of catalogue numbers and their corresponding XML markup. A divergence from the usual pattern is seen in catalogue number 1933, which, unusually, includes sub-numbers (see Figure 5). Moreover, the XML encoding for these catalogue numbers deviates from the typical structure by incorporating additional elements nested within the catalogue number element. These anomalies underline the importance of comprehensive documentation in understanding the encoding rationale, thereby facilitating the data wrangling process.

Figure 5 Sloane’s Manuscript Catalogue Miscellanea showing the catalogue number 1933 entries in Miscellanies, fol. 9v, followed by its TEI-XML representation showing catalogue numbers

In dealing with early modern documents, we are often confronted with unique challenges such as variations in spellings and anomalies in cataloguing practices. These peculiarities necessitate scholarly expertise and scrutiny to accurately decode the inherent data. Nonetheless, our research underlines the potential of harnessing these complexities to yield profound historical revelations. Utilising computational methodologies, we can transform the convoluted raw data, inherent in such early modern documents, into structured datasets amenable to computational analysis. This process enables the discernment of patterns, relations, and narratives which might otherwise remain obfuscated. This investigation signifies an advancement in comprehending the advantages and challenges integral to managing and viewing digital collections as data. We advocate a reconceptualisation of these collections; not as inert repositories of information, but as active resources to be deployed for data-driven analysis. This shift in perspective not only enriches our understanding of the collections themselves but also broadens our understanding of the early modern period more generally. Our findings and methodologies contribute meaningfully to the ongoing scholarly discourse in the cultural heritage sector regarding best practices for managing early modern digital collections as data.

In a broader sense, the project not only deepens our understanding of early modern cataloguing practices but also paves the way for other researchers to further explore the intricate networks of scientific and cultural exchange and collaboration in the early modern period. The data extracted from Sloane’s catalogues, now cleaned and structured, forms a rich foundation for further studies, unlocking the potential for a deeper exploration of the early modern period.

4. Challenges and Lessons Learned

The first significant challenge arose from the structure in the TEI XML document: Given the particular way these catalogue entries are encoded, we faced certain challenges when approaching them as data. For instance, elements like <persName> contain sub-elements that capture additional information about how a name is written in the catalogue. Elements such as <hi rend> and <add rend> are used to denote underlining and superscript, respectively. These nested elements posed a significant challenge at the start of the case study, requiring the development of python scripts capable of handling these scenarios. Detailed document analysis was crucial to accurately understand and navigate the data structure.

Deciphering spelling variations presented another obstacle. Early modern documents, such as those in our study, often feature inconsistent spellings. Our particular interest lay in person and place names, making the task of deciphering these variations crucial to our work. We compiled a comprehensive manual list of these names and their variations, assigning each a unique ID, ensuring that the correct entities were linked irrespective of spelling variations.

Moreover, the catalogues encompassed an array of anomalies and ambiguities. Some of these inconsistencies were legacy errors dating back centuries. One such anomaly was the continuation of text onto the opposite page when space was insufficient, marked by symbols such as a plus sign. These unencoded signs posed considerable challenges during the data extraction process.

These challenges were further exacerbated by a shortfall in comprehensive documentation on the catalogues, which amplified the complexity of data interpretation. Enhanced information on the encoding rationale and modelling decisions is paramount for future data usability.

Our work was also impacted by irregularities in data encoding. The encoding process, having been performed by multiple contributors, was marked by deviations from a consistent rule set, resulting in slight yet significant discrepancies in the encoded data. These inconsistencies, albeit not substantial, were significant enough to generate misleading data outputs during the extraction and analysis processes. Where possible, it was necessary to rectify these discrepancies, which underscored the importance of a more thorough review and validation of the data. This situation highlighted the value of consistent encoding standards, quality control, and thorough training in projects involving the digitisation and encoding of historical documents to ensure accurate data outcomes.

Finally, we recognised the importance of an extensive manual inspection of the data. The complexities of data extraction necessitated an extensive manual inspection of the data, which, although time-consuming, was pivotal in ensuring data accuracy and meaningfulness.

5. Contributions

Our work on Sloane’s catalogues has significantly contributed to transforming historical research through data-driven analysis. By transforming Sloane’s catalogues into machine-readable datasets, we have unlocked new potential for historical research. This approach allows for the identification and mapping of early modern scientific and cultural exchange networks, or addressing absences and biases in historical documents, thereby enriching our understanding of knowledge production, exchange, and collaboration during the Enlightenment period. The potential for integrating this data with other databases allows for complex computational analysis, revealing patterns and relationships that may otherwise remain hidden. This project highlights the potential of collections as active resources for research, thereby enhancing their reusability for future studies.

Moreover, we significantly improved accessibility by converting Sloane’s catalogues to a structured, machine-readable dataset. This project highlights the value of computational methodologies in interrogating early modern historical documents. For researchers, the transition from digitised and encoded manuscript catalogues to a well-structured dataset facilitates a more effective interaction with the material. Enhanced search capabilities and the application of various computational analyses can result in more robust, in-depth studies, unveiling previously hidden aspects of the early modern period captured in Sloane’s catalogues. Furthermore, the ability to integrate this data with external resources such as GeoNames and Virtual International Authority File (VIAF) not only enriches the understanding of the data but also provides insights into the relationships and networks within the catalogues. This linkage, therefore, extends the research scope and lends additional depth and context, demonstrating a promising future trajectory for historical research.

The challenges encountered in this project provide critical insights for developing best practices for managing early modern documents as data within the cultural heritage sector. By navigating the complexities of digital collections as data, particularly in the context of the early modern period, the project provides a roadmap for transforming the management and utilisation of such valuable resources. The lessons learned, such as the need for meticulous data review, validation, and comprehensive documentation, can be used to inform future projects dealing with similar datasets. By demonstrating the benefits of linking collections data to external resources, this project can serve as a model for other institutions aiming to enhance the interoperability and usability of their digital collections.

6. Conclusion

In the face of considerable challenges, this study has reinforced the fundamental importance of comprehensive documentation in enhancing data reusability, contributing substantially to our appreciation of digital collections’ potential for scholarly research. Despite complexities inherent in extracting targeted data from Sloane’s catalogues, the value of the process has proven irreplaceable. It highlights the necessity for thorough documentation and domain expertise when manoeuvring through datasets that are as complex and historically layered as those assembled by Sloane during the Enlightenment.

Navigating the intricate encoding structures and anomalies of these early modern documents was not without its hurdles, but each challenge encountered has offered a unique opportunity for learning and adaptation. From the development of custom python scripts to handle nested TEI XML structures, to the creation of a comprehensive list to account for spelling variations, each approach employed not only helped to surmount these obstacles but also contributed to the formulation of effective practices for dealing with similar digital collections.

Furthermore, the project has illuminated the transformative potential of data-driven analysis in historical research. The transformation of Sloane’s catalogues into structured, machine-readable datasets has not only enriched our understanding of knowledge production, exchange, and collaboration during the Enlightenment period but also enhanced the accessibility of these valuable resources for both the academic community and the wider public.

In doing so, the project sets a promising precedent for the management and utilisation of early modern digital collections as data within the cultural heritage sector. It has demonstrated that while the digital transformation of historical documents can be a challenging undertaking, the rewards, in terms of research potential and improved accessibility, are worth the effort. Thus, despite the inevitable challenges, this project highlights the significant value that a carefully curated and well-documented digital collection can offer to the field of digital humanities research and beyond.

7. Acknowledgements

The authors extend their sincere appreciation to Karen Stepanyan for his significant contribution to the editing and proofreading of this paper. His keen attention to detail and rigorous editorial review have been pivotal in enhancing the clarity and coherence of this document. The authors maintain full responsibility for any remaining errors or oversights.

8. References

Delbourgo, J. (2017) Collecting the world: the life and curiosity of Hans Sloane. London, UK: Allen Lane, an imprint of Penguin Books.

Ortolja-Baird, A. et al. (2019) ‘Digital Humanities in the Memory Institution: The Challenges of Encoding Sir Hans Sloane’s Early Modern Catalogues of His Collections’, Open Library of Humanities, 5(1), p. 44.

Ortolja-Baird, A. and Nyhan, J. (2021) ‘Encoding the haunting of an object catalogue: on the potential of digital technologies to perpetuate or subvert the silence and bias of the early-modern archive1’, Digital Scholarship in the Humanities, p. fqab065.

Padilla, T. et al. (2019) Final Report — Always Already Computational: Collections as Data (Version 1).

Sir Hans Sloane (no date) The British Museum.