From Individual Solutions to Generic Tools

Digitisation at the Max Planck Society

by Andrea Kulas and Lu Yu

Max Planck Institutes are increasingly digitising their library holdings for improving accessibility and supporting research. Although most institutes have similar needs in those respects, they often realise their own individual projects to digitise their specific objects, thereby dispensing with efficiency and re-usability. In the year 2011 the project Digitization Lifecycle (DLC) was initiated by the Max Planck Digital Library and four Max Planck Institutes from the Humanities and Social Sciences. Its aim was to create synergies by developing a generic application for supporting digitisation projects at the Max Planck Society. All project activities were financed by the Max Planck Innovation Fund for two years.

A generic web application for editing, presenting and searching digital content was considered the best tool to support digitisation projects once the digitisation processes have been finished. “Fulfilling the most important requirements of many and being easily extendable” is the meaning of “Generic” in this context, albeit “many” presently denotes only four institutes. The hope is to engage more users and enlarge the community in the future. For the involved institutes, it was essential that the web application had the following functionalities: indexing and searching, editing and publishing digitised works as well as browsing, with a strong focus on semantic content enrichment. Besides synergy effects, we also hoped to create more visibility for digitisation works by making them fully searchable across all collections and institutions. Furthermore, the project focused on developing guidelines for MPG digitisation projects.

The major challenge in DLC lied in creating a generic service. Replacing isolated solutions at the institutes based on their individual requirements can be a cumbersome endeavor. Requirements from different partners are often very heterogeneous and difficult to merge. For example, in individual projects, bibliographic metadata can be stored in many different formats and degrees of detail. Full texts available in the TEI format1 are often tailored to very specific needs such as different genres and digitisation workflows depending on the objects to be digitised. The goal was not to create an all-in-one solution, like a Swiss Army Knife. Two years would be too short for such a challenge and outcomes uncertain. The primal task for a generic solution was to define common formats for bibliographic data and full text, while at the same time still allowing different genres to be uploaded and creating a system which is easily extendable.

The outcome of this two year project is now available at http://dlc.mpdl.mpg.de/and would be worth looking at.

Figure 1: DLC View Page showing the scan of a book, the structure and full text.

During the project lifetime functionalities such as upload and update procedures for self-ingest have been implemented, and an online editor for structural data now allows producing a rudimentary TEI format, which can be exported and even reimported. Once the data specifying the structure of the book is entered and saved, the DLC application can interpret the information. The thereby created table of contents improves considerably the online navigation possibilities within the digitised book as the selected section or page of the book can be retrieved directly. The online editor itself constitutes a powerful instrument for paginating with Roman or Arabic numerals and recto/verso pagination as well as defining headings (such as a title page and chapters). More information can then be added over the time and if a full text in the form of the format TEI P5 is provided, the previously entered more rudimentary information can be overridden. The enrichment of the digital material in an online environment is an important focus of DLC.

Figure 2: Editing in DLC.

Besides the upload of individual books, a batch ingest for a series of books to be uploaded is also available via the file transfer protocol. A fine grained error protocol is offered for the ingest process, which allows a user to control the status of the current batch job. Bibliographic information can be entered manually or uploaded in the form of bibliographic metadata files, individually or as well in the form of a batch ingest.

As a generic system, DLC needs to offer the possibility of accepting many different formats. Naturally, there are limitations regarding this prerequisite. At present, all bibliographic metadata needs to be converted to the MAB-XML format2 to be integrated into DLC while images can be uploaded as JPEG, PNG or TIFF. There are also limitations regarding the acceptance of full texts in the form of the TEI format as the possible variations of TEI documents is considerably high and it is difficult to display something which has not been anticipated. Full text in the form of the TEI format will be uploaded and displayed if the file format is validated according to the DLC-TEI-Schema.3 In all other cases, an error message appears. Minimal requirements for a TEI file are that a TEI header exists and divisions, page breaks and headings are marked. Besides import also export and interfaces are relevant for generic applications. DLC offers the formats PDF, METS/MODS and TEI for export as well as an OAI-PMH interface for harvesting the metadata.

With DLC a publication platform for digital works and collections from Max Planck libraries has been created. Search possibilities across all collections from all contributors facilitate the scientific use of digitised copies. The online platform currently hosted at the Max Planck Digital Library increasingly enlivens by being filled with digitised copies from the project partners. Some works date back to the 16th century. All partners are libraries of Max Planck Institutes: the Biblioteca Hertziana in Rome,4 the Kunsthistorisches Institut in Florenz,5 the Max Planck Institute for Human Development in Berlin and the Max Planck Institute for European Legal History in Frankfurt.6 Those institutes form the current user community. The user group can be understood as an open circle for users and open source developers of the DLC software.

Further information about the project results and the DLC application, at the moment only available for readers familiar with the German language, is available at http://dlcproject.wordpress.com. Instructions for the download and installation of the open source software DLC will be found at the mentioned website in the near future.

  1. The Text Encoding Initiative (TEI) is an international consortium which collectively develops and maintains a standard for the representation of full texts in digital form. Further information is available at http://www.tei-c.org/index.xml.
  2. MAB-XML format is an automatic exchange format for libraries in Germany
  3. http://dlc-tei.net/p5/DLC-TEI.rng
  4. http://dlc.mpdl.mpg.de/dlc/ou/escidoc:1003
  5. http://dlc.mpdl.mpg.de/dlc/ou/escidoc:1004
  6. http://dlc.mpdl.mpg.de/dlc/ou/escidoc:1001