by Jamie McLaughlin
In July 2016, a project website built and hosted by the Digital Humanities Institute in Sheffield was subject to a successful SQL injection attack. The site was about six years old and was periodically maintained and tested, but evidently at least one vulnerability had remained. Following the attack, 291 usernames, email addresses and MD5 encrypted passwords belonging to members of the public were anonymously posted on the internet (Figure 1). All digital humanities websites hosted at Sheffield (about 80) were immediately taken offline by the computing services department. An incident had to be reported to the Information Commissioner’s Office. Over the coming months, every site had to be extensively audited before it could be put back online. Often, this meant working through thousands of lines of code, some of it written decades ago by people who had long since left the University. We estimate that the process took three to five hundred person-hours.
Figure 1: The internet forum where the stolen data was first detected by University of Sheffield staff
The process of reviewing, patching and restoring these sites provided an opportunity to re-evaluate them in light of the passage of time. Some of them were fifteen years old; in terms of the web they belonged to another age. If we could go back in time and re-engineer them, what would we do differently? That quickly became a wider question: Not just “how would we re-engineer these sites?”, but “how would we re-do these projects?”. Could they be managed in such a way as to produce more maintainable digital outputs?
Arts and humanities funding applications must usually specify their outputs, such as monographs and articles. Websites are often listed among these. But a website differs from a monograph or an article because it requires maintenance beyond its initial publication. Any live website is really more of a service than an output. It needs to be hosted and maintained. Websites are perhaps more like an exhibition, or even a conference. They cannot be put on a shelf and forgotten about. Without personnel to maintain them, websites eventually stop working correctly.
Specifying an explicit website closure date on a funding application would be bold. We do not know how funders and peer reviewers would react to it. The DHI in Sheffield still promises to maintain sites for seven years beyond the end of a project’s funded period. In practice we maintain them indefinitely. But this is unusual, and the challenge of sustainability in digital humanities has been recognised for some time. 1 In reality, digital humanities websites do close, and this is usually an unplanned and unmanaged event.
In the commercial world, the closure of a website might have little impact. Even in the case of a popular site, the march of technology means that any gap in provision of services is quickly filled. That is not the case in digital humanities. When the DHI was forced to take its sites offline, it became apparent that most were still regularly used in research and teaching. In this context they were irreplaceable. The temporary closure of almost every site, no matter how old or obscure, prompted complaints. Some sites were registering nearly no traffic, but that traffic often turned out to be a high school class, or undergraduate course in another country, who had based a project or teaching around it. Sometimes PhDs were relying on fifteen-year-old websites, the hosting of which now posed serious security risks.
I will propose two approaches to the maintainability problem. The first involves engineering websites to maximise their lifespan after project funding ends. The second involves explicitly planning for what happens when a project website can no longer be maintained in its original form.
1. Maximising Website Lifespan
Keeping a website running means paying staff to perform security updates and conduct other essential maintenance. Therefore, maintainability depends on the availability and costs of those staff. The more developers who possess the skills to maintain a website, the easier and cheaper it is to recruit them. As a result, a website built using more common and generic technology will be cheaper to maintain than one using obscure or bespoke technology.
A Web Framework (sometimes abbreviated to WF) is a suite of software which constitutes a standardised way to build, deploy and maintain a website. Web Frameworks are de rigueur in commercial web development, but historically uncommon in digital humanities. The DHI only began to use a framework in 2012, and about 50 of the sites which we needed to restore were not built using a framework. Auditing and updating still had to be performed on the framework sites, but rather than examining thousands of lines of bespoke code, we were applying updates and patches. This took much less time and specialised expertise. We were doing broadly the same thing to every site, and we were not alone – other people had done this work before us and we had Google and Stack Overflow to help us.
In commercial web development, frameworks are often favoured because they can help create a functional website more quickly. 2 But for digital humanists, their most important benefit is better long-term maintainability. The principle is safety in numbers. When a person writes software completely from scratch, they become the leading authority on how to maintain it. They might completely forget how it works. They might leave their institution. They might make mistakes in their code. On the other hand if they use a framework, any serious vulnerabilities which come to light may be fixed by the framework’s wider community. Crucially, by using a framework, the skills to maintain a website are more common. Developers possessing those skills are likely to be easier and cheaper to recruit.
Choosing a framework is like choosing an investment; there will be conflicting advice, and whatever is chosen will always be open to dispute. Frameworks wax and wane in popularity. Some suffer declines in usage or disappear completely. Rest assured that whatever is chosen will prove more maintainable than bespoke code. As of September 2018, Laravel, written in PHP, and Django, written in Python, are two of the most well supported frameworks. 3 Even that statement is contentious through omission, because there are competing frameworks of similar popularity. There is no definitive measure of framework popularity. You can look at the number of github stars given to a particular framework, or the number of Stack Overflow questions about it, but neither are perfect indicators of popularity. Popularity in turn is no guarantee of longevity, but it may be the best we can do. One site which uses a number of approaches to estimate framework popularity as of February 2019 is Hotframeworks. In Sheffield we presently use Symfony. It is a well-supported PHP framework, first published in 2007. Laravel is based upon some of its components. The DHI currently host about forty projects implemented using Symfony.
Website maintenance becomes much more critical if that site hosts personal data. This is because, over time, security vulnerabilities are discovered in software. If a website is not maintained, it will use older versions of software which are more likely to contain vulnerabilities. These vulnerabilities can be exploited to download, for example, all of the email addresses and passwords which the website has stored. An incident such as this can have legal ramifications for the hosting institution. Even if personal data is not downloaded in an unauthorised way, storing personal data carries new legal responsibilities following the introduction in Europe of the GDPR. Websites which hold personal data need to be maintained much more conscientiously than websites which do not. This translates into increased costs and risks for the hosting institution. In effect, websites which hold personal data are much less maintainable than websites which do not.
Following our data breach, we discovered that many DHI sites stored personal data unnecessarily. The most common culprits were superfluous user account systems. None of the sites that we audited really needed a user account system. The common use-case was to allow users to save searches or collections of documents to a personal workspace. It is important when developing a website to weigh the necessity of features against future maintainability. I would argue that a public user account system should not be added to a website unless it is absolutely vital to the research questions being investigated. Storing personal data is a huge legal and technical burden. Such features cannot be maintained permanently without an accompanying permanent revenue stream.
Of course, there are occasions where a public user account system really is necessary to answer a project’s research questions. A good example of this would be a project with a crowdsourcing component. Luckily, there are open standards available which allow a user account system to be implemented without having to store passwords or personally identifiable information. One such system is OAuth. OAuth is an open standard for access delegation. In short, it allows users to identify themselves to third party websites and grant those sites access to their personal information, without registering a new account or setting a new password. You may have noticed more websites allowing a user to login using a Facebook or Google account. Facebook and Google would be classified as OAuth providers, and the site asking for access would be the OAuth client. In the commercial world, these systems provide convenience for the user by not asking them to remember another password. They also allow closer integration with the OAuth provider and can be used to ask for (and store) further personal information. But OAuth does not have to be used like this. The purpose in digital humanities would be purely to identify the user, so that all a site has to store is the identity token issued by the provider. This is usually numerical, not known to the user themselves, and not personally identifiable. The digital humanities site would also not have to store a password. This means that if the site were to suffer a data breach, it does not actually possess personally identifiable data, so it cannot leak it.
Unnecessary login systems were not the only maintenance challenge we encountered. Many digital humanities projects feature novel, bespoke components. This is normal and desirable; we are, after all, conducting academic research. The drawback of these components is that they are less maintainable than generic, standard components. This is simply because the expertise to maintain them are more scarce. For example, The DHI auditing team encountered a project which used a dedicated MapServer to power its GIS components. This had been hugely ahead of its time, but now the skills to maintain it were close to extinct. One project had left behind an OpenSimulator server. OpenSimulator is an Open Source alternative to Second Life. This server had been quietly running for four years on a desktop computer behind a desk (Figure 2). This was a security vulnerability, and it would be difficult and expensive to recruit somebody to audit and secure such a relatively obscure system. Both of these projects were hugely ambitious, and enthusiastically took up the challenge of forward-looking humanities research. But in terms of long-term maintainability, they were being punished for that. Their services would certainly have to be terminated earlier than other projects. Are digital humanists destined to always be victims of their own innovation? Or is it possible to develop a methodology which mitigates the maintenance difficulties of ingenious digital resources?
Figure 2: The desktop machine which was running the OpenSim server
2. Planning for Web Retirement
The OpenSimulator server had a companion website. It comprised an introduction to the project’s objectives, methods, and instructions for accessing the OpenSimulator server. To a user discovering the project for the first time, it was useful and accessible. In an ideal world it would have been expanded further, to include images and video of the virtual world when it was in operation. It could have explained the conclusions drawn from the research, and even offered data harvested from the virtual world: usage patterns, chat transcripts (with user consent) and 3D models.
The companion site was maintainable because it had few dynamic components. It was mostly HTML and CSS, serviceable by Apache or NGINX without any further infrastructure. But that did not make the information it presented any less informative. Making sites maintainable means removing features which require server side infrastructure, while still offering a digital output which is useful.
For once, digital humanists are not alone here. There is mainstream web technology to do this. The practice of rendering dynamic websites as static HTML and CSS is often called ‘static site generation’. Projects which create their sites using a framework are at another maintainability advantage here, because there are sometimes framework-specific tools to automate this process. A common static site generation tool for PHP frameworks is Sculpin. There is similar software available for Drupal, and Django. A website created without a framework might require a bespoke script to be written using wget, which is a command line tool for batch downloading web content.
The balance between dynamic features and maintainability is a compromise; one increases as the other declines. A static site such as that described above would sit somewhere in the middle of this compromise. At the extreme of maintainability, one can simply deposit project data in an institutional repository and maintain no live web presence whatsoever. As long as the data is available to download in standard, open formats such as TSV, XML or JSON, future researchers who know about the project and want to reuse its data are catered for. This has been good practice in digital humanities for decades.
But there is another important audience: people who are still to discover the project’s material. This audience will want to assess whether the project is relevant and useful to them without downloading and unpacking TSV or XML. It is these people who are targeted by a live website. A live site is always going to be more discoverable than a data file stored in a repository. But perhaps we can go further than this. With good planning, and within a framework, we can decide on an individual basis what to switch off on a website and what to leave available. From least to most maintainable, one could broadly grade the maintainability of site components as follows:
- Forums. Crowdsourcing. Anything where users can upload content. Features such as these require active moderation. They cannot be maintained when staff are no longer available to do this.
- User account features. Even using OAuth, these will occasionally require user support, which requires staff time.
- Search. This requires a database server or equivalent.
- Dynamic HTML / CSS content. This requires server-side processing, but can generally be converted into 5.
- Static HTML / CSS content. Requires only the most basic web server technology to serve.
- Data in a repository.
It is up to individual projects to decide where on this list their web offering will fall at each point in the project’s lifecycle. This piece seeks only to suggest that this should be a conscious decision, planned and costed from the outset.
Returning to discoverability, there is a case for maintaining a basic keyword search function on a site if resources allow. Keyword search has become the primary way we engage with material on the internet, and more importantly, the way in which we most often quickly assess whether a given resource is useful to us. Implemented using a framework, a basic keyword search need not be overly burdensome, or present a serious security risk.
If keyword search cannot be accommodated, truly excellent static navigation pages can be a substitute. It is up to a scholar to recognise the most logical ways into their project’s information, be that date order, alphabetical order, division by topic, or others. Indexes such as these can exist as static pages, creating no maintainability burden. They can be thought of like indexes and finding aids in a book. Well executed, they can heavily mitigate the need for keyword search.
One can perhaps think of the final, static or nearly static version of a digital resource as much more like a book, or an article – safe to be preserved at low cost and consulted in the future, without the maintenance costs and security risks of a complex dynamic website.
To conclude, digital humanities projects can mitigate the relatively short lifespan of innovative and experimental web outputs by consciously planning to showcase their results using more maintainable technology. We should be more conscious of doing the innovative digital work while a project is actively funded, and accept that more idiosyncratic features will eventually be retired. Maintainability can be maximised via a number of technical strategies including use of web frameworks and avoidance of storing personal data. There is a case to be made for avoidance of unnecessary features, but this must not come at the expense of innovation.
For years digital humanists have discussed sustainability while still creating hundreds of unmaintainable websites. Data repositories have made sure that research data is not usually lost, but they cannot provide the same discoverability as a live website; especially one with a keyword search and / or excellent indexing. As digital humanities itself becomes middle aged, we need to think seriously about embracing a project lifecycle which accepts the limited lifespan of innovative research tools and plans for their eventual succession by a maintainable and discoverable account of their findings.
- Denbo, Haskins and Robey 2008, Sustainability of Digital Outputs from AHRC Resource Enhancement Projects, viewed 21 February 2019, <http://www.ahrcict.rdg.ac.uk/activities/review/sustainability08.pdf>
- Multiple (wiki), Web Application Framework, Docforge, viewed 26 February 2019 via the web archive, <https://web.archive.org/web/20150723163302/http://docforge.com/wiki/Web_application_framework>
- An imperfect estimation of a framework’s popularity is the number of ‘stars’ it has received on github. Stars can be used to indicate support for a project, or to bookmark it. A list of frameworks ranked by number of github stars can be seen here: <https://github.com/topics/framework>