Integrating Language Technology into Scholarly Research Workflows

by Wim Peters, Louisa Parks and Mitchell Lennan

1. Introduction

The overall tasks of identifying, extracting, and formalizing knowledge contained in larger volumes of text as part of a scholarly interpretation process are highly specialized and both knowledge and labour-intensive activities, especially when performed manually. The size of textual material to be interpreted creates a significant bottleneck for the exhaustive scholarly understanding of the semantic content of the textual source material and the domain it stems from, expressed in natural language.

The argument that LT can provide access to information from larger amounts of text warrants the inclusion of language technology (LT) into scholarly interpretation methodologies in order to assist scholars with results from automatic analysis.

LT’s main function is to turn source text into structured data. This extracted information is then, in general, linked back to the source texts using metadata in the form of annotations or external knowledge structures such as thesauri and ontologies. This information is reusable and informs subsequent steps in the scholarly research flow.

In our approach, LT workflow elements are customized to specific scholars’ research questions, involving a set of specific automated analysis tasks, such as concept acquisition and linguistic analysis, as a customized part of the overall scholarly workflow. The LT tasks are not a one-step solution by themselves, but an integral part of scholarly research. They incrementally and flexibly provide textually derived information and structured knowledge in service of the scholarly research questions. This knowledge makes relevant scholarly content explicit and accessible for further exploration and interpretation by experts, leading to further information requirements, which are then incrementally provided by the LT tasks.

2. The Role of Language Technology in Scholarly Research

Incorporating automated techniques into humanities research is still a contentious issue. If unchecked, the result of automatic analysis can easily overload and confuse the scholar and obfuscate scholarly research targets. A completely automatic analysis does not conform to scholarly requirements of quality and rigour because it presupposes that scholars are comfortable with setting wider tolerances for error. Across disciplines, it is recognised that technical fixes will achieve little unless they are embedded in a broader understanding of the rationale and assumptions behind qualitative research (e.g. Zelik et al., 2007). It was noted quite some time before computers became omnipresent in academic practise that there is a need for ‘critical and self-conscious scholarly engagement with computers’ (Mumford, 1962).

Within Digital Humanities, scholars rightfully object to regarding automated analysis as a one stop solution that drives the scholarly research process (see a.o. Hitchcock, 2013).

Turning our attention specifically to LT techniques, also called natural language processing (NLP) techniques, it is observed that they provide valuable text and knowledge base derived information for a wide variety of analysis (e.g. Azzopardi et al., 2016, for the legal domain). NLP methodologies use techniques such as linguistic analysis, named entity recognition, term extraction and relation extraction, which provide useful information for scholarly analysis and interpretation in various stages of the research workflow. They form a bridge between the linguistic surface structure and the underlying conceptual content of textual resources because they enable the computer-based, automatic acquisition of content. When used in the right way, they are useful tools for conceptual exploration, analysis and knowledge management.

The main methodological caveat is that, in order to address the knowledge acquisition bottleneck and allow scholars to perform their research, natural language processing techniques should be applied in a way that supports the scholarly workflow (Peters and Wyner, 2016). This requires flexible and targeted collaboration between scholarly analysis and LT. Instead of allowing technological analysis tools to dictate the research process, these should rather serve the research interests in an informative way that satisfies scholarly requirements. Therefore, instead of dominating the research agenda, LT needs to methodologically fit into the scholarly research methods and be able to provide a targeted contribution to addressing the research questions. Its results should inform research where it is supplied in the workflow and it should be presented in a way that is understandable and useful for scholarly interpretation. Scholars must be enabled to make the most of LT’s ability to enrich the research methodology in order to develop models, provide data, make the means of analysis explicit, and explore hypotheses. In this way LT can help to unlock large amounts of data for focused manual analysis by experts.

The inclusion of language and information technology into interpretation workflows customised to scholars’ research questions entails a set of automated analysis tasks that assist scholars, and a feedback procedure between automated results and manual interpretation. Knowledge is acquired semi-automatically through a collaborative effort involving language technology and scholarly expertise for interpretation and modelling.

The acquisition is incremental in the sense that each LT task contributes to a scholar driven knowledge discovery in line with the research questions defined at the start of the collaboration (see next section). The automatic analysis tasks are completely geared towards providing relevant information for the interpretative steps involved in the methodology of tackling the scholars’ research questions.

3. Workflow

In this paper we present a scholarly research workflow that makes dynamic use of language technology tasks for qualitative research. These consist of strategic and flexible choices of LT tasks within the research workflow. The concrete LT tasks will differ per domain, flexibly and dynamically customised to the scholarly research requirements.

The overall methodology we adopted in this paper, including the contribution from the side of LT is mixed method (Creswell, 2009), in which quantitative and qualitative research can be combined at different stages of the research process such as the formulation of research questions, sampling, data collection and data analysis (Bryman, 2006).

LT can apply methods that are positioned towards the qualitative and the quantitative ends of the analysis scale (Burdick et al., 2012). It produces qualitative (e.g. syntactic analysis) and quantitative (e.g. frequency of word occurrences) results, which ultimately feed into qualitative scholarly research in the form of focusing close reading. Whereas quantitative analysis applies techniques toderive generalisations from large amounts of data, qualitative analysis is characterised by work to identify specific information on data of smaller scale.

The overall organization of this general cyclical methodological workflow is illustrated by the informal metamodel in Figure 1 below. To start the scholarly research process, the formulation of research questions drives the scope of the workflow. And involves considerations such as: What concrete information do we need from the text? What knowledge will form the basis for further research in subsequent cycles within the workflow? How are we going to use this information for our research purposes?

Then, the next step involves the application of manual text analysis in the form of close reading, and the production and evaluation of automatically extracted information. These tasks are continuously tailored to the research needs and give the researchers the opportunity to constantly re-appraise the research questions in the light of newly acquired knowledge. This enables a cyclical answering, adjustment and extension of the research questions, where both manual and (semi-)automatic LT analysis provide customised outcomes. Each cycle forms an incremental progression towards the resolution of the research questions.

The LT tasks are a fully integrated part of the workflow, customised to the researcher’s informational needs in each cycle. An interactive feedback procedure between automated results and manual legal interpretation enables a dynamic workflow with collaborative incremental acquisition of relevant knowledge until, if required, the acquired knowledge and research findings can be formalized into a data structure such as a database or ontology.

Figure 1: Collaborative automatic and manual knowledge acquisition workflow

This integration of manual and automatic analysis of textual material aims at maximizing the acquisition and exploration of conceptual structure. The LT results will therefore not form a one-step solution, nor dictate the scope of the research process. LT tasks perform a fully ancillary role. They will incrementally and flexibly provide textually derived information and structured knowledge within the collaborative workflow, in order to cyclically support scholarly interpretation, evaluation and definition of new analysis tasks in subsequent stages of the workflow.

4. Application Domain: Legal Text Analysis

This work was undertaken under the auspices of the five-year (2013-2018) European Research Council funded BENELEX project. BENELEX in this scenario acts as a ‘case study’ for the research workflow presented in section 3. The project aimed to investigate both the conceptual and practical dimensions of the legal concept of fair and equitable benefit sharing (Morgera, 2016). This included its ‘role and limitations in ensuring fairness and equity in the identification and allocation among different stakeholders of the advantages arising from environmental protection, the sustainable use of natural resources, and the production of knowledge.’

Along with seeking to better understand the progressive development of benefit-sharing obligations in different areas of international environmental law (e.g. biodiversity conservation, oceans, climate change, etc.). An overarching question for the BENELEX project was ‘How does benefit-sharing develop and operate at the intersection of international, transnational and national law and the customary law of indigenous peoples and local communities (IPLCs)?’ Further, BENELEX project attempted to ‘identify challenges for IPLCs in fairly and equitably sharing benefits in different sectors and regions of the world, and the interface between local, national, regional and international law in this connection.’

One of the challenges identified during the project was the participation of IPLCs within relevant decision-making systems, for example natural resource management and protected areas.  IPLC participation within the international arena, particularly within the scope of the United Nations Convention on Biological Diversity (CBD), was highlighted by the BENELEX researchers to be particularly problematic. This issue was the driving force behind the research undertaken in this paper.  

The aim was to establish how issues related to IPLCs and their participation in the CBD arena are approached within different authoritative textual resources over time. This was done by detailed analysis of their textual content as described in the following section.

The research questions the study sought to answer were as follows:

  1. Is there a distinct discourse around IPLCs in comparison to other actors in the United Nations Convention on Biological Diversity international negotiating arena?
  2. What are the most common, or important phrases or keywords that are associated with IPLCs?
  3. Has this discourse changed over time? How (and importantly, why) has the language used towards IPLCs evolved?

 Three documents were analysed for the purposes of this study, these are international guidelines relating to IPLC issues negotiated and agreed by the 196 Parties to the CBD: 

  • Akwé: Kon Voluntary Guidelines (CBD, 2004): These provide guidance on prior and social impact assessments on proposed developments that are likely to affect IPLCs (8300 words);
  • The Tkarihawié:ri Code of Ethical Conduct (CBD, 2010): Which provide guidance regarding activities and interactions with IPLCs (including research) and for the development of national, regional, or local codes of ethical conduct which should be followed during such interactions (3600 words);
  • Mo’otz-Kuxtal Voluntary Guidelines (CBD, 2016): Provide guidance on consent and benefit-sharing from use of IPLC traditional knowledge, including guidance on prior informed consent (3500 words).

These three documents provide concise summaries of lengthy debates about the role and participation of these groups within the CBD. Their exploration and comparison was intended as both an evaluation exercise of the methodology, and a way of generating insight into common or important keywords associated with IPLCs and, where the evolution of these keywords over time played an important role.

We chose them for our pilot because the results could be checked against the in-depth knowledge of other members of the BENELEX research team, and the outcome of previous qualitative analysis (Parks, 2018; Parks & Schröder, 2018). If this ‘test’ was passed, integrating LT techniques could be used to generate avenues for research and initial hypotheses to guide the interpretation of results from comparisons using larger bodies of text, and to inform the study regarding the presence of any distinct discourse on IPLCs.

The presentation of initial results from this comparison, described in Section 6, thus seeks to show how the analysis allowed us to think about the general evolution of talk about IPLCs over time. The methodological assumption was that any change in frequency and meaning of contextual elements of the analysed texts found during the comparative analysis in this study reflects changing attitudes towards, and perspectives on, IPLCs and the nature of their participation within the CBD arena.

5. Methodology

The methodology adopted in BENELEX was a specification of the workflow metamodel model in figure 1 above. While defining this specification there are various concrete methodological aspects to consider pertaining to the close interaction between scholarly manual research and language technology.

In our work we distinguish several phases that are the results of iterative manual and automatic analyses. Each phase contains a number of concrete methodological steps.

Overall, the BENELEX methodology consists of two main phases. We will go into greater detail in the subsections below.

5.1 Phase 1: Domain Exploration

This first phase is to form a conceptual representation of our legal domain that enables us to adopt relevant terminological vocabulary that is informative for the research and use this in our interpretation of the domain. In order to determine which concepts form part of the semantic scope of the textual material we adopt an LT-assisted domain exploration defined as the conceptual characterization of the domain in terms of concepts and their organization. From a methodological perspective this involves a mixed model approach to conceptual acquisition and modelling, applying both automatic and manual extraction, and manual evaluation of information obtained from the three documents described in the previous section. The results are a manually approved list of terms and lookup lists of actors such as persons and organizations. These have been cyclically acquired by cycles of production, evaluation and extension. Figure 2 gives a graphical overview tasks from phase 1 below where specific tasks have now replaced the general ones from figure 1. 

  • Step 1: The automatic extraction of relevant keywords (terminology) and their frequency. These keywords capture the conceptual spectrum of the domain.
  • Step 2: Manual close reading by scholars for the expert evaluation and selection of terms/keywords and further domain exploration. For this purpose, we make use of AntConc tool (Anthony, 2016) to enable the scholars to examine keywords in their contexts (KWIC).
  • Step3: The identification of entities that act as actors and stakeholders are identified as well as their typology, both manually and automatically (named entity recognition), which provides a list of persons, organizations and locations). Further, relevant actors are selected from the automatically extracted candidate terms and continuously sourced from the text during close reading.

Figure 2: BENELEX workflow phase 1

The automatic acquisition of concepts that are salient in the domain (step 1) was performed using theTermRaider tool, which is part of the General Architecture for Text Engineering (GATE) (Cunningham et al., 2002), a platform that enables the creation and running of LT pipelines, and the visualization of their results

The tool combines linguistic analysis such as the addition of part of speech information to words and grammars defining the possible combination of parts of speech into phrases. In this way it distinguishes term candidates such as “biodiversity”, “social benefit”, and “delivery of environmental benefit”.

Then, a termhood score is computed by applying the statistical measures TF-IDF (Salton and Buckley, 1988) and domain relevance (Bouma and Vossen, 2010). Term candidates are offered in decreasing termhood score to the scholars for assessment using keyword in context information when necessary in AntConc. The advantage of this method is that the scholars see a variety of terminology in context, which deepens their insight in the semantics of the texts.

The manually approved terminology and domain specific actors are linked to the texts as textual metadata in the form of annotations. This allows their further use in text analysis and the provision of focused research material for further scholarly research. For the creation, manipulation and graphical representation of text annotations that capture extracted knowledge we used the GATE platform. Figure 3 below illustrates an annotated text in the GATE graphical user interface. Each annotation type has its own colour, with which annotated text spans are highlighted. Purple text represents instances of IPLC actors, the pink text shows relevant terminology.

Figure 3: Annotations in GATE

5.2 Phase 2: Contrastive Comparison

Once we have built up our domain characterization in the form of lists of terms and actors, we then perform contrastive comparison between documents and actor types in the second phase.

The terminology contained within a particular document provides it with its own semantic signature as illustrated by the term clouds in figures 4 and 5.

Figure 4: Term cloud Akwé: Kon

Figure 5: Term Cloud Mo’otz Kuxtal

Using this signature vocabulary of terms and actors, contrastive comparison between documents can highlight differences in perspectives on actors between documents. The postulation is that different documents make different statements about different actors.

Because the research questions focused on IPLCs, our methodology concentrates in its contrastive comparison on the different ways in which the documents talk about IPLCs. Through text and its annotation with terminology and actor information we can now derive the co-occurrences of IPLCs with actors and terms within each paragraph. Co-occurrence within paragraphs presupposes a certain level of thematic relatedness, and the inference is that co-occurring terms modulate the discourse around the IPLC actor type. The postulation behind subsequent qualitative scholarly investigation is that the difference in terminological contexts of IPLCs across documents is informative regarding the ways in which each document speaks about these actors. Pairwise contrastive comparison of the documents (Akwé: Kon – Mo’otz Kuxtal; Akwé: Kon – Tkarihwaié:ri and Tkarihwaié:ri – Mo’otz Kuxtal) highlights the conceptual similarities and differences between them by examining which co-occurring terms they share when mentioning IPLCs and which are unique to them. Figure 6 illustrates terminological contrast and overlap between two documents with some examples.

Figure 6: Contrastive term context analysis: differences and overlap between Akwé: Kon and Mo’otz Kuxtal

6. Results and Observations

This section offers some examples of findings from our initial study on the evolution of language in our three source texts. The overall picture is one where the language used about IPLCs appears to be evolving from a more instrumental view in Akwé: Kon towards a more independent group of rights-holders in the Mo’otz Kuxtal text. The Akwé: Kon guidelines concern procedures that should be undertaken whenever an actor (such as a private business) wants to carry out projects on IPLCs’ lands. These procedures take the form of a range of impact assessments – environmental, social, and cultural which should precede any development and actively involve local stakeholders. This is a practical and functional view of the issue of how to allow these groups to participate effectively in projects with immediate local impacts. By the time we reach the Mo’otz Kuxtal guidelines, this rather more bureaucratic view of participation has shifted to a different view of involvement, signalled by a language that emphasises questions of ethics and recognises issues linked to colonial histories, such as the repatriation and recovery of traditional knowledge.

The results of the analysis suggest this overall picture in various ways. One example is the term ‘involvement’, which appears 5 times in Akwé: Kon but 21 times in Mo’otz Kuxtal. This indicates more discussion of what involvement should entail for IPLCs in Mo’otz Kuxtal as opposed to Akwé: Kon.

In addition, pairwise comparisons of the different texts painted a more detailed picture of this overall evolution. Looking at the unique terms linked to participation are telling in this respect. In Akwé: Kon the common terms include ‘consultation’, ‘public consultation’, and ‘stakeholder’. In Tkarihawié:ri these terms evolve towards a different type of language linked more clearly to the questions of why participation is needed – ‘ethical conduct’, ‘respect’, ‘sacred sites and species’, etc. In Mo’otz Kuxtal the emphasis in terms seems to shift to the theme of making participation ‘effective’, also invoking ‘trust’. Overall, this also signals an evolution from a rather bureaucratic view of IPLC involvement as ‘consultation’ which indicates listening but not being bound to actually do anything about the views of others, through to an emphasis on ethics, also accompanied by language recognising the faults of colonialism (repatriation, recovery of traditional knowledge), through to the current situation where the effort seems to be on giving real meaning to IPLC participation as distinct from mere consultation. A final point in this direction is the view that the unique language of Akwé: Kon suggests that as long as an impact assessment is carried out, a development will and should take place. The terms in the document indicate safeguards, grievance measures and the like, but nothing clearly denotes the possibility of denying permission. This seems to contrast with the recognition of fault in Tkarihawié:ri, but as detailed above that recognition of wrongdoing seems firmly planted in a colonial past and few terms appear to recognise the continuing post- or neo-colonial present. This does seem to come more to the fore in the Mo’otz Kuxtal guidelines, with their unique terms constructing a more concrete view of the current rights and grievances of IPLCs.

The view of a wider debate about the content of ‘participation’ by IPLCs is also indicated by other findings. References to benefit-sharing are a good example. The ‘equitable sharing of benefits’ is referred to 11 times in the Mo’otz Kuxtal guidelines, but only once in Akwé: Kon. ‘Benefit-sharing’ occurs 12 times in Mo’otz Kuxtal, but not at all in Akwé: Kon. Beyond simple considerations of the focus of each document, the fact remains that this discussion of the concrete consequences that should flow from the use of IPLCs’ resources and knowledge indicates a shift towards discussing the content of participation. Further support comes from the inclusion of terms that nod to an even wider scope for debate in Mo’otz Kuxtal that are missing from Akwé: Kon such as ‘unlawful appropriation’, ‘technology transfer’, ‘compliance’ and ‘rights of indigenous peoples and local communities’. All of these imply an evolution of the debate that casts IPLCs as rights holders beyond their role as actors in impact assessment procedures.

The results of this initial analysis also find some confirmation from the results of earlier qualitative discourse and frame analyses of CBD COP decisions. A discourse analysis based on texts referring to IPLCs drawn from COP decisions suggested that the bulk of talk about these groups is concentrated on the themes of participation and recognition (Parks, 2018). Both themes are deepened by the results garnered from this analysis. A frame analysis of COP decision texts referring to participation by IPLCs also bolsters the view of a discourse evolving from a more bureaucratic and formal view towards a wider debate. Specifically, the frame analysis indicated that the content of ‘participation’ is very far from settled in the CBD – language is often vague and fails to identify specific roles for specific actors. Nevertheless, in more recent decisions new frames around themes including ‘participation for respect’, which refers to texts that see the role of participation as contributing to the actorhood of IPLCs as rights holders, have emerged (Parks and Schröder, 2018).

7. Conclusion

In this paper, we presented a collaborative workflow metamodel involving scholars and language technicians. The general postulation is that a combination of quantitative and qualitative analysis methods involving LT and scholarly analysis and dynamically customized to scholarly requirements enables a research workflow that produces an incremental heap of knowledge in the form of qualitatively assessed metadata.

This general workflow model is applicable across scholarly domains and requires research-specific specification concerning the dynamic customization of individual LT microtasks within each step of the workflow according to the scholarly research needs. LT is fully ancillary to scholars in this process.

This workflow structure encourages the modularization of analysis steps and supports progress on the arduous path towards reproducibility and reusability of results, repeatability of method and standardization of the representation of acquired knowledge.

Applying the workflow metamodel we performed a pilot study in the legal domain by specifying the metamodel into an in-depth case study based on both distant and close reading analysis of documents linked to the CBD. The research interest was to investigate and compare discourses relating to indigenous peoples and local communities in three documents that are key to understanding the evolution of discourses about the participation of indigenous peoples and local communities.

Comparing the three documents we found that the LT derived information informed the research questions, and that we were able to draw conclusions about the overall evolution of the discourse about IPLCs and their participation. The insights acquired from the LT workflow were confirmed by pre-existing expert knowledge and the findings from previous qualitative analysis. The conclusion is that this methodology benefits scholarly research and will allow the generation of initial hypotheses and avenues for probing comparison results based on much larger bodies of text on an empirically-informed basis.

This work was undertaken under the auspices of the five-year (2013-2018) European Research Council funded BENELEX project.

8. References

AntConc (Version 3.5.7) 2018, computer software, Tokyo, downloaded June 2018 <>

Azzopardi,S., Gatt, A. and Pace, G.J. 2016, Integrating Natural Language and Formal Analysis for Legal Documents, Conference on Language Technologies & Digital Humanities, Ljubljana, 2016.

Bouma, W. and Vossen, P. 2010, Bootstrapping Language-Neutral Term Extraction, Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta.

Brüning , J.   and   Gogolla, M. 2011, UML metamodel-based workflow modelling and execution, in EDOC, 2011, pp. 97-106.

Bryman, A. 2006, Integrating qualitative quantitative research: How is it done?
Qualitative Research 6(1), 97-113

Burdick, A. Drucker, J. Lunenfeld, P. Pressner, T. and Schnapp, J. 2012, Digital Humanities, In Jan Baetens Cambridge, Mass.: MIT Press, 2012.

Convention on Biological Diversity 2004, Akwé: Kon Voluntary Guidelines, Convention on Biological Diversity, viewed July 2018, <>

Convention on Biological Diversity 2010, Tkarihawié:riCode of Ethical Conduct, Convention on Biological Diversity, viewed July 2018, <>

Convention on Biological Diversity 2016, Mo’otz-Kuxtal Voluntary Guidelines, Convention on Biological Diversity, viewed July 2018, <>

Creswell, J.W. 2009, Research Design: Qualitative, Quantitative, and Mixed Methods Approaches (3rd edition, Thousand Oaks, CA: Sage.

Cunningham, H., Maynard, D., Bontcheva, K. and Tablan, V. 2002, GATE: an Architecture for Development of Robust HLT Applications, In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 7–12 July 2002, ACL ’02, pages 168–175, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.

Hitchcock, T. 2013, Confronting the Digital, Cultural and Social History, 10:1, 9-23.

Lewis Mumford 1962, The Sky Line “Mother Jacob’s Home Remedies, New Yorker, 1 December 1962, p. 148.

Morgera, E. 2016, The Need for an International Legal Concept of Fair and Equitable Benefit-sharing, European Journal of International Law 27,2, pp. 353-383, available here.

Parks, L. 2018, Spaces for local voices? A discourse analysis of the decisions of the Convention on Biological Diversity, Journal of Human Rights and the Environment 9(2), 141-170.

Parks, L. and Schröder, M. 2018, What we talk about when we talk about ‘local’ participation in international biodiversity law. The changing scope of Indigenous peoples and local communities’ participation under the Convention on Biological Diversity, Participation and Conflict 11(3), 743-785.

Peters, W. and Wyner, A. 2016, Legal Text Interpretation: Identifying Hohfeldian Relations from Text, Proceedings of LREC 2016.

Salton, G. and Buckley, C. 1988, Term-weighting approaches in automatic text retrieval, Information Processing & Management, 24 (5). 1988.

Zelik, D., Patterson, E. S., & Woods, D. D. 2007, Understanding rigor in information analysis, 8th International Conference on Naturalistic Decision Making, Pacific Grove, CA.