Creating a machine-assisted spoken version of Chaucer’s Pardoner’s Prologue and Tale using Amazon Polly
1. Introduction
This chapter documents the methodological process of working with machine learning models (ChatGPT-4, Allosaurus, Amazon Polly) to produce a spoken version of Chaucer’s Pardoner’s Prologue and Tale (henceforth, PPT), including both speech-to-text and text-to-speech processes. The primary purpose of this chapter is to evaluate the success of each of these models in producing an output suitable for teaching a Chaucerian text. In particular, we examine whether such an approach would be useful for university students, in assisting their understanding of the tale, and whether the phonological and prosodic outputs are appropriate for doing so. We also investigate the level of editorial intervention required at each stage of the process, and thus whether the use of machine-assisted methods can lead to a faster and more streamlined process in producing pronunciations of the tale without any loss of quality or authority.
In user focus groups, students discussed their requirements for an edition of the Pardoner’s Prologue and Tale for the classroom, or for independent study. One of the main features arising from the discussions was the ability to listen to the tale and practice their pronunciation. Students find that activities where they can speak the text aloud – either individually, in small groups, or as a whole class – can assist with learning not only the pronunciation of Middle English sounds, but the underlying meaning of the text. Some students referred to the ability to understand satire, double entendres and puns in medieval texts once they had listened to the tale being read to them, or if they practiced reading themselves. Yet, one of the main requirements of a spoken version of the tale must be that it is engaging—students state that the tone and expression of the reader is pertinent to their experience of understanding the role of different characters, comedy and poetic devices (see Section 6 for more detail on students’ views). Thus, the following sections aim to critique the production of an AI version of a Chaucerian tale, prior to its testing by users, and understand how useful such outputs would be for a university classroom.
2. The Middle English pronunciation debate
One of the main challenges medieval researchers face when producing spoken versions of Middle English text, is the numerous debates surrounding its pronunciation. While there is agreement concerning the pronunciation of some aspects of Chaucer’s English, there remain some grey areas, due to the large amount of spelling and structural variation present in Middle English texts. For example, it is generally agreed that Chaucer would have pronounced the final unstressed syllable of each line, -e (or the final schwa [ə]) (see Burrow 1971). This unstressed syllable is potentially a vestige of an Old English inflectional ending, and might be used to highlight alternating stress. The final schwa eventually fell out of use in the Early Modern English period.
There is also disagreement concerning Chaucer’s use of iambic pentameter, i.e. the degree to which lines consisted of five metrical feet and ten syllables. Solopova critiques such metrical consistency in Middle English works, specifically those of Chaucer, stating that “poetry where every line is perfect in the sense that it follows exactly the requirements of a single pattern and is therefore identical with every other line could not exist: its regularity would make it uninteresting” (1997: 143).
The purpose of the methods and analysis discussed here is not to reproduce a pronunciation of Chaucer that follows proposed Middle English conventions perfectly – or even ‘accurately’ – given the inconsistency to which Chaucer adhered to such conventions himself. Instead, it is to determine the extent to which different Middle English and Chaucerian phonological and prosodic features could be recreated using machine-assisted methods, particularly if there is no Middle English expert available to assist in a reading of the tale for a digital resource. This analysis also contributes to the wider discussion surrounding AI models and their capacity to generate ‘new’ outputs, thus assessing whether the resulting pronunciation comes close to a general medieval English sound that can be used for the purposes of teaching.1
3. Using ChatGPT to produce Middle English phonemes
After testing ChatGPT-3.5’s translation of Chaucer’s PPT, we prompted the upgraded version of ChatGPT (version 4) to produce International Phonetic Alphabet (IPA) outputs of the tale via OpenAI’s API (see newer versions of ChatGPT as generative AI continues to evolve). The following prompt was used as the initial step in the text-to-speech process:
prompt = f"""
Your task is to process The Pardoner's Tale from Chaucer's The Canterbury Tales derived from the Ellesmere Manuscript. This is in the original language of Middle English. \
The text has been split into lines and is provided in JSON format, delimited by triple backticks. \
for each line: \
step 1 - create an IPA pronunciation from the text, based on Middle English, substitute any double quotes in the output with backticks. (ipa_original)\
Provide output in JSON format with the following keys: ln, en_gb_rhyme. \,
Only process the supplied lines that have an 'ln' key. \
An example of the output is provided in (1), which also shows any necessary editorial changes that needed to be made:
(1) Example IPA output from ChatGPT-4:
3.1. Challenges of using CHatGPT in the process of text-to-speech
ChatGPT generally produced the modern English IPA equivalent of the Middle English sound when undertaking the prompted task. First, GPT generally failed to include the line-final unstressed syllable on any line within PPT. For instance, welle ‘well’ and swelle ‘swell’ at the end of lines 27-28 were transcribed to [wɛl] and [swɛl], without the final unstressed schwa. The same could be said for the rhyming words ystonge ‘stung’ and tonge ‘tongue’ (lines 29-30), which could be pronounced with final schwa, as well as the plosive [g] following the nasal [ŋ] in Middle English (e.g. see A Guide to Chaucer’s Pronunciation by Kökeritz 1978: 8). This final syllable may have been used to highlight alternating stress patterns (i.e. the rhythmic ‘da DUM’ of Chaucerian verse and iambic pentameter).
In addition, Chaucer’s works came before the ‘Great Vowel Shift’ (or GVS, e.g. see van Gelderen 2014: 22 for further explanation), which affected the pronunciation of long vowels. Consequently, the vowel sounds in words such as ‘stung’ and ‘tongue’, transcribed as the vowel [ʊ] by GPT, was more likely pronounced with the vowel [ɒ] (as in the word ‘orange’). It is evident that GPT often relied on Modern English pronunciation of vowels to produce IPA for the tale.
Lastly, one of the main errors GPT made was the production of Modern English IPA for the entire word. For instance, the words draughte and taughte (lines 37-38) were transcribed using its knowledge of the Modern English words ‘draught’ and ‘taught’, which do not rhyme in the present-day. In ME, these words rhyme, and would have been pronounced similar to [drɔxtə] and [tɔxtə] or [drɒxtə] and [tɒxtə], with pronunciation of <gh> as in German ‘ich’. GPT instead used the unrhyming pair [drɑːft] and [tɔ:t]. Again, GPT appears to be using its knowledge and training of Modern English to produce IPA here.
To avoid some of these errors, the editor might specify additional requirements in the prompt. Given GPT’s lack of training on Middle English rhyming and metre conventions, these may need to be spelled out clearly in the input, pointing toward issues such as the pronunciation of final -e, the use of iambic pentameter (and the inconsistency of its use) and sound changes in the English language such as the Great Vowel Shift. However, editors themselves must decide whether the process of specifying detailed prompts for an LLM would be time-consuming and a hindrance, given there is a high likelihood that these models may still produce falsehoods.
Instead, if the editor or researcher’s sole interest is to test the possibilities of machine-assisted approaches in producing Middle English for the use of an edition, we recommend the editor uses some sort of guide in the process of producing Middle English IPA. We used a recording of Chaucer’s PPT from The Chaucer Studio by Joseph Gallagher (Gallagher, Sales and Thomas 2003), to produce Middle English IPA and act as a guide for pronunciation of Chaucer’s PPT. While this process involved moving from speech-to-text, and subsequently text-to-speech using AWS’ Amazon Polly, it did allow us to test whether machines were capable of 1) producing Middle English IPA from a sound recording, and 2) producing sound from Middle English IPA. The following sections are therefore dedicated to describing this process and the benefits and challenges that come with it.
4. Chaucer Studio and Allosaurus as guides for text-to-speech2
We used Allosaurus to automatically transcribe the Chaucer Studio recording, a program trained to recognise phones from over 2000 languages. The input is a sound recording, and the output is a list of timestamps with corresponding phonemes for each sound. We set the output to recognise pauses of 0.22 seconds or more in the recording, so that a whole string of sound could be parsed rather than individual phonemes, to speed up the editorial process. The output was then edited to include word boundaries, ready to be processed through Amazon Polly where it was hoped that the software would convert the IPA to speech in a ‘verse-like’ manner.
Allosaurus recognised several sounds in the Chaucer Studio recording, and produced a fairly consistent output in terms of the vowels and consonants used. The narrowest (i.e. the most detailed) transcription we received as an output was the use of aspiration, and the output primarily consisted of phonemes. An example of part of the output is provided below in (2), along with the editorial changes made (see Appendix A for the entire output and all editorial changes).
(2) Example output from Allosaurus, along with editorial changes:
Overall, the total number of edits made to the Allosaurus output, versus the number of words in the tale, was quite substantial. 4825 total edits were made to an output with 22,438 phonemes, and the percentage of edits made to Allosaurus relative to the entirety of phonemes in the tale was 21.5%.
4.1. Challenges of using Allosaurus for recreating Middle English phonemes
Word boundaries and lines were first formed from Allosaurus’ output strings. The IPA was then edited to match the pronunciation of the recording, and punctuation was added (using Benson’s 2008 Riverside Chaucer edition) so that Polly would apply pauses in the correct environments, to encourage the model to produce an output close to medieval verse. One of the issues with the compatibility of Allosaurus outputs of Chaucer’s English, and the requirements of Polly, was the difference between modern English and medieval English IPA. Polly’s modern ‘British English’ language does not include the full range of phonemes that existed in the inventory of Middle English. A large number of the Middle English sounds do exist in, for example, modern German or Dutch language inventories in Polly, but Polly does not allow for the tagging of other languages when the model is set to read a particular language, without extensive markup. Even then, the output comes with its challenges, in that it begins to pronounce Middle English words inaccurately (more on language tags in Section 5.2 below).
Despite the high number of editorial changes, the Allosaurus output resembled sounds closer in articulation to Middle English, compared to prompting via GPT-4. This is likely because of the human Chaucer Studio guide. For example, Allosaurus recognised the wide variation in vowel sounds which have changed in the present-day, such as monophthongal /i/, /e/ and /o/, rather than diphthongal /aɪ/, /eɪ/ and /əʊ/, prior to the Great Vowel Shift. The noun dronkenesse ‘drunkenness’ (line 224), was transcribed as /dɹoŋkənɛsə/, with the Middle English vowel /o/ rather than present-day /ʌ/. Consonants such as the velar fricative /x/ and the alveolar trill /r/ were also recognised, despite their non-occurrence in Modern English. These were evident in spelling combinations such as word-final <gh> in though /θɔx/ (line 229), and <r> in lecherous /lɛtʃʊrəs/ (line 224). Aspiration was, on the whole, used in appropriate places, after voiceless plosives /p/, /t/ and /k/, as in word-final <t> in yet /jɛtʰ/ (line 230), but this was not always applied consistently.
Rather than prescribing a British English phoneme inventory for the recording, Allosaurus recognised the sounds from the 2000 languages on which it was trained, to be able to produce a (fairly) faithful transcription of Middle English vowels and consonants that were present in the recording. However, there is still a number of edits to be made to each line (equating to around 7.5 edits per line). A typical line in PPT had 8-9 words (bearing in mind the number of sounds in each word). Yet, the use of auto-transcription prior to inputting sounds into Polly was more accurate than asking a machine – untrained on the conventions of Middle English or Chaucerian pronunciation – to transcribe Middle English verse to IPA.
The most common error made by Allosaurus was the inaccuracy of the place/manner of articulation of plosives and whether they were voiced. For instance, in line 227 of (2), the velar plosive /k/ was used instead of the labiodental fricative /θ/ in the word <breeth> ‘breath’, and in line 230, the voiced plosive /d/ in <god> and <drank> were transcribed with voiceless plosive /t/. Sounds were also inserted where they were not required (e.g. line 225: the additional /t/ at the end of <wrecchednesse> ‘wretchedness’), or were not included when required (e.g. line 227: the lack of schwa /ə/ on <embrace>, to rhyme with the word at the end of the preceding line <face>). Some of these errors may have been a result of the tonal quality of the sound in isolation, or the accent of the speaker, especially if Allosaurus could not register the sound as part of the 2000+ languages in the training data. The quality of the sound was particularly evident when the speaker shifted to a new character with a different pitch or timbre.
There is clearly an attempt made by Allosaurus to produce connected speech. For instance, in the section of line 224 which states wyn and dronkenesse ‘wine and drunkenness’, Allosaurus produces the velar nasal /ŋ/ at the end of <wine> and before <and>, in an attempt to reconstruct the connected speech produced by the recording (i.e. it was perhaps easier and more natural for the speaker, reproducing a Middle English/Chaucerian accent, to use a velar rather than alveolar sound following the high /i/ vowel in ‘wine’). While IPA including connected speech produces more natural-sounding outputs, Amazon Polly required word boundaries to be able to pronounce the word accurately. If the velar sound was retained, Polly would not have reproduced the sound subtly as in natural speech, and therefore the listener may not have been able to parse the word.
Editors should finally be aware that the recording they are using as a guide may be based on a different manuscript variant to the one they are working with. Further work is sometimes required to determine whether the words match up (often in relation to determiners, e.g. ‘the’ vs ‘a’) or whether entire lines are included. For instance, the line ‘Here ended the Pardoner’s Tale’ was not included in the recording, yet was required based on the Ellesmere manuscript we worked with. We encountered few issues with possible manuscript variants, as the Ellesmere Chaucer is commonly used amongst the 80 manuscript witnesses of Chaucer’s Canterbury Tales.
5. Amazon Polly
Once the IPA had been fully reproduced and edited, Amazon Polly (henceforth, Polly) was introduced for the text-to-speech process.3 Polly uses Speech Synthesis Markup Language (SSML) to “deploy high-quality, natural-sounding human voices in dozens of languages”, and is primarily used for ‘conversational user experiences’ and quick responses, particularly for business purposes or web design. On its homepage, Polly promises that users can undertake the following tasks:
- Customize and control speech output that supports lexicons and Speech Synthesis Markup Language (SSML) tags.
- Store and redistribute speech in standard formats like MP3 and OGG.
- Quickly deliver lifelike voices and conversational user experiences in consistently fast response times.
(Amazon Web Services, Inc. 2024)
We acknowledge that the primary task of Amazon’s Polly is not to reconstruct what an older language might have sounded like, nor is it trained to recognise medieval English sounds. However, given that there are no commonly known programmes trained to produce medieval speech (as far as we are aware), we were intrigued to test whether Polly could be successful in producing something medievalists and/or their students might recognise as Middle English, perhaps for a classroom activity, and determine whether it could feasibly be included in a digital teaching edition. As shown above, it is possible for Polly to create ‘lifelike’ and ‘conversational’ outputs, yet we also wanted to test whether Polly would recognise the use of punctuation to form verse (i.e. with commas, semi colons, and full stops at the ends of lines). We also included further SSML tags within sections of the input, to test whether phonemes, languages, rate/pitch/timbre, and emphasis, could be altered.
5.1. Polly’s production of phonemes and verse
In the following sections I compare some of the outputs produced by Polly to a human recording of the tale, to highlight the differences in pronunciation and prosody.
In (3a), I provide the first two lines of the text (including IPA), as edited from Allosaurus, and in (3b) I provide the relevant SSML which was inputted into Polly. (3c) is the output itself, and (3d) is Professor Jeremy Smith’s version (Honorary Senior Research Fellow at the University of Glasgow), who we thank for kindly agreeing to record the tale for us.
(3a) Example of an input for Polly, arising from Allosaurus (lines 3-4):
Line 3: “loɹdɪŋiːs,” kwɑd i:, “ɪn tʃɪɹtʃəs wɛn i pɹætʃə,
Lordynges quod He in chirches whan I preche
Line 4: i pejn mej tɔ hɑn ɑn hɔwtejn spætʃə,
I peyne me to han an hauteyn speche
(3b) SSML tags for lines 3-4:
<mark name="line_3" />
<prosody rate="85%" xml:space="preserve">
<phoneme alphabet="ipa" ph="“loɹdɪŋiːs"/>,
<phoneme alphabet="ipa" ph="”"/>
<phoneme alphabet="ipa" ph="kwɑd"/>
<phoneme alphabet="ipa" ph="i"/>:
<phoneme alphabet="ipa" ph=""/>,
<phoneme alphabet="ipa" ph=""/>
<phoneme alphabet="ipa" ph="“ɪn"/>
<phoneme alphabet="ipa" ph="tʃɪɹtʃəs"/>
<phoneme alphabet="ipa" ph="wɛn"/>
<phoneme alphabet="ipa" ph="i"/>
<phoneme alphabet="ipa" ph="pɹætʃə"/>,</prosody>
<mark name="line_4" />
<prosody rate="85%" xml:space="preserve">
<phoneme alphabet="ipa" ph="i"/>
<phoneme alphabet="ipa" ph="pejn"/>
<phoneme alphabet="ipa" ph="mej"/>
<phoneme alphabet="ipa" ph="tɔ"/>
<phoneme alphabet="ipa" ph="hɑn"/>
<phoneme alphabet="ipa" ph="ɑn"/>
<phoneme alphabet="ipa" ph="hɔwtejn"/>
<phoneme alphabet="ipa" ph="spætʃə"/>,</prosody>
(3b) is an example of how phonemes were incorporated into SSML, using the tag <phoneme alphabet=”ipa” ph=”word”/>. “ipa” was specified to indicate that the International Phonetic Alphabet should be used, and the tag ph=”word” was used to indicate the specific sounds, for each word in the line. Punctuation (e.g. commas, semi colons, full stops, etc.) was included at the relevant points, usually between the phoneme tags, and was recognised by Polly.
A challenge for including punctuation in the tale is that the pauses tend to be slightly longer when inputting them directly into Polly from an already existing edition. For example, a comma is upgraded to a ‘sentence-length pause’, and a full stop is upgraded to a ‘paragraph-length’ pause. If editors would like to alter the input, the <break> tag can be used based on the strength required (e.g. a ‘weak’ pause is similar to that of a comma, and a ‘strong’ pause has the same duration as a pause after a sentence). Lines were also labelled in order to isolate these in the edition, should users want to hear the output for one specific line. Prosody rate was also marked up, which I discuss in Section 5.2.
(3c) Polly output of lines 3-4:
(3d) Recording of lines 3-4, by Professor Jeremy Smith:
There are several differences between the samples in (3c) and (3d). These relate to: the rhythm and metre of verse and the ‘naturalness’ of the voice; the difference in pronunciation of vowels and consonants, which might differ depending on the preference of the speaker; and the use of emphasis and character voice to bring the tale to life. We put together an activity which encourages students to compare the outputs, either independently or in a classroom environment, and identify some of the challenges with AI in reproducing human speech and the medieval oral tradition. This type of critical assessment can be explored further within a digital scholarly edition which promotes further engagement with different issues in Chaucer and medieval studies.
5.2. British versus German voices and the <lang> tag
The next question relates to whether Polly could recognise and produce sounds not available under the ‘English (British) (en-GB)’ supported language. Appendix B lists the available vowels and consonants under the language. As shown, both monophthongal and diphthongal vowels are recognised by Polly (e.g. /ɔː/ and /ɔɪ/), although diphthongs were not common in Middle English given they were introduced after the Great Vowel Shift. The three main vowels not supported by this inventory, but that likely occurred in Middle English (and were produced by the Chaucer Studio recording), are /a/, /e/, and /o/. There are also two consonants, /x/ and /r/, which were not recognised by Polly’s British inventory. In (3a), the sounds /e/ and /o/ could therefore not be produced in the desired way. In the following section I discuss tests implemented to attempt to rectify the production of these sounds.
The German (de-DE) inventory of IPA phonemes appears to include a number of the required sounds, including /a/, /e/, /o/, and /x/. It does not, however, incorporate the sounds typical of a present-day English RP accent (which were also incorporated into the Chaucer Studio recording), in particular /æ/, /ʌ/, /ɹ/, /w/, /ʒ/. Nevertheless, the SSML was processed through a German voice to test whether the output was more similar to Middle English than the original British voice. German ‘Daniel’ was used, and even though some of the ME sounds could be produced, there appeared to be a German accent incorporated into the output, along with the omission of a number of the sounds, as listed above. An example of the German voice reading the tale is provided in (4a).
(4a) A sample of Polly’s German ‘Daniel’ voice:
We attempted a workaround to try to incorporate a <lang> tag into the SSML, to explore whether IPA from the German inventory could be superimposed onto the British voice. The test involved investigating whether the /a/ sound could be included in the British voice ‘Brian’, in the string <pardoners tale> /pɑɹdənærs tajlə/. (4b-c) highlight the different SSML tests with and without the tag <lang xml:lang=”de-DE”>.
(4b) SSML without the <lang> tag, using Allosaurus IPA:
<prosody rate="85%">English tale combined<break time="500ms"/>
<phoneme alphabet="ipa" ph="pɑɹdənærs" />
<phoneme alphabet="ipa" ph="tajlə" /></prosody><break time="500ms"/>
(4c) SSML with the German “de:DE” <lang> tag:
<prosody rate="85%">English tale split with german a<break time="500ms"/>
<phoneme alphabet="ipa" ph="pɑɹdənærs" />
<phoneme alphabet="ipa" ph="t" /><lang xml:lang="de-DE">
<phoneme alphabet="ipa" ph="a" /></lang><phoneme alphabet="ipa" ph="jlə" /></prosody><break time="500ms"/>
(4b) is the SSML for the standard English output, without using the German <lang> tag>, with the word ‘tale’. Generally, the presence of the sound /j/ following /a/ meant that the pronunciation was close to the required output, as Polly treated it as a diphthong (equivalent to /aɪ/ in the word ‘price’).
The German tag was used in (4c) to determine whether the quality of the Middle English sound would come through further, with the vowel /a/ isolated from the surrounding phonemes (<phoneme alphabet=”ipa” ph=”t” /><lang xml:lang=”de-DE”><phoneme alphabet=”ipa” ph=”a” /></lang><phoneme alphabet=”ipa” ph=”jlə” />). In this output, the tagged German /a/ vowel, sandwiched between the default ‘English’ phonemes, was ignored by Polly, and pronounced <tale> as /tjlə/, which you can hear in (4d).
(4d) Example of the English ‘Brian’ voice with German <lang> tag:
Instead, we maintained the original Middle English phonemes, finding that the vowels /a/, /e/ and /o/ had a close pronunciation to ME. Polly’s knowledge of the orthographic <a>, <e> and <o> versions of the phonemes and how they should be pronounced in various contexts was sufficient, without needing to use a different accent. For instance, the vowel /a/ in <tale> above occurred alongside the consonant /j/, which Polly recognises as the diphthong /aɪ/. The same could be said for the vowel /e/ when it occurred alongside /j/ (e.g. in ‘me’, pronounced /mej/), in that it is similar to the modern English diphthong /eɪ/. However, words such as ‘be’, would be pronounced like present-day ‘be’ (/bi:/), as opposed to Middle English /be:/. In addition, the use of /o/ (e.g. in the words <oh> /ow/ and <dronkenesse> /dɹoŋkənɛsə/) <would be recognised as orthographic <o> in modern English, the diphthong /əʊ/ or monophthong /ɒ/ rather than Middle English /o:/. Lastly, the velar fricative /x/ (e.g. in <draughte> /dɹɔxtə/) was not pronounced at all, and the alveolar trill /r/ (e.g. in <correcioun> /kɑrəksjun/) was pronounced as the approximant /ɹ/ instead. Even though there were some substitutions made to the Middle English sounds, Polly appeared to pronounce each individual phoneme, on the whole, with accuracy.4
5.3. Emphasis, prosodic rate, pitch and timbre
The voice selected for the overall PPT output was British ‘Brian’, in neural format. The neural voices (Amazon Polly’s Neural TTS (NTTS) system) are said to be higher quality than the standard voices, the latter of which uses ‘concatenative synthesis’—a method of stringing together phonemes and segment waveforms. On the other hand, the neural system “converts a sequence of phonemes […] into a sequence of spectrograms, which are snapshots of the energy levels in different frequency bands”, that are then converted into an audio stream using a ‘vocoder’ (see the Neural Developer guide by AWS). The string of phonemes is more seamless in a neural voice, and is designed to sound more like natural speech.
However, the neural voices do not allow for markup of emphasis, prosodic rate, pitch and timbre—tags which could be used to mimic natural variations in speech (e.g. pitch and character/voice changes), or the reading aloud of verse, including emphasis on certain syllables. Thus, we added further markup to the standard voices to determine whether the Chaucerian voice would sound more natural, and mimic the ebbs and flows of speech. It would mean, however, the neural method of stringing together phonemes via spectrograms could not be used in conjunction with emphasis and prosody markup. Below is an example of a test for emphasis, as well as pitch and timbre (5a-d).
(5a) Example of IPA from PPT (line 14):
14: mej tow dɪstɔɹb ʌv kɹistəs ɑli wɛɹk.
Me to destourbe of Cristes hooly werk
(5b) SSML of line 14, with emphasis tag:
<mark name="line_14"/>
<prosody rate="85%">
<phoneme alphabet="ipa" ph="mej"/>
<phoneme alphabet="ipa" ph="tow"/>
<phoneme alphabet="ipa" ph="dɪstɔɹbə"/>
<phoneme alphabet="ipa" ph="ʌv"/>
<emphasis level="moderate"><phoneme alphabet="ipa" ph="kɹistəs"/></emphasis>
<phoneme alphabet="ipa" ph="ɑli"/>
<phoneme alphabet="ipa" ph="wɛɹk"/>
Unfortunately, the use of emphasis and pitch/timbre markup did not provide a natural sounding output, especially alongside the use of Middle English phonemes. (5b) shows how the word Cristes ‘Christ’ on line 14 of PPT was emphasised within the SSML (based on emphasis found in the Chaucer Studio recording). The emphasis was set to ‘moderate’, which involved an increase in the volume and a slowing down of the speaking rate. Alongside the use of a standard voice rather than neural, the markup did not mimic the subtle emphasis used when speaking verse aloud. The other possible format was ‘reduced’, yet this decreases the volume and speeds up the speech, which was not suitable for our purposes. The audio file is provided in (5c).
(5c) Line 14 audio file, with emphasis tag:
(5d) Example of IPA from PPT (lines 358-359):
358: bɛð ɹɛdi fɔɹ tʰo mejt hɪm ɛvrəmar;
Beth redy for to meete hym eueremoore
359: ðows, tawxtə mej mi damə; i sej namar.
Thus taughte me my dame I sey namoore
(5e) SSML of lines 358-359, with timbre (vocal tract length) and prosodic pitch tags:
<mark name="line_358" />
<prosody rate="85%" xml:space="preserve">
<amazon:effect vocal-tract-length="-15%"><prosody pitch="+20%">
<phoneme alphabet="ipa" ph="bɛð"/>
<phoneme alphabet="ipa" ph="ɹɛdi"/>
<phoneme alphabet="ipa" ph="fɔɹ"/>
<phoneme alphabet="ipa" ph="tʰo"/>
<phoneme alphabet="ipa" ph="mejt"/>
<phoneme alphabet="ipa" ph="hɪm"/>
<phoneme alphabet="ipa" ph="ɛvrəmar"/>;</prosody></amazon:effect></prosody>
<mark name="line_359" />
<prosody rate="85%" xml:space="preserve">
<amazon:effect vocal-tract-length="-15%"><prosody pitch="+20%">
<phoneme alphabet="ipa" ph="ðows"/>,
<phoneme alphabet="ipa" ph=""/>
<phoneme alphabet="ipa" ph="tawxtə"/>
<phoneme alphabet="ipa" ph="mej"/>
<phoneme alphabet="ipa" ph="mi"/>
<phoneme alphabet="ipa" ph="damə"/>;
<phoneme alphabet="ipa" ph=""/>
<phoneme alphabet="ipa" ph="i"/>
<phoneme alphabet="ipa" ph="sej"/>
<phoneme alphabet="ipa" ph="namar"/>.
<phoneme alphabet="ipa" ph="”"/></prosody></amazon:effect></prosody>
(5e) highlights how the pitch and timbre tags were used to mark a shift in the character speaking. In this extract, the young boy who speaks to the taverners is introduced. We aimed to shorten the vocal tract length (timbre) and increase the pitch, by 15 and 20% respectively, to mimic a young boy’s voice. While these could be altered more subtly, due to the ability to set the value of the pitch and timbre of the voice, the use of standard voice did reduce the ability to understand Middle English, especially when altering the pitch and vocal tract length. An example is provided in (7).
(7) Example of pitch and timbre change, from the narrator to the young boy:
The issues with the standard voices, in that they do not produce the natural flow of speech, is evident from the audio clips here. The pronunciation of specific phonemes is also not as accurate as the neural voice, and individual words are sometimes not identifiable.
Lastly, for the prosodic rate of the tale – the speed at which the tale was spoken – we settled on 85% (marked up by <prosody rate=”85%”> for each line). This was a speed which was slow enough to understand what was being said, given modern audiences may not be familiar with Middle English.
We therefore chose to maintain the neural voice, and avoided altering the SSML input drastically based on individual characters and the amount of emphasis required on specific words. This also reduces the workload of the editor in producing a Middle English output via text-to-speech.
6. Conclusions on the usefulness of Amazon Polly for Chaucerian verse
The process of using Allosaurus and Polly to produce and process IPA was enlightening, and a number of sounds mirrored what might be expected from a Middle English accent. However, there are some shortfalls to using the output in a teaching edition of Chaucer’s Pardoner’s Prologue and Tale (PPT).
First, it is not a sufficient recording for students reading the tale for the first time. After conducting focus groups to determine what students would like to see from a teaching edition of PPT, most students – from the UK and the US – stated that they welcomed opportunities to read aloud or listen to recordings of the tale, either on their own or as part of a classroom activity. For example, a student from the UK believed that the meaning of the text can be lost when reading independently:
“Once it’s gone around the whole class, and everyone has done it, I feel like you get that sense that you’re a lot more involved. […] The whole mouth sounds of it is an aspect of the text that is often lost in it being read either through a contemporary version or just through a screen or through paper. It’s a whole dimension of it that disappears”.
ID: Student 3; Fourth year undergraduate, English Literature, UK
He also felt that he did not understand most of the double entendres, puns, and when each of the characters begin speaking in Chaucer’s Squire’s Tale until he read the tale aloud:
“Until I actually had to hear the different deliveries that the characters were doing themselves, you get that someone’s being cut off. […] All of a sudden you get the sense that it genuinely is a polyphony—there are all these voices across the text constantly talking over each other.”
ID: Student 3; Fourth year, undergraduate, English Literature, UK
Furthermore, one student mentioned that listening to the tale being spoken aloud allows the tale to come alive; an important aspect of understanding the uncanniness of the medieval period and human behaviour:
“If something is read out loud, […] it comes to life in a way. […] If you hear somebody that has a really good voice tone or contributes to making the text alive, I think it even helps to both catch your attention and understand the text in greater detail.”
ID: Student 4; Third year, undergraduate, English Language and Linguistics, UK
Student 9 also found that the tale was easier to understand when it was read aloud, and she often lost the ‘flow’ of reading when reading alone:
“I get so caught up in figuring out what the word means, that I lose the semantics, I lose the way it’s meant to be read, I lose the flow. So I find that having someone read it to me, and understanding it as I go along, incredibly helpful in that regard.”
ID: Student 9; First year, undergraduate, English Language and Linguistics, UK
Like the underlying messages of The Squire’s Tale, PPT is built primarily on satire. The way in which the Pardoner preaches his prologue and narrates his tale – where he outlines how medieval people sin, references legendary figures and their own sin, and details the sin committed by each of the rioters – is embedded in irony as he transgresses against the church. The Pardoner commits nearly all seven deadly sins himself despite his apparent allegiances to the church. As Student 3 mentions, some of the irony and satire underpinning the tale might be lost if one reads the tale for the first time in their head. In addition, Student 4 insinuated that the tale should be read by someone knowledgeable in the pronunciation of Middle English. The person narrating the tale is therefore of utmost importance, and it is not certain whether someone would be able to acquire the same understanding of the tale if listening to an AI’s recreation of Middle English. For Student 9, the flow may also be disrupted if students get caught up in the meaning of the tale, both of which may not be grasped when listening to AI.
Given the requirements of students, who specify that the reader of the tale should speak clearly, with an engaging voice and an ability to ‘bring the tale to life, Amazon’s Polly may not be able to provide a suitable spoken output for its users. The neural voice was built to produce a natural output, mimicking human-like speech by producing natural frequency levels via spectrograms. Polly also recognises punctuation, allowing for the output to be understood as verse. However, without any appropriate intervention by the editor in terms of the prosodic markup, the output remains monotonous and would not instil enthusiasm or confidence in Middle English in students. Additionally, when an editor attempts to alter the output via prosodic markup (e.g. altering the pitch, timbre and emphasis throughout the tale), it results in the opposite of the desired effect—a disjointed production of speech which shifts drastically from one pitch or volume to another. The production of specific sounds from late medieval English are not supported by the inventories created by Amazon, and there is no easy fix for using the sounds arising in other languages.
There are clear and obvious benefits to using Polly for the purposes of illustrating a product, or for general conversational AI. Yet, it is neither suitable for retelling a Chaucerian tale, nor for speeding up the process of recording the tale for the editor. Allosaurus and Amazon Polly combine to take over parts of the role of the editor in transcribing Middle English sounds, yet there more in-depth editorial work is required to ensure the output is as accurate and engaging as possible.
7. References
- Amazon Web Services, Inc. 2024. “Amazon Polly.” https://aws.amazon.com/polly/ [accessed 27 June 2024].
- Gallagher, Joseph, Troy Sales & Paul Thomas. 2003. “The Pardoner’s Tale.” Chaucer Studio CD-ROM, BYU Creative Works. https://creativeworks.byu.edu/CreativeworksStore/ProductViewDetail?ProductId=21&SiteID=20.
- Kökeritz, Helge. 1978. A Guide to Chaucer’s Pronunciation. Toronto: University of Toronto Press.
- Solopova, Elizabeth. 1997. “Chaucer’s Metre and Scribal Editing in the Early Manuscripts of The Canterbury Tales.” The Canterbury Tales Project: Occasional Papers 2: 143-164.
- van Gelderen, Elly. 2014. A History of the English Language, 2nd ed. Amsterdam: John Benjamins.
8. Data
8.1. Data: Allosaurus IPA output with editorial changes
8.2. Data: Polly’s English (British) (en-GB) inventory
A table showing Polly’s English (British) inventory, which was drawn upon for the IPA in the SSML.
8.3. Data: A guide to Chaucerian pronunciation
A guide to Chaucerian pronunciation by Kökeritz (1978).
- The IPA used for the outputs was produced using the Chaucer Studio recording of The Pardoner’s Prologue and Tale by Joseph Gallagher as a guide (see Section 4), as well as Kökeritz’s (1978) guide to Chaucer’s pronunciation. ↩︎
- See Appendix C for a guide to Chaucerian pronunciation (adapted from Kökeritz 1978), which is relevant for this section. ↩︎
- Where there is discussion of the SSML tags incorporated into the input for Polly, I generally refer to the documentation provided by AWS, available at: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html. ↩︎
- While Polly does outline that accented bilingual voices can be created, for example, to code-switch between English and French in one sentence, the phonemes are based on the native voice of the selected language. The documentation on the following webpage, https://docs.aws.amazon.com/polly/latest/dg/bilingual-voices.html, provides the SSML example ‘<speak> Why didn’t she just say, <lang xml:lang=”fr-FR”>’Je ne parle pas français?'</lang>. </speak>’. They state that “because Joanna is not a native French voice, pronunciation is based on her native language, US English. For instance, although perfect French pronunciation features an uvular trill /R/ in the word français, Joanna’s US English voice pronounces this phoneme as the corresponding sound /r/”. It is understandable that the German vowel /a/ was omitted from the British voice speaking with Middle English phonemes. It appears that Polly has some bilingual voices, for example, ‘Indian English’ and ‘Hindi’. It is possible that there may be more bilingual voices to trial in the future, but for now, editors can only use the phoneme inventory of the selected voice/language. ↩︎