The establishment of the Sheffield Corpus of Chinese (SCC) was the outcome of a pilot project, the long term aim of which was to provide an extensive digital resource for marked-up historical Chinese texts covering different text types and genres and arranged in different time periods to facilitate study of the development and varieties of the language.
The pilot project was essentially a feasibility study based on three Chinese texts from the Song (960-1279), Ming (1366-1644) and Qing (1644-1911) dynasties. The texts, amounting to about 18,000 words, are parts-of-speech tagged and word-segmented using a mark-up scheme developed in the context of XML (eXtensible Markup Language). The initial form of the SCC at the completion of the pilot project has a tag set of 21 word classes with 49 categories and contains a full-text retrieval and search system that can locate and produce frequency tables of words specified by users both on a character-to-character basis and a word category basis. Parallel English translations have been added as is practicable to broaden the accessibility of the corpus and to facilitate contrastive study between English and Chinese in terms of translation research. The application of XML to Chinese is still at an early stage so the establishment of the SCC made a significant contribution to applying this technology to the language. As the SCC developed and expanded, it addressed the lack of diachronic corpora in this field with fully marked-up Chinese texts and has both promoted and facilitated a wide range of diachronic linguistic and other studies.
- Dr Xiaoling Hu (School of East Asian Studies, University of Sheffield)
- Jamie McLaughlin (Developer – The Digital Humanities Institute)