Ghent Database of Medieval Chinese


Originally initiated in 2014, the Ghent Database of Medieval Chinese (GDMC) is a collaborative project with the Dharma Drum Institute of Liberal Arts (Taiwan)  The project has two main goals:

  • To produce high-quality marked-up digital editions of important medieval Chinese Dūnhuáng texts;
  • To build up a database aiming at the analysis of aspects of the language of Late Medieval Chinese, a period crucial for Chinese historical linguistics and our understanding of the early stages of development of Modern Mandarin and Chinese dialects.

During these initial phases, which were financed by foundations of Flanders (FWO) and Ghent University (BOF), the important know-how as well as the database infrastructure was created. In the beginning of 2017, the entire technical infrastructure of the DB was redone in order to enhance its functionality, and funding from the Tianzhu Foundation was used to fund programming on the database.

The texts

The texts to be digitized and marked-up are carefully selected from various genres, all of them characterized by various degrees of intrusions of vernacular elements, and as such reflecting features of the colloquial language of Late Medieval Chinese (ca. 700-1000 CE). The texts selected from ca. 50 manuscripts include the following genres:

  • So-called “Transformation Texts” (變文)
  • Sūtra Lecture texts (講經文)
  • Dūnhuáng Avadāna (因緣)
  • Early Chán texts among the Dūnhuáng corpus
  • (Chán) Buddhists appraisals
  • Other vernacular texts, most importantly the Zŭtáng jí 祖堂集

The texts are digitized and carefully marked-up in collaboration with researchers at the Dharma Drum Institute of Liberal Arts. This is a highly specialized and difficult work, which demands great familiarity with traditional Chinese literature, a deep understanding of the paleographic features of Dūnhuáng manuscripts, as well as a thorough knowledge of TEI manuscript mark-up conventions. For every manuscript a digitized and marked-up XML version is produced which is visualized twofold in the html versions:

  • A “diplomatic” transcription (directly reflecting manuscript features such as variant characters – many of them not yet registered anywhere else; and indicating features such as damaged, blurred, deleted, substituted, etc. characters)
  • A “regularized” transcription (this is a “clean” version of the text, with characters regularized, abbreviations resolved, erroneous characters corrected, etc.; this version will serve as “raw material” for further – e.g. linguistic – mark-up)

The growing number of marked-up texts will be integrated in the DB and become the basis for further studies and analysis. In addition to supplying the community of researchers with high-quality digital editions, the main focus is on the study of the syntax of Medieval Chinese, as well as features of medieval manuscripts which so far have not received sufficient scholarly attention, i.e., the systematic study of Dūnhuáng variant characters, and the study of phonetic loan characters (i.e. research in the degree of phonetization of colloquial Medieval Chinese where characters present sound rather than meaning).

The Database structure

The structural build-up of the DB engages with cutting-edge technology in Digital Humanities; the initial architecture was based on eXist, whereas the DB has been currently rebuild in a MySQL environment and using programming tools and languages such as InnoDB, MyISAM, PHP, OOP, PDO, HTML5, and JavaScript in order to ensure a more stable and flexible environment. More generally, the DB is build to accommodate and adapt to a variety of needs of researchers, including modules for data storage (i.e., the dictionary module of function words; module for registering character variants) and data analysis (i.e., Syntax Tree generators; a tool developed for sentence parsing; etc.), in addition to integrating the marked-up digital texts. A variety of online input masks have been developed in order to register data.

Team and international collaboration

In order to ensure the high quality of the material produced, specialists of a variety of fields are participating in the project, including specialists in digitization and TEI, linguists, and Medieval Chinese text experts. As such, international collaboration is one of the main concerns of this project. Concretely, the texts are digitized and meticulously marked-up in one of the leading East Asian centers of Digital Humanities, the Library and Information Center of the Dharma Drum Institute of Liberal Arts. The texts are thereafter integrated in the Ghent Database of Medieval Chinese where information on grammatical markers, paleography, and phonetic loan characters is harvested and integrated in the DB. As a third step, the material is further analyzed (e.g., the syntactical markers are described in the context, example sentences are analyzed in the Syntax Tree generator and Sentence Parser). Besides Christoph Anderl acting as director of the project, Prof. Hung Jen-Jou (DILA), Prof. Marcus Bingenheimer (Temple University) are co-directors of the project. The main encoders and researchers on the digitization and markup are Lin Ching-hui 林靜慧 (GDMC texts) and Zhang Bo-yong 張伯雍 (Early Chan Texts, contributed by Bingenheimer’s project).


The GDMC aims to develop into a flexible tool in order to assist researchers and students in their study of medieval Chinese texts. It provides high-quality editions of a selection of Late Medieval key texts, and will in the future include several “dictionary modules” and analytical tools. The texts produced will serve as the basis of new editions and annotated translations, and as important reference tool for research in Chinese Medieval Culture (based on the work on the digitization and markup, a book with a new edition of the Pomobian 破魔變 with annotated translations into Modern Chinese and English is currently in print). As added value, the DB will also serve as a didactic tool in the teaching of master, PhD students, and internships. Vice versa, the flexible nature of the DB allows the integration of research material produced in the context of current PhD projects (e.g., on phonetic loan characters and on the rhetoric structure of the Zǔtáng jí).