Technological and scientific basis for the linguistic annotation of Old Lithuanian Corpus (SLIEKKAS)

The Old Lithuanian Corpus (Lith. Senosios lietuvių kalbos tekstynas; acronym SLIEKKAS, cf. Lith. sliekas “earthworm”) has to be a comprehensive, deeply annotated diachronic reference corpus of Old Lithuanian (1500–1800, ca. 10 m. textwords), being developed in cooperation between the Goethe-University of Frankfurt am Main (Germany), the Institute of Lithuanian Language (Vilnius, Lithuania), the University of Vilnius (Lithuania), and the University of Pisa (Italy). The aim is to create a multimodal (facsimile with annotated text), annotated (header-information, hierarchic structural palaeographic, textological, grammatical annotations) reference corpus. The ultimate goal is to develop a qualitative multilevel electronic retrieval engine for multilateral linguistic research of Old Lithuanian which will lead to reliable results for diachronic Lithuanian language studies. It has to finally enable the implementation of the two biggest desiderata of Baltic linguistics, the Old Lithuanian grammar, and the historic dictionary of Lithuanian.

The aim of the project Technological and scientific basis for the linguistic annotation of Old Lithuanian Corpus was to develop the linguistic and text-technological basis for the creation of a comprehensive deeply annotated reference corpus of Old Lithuanian and to test it on the basis of an exemplary corpus comprising ca. 350 000 Old Lithuanian words. The attempt to start with a test corpus in a pilot project was driven by the necessity to establish complex multilayered structures that are needed for a diachronic corpus, and to apply them gradually.

A basic-XML-structure, which is relevant for a further annotation on the basis of the Toolbox program (SIL) and in the annotation software ELAN (Max Planck Institute for Psycholinguistics in Nijmegen), was set. A detailed lexical and grammatical (morphosyntactical) annotation was started on (Toolbox). During the annotation process all the semi-automatically prepared data (lists of the word forms, dictionaries of the word forms, the lemmata and the glosses) were corrected and complemented manually. The standards of the Old Lithuanian Corpus were coordinated with the standards of the Old German Corpus, which also comprises the annotation of Latin texts. A program for the alignment of the annotated texts with facsimile reproductions of the original ImAnTo, created at the University of Frankfurt/Main, was successfully applied for the Old Lithuanian texts.

