Primary Data Encoding of a Bilingual Corpus

Johann Gamper and Paolo Dongilli
European Academy Bolzano/Bozen
Weggensteinstr. 12/A, 39100 Bolzano/Bozen, Italy

This paper discusses the building of a bilingual corpus of legal and administrative texts, focusing on the encoding of documentation and structural information according to the Corpus Encoding Standard. The corpus is one module in an ongoing research project about (semi-)automatic terminology acquisition at the European Academy Bolzano and will serve as a basis for applying term extraction programs. We will discuss the pieces of information to be annotated as well as lessons learned in this process.

Due to the equal status of the Italian and the German language in South Tyrol, legal and administrative documents have to be written in both languages. A prerequisite for high quality translations is a consistent and comprehensive bilingual terminology, which also forms the basis for an independent German legal language reflecting the Italian legislation. The first systematic effort in this direction was initiated a few years ago at the European Academy Bolzano/Bozen with the goal to compile an Italian/German legal and administrative terminology for South Tyrol. A few years of experience has shown that manual acquisition of terminological data from texts is a very work-intensive and error-prone task. Recent advances in automatic corpus analysis favored a modern form of terminology acquisition where a corpus is a collection of language material in machine-readable form and computer programs help scanning the corpus for terminologically relevant information, generating lists of term candidates which have to be post-edited by humans. This new form of terminology acquisition will be applied in the CATEx (Computer Assisted Terminology Extraction) project, which emerged from the need to support and improve, both qualitatively and quantitatively, the manual acquisition of terminological data at the European Academy Bolzano/Bozen. Thus, the main objective of CATEx is the development of a computational framework for (semi-)automatic terminology acquisition, which consists of four modules:

  1. a parallel text corpus;
  2. term-extraction programs;
  3. a term bank linked to the text corpus;
  4. a user-interface for browsing the corpus and the term bank.

Currently, we are building the parallel text corpus. This comprises the following tasks: corpus design, preprocessing, encoding primary data, and encoding linguistic information.

Corpus design selects a collection of texts which should be included in the corpus. In its current form, our corpus contains only one sort of texts, namely the bilingual version of the most important Italian law codes. A particular feature of our corpus, which contains both German and Italian translations, is the structural equivalence of the original text and its translation down to the sentence level. This corpus is one of the largest special language corpora. It contains around 5 million words and 35,898 (66,934) different Italian (German) word forms.

In the preprocessing phase we correct (mainly OCR) errors in the raw text material and produce a unified electronic version in order to simplify the creation of programs for consequent annotations.

Corpus encoding enriches the raw text material with explicitly encoded information. We apply the Corpus Encoding Standard (CES) [4], which is an application of SGML [3] and defines a set of guidelines for corpus annotation especially tailored for language engineering. So-called document type definitions are provided, which specify pieces of information that should be encoded in the corpus. CES distinguishes primary data (raw text material in machine-readable form) and linguistic annotation (information resulting from linguistic analyses of the raw texts).

Primary data encoding comprises the mark-up of documentation and structural information. Documentation information includes global information about the text, e.g. bibliographic information (author, publisher, edition, etc.) and information concerning the distribution of the electronic corpus (institution, address, etc.). Structural annotation covers the mark-up of relevant structural elements in the raw text material. Gross structural mark-up and sub-paragraph mark-up are distinguished. The gross structure of a text consists of elements such as large divisions (chapters, sections, etc.) down to the paragraph level, titles, lists, tables, etc. Sub-paragraph structures include elements like sentences, abbreviations, dates, quotations, references, etc. Each text is encoded as a <cesDoc> element which consists of a header and a body. The header (<cesHeader> element) contains the documentation information and the body contains the raw text material and the mark-up for structural information.

The annotation of documentation and structural information serves several purposes. First of all, these pieces of information are necessary to automatically extract the source of terms, e.g. ``Codice Civile, art. 320''. Second, structural information is important for the development of a sophisticated user interface to browse the corpus. This is important in our case, since we intend to disseminate the corpus prior to the completion of terminology extraction. A bilingual, sentence aligned corpus provides a valuable resource for translators. Moreover, in a later state the corpus will be linked to the terminological database, hence user-friendly browsing of the corpus becomes important. Finally, documentation information helps to maintain the text corpus.

The general approach we adopted in the preprocessing phase and for structural annotation was to scan the raw texts using a sequence of filters. Each filter adds some small pieces of new information and writes a log file in cases of doubt. The output and the log file are used in turn to improve the filter programs in order to minimize manual post-editing. This modular boot-strapping approach has advantages over huge parametrizable programs: filters are fairly simple and can be partially reused or easily adapted for texts with different formats. The filters have been implemented in Perl, a general purpose interpreted language which, providing extensive support for regular expression matching, turns out to be a very powerful language for such applications.

Most of the gross structural elements can be recognized by analysing the mark-up for text formating information in the raw text material. For the recognition of sub-paragraph elements we used the MULTEXT tokenizer MtSeg [1] (available from Tokenization detects structure features (paragraph and sentence boundaries) and particular tokens (abbreviations, dates, digits, compound names, etc.), pieces of information that will also help us in completing the structural annotation. MtSeg is composed of a series of sub-tools, each devoted to solving a unique, specific problem. The sub-tools perform such processes as splitting text at spaces, isolating punctuation, identifying abbreviations, recombining compounds, etc. The rules determining how to treat the different tokens are provided as data to the appropriate sub-tool via a set of language-specific, user-defined resource files and are thus entirely customizable. We had to add new items to the resource files for German and Italian--information we extracted from our texts using ad hoc Perl scripts. The resource files we modified are those which contain abbreviations, compound names and clitics (tokens with apostrophe). The segmenter's resource files can be created in a boot-strapping process. By checking the output of the segmenter we verified the entries we already disposed of and saw others that had to be added to the files. After a few boot-strapping sessions on the Civil Code (both German and Italian versions) we were able to tune the resource files at a really satisfactory level. At last we ran MtSeg on a 10% chunk ( 28,000 words) of the Civil Code and the segmented output was scrupulously compared with the original code and checked for errors by a group of linguists: we found only one error typology for both German and Italian codes, errors that turned out to be unavoidable. The structural equivalence of our parallel texts provides valuable allows us to easily detect such segmentation errors. A more detailed analysis of the tokenization process will be given in the full version of the paper.

Future work will include the linguistic annotation which enriches the primary data with information resulting from linguistic analyses of these data. We will consider the assignment/disambiguation of lemmas and part-of-speech (POS) tags and the alignment of parallel texts, first on the sentence level and later on the word level. For the lemmatization and morpho-syntactic annotation we will use the MULTEXT tool MtLex [1]. MtLex is equipped with an Italian and a German lexicon which contain 138,823 and 51,010 different word forms respectively. Currently, we are extending the corresponding lexicons in order to include the 15,013 (58,217) new Italian (German) word forms in our corpus. The creation of the Italian lexicon took 2 MM. The MULTEXT tagger MtTag [1] will be used for the disambiguation of POS tags. Word alignment still requires the study of various approaches, e.g. [2,5]. Finally, we are working also on a sophisticated interface to navigate through parallel documents.


Susan Armstrong.
MULTEXT: Multilingual text tools and corpora.
In Arbeitspapiere zum Workshop Lexikon und Text: Wiederverwendbare Methoden und Ressourcen für die linguistische Erschließung des Deutschen, Lexicographica, pages 107-119. Max Niemeyer Verlag, Tübingen, 1996.
Ido Dagan, Kenneth W. Church, and William A. Gale.
Robust bilingual word alignment for machine aided translation.
In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1-8, 1993.
Charles F. Goldfarb.
The SGML Handbook.
Oxford University Press, 1990.
Nancy Ide, Greg Priest-Dorman, and Jean Véronis.
Corpus encoding standard, 1996.
I. Dan Melamed.
A portable algorithm for mapping bitext correspondence.
In Proceedings of ACL/EACL-97, pages 302-312, 1997.