Language-specific encoding in multilingual corpora

Requirements and solutions




1. Example of multilingual arrangement of text:
Mt. 6,9 in Old Georgian, Armenian, Greek, Syriac, synoptically arranged

NINO.JPG

2.
Avestan manuscript (Aogəmadaēca): Avestan (in Avestan script) mixed with Pahlavī (Middle Persian in Middle Persian script), Pāzend (Middle Persian in Avestan script), and Sanskrit (in Devanāgarī script)

aogemad0.JPG
(Image source: Facsimilie of p. 1 of Cod.Iran.42 (Copenhagen) as printed in Aogəmadaēca, A Zoroastrian Liturgy, ed. K.M. JamaspAsa, Wien 1982).
2.1.

aogemad1.JPG
2.2.

aogemad2.JPG

3. Marking of languages for language-specific retrieval, marking of scripts and script directions for original-like rendering

3.1. Wordcruncher solution (as used in the TITUS project):
3.1.1. Marking of textual units for both scripts and language contained:
<Pright><Tsye16>j" hk0L TDm8W j9wW ^b8k8W {0r yd@ m?" mtbc" j/8W `dj" T|"j8n09Z<Tn16>
<Tsyn16>j$ hk0L TDm8W j9wW ^b8k8W {0r yd@ m?$ mtbc$ j/8W `dj$ T|$j8n09Z<Tn16>
<Tsys16>1 hk0L TDm8W j9wW ^b8k8W {0r yd@ m?" mtbc" j/8W `d1 T|"j8n09Z<Tn16>

<Pnormal>
|p9      <Tdge16>esret ilocevdi tkuen. mamao ─ueno romeli xar cata Ωina. ┬mida iτavn saxeli Ωeni:<Tn16>
      <Tcgei16>esret ilocevdi tkuen. mamao ─ueno romeli xar cata Ωina. ┬mida iτavn saxeli Ωeni:<Tn16>
      <Tmge16>xolo tkuen esret ilocevdit: mamao ─ueno, romeli xar cata Ωina, ┬mida iτavn saxeli Ωeni,<Tn16>
      <Tcgei16>xolo tkuen esret ilocevdit: mamao ─ueno, romeli xar cata Ωina, ┬mida iτavn saxeli Ωeni,<Tn16>
      <Tgr16>O¡Ω∩@ o╔Σ τΦoΘ█úφ█Θ▐█ ⁿπ█ç@_ / ╛Ω█Φ {π║Σ √ ╨Σ Ωoç@ o╙Φ╫Σoç@, / °┘▀╫Θ▐<Ω∩ Ωò ÉΣoπ Θoδ,<Tn16>
      <Thy16>Ew ard aysp╔s ka├╔╫ dow╫ ya┘π±s. Hayr mer or yerkins, sowrb e┘i├i anown ╫o:<Tn16>
      <Thyti16>Ew ard aysp╔s ka├╔╫ dow╫ ya┘π±s. Hayr mer or yerkins, sowrb e┘i├i anown ╫o:<Tn16>
      <Tcesk16>hkn& hkyl φlw &ntwn &bwn dbΩmy& ntqdΩ Ωmk<Tn16>

<Pright><Tsye16>hk?" hk0L ~j8 ^ntwW ^b8W Db}.0" ntqd\ |.6<Tn16>
<Tsyn16>hk?$ hk0L ~j8 ^ntwW ^b8W Db}.0$ ntqd\ |.6<Tn16>
<Tsys16>hk?" hk0L ~j8 ^ntwW ^b8W Db}.0" ntqd\ |.6<Tn16>

<Pnormal>|p10      <Tdge16>movedin supevay Ωeni. iτavn nebay Ωeni vitarca cata Ωinagrca ku!^anasa zeda:<Tn16>
      <Tcgei16>movedin supevay Ωeni. iτavn nebay Ωeni vitarca cata Ωinagrca ku!^anasa zeda:<Tn16>
      <Tmge16>movedin supeva Ωeni, iτavn neba Ωeni, vitarca cata Ωina, egreca kueτan-sa zvda.<Tn16>
      <Tcgei16>movedin supeva Ωeni, iτavn neba Ωeni, vitarca cata Ωina, egreca kueτan-sa zvda.<Tn16>
      <Tgr16>╨Γ▐éΩ∩ { ß╫Θ▀Γ█í╫ Θoδ, / ┘█Σ▌▐<Ω∩ Ωò ▐éΓ▌π Θoδ, / ╜@ ╨Σ o╙Φ╫Σª α╫ì ╨τì ┘~@.<Tn16>
      <Thy16>Ekes├╔ ar╫ayow±iwn ╫o: E┘i├in kam╫ ╫o orp╔s yerkins ew yerkri:<Tn16>
      <Thyti16>Ekes├╔ ar╫ayow±iwn ╫o: E┘i├in kam╫ ╫o orp╔s yerkins ew yerkri:<Tn16>
      <Tcesk16>t&t& mlkwtk nhw& φbynk &ykn& dbΩmy& &p b&r%&<Tn16>

3.1.2. Structure of the files necessary for indexation (LST-File, ETX-File, SIF-File):

LSTFILE.jpG
ETXFILE.jpG
BC=<
EC=>
PARAGRAPH_STYLE
normal

RULER normal
JUSTIFICATION left
END
PARAGRAPH_STYLE center
RULER center
JUSTIFICATION center
END
...

...
TEXT_STYLE title
FONT orient
CHARACTER_STYLE bold
FONT_POINTSIZE 24
TEXT_COLOR [0 0 128][255 255 255]
END
TEXT_STYLE cgei16
FONT georgisch-trs.
CHARACTER_STYLE italic
INDEX_FLAG on
FONT_POINTSIZE 15
TEXT_COLOR [255 0 0][255 255 255]
END
....
FONT syriac-nest.
FONT_NAME Titus SyriacNestorian
FONT_FAMILY roman
CHAR_SET ansi
PITCH proportional
DIRECTION right-to-left
FONT_TYPE TrueType
LANGUAGE Syriac-Nestorian
END
...



3.1.3. Results to be obtained: language-specific word wheel (including language-specific spelling and sorting rules etc.)
Georgian:

NINOM.JPG
Armenian:

NINOA.JPG
Greek:

NINOG.JPG
Syriac transliterated:

NINOT.JPG
Original Syriac (Serto) (right-to-left):

NINOS.JPG


3.2. Disadvantages:

3.2.1. Language tagging joined unseparably with script tagging

3.2.2. No unique encoding of characters (font mapping)



4. Unicode as a way out?


4.1. Not easily for the Avestan example: Avestan (and Middle Persian) will not be adopted for 16-bit Unicode standard (while Devanāgarī is); > "surrogate area" (not yet usable in any environment); Devanāgarī requires special "rendering engine", cf. Unicode block 0900 - 09FF.

4.2. Transcriptional solution: Possible but with shortcomings:

4.2.1. No inherent differentiation of languages possible: separate language marking required

4.2.2. Lots of combinations with diacritic characters requested, cp. Y. 9,3:

HOMYAST.JPG


4.2.3. Shortcomings:

4.2.3.1. No unique method of encoding diacritic combinations: ä = a+̈, š = š+̌, ṣ̌ = s+̌+̣ = s+̣+̌; > special "rendering engines" required for treatment

4.2.3.2. General preferability of "precomposed characters" vs. secondary combinations depending on function of diacritics: ṣ̌š+̣ if "dot below" is intended to mean "badly preserved character"

4.2.3.3. Definitions independent of script, but Latin-oriented: one "diaeresis" (̈) for , , ε̈ and ӭ, but also ע̈ (Pahlavī y)? Cp. synoptic arrangement of words taken from several Slavic languages (courtesy of M. Endres)

4.2.3.4. Present set of diacritics not based on thorough investigation of usage, cf. short collection of diacritics missing in Unicode, established within the TITUS project.



4.2.4. Approach of the TITUS project with respect to diacritic combinations used in transcriptional systems:

4.2.4.1. Providing interactive database for usage of diacritics

4.2.4.1. Using database for preparation of a proposal for Unicode extension (consisting of about 2000 characters by today, including precomposed Latin characters as well as combinations based on other scripts)

4.2.4.3. Using Unicode "private use area" for a documentation and interim usage as in the examples shown.

4.2.4.4 Providing special font to match these requirements.


Copyright
Jost Gippert Frankfurt a/M 1997. No parts of this document may be republished in any form without prior permission by the copyright holder.