The Encoding of Language Corpora: The TEI Recommendations in Principle and Practice Since their first publication in 1994, the Recommendations of the Text Encoding Initiative have had an extraordinary influence on the divers communities of people creating, using, and curating digital resources of all kinds, serving as an important reference point even for projects which have not adopted them. Some indication of the breadth and variety of the community of TEI users is given by the TEI applications web page at http://www-tei.uic.edu/orgs/tei/, which lists applications in digital library creation, language corpus construction, language engineering, document production, and text-centred humanistic research of all kinds, on both sides of the Atlantic and beyond. In Europe the TEI has had a major impact on the emergence of standards for the creation of language resources, in particular for the markup of linguistic corpora and related language engineering products such as lexica. This talk will review the theoretical bases of the TEI encoding scheme, in particular its attempts to harmonize the widely divergent practices of computer-aided research which now crosses many political, linguistics, temporal, and disciplinary boundaries; secondly, and more specifically, it will review their application in specifically language engineering related fields. The approach taken will be to first describe the TEI architecture, focussing on its modular and extensible nature. This will be followed by a review of the requirements of those building and distributing language corpora today, and the relevant parts of the TEI, in an attempt to show where the latter can most usefully be applied to meet the needs of the former, and also to assess where modification or development of the TEI Guidelines might be beneficial in the light of experience. A language corpus may be defined as a body of naturally occurring language data assembled for some specific purpose. Typically, the purpose will be to facilitate automatic linguistic analysis, either of the corpus itself, or of some other material for which the corpus is intended to provide a comparative basis, but there are many cases of corpora which are constructed simply because of the inherent interest or importance of the language data which they contain. A special case of this type are historical language corpora, such as the Corpus of Old English or the Corpus of Historical Spanish. Whatever their intended application however, corpora are easily distinguishable from simple assemblages of texts or electronic collections, in that the components of a corpus are intended to be used together as a single unit, most if not all of the time. For this to be feasible, at least the following are pre-requisites: - a uniform structural encoding, and hence a uniform reference system; - a consistent editorial policy; - an explicit and automatically verifiable scheme for representing any linguistic or other analytic information included; - detailed contextual information. When describing the components of written texts (other than words), it is necessary to indicate the boundaries of chapters, sections, paragraphs, sentences, etc., and the specialized roles of headings, lists, notes, citations, captions, references, etc. Many of these components serve a dual function: they mark a particular type of discourse within the text, but they also serve to identify locations within it, forming the basis of a reference system which may be used to localize occurrences of tokens within a specific context. In the same way, for spoken texts, indications of the beginnings and ends of individual utterances are essential, as is an indication of the speaker of each. The TEI recommendations for handling these general issues of text structure will be presented in detail, with examples from a variety of sources. In both spoken and written texts, it is helpful to include editorial information about the status of the electronic text itself (for example to mark corrections or conjectures by the transcriber or editor): transcription is not an exact science, even for printed materials, and still more so for spoken texts. Even where entirely automatic procedures have been adopted, subsequent users of corpora need to be informed of the nature of the algorithms etc. applied. These issues are also addressed by a variety of mechanisms defined by the TEI, whose relative merits will be assessed. Finally, it is essential to record descriptive information about the social or cultural context in which the text was produced, or classified. Such meta-information may often be of crucial importance to the corpus analyst, whether or not a contrastive study (for example between the speech of men and women, or between texts aimed at the young and the old) is involved. This is the third major area in which the TEI's recommendations are to be detailed. Unlike some early monolithic applications of SGML, the TEI scheme was designed from the first as a modular scheme. With the advent of XML, and the concomitant widespread take-up of the basic principles of SGML (descriptive markup, reusable encodings, application-specific tagging etc.) the benefits of this approach are becoming increasingly apparent. I will describe the process by which application-specific views of the TEI scheme may be constructed, and their benefits in facilitating the distribution and conservation of digital information. Emphasis will be placed on the ways in which the Guidelines offer potential for immediate application of new technologies such as XML, and its hyperlinking facilities XLL, which are derived largely from the TEI's extended pointer scheme. Facilities permitting, some TEI-aware tools and software will also be demonstrated Lou Burnard is European editor of the TEI Guidelines and Manager of the Humanities Computing Unit at Oxford University Computing Services. He was educated at Balliol College Oxford, from which he graduated with a first class degree in English in 1968. He has worked in computing applications in the Humanities since 1974, with extensive experience in database systems, information retrieval, and text encoding on a wide variety of computer systems. In 1976, he set up the Oxford Text Archive, an early version of what we now know as the digital library; in 1987 he was appointed European editor of the TEI Guidelines. He has played a major role in the creation of several major digital initiatives in the UK, including the Arts and Humanities Data Service, and the British National Corpus, and various European initiatives; his most recent publication is "The BNC Handbook" published by Edinburgh University Press in 1998. contact information: Lou Burnard OUCS/Humanities Computing Unit 13 Banbury Road Oxford OX2 6NN tel +44 1865 273221 fax +44 1865 273275 lou.burnard@oucs.ox.ac.uk ---------------------------------------------------------------- Lou Burnard http://users.ox.ac.uk/~lou ----------------------------------------------------------------