Ann Lawson -------------- Improving Dictionary Coverage: Conclusions drawn from a corpus-based validation study Cyril Belica and Ann Lawson Institut fr deutsche Sprache, Mannheim, Germany This paper describes the recent corpus-based validation of dictionary material and examines the methodology, results and the conclusions to be drawn for future work. The paper comprises three parts, describing first the practical study undertaken, followed by an evaluation of the results and a discussion of possible alterations or improvements. It then concludes with an examination of the implications to be drawn from this study for future work. The original study was commissioned by the Klett publishing house as a corpus-based validation of the German part of the database from which their bilingual dictionaries are generated. The original lemma-list contained around 60,000 entries. The corpus resources and tools used in the study are based at the IDS, the Institut fr deutsche Sprache, and include part of the COSMAS corpus resources (around 200 million words for the purposes of this study) and the IDS-Toolbox of cognitive- and statistically-based programs and algorithms. The study examined chronological frequencies, "missing" headwords, found in the corpus but not in the dictionary, and "redundant" headwords, that is, headwords for which no examples could be found in the corpus. The resulting lemmas were extensively documented, with typical contexts, concordances, relative frequencies and collocations. The relative chronological frequency of each headword was determined using a diachronic reference corpus of 40 million words compiled for the purpose.=20 Since an exhaustive linguistic evaluation of the results attained, and a description of the use of the results in future dictionary editions, has yet to be completed by the lexicographers involved, we undertake an initial survey and assessment, with special reference to collocational information extracted for selected lemmas. The significance of word class and other linguistic factors (such as extreme high or low frequency) and their influence on the results is investigated. In particular, perceived shortcomings or omissions in the collocational information are explored in order to uncover their origin. Methods of improving the algorithms for future work are outlined. It is planned to demonstrate how the fine-tuning of the algorithm in specific contexts has led to improvements in the recall and relevance of the data extracted. As a case in point, the influence of span variation according to word class and context is investigated. A brief report will be made on the ongoing investigation of the implications of this work in the language-independent sphere. ********************************************************************** Dr Ann Lawson =09=09 Multilinguale Forschung TELRI-II/SIMPLE/DHYDRO=09=09 Abteilung LEXIK lawson@ids-mannheim.de Institut fr deutsche Sprache Tel: +49 621 1581 427 R5, 6-13 Fax: +49 621 1581 415 D-68161 Mannheim **********************************************************************