Tools for Constructing Arabic Language Corpus Ibrahim A. Al-Kharashi, Ph. D. Computer and Electronics Research Institute=20 King Abdulaziz City for Science and Technology P. O. Box 6086, Riyadh 11442 kharashi@kacst.edu.sa Abstract: Corpora are very essential part of any process for studying characteristics of languages. Corpora are used in wide variety of language studies including speech, lexical, syntactical and semantic aspects. Content of corpora are carefully selected so that they represent the language or language variety. Computer assisted tools were developed to assist in construction corpora, and manipulating and analyzing their contents. Such tools have become faster, more powerful, and more user-friendly. Until quite recently the overwhelming majority of corpora and their tools were made up for English and some European languages. So far activities relating to Arabic linguistic studies using corpora have been small in scale and fragmented with no tools to assist researchers in that field. Arabic, one of the Semitic languages, has its own characteristics regarding character sets, lexical, syntactical and semantic issues.=20 Arabic language is written from right to left. Its alphabet has twenty-eight consonant letters, three of them considered also to be long vowels. Optionally, one of three short vowels can be placed after some characters where ambiguities (in pronunciation and/or meaning) might arise. In a fully vowelized Arabic text, absence of vowel can be indicated by sokon (silence) symbol. In certain cases, double letters, can be replaced by single letter with the tashdeed (strengthening) sign placed over it. In a very simple written Arabic, each letter occurs in up to four presentation forms (shapes) depending on its position within the text (initial, medial, final or isolated). Contextual analysis algorithm is usually used to determine shapes of printed or displayed Arabic text Morphologically, Arabic language is very rich and based on root-pattern structure, Most of Arabic words are generated out of finite set of roots (about 7000) transformed into stems using one or more of patterns (about 120). In theory, single Arabic root can generate hundreds of words (noun, verbs, =85). Arabic word may exist in hundred shapes in normal text by adding certain suffixes and prefixes (mostly considered as functional or stop words in English language). Striping out affixes and normalizing words is an essential part of any natural processing information retrieval and search engine systems. Normalization is done through stemming algorithm. Linguistically, normalization of an Arabic word goes through timely consumed process known as morphological analysis. The process goes through two distinguished stages. In first stage, the analyzer strips out all affixes and prefixes and reduce the word to its singular form. In the second stage, the analyzer produce=20 the root of the word. In most practical cases, and to increase the quantity of retrieve records without decreasing the quality, it is prefer to use the stem of the word for indexing and searching rather than the root,=20 This paper presents the initial stage of constructing tools to assist researchers in the field of linguistic studies to construct, manage and manipulate Arabic language corpus so that researchers do not have to go through the issues of sampling, collection and encoding. Tools allow for add, delete, update, search, tag and display contents of a corpus.=20 Though training, suggested tools should provide a mechanism to automatically detect different part of speech in the corpus and suggest appropriate tags. Linguist may then accept the suggested tag, reject it or modify it. Tools are complemented by a range of statistical functionality that provide information about the relative frequency of letter, word or string, and its distribution across text-types. Searching for single or multiword strings, root, stem or affix, searching according to word class, or usage of certain pattern are also considered. --- Regards Ibrahim A. Al-Kharashi Computer and Electronics Research Institute (CERI) King Abdulaziz City for Scienece and Technology (KACST) P. O. Box 6086, Riyadh 11442 Phone: 481-3273 - Fax: 481-3764 e-mail: kharashi@kacst.edu.sa