Overcoming the Language Barriers in the Web: The UNL-Approach


In the paper we introduce the Universal Networking Language(UNL)-Project and address the possibilities of taming existing linguistic resources into the current purpose, i.e. to develop a UNL-Enconverter for German. The UNL-Project is a long-term project related to the storage, retrieval, exchange and presentation of information throughout the Internet. The core of the system is the UNL, a natural language-independent meta language, designed for the adequate representation of information conveyed by natural language. For each natural language there exist a so-called enconverter that translates a web-page encoded in the natural language into the UNL-expressions and a deconverter generating a natural language expressions from the UNL representations. In the future the enconverters and deconverters are expected to be provided with as a plug-in software for a web-browser so that users will have natural access to the informations in foreign languages, while the enconverting and deconverting are performed in the background.

As a backbone of this project, a world-wide network of universities and research institutes has been set up under the guidance of the Institute of the Advanced Studies(IAS) at the United Nations University(UNU) in Tokyo=2E The network includes currently the partners in Japan, China, France, Brazil, Russia, Indonesia, Mongolia, Latvia, Italy, Spain, Egypt, Jordan, Tanzania, and Germany. In Germany the Institute for Applied Information Science(IAI) is the official project partner with the Center for the Artificial Intelligence(DFKI) as its sub-contractor. At the IAI the German-Enconverter is being developed and tested, while the German-Deconverter is being developed at the DFKI. The IAI is also maintaining the so-called Universal Word Lexicon Server which is used by the DFKI for German generation via the Internet. The test version of the German-Enconverter and the Lexicon Server are accessible in the Internet, however they are strictly restricted to the UNL-members for the smooth continuation of the project.

The UNL-System is made up of four components.
Firstly, the artificial language UNL is supposed to express any kind of information in a natural language. The UNL has been developed at the UNU and is currently validated by the UNL network. The meaning of a natural language sentence can be expressed in the UNL with a number of binary relations with a relation label. There are closed number of well-defined relation labels which characterize the relationships between the concepts participating in the events or states a natural language sentence may denote.
Secondly, a web-page in a natural language, say in German, can be automatically translated into the corresponding UNL-Text by the UNL-Enconverter. This "enconverting" process is somewhat similar to an analysis part in the traditional MT-model, so that we believe most of the existing MT-system to be easily tuned into a UNL-Enconverter. Thanks to the UNL-Enconverter, a writer does not need a specific knowledge of the UNL grammar.
Thirdly, a natural language text is generated by a UNL-Deconverter from the UNL-expressions.
Lastly the Universal Words(UW) that are supposed to denote "concepts" are hierarchically organized in the Knowledge Base(KB). The strings of the Universal Words(UW) are identical to those of the English Words, however they are different from the 'normal' English words, in that a UW, for which there is no English word, can be registered as a UW in the KB after an examination by the UNL-Center.

The general strategy followed in constructing the German UNL-Enconverter is to make use of the existing modules as much as possible. The UNL-Enconverter integrates several submodules such as a morphological analyzer MPRO, a post-morphological disambiguator KURD, and an Example-Based MT component EDGAR for identifying proper names.
The kernel of the UNL-Encoder is the CAT2 MT-System. The CAT2-System was developed at the IAI for the purpose of multilingual MT. During the analysis of German input sentence we avoid the time-taking complex analysis in a deep level, but try to build a simple syntactic structure by a limited number of construction-specific syntactic rules; these are specific rules for certain types of linguistic constructions such as passive, modal-verb construction, relativ-sentence construction and so on.
Once a syntactic structure is built, the structure is transferred to the UNL. In generating the UNL-expressions, it is the most important task to pick up the correct UWs for the German Words. This is guided by statistical informations according to the subject field of the sentence. For example, the German word 'tor' is much more likely to be translated to the UW 'goal' in the 'sport' context than into e.g.'gate'. Such statistical information is collected by the IAI Translation Service Server, where the service users send a text with a specific subject field information. In this method we can get quite rich and useful informations.
The functional words or morphemes in German such as determiners, tense morphemes, plural morphemes and etc. can be expressed in the UNL as an attribute attached to the UW. The semantic features in German words must be also transferred to the UNL. However, as the semantic features employed in the CAT2-formalism and the UNL differ, a few f-rules had to be written to adapt the CAT2 semantic features to the UNL semantic features which are organized in the KB.
Once a UNL-Tree is built, a small Perl-Program runs to collect the UNL-related features in the nodes. The program suppresses the nodes which contain no UNL-features and transform the UNL-features into a well-formed UNL-expression.

In this paper we tried to show how an existing NLP modules can be employed to construct a UNL-Enconverter for German. The UNL-Enconverters share with conventional MT systems the analysis and the transfer modules. For German UNL-Enconverter the CAT2 MT-System can be successfully employed.