An Amharic Stemmer : Reducing Words to their Citation Forms
2007 (English)In: Computational Approaches to Semitic Languages: Common Issues and Resources, 2007Conference paper (Other academic)
Stemming is an important analysis step in a number of areas such as natural language processing (NLP), information retrieval (IR), machine translation(MT) and text classification. In this paper we present the development of a stemmer for Amharic that reduces words to their citation forms. Amharic is a Semitic language with rich and complex morphology. The application of such a stemmer is in dictionary based cross language IR, where there is a need in the translation step, to look up terms in a machine readable dictionary (MRD). We apply a rule based approach supplemented by occurrence statistics of words in a MRD and in a 3.1M words news corpus. The main purpose of the statistical upplements is to resolve ambiguity between alternative segmentations. The stemmer is evaluated on Amharic text from two domains, news articles and a classic fiction text. It is shown to have an accuracy of 60% for the old fashioned fiction text and 75% for the news articles.
Place, publisher, year, edition, pages
IdentifiersURN: urn:nbn:se:su:diva-12116OAI: oai:DiVA.org:su-12116DiVA: diva2:178636