Morphologically Annotated Amharic

This resource has been submitted as a resource paper at SIGIR 2021. It consists in 2 lexicons, 2 corporas and 1 python program that can be downloaded here and which are detailed below :

Morphologically Annotated Amharic Text Corpora, Tilahun Yeshambel (IT PhD Program, Addis Ababa University, Ethiopia), Josiane Mothe (INSPE, UT2J, IRIT CNRS, Univ. de Toulouse, France), Yaregal Assabie (Department of Computer Science, Addis Ababa Ethiopia).

Lexicons

Root-based lexicon

Stem-based lexicon

A total of 171,070 unique words have been extracted from 6,069 full text documents which are part of the 2AIRTC collection. They have been manually annotated both using the stem-based and a root-based morphological forms. Many of the Amharic surface words are constructed from more than one morphological segment called morphemes. Morphemes are divided into prefix, suffix, infix, stem and root. Therefore, the morphological annotation was performed by segmenting each word into its morphemes. The general structure of a morphologically annotated word W is:

[ p_ ]* w [ _s ]*

where p is a prefix morpheme, “_” is a morphological segment marker, w is the root or stem of W, s is a suffix morpheme, […] denotes optionality, and * denotes the possibility of multiple occurrences.

For example, the word ከልተስማማናቸውም/kəʔəltəsɨmamanatʃəwɨm/ is annotated as follows.

In this example, the Amharic word is split into 7 morphological morphemes (3 prefixes, a stem and 3 suffixes) which are preposition (ከ/kə ‘from’/), negation (አል/ʔəl ‘not’/), passive form (ተ/tə/), the verbal stem (ስማም/sɨmam ‘comfort’/), subject pronoun (አን/ʔən ‘we’/), object pronoun (አቸው/ʔətʃəw ‘they’/), and the negation marker (ም/mɨ ‘not’/).

Annotated documents

Root-based annotated corpora

Stem-based annotated corpora

Transliterated root-based annotated corpora

Transliterated stem-based annotated corpora

From the lexicons we re-wrote the textual corpus resulting in two annotated corpora, where each word in the initial corpus is replaced by its annotation from the appropriate lexicon. These corpora are monolingual, coded in Unicode-8 and International Phonetic Association (IPA) text file forms.

Each corpus consists of 6,069 full text documents (an extract from 2AIRTC document collection) consisting of 72,814 sentences or 1,592,351 morphologically annotated words. Each annotated word form contains its base and a full set of morphological features for the inflected and derivational form. The grammatical features are separated each other and from the stem or root by underscore (_) punctuation; radicals of a root are separated each other by hyphen (-); and multiple annotations of a word are enclosed in [] without leaving free space between them.

Rewriting a corpus using the lexicons

A Python script to produce the final morphologically annotated corpora automatically using documents and morphologically analyzed lexicons