2AIRTC: The Amharic Adhoc Information Retrieval Test Collection

PAGE IN PROGRESS

When using this collection, please make a reference to:

Yeshambel, T., Mothe, J., & Assabie, Y. (2020, September). 2AIRTC: The Amharic Adhoc Information Retrieval Test Collection. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 55-66). Springer, Cham.

@inproceedings{yeshambel20202airtc,
  title={2AIRTC: The Amharic Adhoc Information Retrieval Test Collection},
  author={Yeshambel, Tilahun and Mothe, Josiane and Assabie, Yaregal},
  booktitle={International Conference of the Cross-Language Evaluation Forum for European Languages},
  pages={55--66},
  year={2020},
  organization={Springer}
}

This collection consists in : a query set, a document set, query relevance judgments which are described below and can be downloaded here.

Document set

Documents were collected from news agencies sites (Walta Media and Communication Corporate, Fana Broadcasting Corporate, Amhara Mass Media Agency), social media (Facebook), historical documents from blogger (Daniel Kibret), from Amharic Wikipedia, from Walta Information Center and other sources. Documents are diverses in source, domain and genre: news, religious documents, letters, opinions and reports, social media. Topics are business, sport, entertainment, education, religion, politics, technology, health and culture.

There are 12,586 documents from the collected ones that have been assessed for their relevance. collected and assessed for their relevance Out of these, 6,960 documents have been assessed as relevant for at least one topic.

The documents are full length, processed to remove unnecessary parts such as tags and English alphabets, and plain text form. All the documents are stored in a single text file using TREC-like format. Each document has a unique document identification number.

The content of each document is enclosed with <TEXT> and </TEXT> tags. One document is delimited from the other by “DOC” and “</DOC>” tags. Documents are coded in UTF-8.

Here please see Assessed Document collection and Full Document collection.

Topic set

<top>	
<num>2</num>  
<title_A> የኢትዮጵያዊያን የዘመን አቆጣጠር </title_A>  
<title_E> Ethiopian calendar   </title_E> 
<desc_A> ስለኢትዮጵያ ዘመን አቆጣጠር ሥርአት የሚያትቱ ሰነዶችን መለየት፡፡ </desc_A> 
<desc_E> Identifying documents discussing on Ethiopian calendar system. </desc_E>  
<narr_A> ስለ ኢትዮጵያ የዘመን አቆጣጠር ታሪክና አመሰራረትት የሚያትቱ ሰነዶች ጥሩ የመረጃ ምንጮች ናቸው፡፡ ከዚህ በተጨማሪ   የበአላት ቀናት  እና የአቆጣጠር ስሌት የሚያትቱ ሰነዶች ጠቃሚ የመረጃ ምንጮች ናቸው፡፡ ይሁን እነጅ ስለአውሮጳውያን የዘመን አቆጣጠር ወይም ሌሎች ሀገሮች የቀን አቆጣጠር የሚገልጹ ሰነዶች ጠቃሚዎች አይደሉም፡፡ እንዲሁም ስለአዲስ አመት የሚያትቱ ሰነዶች ጠቃሚ የመረጃ ምንጮች አይደሉም፡፡</narr_A> 
<narr_E> Documents discussing the origin and history of Ethiopian calendar are good sources of information. In addition, documents explaining about holidays and methods for finding the dates and day in each year are relevant. However, documents discussing on Gregorian calendar or other calendars are not relevant. Moreover, documents discussing on new year are not relevant </narr_E> 
</top>

Figure 2: 2AIRTC topic number 2

The topics were built in such a way to reflect real word information need and cover diverse issues; there are 240 topics some about specific entities (e.g., people, places or events).

A topic is written both in Amharic and its translated version of English.

Each topic has a unique identification integer number. The title field contains few search words which describe a topic and could be a typical query to be submitted to a retrieval system. Topic titles vary in terms of length and types. The topic titles include short topics, medium topics, and collocation. The description field contains the description of the topic area in one or two sentences. It is the description of the user’s information need. The narrative field provides further explanation about each title to decide which types of documents are relevant and which are not. It consists of more than two sentences. Assessors judge document relevance based on this field. Topics are coded in UTF-8.

A topic file can be downloaded here

Relevance judgments

The relevance judgment indicates the set of relevant documents to each topic. A document is marked as relevant based on the narrative information in the topic; thus, it should not simply contain words from the query but rather fulfill the information need.

Each topic has at least 10 relevant documents. The 2AIRTC relevance file was produced using the TREC format, as follows:

topic ID, 0, document ID and relevance fields

Topic and document IDs are unique identification numbers of topics and documents, respectively. The number zero (0) is common to all topics and documents. The relevance indicates the relevance value of the considered document/topic pair and is 1 if the document is relevant to a topic, 0 otherwise.

QRELs file can be downloaded here