MultiFarm

MultiFarm Homepage

This page informs about the MultiFarm data set, a comprehensive data set for cross-lingual ontology matching. The data set can be downloaded and used for any scientific purpose.

The original MultiFarm data set is composed of a set of 7 ontologies of the Conference domain (Cmt, Conference, ConfOf, Edas, Ekaw, Iasted, Sigkdd), translated into 8 languages (+English) -- Chinese (cn), Czech (cz), Dutch (nl), French (fr), German (de), Portuguese (pt), Russian (ru), Spanish (es) -- and the corresponding cross-lingual alignments between them. This data set is based on the OntoFarm data set, which has been used successfully for several years in the OAEI Conference track.

The data set generation and structure is briefly explained on this web page, more details can be found in the following paper.

Christian Meilicke, Raúl García Castro, Fred Freitas, Willem Robert van Hage, Elena Montiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondrej Svab-Zamazal, Vojtech Svatek, Andrei Tamilin, Cássia Trojahn, Shenghui Wang. MultiFarm: A Benchmark for Multilingual Ontology Matching. Web Semantics: Science, Services and Agents on the World Wide Web (15), Elsevier, Amsterdam, 2012. Download the authors version of the paper

It would be nice of you could inform us (contact below) in case you use the data set in an experimental evaluation.

News

The following enumeration summarises the modifications that have been applied to the data set after its first publication.

Jul 2015: Bugs and translation issues have been fixed (see details).
Jul 2015: The page has been moved from http://web.informatik.uni-mannheim.de/multifarm/.
Jul 2015: Arabic translations have been provided (see details).
Jun 2015: The URL for accessing MultiFarm test suites on SEALS test repository has been changed. Please, use http://repositories.seals-project.eu/tdrs/.
Jan 2013: Due to some issues with the identifiers, a new version of the data set has been uploaded. It is the one used for OAEI 2012. The new version is available via a link below.
Oct 2011: From all language pairs reference alignments for edas and ekaw have been filtered out. This allows to use around half of the reference alignments for blind evaluation in OAEI or in other evaluation campaigns, while a good deal of test cases remains freely available to improve current matching systems. The same changes have applied to the data stored in the SEALS platform.
Oct 2011: The meta-description in the data stored in the SEALS platform has been changed. Now all test suites can be found by searching for Multifarm here (user account required). The id cn-cz-multifarm, for example, refers to the chinese-czech language pair.

Translations in raw format

The original data set has been generated by translating the existing OntoFarm dataset. The results of this first step are available in simple tab-separated text files and can be downloaded below. As indicated above, in 2015 some translation issues have been fixed. In the tab-separated text files, the first column represents the original English-ID of ontology entities, the second column the original translation and third column the fixed/revised translation.

Note that all reference alignments involving edas and ekaw have been filtered out and are used for blind evaluation in OAEI. Hence, the raw translations for these ontologies are not available.

Download

Zipped raw translations

Complete bundle with (open) ontologies and reference alignments

The results of the translation have been used to generate language specific variants of existing ontologies and reference alignment for all pairs of ontologies. These files are bundled in a single zip-file. They can be downloaded and used in any kind of scenario/experiment.

The zip-file is structured as follows:

ont/ 
    cn/
       cmt-cn.owl
       conference-cn.owl
       [one file for each ontology cmt, conference, confOf, iasted, sigkdd]
    cz/ 
       cmt-cz.owl
       conference-cz.owl
       ...
    de/ 
       cmt-de.owl
       conference-de.owl
       ...
    [a directory for each language en, de, fr, ...]
ref/
    cn-cz/
          cmt-cmt-cn-cz.rdf
          cmt-conference-cn-cz.rdf
          cmt-conference-cz-cn.rdf
          cmt-confOf-cn-cz.rdf
          cmt-confOf-cz-cn.rdf
          ...
          conference-conference-cn-cz.rdf
          ...
          [24 files for each publicly available reference alignment]
    [a directory for each language pair cn-cz, cn-de, ...]

Download

Zipped bundle v2 (bugs and translation issues fixed - see logs - and Arabic translations added)
Original zipped bundle v1 (changes ontology entities IDs, used in OAEI 2012, 2013 and 2014 - see logs)
Original zipped bundle v0 (2011 version - see logs)

SEALS Test suites

The data set can also be used via the SEALS platform, where we have prepared and stored a test suite for each language pair. You need an account for the SEALS platform to search and retrieve them from the test data repository. For accessing this repository, please refer to the OAEI instructions.

The corresponding MultiFarm test suites have the following identifiers:

MultiFarm identifiers (testing data set)

Repository: http://repositories.seals-project.eu/tdrs/
Suite-ID: [pair-language]
Version-ID: [pair-language]-[v1|v2]

The [pair-language] refers to one of the 45 different language pairs: ar-cn, ar-cz, ar-de, ar-en, ar-es, ar-fr, ar-nl, ar-pt, ar-ru, cn-cz, cn-de, cn-en, cn-es, cn-fr, cn-nl, cn-pt, cn-ru, cz-de, cz-en, cz-es, cz-fr, cz-nl, cz-pt, cz-ru, de-en, de-es, de-fr, de-nl, de-pt, de-ru, en-es, en-fr, en-nl, en-pt, en-ru, es-fr, es-nl, es-pt, es-ru, fr-nl, fr-pt, fr-ru, nl-pt, nl-ru, pt-ru. For instance, ar-cn refers to the test cases involving the Arabic and Chinese languages while cn-cz refers to the test cases involving the Chinese and Czech languages. For each pair, 25 alignments involving the ontologies Cmt, Conference, ConfOf, Iasted and Sigkdd are available. As described below, edas and ekaw ontologies are used for blind evaluation.

Evaluation campaigns

The data set has been used in in the following experiments:

Involved people

The data set has been generated by a collaborative initiative of the following people.

Arabic: Abderrahmane Khiat

with support from: Moussa Benaissa

Chinese: Shenghui Wang
Czech: Ondrej Zamazal, Vojtech Svatek (owners of OntoFarm data set)
Dutch: Willem Robert van Hage
French: Cassia Trojahn

with support from: Catherine Comparot

German: Christian Meilicke, Heiner Stuckenschmidt

with support from: Dominique Ritze and Jakob Huber

Italian: Davide Tomasi

with support from: Roger Granada

Portuguese: Fred Freitas, Ryan Ribeiro de Azevedo

with support from: Ícaro Medeiros, Fernando Lins, Eric Rommel, and Roberta Fernandes

Russian: Andrei Tamilin
Spanish: Elena Montiel-Ponsoda, Raul Garcia Castro

Contact

This data set is currently maintained by Cassia Trojahn. Please contact her for further information, questions, remarks, feedbacks.

Corrected bugs and Updates

MultiFarm v2 :

Jul 2015: Minor translation issues have been fixed for English, German and Portuguese.
Jul 2015: Review of French translations has been done (some issues fixed and better homogenisation of translations across ontologies).
Jul 2015: Entities with null labels have been suppressed : these entities refered to the "owl:Thing" entity which have been incorrectly created when generating the translated ontologies ("thing" was not in the raw translation list and then "null" was written as label instead).
Jul 2015: ISO language codes for Chinese and Czech have been corrected.
Jul 2015: Arabic translations have been added.

MultiFarm v1

Jan 2013: Due to some issues with the identifiers, a new version of the data set has been uploaded (v1). It is the one used for OAEI 2012.

Acknowledges

We thank users of the data set that have detected some bugs and translation issues:

Konstantin Todorov (Sep 2014) have reported some issues in French translations.
Peter Geibel (Sep 2014) have reported some issues in German translations.
Heiko Paulheim has reported (Feb 2012) that MultiFarm uses incorrect ISO language codes for some languages. The bug is related to Chinese (zh instead of cn) and czech (cs instead of cz).

Colophon

The logo at the top of this page is a modified version of a logo often used to refer to the Semantic Web. We have added the chinese signs for 'many' and 'language' to the original logo.