This page informs about the MultiFarm data set, a comprehensive data set for cross-lingual ontology matching. The data set can be downloaded and used for any scientific purpose.
The original MultiFarm data set is composed of a set of 7 ontologies of the Conference domain (Cmt, Conference, ConfOf, Edas, Ekaw, Iasted, Sigkdd), translated into 8 languages (+English) -- Chinese (cn), Czech (cz), Dutch (nl), French (fr), German (de), Portuguese (pt), Russian (ru), Spanish (es) -- and the corresponding cross-lingual alignments between them. This data set is based on the OntoFarm data set, which has been used successfully for several years in the OAEI Conference track.
The data set generation and structure is briefly explained on this web page, more details can be found in the following paper.
Christian Meilicke, Raúl García Castro, Fred Freitas, Willem Robert van Hage, Elena Montiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondrej Svab-Zamazal, Vojtech Svatek, Andrei Tamilin, Cássia Trojahn, Shenghui Wang. MultiFarm: A Benchmark for Multilingual Ontology Matching. Web Semantics: Science, Services and Agents on the World Wide Web (15), Elsevier, Amsterdam, 2012. Download the authors version of the paper
It would be nice of you could inform us (contact below) in case you use the data set in an experimental evaluation.
The following enumeration summarises the modifications that have been applied to the data set after its first publication.
The original data set has been generated by translating the existing OntoFarm dataset. The results of this first step are available in simple tab-separated text files and can be downloaded below. As indicated above, in 2015 some translation issues have been fixed. In the tab-separated text files, the first column represents the original English-ID of ontology entities, the second column the original translation and third column the fixed/revised translation.
Note that all reference alignments involving edas and ekaw have been filtered out and are used for blind evaluation in OAEI. Hence, the raw translations for these ontologies are not available.
Zipped raw translations
The results of the translation have been used to generate language specific variants of existing ontologies and reference alignment for all pairs of ontologies. These files are bundled in a single zip-file. They can be downloaded and used in any kind of scenario/experiment.
The zip-file is structured as follows:
ont/ cn/ cmt-cn.owl conference-cn.owl [one file for each ontology cmt, conference, confOf, iasted, sigkdd] cz/ cmt-cz.owl conference-cz.owl ... de/ cmt-de.owl conference-de.owl ... [a directory for each language en, de, fr, ...] ref/ cn-cz/ cmt-cmt-cn-cz.rdf cmt-conference-cn-cz.rdf cmt-conference-cz-cn.rdf cmt-confOf-cn-cz.rdf cmt-confOf-cz-cn.rdf ... conference-conference-cn-cz.rdf ... [24 files for each publicly available reference alignment] [a directory for each language pair cn-cz, cn-de, ...]
Zipped bundle v2 (bugs and translation issues fixed - see logs - and Arabic translations added)
Original zipped bundle v1 (changes ontology entities IDs, used in OAEI 2012, 2013 and 2014 - see logs)
Original zipped bundle v0 (2011 version - see logs)
The data set can also be used via the SEALS platform, where we have prepared and stored a test suite for each language pair. You need an account for the SEALS platform to search and retrieve them from the test data repository. For accessing this repository, please refer to the OAEI instructions.
The corresponding MultiFarm test suites have the following identifiers:
The [pair-language] refers to one of the 45 different language pairs: ar-cn, ar-cz, ar-de, ar-en, ar-es, ar-fr, ar-nl, ar-pt, ar-ru, cn-cz, cn-de, cn-en, cn-es, cn-fr, cn-nl, cn-pt, cn-ru, cz-de, cz-en, cz-es, cz-fr, cz-nl, cz-pt, cz-ru, de-en, de-es, de-fr, de-nl, de-pt, de-ru, en-es, en-fr, en-nl, en-pt, en-ru, es-fr, es-nl, es-pt, es-ru, fr-nl, fr-pt, fr-ru, nl-pt, nl-ru, pt-ru. For instance, ar-cn refers to the test cases involving the Arabic and Chinese languages while cn-cz refers to the test cases involving the Chinese and Czech languages. For each pair, 25 alignments involving the ontologies Cmt, Conference, ConfOf, Iasted and Sigkdd are available. As described below, edas and ekaw ontologies are used for blind evaluation.
The data set has been used in in the following experiments:
The data set has been generated by a collaborative initiative of the following people.
This data set is currently maintained by Cassia Trojahn. Please contact her for further information, questions, remarks, feedbacks.
MultiFarm v2 :
We thank users of the data set that have detected some bugs and translation issues:
The logo at the top of this page is a modified version of a logo often used to refer to the Semantic Web. We have added the chinese signs for 'many' and 'language' to the original logo.