Paper 3

An Adaptive Similarity Search in Massive Datasets

Authors: Trong Nhan Phan, Josef Küng, Tran Khanh Dang

Volume 23 (2015)

Abstract

Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in particular. Aiming at large-scale data processing, we pro-pose an adaptive similarity search in massive datasets with MapReduce. Addi-tionally, our proposed scheme is both applicable and adaptable to popular simi-larity search cases such as pairwise similarity, search-by-example, range que-ries, and k-Nearest Neighbour queries. Moreover, we embed our collaborative refinements to effectively minimize irrelevant data objects as well as unneces-sary computations. Furthermore, we experience our proposed methods with the two different document models known as shingles and terms. Last but not least, we conduct intensive empirical experiments not only to verify these methods themselves but also to compare them with a previous related work on real da-tasets. The results, after all, confirm the effectiveness of our proposed methods and show that they outperform the previous work in terms of query processing.