Paper 4

A Scalable Expressive Ensemble Learning using Random Prism: A MapReduce Approach

Authors: Frederic Stahl, David May, Hugo Mills, Max Bramer, and Mohamed Medhat Gaber

Volume 20 (2015)

Abstract

The induction of classi cation rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are aff ected by noise. The Random Prism classifi er has recently been proposed as an alterna- tive to the popular Random Forests classifi er, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classifi cation ap- proaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a re- cently proposed parallel version of it called Parallel Random Prism. Par- allel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the fi rst time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Par- allel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Ex- pressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision mak- ing increases the user’s trust in the system.