Paper 3

Efficient Online Aggregates in Dense-Region-Based Data Cube Representations

Authors: Kais Haddadin and Tobias Lauer

Volume 2 (2010)

Abstract

In-memory OLAP systems require a space-efficient representation of sparse data cubes in order to accommodate large data sets. On the other hand, many efficient online aggregation techniques, such as prefix sums, are built on dense array-based representations. These are often not applicable to real-world data due to the size of the arrays which usually cannot be compressed well, as most sparsity is removed during pre-processing. A possible solution is to identify dense regions in a sparse cube and only represent those using arrays, while storing sparse data separately, e.g. in a spatial index structure. Previous denseregion- based approaches have concentrated mainly on the effectiveness of the dense-region detection (i.e. on the space-efficiency of the result). However, especially in higher-dimensional cubes, data is usually more cluttered, resulting in a potentially large number of small dense regions, which negatively affects query performance on such a structure. In this article, our focus is not only on space-efficiency but also on time-efficiency, both for the initial dense-region extraction and for queries carried out in the resulting hybrid data structure. After describing a pre-aggregation method for representing dense sub-cubes which supports efficient online aggregate queries as well as cell updates, our sub-cube extraction approach is outlined in detail. In addition, optimizations in our approach significantly reduce the time to build the initial data structure compared to former systems. Two methods to trade available memory for increased aggregate query performance are provided. Also, we present a straightforward adaptation of our approach to support multi-core or multi-processor architectures, which can further enhance query performance. Experiments with different realworld data sets show how various parameter settings can be used to adjust the efficiency and effectiveness of our algorithms.