Paper 3

Engineering Runtime Root Cause Analysis of Detected Anomalies

Authors: Zisis Flokas, Anastasios Gounaris

Volume 55 (2023)

Abstract

The main objective of this work is to provide a unified, easy to configure and extensible end-to-end system that performs root cause analysis (RCA) methods on top of anomaly detection (AD) methods in an online setting. AD-focused RCA for online settings has not been investigated so far; therefore our work can be seen as an initial approach to this end. Inspired by the solutions developed in the ThirdEye project, which is coupled with the Apache Pinot data warehousing system, we re-engineer ThirdEye’s RCA components/techniques so that they operate in a manner that they can directly ingest input records from Apache Kafka and continuously compute aggregates at different level of granularity in a principled manner for both OLAP queries and provision of baselines to support RCA. To attain scalability, we build our solution in the Apache Flink stream processing engine. This work presents the main design choices when applying ThirdEye’s concepts on data streams and presents indicative examples and scalability experiments. Our solution is provided in open-source.

Keywords

root cause analysis, anomaly detection, data streams, Flink, Kafka