Paper 4 – TLDKS Journal

ETL Processes in the Era of Variety

Authors: Nabila Berkani, Ladjel Bellatreche, Laurent Guittet

Volume 39 (2018)

Abstract

Nowadays, we are living in an open and connected world, where small, medium and large companies are looking for integrating data from various data sources to satisfy the requirements of new applications such as delivering real-time alerts and trigger automated actions, complex system failure detection, anomalies detection, etc. The process of getting these data from their sources to its home system in efficient and correct manner is known by data ingestion, usually refer to Extract, Transform, Load (ETL) widely studied in data warehouses. In the context of rapidly technology changing and the explosion of data sources, ETL processes have to consider two main issues: (a) the variety of data sources that spans traditional, XML, semantic, graph databases, etc. and (b) the variety of storage platforms, where the home system may have several stores (known by polystore), where one hosts a particular type of data. These issues directly impact the efficiency and the deployment flexibility of ETL. In this paper, we deal with these issues. Firstly, thanks to Model Driven Engineering, we make generic different types of data sources. This genericity allows overloading the ETL operators for each type of sources. This genericity is illustrated by considering three types of the most popular data sources: relational, semantic and graph databases. Secondly, we show the impact of genericity of operators in the ETL workflow, where a Web-service-driven approach for orchestrating the ETL flows is given. Thirdly, the extracted and merged data obtained by the ETL workflow are deployed according their favorite stores. Finally, our finding is validated through a proof of concept tool using the LUBM semantic database and Yago graph deployed in Oracle RDF Semantic Graph 12c.