Abstract. Multivariate functional data refer to a population of multivariate functions generated by a system involving dynamic parameters depending on continuous variables (e.g., multivariate time series). Outlier detection in such a context is a challenging problem because both the individual behavior of the parameters and the dynamic correlation between them are important. To address this problem, recent work has focused on multivariate functional depth to identify the outliers in a given dataset. However, most previous approaches fail when the outlyingness manifests itself in curve shape rather than curve magnitude. In this paper, we propose identifying outliers in multivariate functional data by a method whereby different outlying features are captured based on mapping functions from differential geometry. In this regard, we extract shape features reflecting the outlyingness of a curve with a high degree of interpretability. We conduct an experimental study on real and synthetic data sets and compare the proposed method with functional-depth-based methods. The results demonstrate that the proposed method, combined with state-of-the-art outlier detection algorithms, can outperform the functional-depth-based methods. Moreover, in contrast with the baseline methods, it is efficient regardless of the proportion of outliers.
F. Ravat, J. Song, O. Teste, C. Trojahn Efficient querying of multidimensional RDF data with aggregates: comparing NoSQL, RDF and relational data stores International Journal of Information Management, Elsevier Science Publisher 2020Abstract. This paper proposes an approach to tackle the problem of querying large volume of statistical RDF data. Our approach relies on pre-aggregation strategies to better manage the analysis of this kind of data. Specifically, we define a conceptual model to represent original RDF data with aggregates in a multidimensional structure. A set of translations rules for converting a well-known multidimensional RDF modelling vocabulary into the proposed conceptual model is then proposed. We implement the conceptual model in six different data stores: two RDF triple stores (Jena TDB and Virtuoso), one graph-oriented NoSQL database (Neo4j), one column-oriented data store (Cassandra), and two relational databases (MySQL and PostGreSQL). We compare the querying performance, with and without aggregates, in these data stores. Experimental results, on real-world datasets containing 81.92 million triplets, show that pre-aggregation allows for reducing query runtime in all data stores. Neo4j NoSQL and relational databases with aggregates outperform triple stores speeding up to 99% query runtime.
O. Coustié, X. Baril, J. Mothe, O. Teste METING: A Robust Log Parser Based on Frequent n-Gram Mining IEEE International Conference on Web Services (ICWS’20), Beijing, China 2020Abstract. Execution logs are a pervasive resource to monitor modern information systems. Due to the lack of structure in raw log datasets, log parsing methods are used to automatically retrieve the structure of logs and gather logs of common templates. Parametric log parser are commonly preferred since they can modulate their behaviour to fit different types of datasets. These methods rely on strong syntactic assumptions on log structure e.g. all logs of a common template have the same number of words. Yet, some reference datasets do not comply with these assumptions and are still not effectively treated by any of the state-of-the-art log parsers. We propose a new parametric log parser based on frequent n-gram mining: this soft text-driven approach offers a more flexible syntactic representation of logs, which fits a great majority of log data, especially the challenging ones. Our comprehensive evaluations show that the approach is robust and clearly outperforms existing methods on these challenging datasets.
N. El Malki, R. Cugny, F. Ravat, O. Teste DECWA: Density-Based Clustering using Wasserstein Distance 29th International Conference on Information and knowledge Management (CIKM’20), Galway, Ireland 2020Abstract. Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters. Despite their encouraging results, they suffer to find low-density clusters, near clusters with similar densities, and high-dimensional data. Our proposals are a new characterization of clusters and a new clustering algorithm based on spatial density and probabilistic approach. First of all, sub-clusters are built using spatial density represented as probability density function ($p.d.f$) of pairwise distances between points. A method is then proposed to agglomerate similar sub-clusters by using both their density ($p.d.f$) and their spatial distance. The key idea we propose is to use the Wasserstein metric, a powerful tool to measure the distance between $p.d.f$ of sub-clusters. We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.
O. Coustié, X. Baril, J. Mothe, O. Teste Application Performance Anomaly Detection with LSTM on Temporal Irregularities in Logs 29th International Conference on Information and knowledge Management (CIKM’20), Galway, Ireland 2020Abstract. Performance anomalies are a core problem in modern information systems, that affects the execution of the hosted applications. The detection of these anomalies often relies on the analysis of the application execution logs. The current most effective approach is to detect samples that differ from a learnt nominal model. However, current methods often focus on detecting sequential anomalies in logs, neglecting the time elapsed between logs, which is a core component of the performance anomaly detection. In this paper, we develop a new model for performance anomaly detection that captures temporal deviations from the nominal model, by means of a sliding window data representation. This nominal model is trained by a Long Short-Term Memory neural network, which is appropriate to represent complex sequential dependencies. We assess the effectiveness of our model on both simulated and real datasets. We show that it is more robust to temporal variations than current state-of-the-art approaches, while remaining as effective.
O. El Rifai, M. Biotteau, X. Deboissezon, I. Medgiche, F. Ravat, O. Teste Blockchain-based Federated Learning in Medecine International Conference on Artificial Intelligence in Medicine (AIME'20), Minneapolis, USA 2020Abstract. Worldwide epidemic events have confirmed the need for medical data processing tools while bringing issues of data privacy, transparency and usage consent to the front. Federated Learning and the blockchain are two technologies that tackle these challenges and have been shown to be beneficial in medical contexts where data are often distributed and coming from different sources. In this paper we propose to integrate these two technologies for the first time in a medical setting. In particular, we propose a implementation of a coordinating server for a federated learning algorithm to share information for improved predictions while ensuring data transparency and usage consent. We illustrate the approach with a prediction decision support tool applied to a diabetes data-set. The particular challenges of the medical contexts are detailed and a prototype implementation is presented to validate the solution.
I. Ben Kraiem, F. Ghozzi, A. Péninou, G. Roman-Jimenez, O. Teste Automatic Classification Rules for Anomaly Detection in Time-series 14th International Conference on Research Challenges in Information Science (RCIS’20), Limassol, Cyprus 2020Abstract. Anomaly detection in time-series is an important issue in many applications. It is particularly hard to accurately detect multiple anomalies in time-series. Pattern discovery and rule extraction are effective solutions for allowing multiple anomaly detection. In this paper, we define a Composition-based Decision Tree algorithm that automatically discovers and generates human-understandable classification rules for multiple anomaly detection in time-series. To evaluate our solution, our algorithm is compared to other anomaly detection algorithms on real datasets and benchmarks.
O. El Rifai, M. Biotteau, X. Deboissezon, I. Medgiche, F. Ravat, O. Teste Blockchain-Based Personal Health Records for Patients’ Empowerment 14th International Conference on Research Challenges in Information Science (RCIS’20), Limassol, Cyprus 2020Abstract. With the current trend of patient-centric health-care, blockchain-based Personal Health Records (PHRs) frameworks have been emerging. The adoption of these frameworks is still in its infancy stage and is dependent on a broad range of factors. In this paper we look at some of the typical concerns raised from a centralized medical records solution such as the one deployed in France. Based on the state of the art literature in terms of Electronic Health Records (EHRs) and PHRs, we discuss the main implementation bottlenecks that can be encountered when deploying a blockchain solution and how to avoid them. In particular, we explore these bottlenecks in the context of the French PHR system and suggest some recommendations for a paradigm shift towards patients’ empowerment.
A. Yewgat, D. Busby, M. Chevalier, C. Lapeyre, O. Teste Deep-CRM: A New Deep Learning Approach For Capacitance Resistive Models 17th European Conference On The Mathematics Of Oil Recovery (ECMOR'20) 2020Abstract. Data driven models can represent a suitable alternative to classical reservoir modelling as they require much less computation time and allocated resources. Among such models are Capacitance Resistive Models (CRMs), based on set of coupled ordinary differential equations (ODEs) representing material balance. The aim of this work is to propose a complete approach to optimize the CRM's parameters and forecast future production. This approach is not based on any assumptions on injections or on producers' Bottom Hole Pressure. To this end, we introduce a new approach based on a deep learning strategy: Physics-Informed Neural Networks (PINNs) for CRMs. Our approach, called Deep-CRMs, is presented. Experiments are conducted to compare our approach to the nonlinear multivariate regression of the closed form solution. These experiments are based on two datasets: the first is a synthetic dataset generated using ECLIPSE® and SISMAGE®, and the second is a real field dataset provided by one of our affiliates.
E. Maître, Z. Chemli, M. Chevalier, B. Dousset, J-P. Gitto, O. Teste Event detection and time series alignment to improve stock market forecasting Joint Conference of the Information Retrieval Communities in Europe (CIRCLE'20), Samatan, France 2020Abstract. Buying commodities is a critical issue for multiple industries because the variations of stock prices are induced not only by multiple economic parameters but also by external events. Raw material buyers must keep track of information in numerous fields, which constitutes a major challenge considering the exponential growth of online data. To tackle this issue, we propose an event detection approach in order to assist them in their anticipation process. Indeed, a lot of contextual information is contained in text and exploiting it can allow one to improve its anticipation ability. Thus, we develop a framework of event detection and qualification, then we quantify the impact of these events on stock market to help buyers in their anticipation process. In this paper, we will first introduce our context, then explain the scope of our work and our goals. After detailing the related work, we will present our proposition, conclude and propose some future work possibilities.
C. Lejeune, J. Mothe, O. Teste Outlier detection in multivariate functional data based on a geometric aggregation 23th EDBT/ICDT Joint Conference, International Conference on Extending Database Technology (EDBT/ICDT’20), Copenhagen, Denmark 2020Abstract. The increasing ubiquity of multivariate functional data (MFD) requires methods that can properly detect outliers within such data, where a sample corresponds to 𝑝 > 1 parameters observed with respect to (w.r.t) a continuous variable (e.g. time). We improve the outlier detection in MFD by adopting a geometric view on the data space while combining the new data representation with state-of-the-art outlier detection algorithms. The geometric representation of MFD as paths in the 𝑝-dimensional Euclidean space enables to implicitly take into account the correlation w.r.t the continuous variable between the parameters. We experimentally show that our method is robust to various rates of outliers in the training set when fitting the outlier detection model and can detect outliers which are not detected by standard algorithms.
N. El Malki, F. Ravat, O. Teste K-means: k estimation solution based on kd-tree in a massive data context 22th International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data, co-located with the 23th EDBT/ICDT Joint Conference (DOLAP@EDBT/ICDT'20), Copenhagen, Denmark 2020Abstract. K-means clustering is a popular unsupervised classification algorithm employed in several domains, e.g., imaging, segmentation, or compression. Nevertheless, the number of clusters k, fixed apriori, affects mainly the clustering quality. Current State-of-the-art k-means implementations could automatically set of the number of clusters. However, they result in unreasonable processing time while classifying large volumes of data. In this paper, we propose a novel solution based on kd-tree to determine the number of cluster k in the context of massive data for preprocessing data science projects or in near-real-time applications. We demonstrate how our solution outperforms current solutions in terms of clustering quality, and processing time on massive data.
H. Ben Hamadou, F. Ghozzi, A. Péninou, O. Teste Schema-independent Querying for Heterogeneous Collections in NoSQL Document Stores Information Systems Journal, Elsevier Science Publisher, Vol. 85, p.48-67 2019
Abstract. NoSQL document stores are well-tailored to efficiently load and manage massive collections of heterogeneous documents without any prior structural validation. However, this flexibility becomes a serious challenge when querying heterogeneous documents, and hence the user has to build complex queries or reformulate existing queries whenever new schemas are introduced in a collection. In this paper we propose a novel approach, based on formal foundations, for building schema-independent queries which are designed to query multi-structured documents. We present a query enrichment mechanism that consults a pre-constructed dictionary. This dictionary binds each possible path in the documents to all its corresponding absolute paths in all the documents. We automate the process of query reformulation via a set of rules that reformulate most document store operators, such as select, project, unnest, aggregate and lookup. We then produce queries across multi-structured documents which are compatible with the native query engine of the underlying document store. To evaluate our approach, we conducted experiments on synthetic datasets. Our results show that the induced overhead can be acceptable when compared to the efforts needed to restructure the data or the time required to execute several queries corresponding to the different schemas inside the collection.
A. Laadhar, F. Ghozzi, I. Megdiche, F. Ravat, O. Teste, F. Gargouri The Impact of Imbalanced training Data on Local matching learning of ontologie 22nd International Conference on Business Information Systems (BIS’19), Seville, Spain 2019
Abstract. Matching learning corresponds to the combination of ontology matching and machine learning techniques. This strategy has gained increasing attention in recent years. However, state-of-the-art approaches implementing matching learning strategies are not well-tailored to deal with imbalanced training sets. In this paper, we address the problem of the imbalanced training sets and their impacts on the performance of the matching learning in the context of aligning biomedical ontologies. Our approach is applied to local matching learning, which is a technique used to divide a large ontology matching task into a set of distinct local sub-matching tasks. A local matching task is based on a local classifier built using its balanced local training set. Thus, local classifiers discover the alignment of the local sub-matching tasks. To validate our approach, we propose an experimental study to analyze the impact of applying conventional resampling techniques on the quality of the local matching learning.
A. Laadhar, F. Ghozzi, I. Megdiche, F. Ravat, O. Teste, F. Gargouri Partitioning and Local Matching Learning of Biomedical Ontologies 34th ACM/SIGAPP Symposium On Applied Computing (SAC’19), Limassol, Cyprus 2019
Abstract. Conventional ontology matching systems are not well-tailored to ensure sufficient quality alignments for large ontology matching tasks. In this paper, we propose a local matching learning strategy to align large and complex biomedical ontologies. We define a novel partitioning approach that breakups large ontology alignment task into a set of local sub-matching tasks. We perform a machine learning approach for each local sub-matching task. We build a local machine learning model for each sub-matching task without any user involvement. Each local matching learning model automatically provides adequate matching settings for each local sub-matching task. Our results show that: (i) partitioning approach outperforms existing techniques, (ii) local matching while using a specific machine learning model for each sub-matching task yields to promising results and (iii) the combination between partitioning and machine learning increases the overall result.
Abstract. In this paper, we propose a novel approach to tackle the problem of querying large volume of statistical RDF cubes. Our approach relies on combining pre-aggregation strategies and the performance of NoSQL engines to represent and manage statistical RDF data. Specifically, we define a conceptual modeling solution to represent original RDF data with aggregates in a multidimensional structure. We complete the conceptual modeling with a logical design process based on well-known multidimensional RDF graph and property-graph representations. We implement our proposed model in RDF triple stores and a property-graph NoSQL database, and we compare the querying performance, with and without aggregates. Experimental results, on real-world datasets containing 81.92 million triplets, show that pre-aggregation allows reducing query runtime in both RDF triple stores and property-graph NoSQL databases. Neo4j NoSQL database with aggregates outperforms RDF Jena TDB2 and Virtuoso triple stores, speeding up to 99% query runtime.
I. Ben Kraiem, F. Ghozzi, A. Peninou, O. Teste CoRP: A Pattern-Based Anomaly Detection in Time-Series Enterprise Information Systems, Revised Selected Papers, International Conference on Enterprise Information Systems (ICEIS’19), Lecture Notes in Business Information Processing (LNBIP), Vol. 241, Springer, ISBN 978-3-030-40782-7, p. 424-442 2019
Abstract. Monitoring and analyzing sensor networks is essential for exploring energy consumption in smart buildings or cities. However, the data generated by sensors are affected by various types of anomalies and this makes the analysis tasks more complex. Anomaly detection has been used to find anomalous observations from data. In this paper, we propose a Pattern-based method, for anomaly detection in sensor networks, entitled CoRP “Composition of Remarkable Point” to simultaneously detect different types of anomalies. Our method detects remarkable points in time series based on patterns. Then, it detects anomalies through pattern compositions. We compare our approach to the methods of literature and evaluate them through a series of experiments based on real data and data from a benchmark.
I. Ben Kraiem, F. Ghozzi, A. Peninou, O. Teste
Pattern-based method for anomaly detection in sensor networks
21st International Conference on Enterprise Information Systems (ICEIS’19), Heraklion, Crete, Greece
Best student paper award
Abstract. The detection of anomalies in real fluid distribution applications is a difficult task, especially, when we seek to accurately detect different types of anomalies and possible sensor failures. Resolving this problem is increasingly important in building management and supervision applications for analysis and supervision. In this paper we introduce CoRP ”Composition of Remarkable Points” a configurable approach based on pattern modelling, for the simultaneous detection of multiple anomalies. CoRP evaluates a set of patterns that are defined by users, in order to tag the remarkable points using labels, then detects among them the anomalies by composition of labels. By comparing with literature algorithms, our approach appears more robust and accurate to detect all types of anomalies observed in real deployments. Our experiments are based on real world data and data from the literature.
Abstract. The k-means algorithm is one well-known of clustering algorithms. k-means requires iterative and repetitive accesses to data up to performing the same calculations several times on the same data. However, intermediate results that are difficult to predict at the beginning of the k-means process are not recorded to avoid recalculating some data in subsequent iterations. These repeated calculations can be costly, especially when it comes to clustering massive data. In this article, we propose to extend the k-means algorithm by introducing pre-aggregates. These aggregates can then be reused to avoid redundant calculations during successive iterations. We show the interest of the approach by several experiments. These last ones show that the more the volume of data is important, the more the pre-aggregations speed up the algorithm.
A. Laadhar, F. Ghozzi, I. Megdiche, F. Ravat, O. Teste, F. Gargouri POMap++ Results for OAEI 2019: Fully Automated Machine Learning Approach for Ontology Matching 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference (OM@ISWC'19), Auckland, New Zelaand 2019
Abstract. POMap++ is a novel ontology matching system based on a machine learning approach. This year is the second participation of POMap++ in the Ontology Alignment Evaluation Initiative (OAEI). POMap++ follows a fully automated local matching learning approach that breaks down a large ontology matching task into a set of independent local sub-matching tasks. This approach integrates a novel partitioning algorithm as well as a set of matching learning techniques. POMap++ provides an automated local matching learning for the biomedical tracks. In this paper, we present POMap++ as well as the obtained results for the Ontology Alignment Evaluation Initiative of 2019.