Olivier Teste

Data Management Systems, Machine Learning

Archives

2021

ScRep'21

F. Boizard, B. Buffin-Meyer, J. Aligon, O. Teste, J. P. Schanstra, J. Klein
PRYNT: a tool for prioritization of disease candidates from proteomics data using a combination of shortest-path and random walk algorithms
Scientific Reports, Open Access Journal, ISSN 2045-2322
2021

Abstract. The urinary proteome is a promising pool of biomarkers of kidney disease. However, the protein changes observed in urine only partially reflect the deregulated mechanisms within kidney tissue. In order to improve on the mechanistic insight based on the urinary protein changes, we developed a new prioritization strategy called PRYNT (PRioritization bY protein NeTwork) that employs a combination of two closeness-based algorithms, shortest-path and random walk, and a contextualized protein–protein interaction (PPI) network, mainly based on clique consolidation of STRING network. To assess the performance of our approach, we evaluated both precision and specificity of PRYNT in prioritizing kidney disease candidates. Using four urinary proteome datasets, PRYNT prioritization performed better than other prioritization methods and tools available in the literature. Moreover, PRYNT performed to a similar, but complementary, extent compared to the upstream regulator analysis from the commercial Ingenuity Pathway Analysis software. In conclusion, PRYNT appears to be a valuable freely accessible tool to predict key proteins indirectly from urinary proteome data. In the future, PRYNT approach could be applied to other biofluids, molecular traits and diseases.

ICDM-W'21

T. Li, P. Goupil, J. Mothe, O. Teste
Early Detection of Atmospheric Turbulence for Civil Aircraft: A Data Driven Approach
21st Industrial Conference on Data Mining Workshops (ICDM-W'21), Auckland, New Zealand
2021

Abstract. Atmospheric turbulence phenomena are the main causes of injuries in civil air transport and due to climate change, the frequency and severity of turbulence is increasing. There is no precise turbulence prediction method. The state-of-the-art turbulence detection methods used on board commercial aircraft include pilot reports and online algorithms based on in situ eddy dissipation rate. They provide turbulence observations but not their predictions. Weather radar on the other hand only detects turbulence in wet air without any precise announcement about the timing. Equipped with a large number of sensors coming from different aircraft systems, the flight variables (multivariate time series generated by sensors) as well as their relationship may contain useful information indicating upcoming turbulence. Our approach aims at representing raw time series as functions which enable not only discovering the underlying function behind raw measurements but also implicitly removing data noise. Functional geometry features, which can capture the dynamic relation between variables, are deduced from the multidimensional path in functional representation. Based on the transformed geometry features, an outlier detection method is further deployed to detect specific behaviors indicating upcoming severe turbulence. Preliminary experimental results show that our approach reaches a 0.532 true positive rate while keeping a zero false positive rate, which meets the zero false alarm requirement for optimizing the passenger experience.

DaWaK'21

N. El Malki, R. Cugny, O. Teste, F. Ravat
A New Accurate Clustering Approach for Detecting Different Densities in High Dimensional Data
23rd International Conference Big Data Analytics and Knowledge Discovery (DaWaK'21), virtual event
2021

Abstract. Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Density-based clustering methods have proven to be effective for arbitrary-shaped clusters, but they have difficulties to find low-density clusters, near clusters with similar densities, and clusters in high-dimensional data. Our proposal consists in a new clustering algorithm based on spatial density and probabilistic approach. Sub-clusters are constituted using spatial density represented as probability density function (p.d.f) of pairwise distances between points. To agglomerate similar sub-clusters we combine spatial and probabilistic distances. We show that our approach outperforms main state-of-the-art density-based clustering methods on a wide variety of datasets.

DEXA'21

Y. Yang, F. Abdelhédi, J. Darmont, F. Ravat, O. Teste
Internal Data Imputation in Data Warehouse Dimensions
32nd International Conference Database and Expert Systems Applications (DEXA'21), virtual event
2021

Abstract. Missing data occur commonly in data warehouses and may generate data usefulness problems. Thus, it is essential to address missing data to carry out a better analysis. There exists data imputation methods for missing data in fact tables, but not for dimension tables. Hence, we propose in this paper a data imputation method for data warehouse dimensions that is based on existing data and takes both intra- and inter-dimension relationships into account.

IDEAS'21

Y. Yang, J. Darmont, F. Ravat, O. Teste
An Automatic Schema-Instance Approach for Merging Multidimensional Data Warehouses
25th International Database Engineering & Applications Symposium (IDEAS'21), Montreal, QC, Canada
2021

Abstract. Using data warehouses to analyse multidimensional data is a significant task in company decision-making. The data warehouse merging process is composed of two steps: matching multidimensional components and then merging them. Current approaches do not take all the particularities of multidimensional data warehouses into account, e.g., only merging schemata, but not instances; or not exploiting hierarchies nor fact tables. Thus, in this paper, we propose an automatic merging approach for star schema-modeled data warehouses that works at both the schema and instance levels. We also provide algorithms for merging hierarchies, dimensions and facts. Eventually, we implement our merging algorithms and validate them with the use of both synthetic and benchmark datasets.

DSIT'21

G. Dorleon, N. Bricon-Souf, I. Megdiche, O. Teste
Absolute Redundancy Analysis Based on Features Selection
4th International Conference on Data Science and Information Technology (DSIT'21), Shanghai
2021

Abstract. The goal of feature selection (FS) in machine learning is to find the best subset of features to create efficient models for a learning task. Different FS methods are then used to assess features relevancy. An efficient feature selection method should be able to select relevant and non-redundant features in order to improve learning performance and training efficiency on large data. However in the case of non-independents features, we saw existing features selection methods inappropriately remove redundancy which leads to performance loss. We propose in this article a new criteria for feature redundancy analysis. Using our proposed criteria, we then design an efficient features redundancy analysis method to eliminate redundant features and optimize the performance of a classifier. We experimentally compare the efficiency and performance of our method against other existing methods which may remove redundant features. The results obtained show that our method is effective in maximizing performance while reducing redundancy.

EDBT'21

I. Ben Kraiem, F. Ghozzi, A. Péninou, G. Roman-Jimenez, O. Teste
Human-Interpretable Rules for Anomaly Detection in Time-series
24th EDBT/ICDT Joint Conference, International Conference on Extending Database Technology (EDBT/ICDT’21), Nicosia, Cyprus
2021

Abstract. Anomaly detection in time series is a widely studied issue inmany areas. Anomalies can be detected using rule-based approaches and human-interpretable rules for anomaly detection refer to rules presented in a format that is intelligible to analysts. Learning these rules is a challenge but only a fewworks address the issue of detecting different types of anomalies in time-series. This paper presents an extended decision tree based on patterns to generate a minimized set of human comprehensible rules for anomaly detection in univariate times-series. This method uses Bayesian optimization to avoid manual tuning of hyper-parameters. We define a quality measure to evaluate both the accuracy and the intelligibility of the produced rules. Experiments show that our approach generates rules that outperforms the state ofthe-art anomaly detection techniques.

2020

KnoSyst'20

C. Lejeune, J. Mothe, A. Soubki, O. Teste
Shape-based outlier detection in multivariate functional data
International Journal of Knowledge-Based Systems, Elsevier Science Publisher
2020

Abstract. Multivariate functional data refer to a population of multivariate functions generated by a system involving dynamic parameters depending on continuous variables (e.g., multivariate time series). Outlier detection in such a context is a challenging problem because both the individual behavior of the parameters and the dynamic correlation between them are important. To address this problem, recent work has focused on multivariate functional depth to identify the outliers in a given dataset. However, most previous approaches fail when the outlyingness manifests itself in curve shape rather than curve magnitude. In this paper, we propose identifying outliers in multivariate functional data by a method whereby different outlying features are captured based on mapping functions from differential geometry. In this regard, we extract shape features reflecting the outlyingness of a curve with a high degree of interpretability. We conduct an experimental study on real and synthetic data sets and compare the proposed method with functional-depth-based methods. The results demonstrate that the proposed method, combined with state-of-the-art outlier detection algorithms, can outperform the functional-depth-based methods. Moreover, in contrast with the baseline methods, it is efficient regardless of the proportion of outliers.

InfMngt'20

F. Ravat, J. Song, O. Teste, C. Trojahn
Efficient querying of multidimensional RDF data with aggregates: comparing NoSQL, RDF and relational data stores
International Journal of Information Management, Elsevier Science Publisher
2020

Abstract. This paper proposes an approach to tackle the problem of querying large volume of statistical RDF data. Our approach relies on pre-aggregation strategies to better manage the analysis of this kind of data. Specifically, we define a conceptual model to represent original RDF data with aggregates in a multidimensional structure. A set of translations rules for converting a well-known multidimensional RDF modelling vocabulary into the proposed conceptual model is then proposed. We implement the conceptual model in six different data stores: two RDF triple stores (Jena TDB and Virtuoso), one graph-oriented NoSQL database (Neo4j), one column-oriented data store (Cassandra), and two relational databases (MySQL and PostGreSQL). We compare the querying performance, with and without aggregates, in these data stores. Experimental results, on real-world datasets containing 81.92 million triplets, show that pre-aggregation allows for reducing query runtime in all data stores. Neo4j NoSQL and relational databases with aggregates outperform triple stores speeding up to 99% query runtime.

ICWS’20

O. Coustié, X. Baril, J. Mothe, O. Teste
METING: A Robust Log Parser Based on Frequent n-Gram Mining
IEEE International Conference on Web Services (ICWS’20), Beijing, China
2020

Abstract. Execution logs are a pervasive resource to monitor modern information systems. Due to the lack of structure in raw log datasets, log parsing methods are used to automatically retrieve the structure of logs and gather logs of common templates. Parametric log parser are commonly preferred since they can modulate their behaviour to fit different types of datasets. These methods rely on strong syntactic assumptions on log structure e.g. all logs of a common template have the same number of words. Yet, some reference datasets do not comply with these assumptions and are still not effectively treated by any of the state-of-the-art log parsers. We propose a new parametric log parser based on frequent n-gram mining: this soft text-driven approach offers a more flexible syntactic representation of logs, which fits a great majority of log data, especially the challenging ones. Our comprehensive evaluations show that the approach is robust and clearly outperforms existing methods on these challenging datasets.

CIKM'20

N. El Malki, R. Cugny, F. Ravat, O. Teste
DECWA: Density-Based Clustering using Wasserstein Distance
29th International Conference on Information and knowledge Management (CIKM’20), Galway, Ireland
2020

Abstract. Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters. Despite their encouraging results, they suffer to find low-density clusters, near clusters with similar densities, and high-dimensional data. Our proposals are a new characterization of clusters and a new clustering algorithm based on spatial density and probabilistic approach. First of all, sub-clusters are built using spatial density represented as probability density function ($p.d.f$) of pairwise distances between points. A method is then proposed to agglomerate similar sub-clusters by using both their density ($p.d.f$) and their spatial distance. The key idea we propose is to use the Wasserstein metric, a powerful tool to measure the distance between $p.d.f$ of sub-clusters. We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.

CIKM'20

O. Coustié, X. Baril, J. Mothe, O. Teste
Application Performance Anomaly Detection with LSTM on Temporal Irregularities in Logs
29th International Conference on Information and knowledge Management (CIKM’20), Galway, Ireland
2020

Abstract. Performance anomalies are a core problem in modern information systems, that affects the execution of the hosted applications. The detection of these anomalies often relies on the analysis of the application execution logs. The current most effective approach is to detect samples that differ from a learnt nominal model. However, current methods often focus on detecting sequential anomalies in logs, neglecting the time elapsed between logs, which is a core component of the performance anomaly detection. In this paper, we develop a new model for performance anomaly detection that captures temporal deviations from the nominal model, by means of a sliding window data representation. This nominal model is trained by a Long Short-Term Memory neural network, which is appropriate to represent complex sequential dependencies. We assess the effectiveness of our model on both simulated and real datasets. We show that it is more robust to temporal variations than current state-of-the-art approaches, while remaining as effective.

AIME’20

O. El Rifai, M. Biotteau, X. Deboissezon, I. Medgiche, F. Ravat, O. Teste
Blockchain-based Federated Learning in Medecine
International Conference on Artificial Intelligence in Medicine (AIME'20), Minneapolis, USA
2020

Abstract. Worldwide epidemic events have confirmed the need for medical data processing tools while bringing issues of data privacy, transparency and usage consent to the front. Federated Learning and the blockchain are two technologies that tackle these challenges and have been shown to be beneficial in medical contexts where data are often distributed and coming from different sources. In this paper we propose to integrate these two technologies for the first time in a medical setting. In particular, we propose a implementation of a coordinating server for a federated learning algorithm to share information for improved predictions while ensuring data transparency and usage consent. We illustrate the approach with a prediction decision support tool applied to a diabetes data-set. The particular challenges of the medical contexts are detailed and a prototype implementation is presented to validate the solution.

RCIS'20

I. Ben Kraiem, F. Ghozzi, A. Péninou, G. Roman-Jimenez, O. Teste
Automatic Classification Rules for Anomaly Detection in Time-series
14th International Conference on Research Challenges in Information Science (RCIS’20), Limassol, Cyprus
2020

Abstract. Anomaly detection in time-series is an important issue in many applications. It is particularly hard to accurately detect multiple anomalies in time-series. Pattern discovery and rule extraction are effective solutions for allowing multiple anomaly detection. In this paper, we define a Composition-based Decision Tree algorithm that automatically discovers and generates human-understandable classification rules for multiple anomaly detection in time-series. To evaluate our solution, our algorithm is compared to other anomaly detection algorithms on real datasets and benchmarks.

RCIS'20

O. El Rifai, M. Biotteau, X. Deboissezon, I. Medgiche, F. Ravat, O. Teste
Blockchain-Based Personal Health Records for Patients’ Empowerment
14th International Conference on Research Challenges in Information Science (RCIS’20), Limassol, Cyprus
2020

Abstract. With the current trend of patient-centric health-care, blockchain-based Personal Health Records (PHRs) frameworks have been emerging. The adoption of these frameworks is still in its infancy stage and is dependent on a broad range of factors. In this paper we look at some of the typical concerns raised from a centralized medical records solution such as the one deployed in France. Based on the state of the art literature in terms of Electronic Health Records (EHRs) and PHRs, we discuss the main implementation bottlenecks that can be encountered when deploying a blockchain solution and how to avoid them. In particular, we explore these bottlenecks in the context of the French PHR system and suggest some recommendations for a paradigm shift towards patients’ empowerment.

ECMOR'20

A. Yewgat, D. Busby, M. Chevalier, C. Lapeyre, O. Teste
Deep-CRM: A New Deep Learning Approach For Capacitance Resistive Models
17th European Conference On The Mathematics Of Oil Recovery (ECMOR'20)
2020

Abstract. Data driven models can represent a suitable alternative to classical reservoir modelling as they require much less computation time and allocated resources. Among such models are Capacitance Resistive Models (CRMs), based on set of coupled ordinary differential equations (ODEs) representing material balance. The aim of this work is to propose a complete approach to optimize the CRM's parameters and forecast future production. This approach is not based on any assumptions on injections or on producers' Bottom Hole Pressure. To this end, we introduce a new approach based on a deep learning strategy: Physics-Informed Neural Networks (PINNs) for CRMs. Our approach, called Deep-CRMs, is presented. Experiments are conducted to compare our approach to the nonlinear multivariate regression of the closed form solution. These experiments are based on two datasets: the first is a synthetic dataset generated using ECLIPSE® and SISMAGE®, and the second is a real field dataset provided by one of our affiliates.

CIRCLE'20

E. Maître, Z. Chemli, M. Chevalier, B. Dousset, J-P. Gitto, O. Teste
Event detection and time series alignment to improve stock market forecasting
Joint Conference of the Information Retrieval Communities in Europe (CIRCLE'20), Samatan, France
2020

Abstract. Buying commodities is a critical issue for multiple industries because the variations of stock prices are induced not only by multiple economic parameters but also by external events. Raw material buyers must keep track of information in numerous fields, which constitutes a major challenge considering the exponential growth of online data. To tackle this issue, we propose an event detection approach in order to assist them in their anticipation process. Indeed, a lot of contextual information is contained in text and exploiting it can allow one to improve its anticipation ability. Thus, we develop a framework of event detection and qualification, then we quantify the impact of these events on stock market to help buyers in their anticipation process. In this paper, we will first introduce our context, then explain the scope of our work and our goals. After detailing the related work, we will present our proposition, conclude and propose some future work possibilities.

EDBT'20

C. Lejeune, J. Mothe, O. Teste
Outlier detection in multivariate functional data based on a geometric aggregation
23th EDBT/ICDT Joint Conference, International Conference on Extending Database Technology (EDBT/ICDT’20), Copenhagen, Denmark
2020

Abstract. The increasing ubiquity of multivariate functional data (MFD) requires methods that can properly detect outliers within such data, where a sample corresponds to 𝑝 > 1 parameters observed with respect to (w.r.t) a continuous variable (e.g. time). We improve the outlier detection in MFD by adopting a geometric view on the data space while combining the new data representation with state-of-the-art outlier detection algorithms. The geometric representation of MFD as paths in the 𝑝-dimensional Euclidean space enables to implicitly take into account the correlation w.r.t the continuous variable between the parameters. We experimentally show that our method is robust to various rates of outliers in the training set when fitting the outlier detection model and can detect outliers which are not detected by standard algorithms.

DOLAP'20

N. El Malki, F. Ravat, O. Teste
K-means: k estimation solution based on kd-tree in a massive data context
22th International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data, co-located with the 23th EDBT/ICDT Joint Conference (DOLAP@EDBT/ICDT'20), Copenhagen, Denmark
2020

Abstract. K-means clustering is a popular unsupervised classification algorithm employed in several domains, e.g., imaging, segmentation, or compression. Nevertheless, the number of clusters k, fixed apriori, affects mainly the clustering quality. Current State-of-the-art k-means implementations could automatically set of the number of clusters. However, they result in unreasonable processing time while classifying large volumes of data. In this paper, we propose a novel solution based on kd-tree to determine the number of cluster k in the context of massive data for preprocessing data science projects or in near-real-time applications. We demonstrate how our solution outperforms current solutions in terms of clustering quality, and processing time on massive data.

2019

InfSyst'19

H. Ben Hamadou, F. Ghozzi, A. Péninou, O. Teste
Schema-independent Querying for Heterogeneous Collections in NoSQL Document Stores
Information Systems Journal, Elsevier Science Publisher, Vol. 85, p.48-67
2019

Abstract. NoSQL document stores are well-tailored to efficiently load and manage massive collections of heterogeneous documents without any prior structural validation. However, this flexibility becomes a serious challenge when querying heterogeneous documents, and hence the user has to build complex queries or reformulate existing queries whenever new schemas are introduced in a collection. In this paper we propose a novel approach, based on formal foundations, for building schema-independent queries which are designed to query multi-structured documents. We present a query enrichment mechanism that consults a pre-constructed dictionary. This dictionary binds each possible path in the documents to all its corresponding absolute paths in all the documents. We automate the process of query reformulation via a set of rules that reformulate most document store operators, such as select, project, unnest, aggregate and lookup. We then produce queries across multi-structured documents which are compatible with the native query engine of the underlying document store. To evaluate our approach, we conducted experiments on synthetic datasets. Our results show that the induced overhead can be acceptable when compared to the efforts needed to restructure the data or the time required to execute several queries corresponding to the different schemas inside the collection.

BIS'19

A. Laadhar, F. Ghozzi, I. Megdiche, F. Ravat, O. Teste, F. Gargouri
The Impact of Imbalanced training Data on Local matching learning of ontologie
22nd International Conference on Business Information Systems (BIS’19), Seville, Spain
2019

Abstract. Matching learning corresponds to the combination of ontology matching and machine learning techniques. This strategy has gained increasing attention in recent years. However, state-of-the-art approaches implementing matching learning strategies are not well-tailored to deal with imbalanced training sets. In this paper, we address the problem of the imbalanced training sets and their impacts on the performance of the matching learning in the context of aligning biomedical ontologies. Our approach is applied to local matching learning, which is a technique used to divide a large ontology matching task into a set of distinct local sub-matching tasks. A local matching task is based on a local classifier built using its balanced local training set. Thus, local classifiers discover the alignment of the local sub-matching tasks. To validate our approach, we propose an experimental study to analyze the impact of applying conventional resampling techniques on the quality of the local matching learning.

SAC'19

A. Laadhar, F. Ghozzi, I. Megdiche, F. Ravat, O. Teste, F. Gargouri
Partitioning and Local Matching Learning of Biomedical Ontologies
34th ACM/SIGAPP Symposium On Applied Computing (SAC’19), Limassol, Cyprus
2019

Abstract. Conventional ontology matching systems are not well-tailored to ensure sufficient quality alignments for large ontology matching tasks. In this paper, we propose a local matching learning strategy to align large and complex biomedical ontologies. We define a novel partitioning approach that breakups large ontology alignment task into a set of local sub-matching tasks. We perform a machine learning approach for each local sub-matching task. We build a local machine learning model for each sub-matching task without any user involvement. Each local matching learning model automatically provides adequate matching settings for each local sub-matching task. Our results show that: (i) partitioning approach outperforms existing techniques, (ii) local matching while using a specific machine learning model for each sub-matching task yields to promising results and (iii) the combination between partitioning and machine learning increases the overall result.

SAC'19

F. Ravat, J. Song, O. Teste, C. Trojahn
Improving the performance of querying multidimensional RDF data using aggregates
34th ACM/SIGAPP Symposium On Applied Computing (SAC’19), Limassol, Cyprus
2019

Abstract. In this paper, we propose a novel approach to tackle the problem of querying large volume of statistical RDF cubes. Our approach relies on combining pre-aggregation strategies and the performance of NoSQL engines to represent and manage statistical RDF data. Specifically, we define a conceptual modeling solution to represent original RDF data with aggregates in a multidimensional structure. We complete the conceptual modeling with a logical design process based on well-known multidimensional RDF graph and property-graph representations. We implement our proposed model in RDF triple stores and a property-graph NoSQL database, and we compare the querying performance, with and without aggregates. Experimental results, on real-world datasets containing 81.92 million triplets, show that pre-aggregation allows reducing query runtime in both RDF triple stores and property-graph NoSQL databases. Neo4j NoSQL database with aggregates outperforms RDF Jena TDB2 and Virtuoso triple stores, speeding up to 99% query runtime.

LNBIP'19

I. Ben Kraiem, F. Ghozzi, A. Peninou, O. Teste
CoRP: A Pattern-Based Anomaly Detection in Time-Series
Enterprise Information Systems, Revised Selected Papers, International Conference on Enterprise Information Systems (ICEIS’19), Lecture Notes in Business Information Processing (LNBIP), Vol. 241, Springer, ISBN 978-3-030-40782-7, p. 424-442
2019

Abstract. Monitoring and analyzing sensor networks is essential for exploring energy consumption in smart buildings or cities. However, the data generated by sensors are affected by various types of anomalies and this makes the analysis tasks more complex. Anomaly detection has been used to find anomalous observations from data. In this paper, we propose a Pattern-based method, for anomaly detection in sensor networks, entitled CoRP “Composition of Remarkable Point” to simultaneously detect different types of anomalies. Our method detects remarkable points in time series based on patterns. Then, it detects anomalies through pattern compositions. We compare our approach to the methods of literature and evaluate them through a series of experiments based on real data and data from a benchmark.

ICEIS'19

I. Ben Kraiem, F. Ghozzi, A. Peninou, O. Teste
Pattern-based method for anomaly detection in sensor networks
21st International Conference on Enterprise Information Systems (ICEIS’19), Heraklion, Crete, Greece
2019
Best student paper award

Abstract. The detection of anomalies in real fluid distribution applications is a difficult task, especially, when we seek to accurately detect different types of anomalies and possible sensor failures. Resolving this problem is increasingly important in building management and supervision applications for analysis and supervision. In this paper we introduce CoRP ”Composition of Remarkable Points” a configurable approach based on pattern modelling, for the simultaneous detection of multiple anomalies. CoRP evaluates a set of patterns that are defined by users, in order to tag the remarkable points using labels, then detects among them the anomalies by composition of labels. By comparing with literature algorithms, our approach appears more robust and accurate to detect all types of anomalies observed in real deployments. Our experiments are based on real world data and data from the literature.

ICEIS'19

N. El Malki, F. Ravat, O. Teste
K-means improvement by dynamic pre-aggregates
21st International Conference on Enterprise Information Systems (ICEIS’19), Heraklion, Crete, Greece
2019

Abstract. The k-means algorithm is one well-known of clustering algorithms. k-means requires iterative and repetitive accesses to data up to performing the same calculations several times on the same data. However, intermediate results that are difficult to predict at the beginning of the k-means process are not recorded to avoid recalculating some data in subsequent iterations. These repeated calculations can be costly, especially when it comes to clustering massive data. In this article, we propose to extend the k-means algorithm by introducing pre-aggregates. These aggregates can then be reused to avoid redundant calculations during successive iterations. We show the interest of the approach by several experiments. These last ones show that the more the volume of data is important, the more the pre-aggregations speed up the algorithm.

OM@ISWC'19

A. Laadhar, F. Ghozzi, I. Megdiche, F. Ravat, O. Teste, F. Gargouri
POMap++ Results for OAEI 2019: Fully Automated Machine Learning Approach for Ontology Matching
14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference (OM@ISWC'19), Auckland, New Zelaand
2019

Abstract. POMap++ is a novel ontology matching system based on a machine learning approach. This year is the second participation of POMap++ in the Ontology Alignment Evaluation Initiative (OAEI). POMap++ follows a fully automated local matching learning approach that breaks down a large ontology matching task into a set of independent local sub-matching tasks. This approach integrates a novel partitioning algorithm as well as a set of matching learning techniques. POMap++ provides an automated local matching learning for the biomedical tracks. In this paper, we present POMap++ as well as the obtained results for the Ontology Alignment Evaluation Initiative of 2019.