Paper 7

Database Support for Enabling Data-Discovery Queries over Semantically-Annotated Observational Data

Authors: Huiping Cao, Shawn Bowers, Mark P. Schildhauer

Volume 6 (2012)

Abstract

Observational data plays a critical role in many scientific disciplines, and scientists are increasingly interested in performing broad-scale analyses by using observational data collected as part of many smaller scientific studies. However, while these data sets often contain similar types of information, they are typically represented using very different structures and with little semantic information about the data itself, which creates significant challenges for researchers who wish to discover existing data sets based on data semantics (observation and measurement types) and data content (the values of measurements within a data set). We present a formal framework to address these challenges that consists of a semantic observational model (to uniformly represent observation and measurement types), a high-level semantic annotation language (to map tabular resources into the model), and a declarative query language that allows researchers to express data-discovery queries over heterogeneous (annotated) data sets. To demonstrate the feasibility of our framework, we also present implementation approaches for efficiently answering discovery queries over semantically annotated data sets. In particular, we propose two storage schemes (in-place databases rdb and materialized databases mdb) to store the source data sets and their annotations. We also present two query schemes (ExeD and ExeH) to evaluate discovery queries and the results of extensive experiments comparing their effectiveness.