Paper 9

Integrating Large and Distributed Life Sciences Resources for Systems Biology Research: Progress and New Challenges

Authors: Hasan Jamil

Volume 3 (2011)

Abstract

Researchers in Systems Biology routinely access vast collec- tion of hidden web research resources freely available on the internet. These collections include online data repositories, online and download- able data analysis tools, publications, text mining systems, visualization artifacts, etc. Almost always, these resources have complex data formats that are heterogeneous in representation, data type, interpretation and even identity. They are often forced to develop analysis pipelines and data management applications that involve extensive and prohibitive manual interactions. Such approaches act as a barrier for optimal use of these resources and thus impede the progress of research. In this paper, we discuss our experience of building a new middleware approach to data and application integration for Systems Biology that leverages recent developments in schema matching, wrapper generation, work°ow management, and query language design. In this approach, ad hoc integration of arbitrary resources and computational pipeline con- struction using a declarative language is advocated. We highlight the features and advantages of this new data management system, called LifeDB, and its query language BioFlow. Based on our experience, we highlight the new challenges it raises, and potential solutions to meet these new research issues toward a viable platform for large scale au- tonomous data integration. We believe the research issues we raise have general interest in the autonomous data integration community and will be applicable equally to research unrelated to LifeDB.