Skip to main content

Highlight

Research Focus on Data Analysis and Management

Achievement/Results

Our NSF funded Program in Integrative Information, Computer and Application Sciences (PICASso) has added a new research focus to the program, that of Data Analysis and Management. Many of our trainees and faculty members are working on research problems focused on analysis of diverse high dimensional data. A number of publications by IGERT trainees and their advisors have resulted this effort. Here we highlight the work of Chad Myers, Matt Hibbs, Curtis Huttenhower, and advisors Olga Troyanskaya and Kai Li. They published three papers in a leading computational biology journal, Bioinformatics.

1. Huttenhower C, Hibbs MA, Myers CL, Troyanskaya OG. A scalable method for integration and functional analysis of multiple microarray data sets. Bioinformatics 22:2890, 2006.

This publication builds on the group’s earlier work to extend Bayesian data integration to large scale compendia of gene expression data. Microarrays (the most common type of gene expression data) are currently in widespread use in the biological sciences and represent a rich source of high-dimensional data. However, they must be carefully processed to handle noise and widely varying experimental characteristics.

2. Myers, CL and Troyanskaya OG. Context-sensitive data integration and prediction of biological networks. Bioinformatics. 2007 Sep 1;23(17):2322-30.

This work explores the presence of context-dependent variation in functional genomic data and expands the Bayesian approach for context-sensitive integration and query-based recovery of biological process-specific networks. This methodology can be applied to both microarray data and other sources of high-throughput genomic data.

2. Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. (2007) Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 2007 23(20):2692-2699.

This paper analyzed the functional coverage of the S. cerevisiae gene expression microarray data containing roughly 2400 experimental conditions, and designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using the system can provide a small set of query genes to establish a biological search context; based on this query, each dataset’s relevance to the context is weighted, and within these weighted datasets additional genes that are co-expressed with the query set are identified.

The work of these PICASso trainees and faculty advisers provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses. Not only can these methods mine new knowledge from large collections of existing experimental data, they can also guide experimental scientists to promising areas of research to pursue in a laboratory setting. These advances were only made possible through an emphasis on interdisciplinary research and a truly integrated application of computational analysis and machine learning to the results of experimental data from the life sciences.

Address Goals

Our research goals include taking advantage of the synergies between computer science and its applications in the natural sciences and engineering. When properly exploited, the underlying ties between these fields enable advances in many areas that could not be easily achieved otherwise. Many areas of science, engineering and information technology are producing tremendous amounts of data at dramatically higher rates, based on innovations in observational equipment and on exponential growth in computational and storage capabilities. The resulting avalanche of data has transformed whole areas of science. It has created the opportunity for a revolution in scientific discovery through data analysis and management; these in turn play a crucial role in inventing better engineering designs and new information services. Sophisticated, promising approaches have been developed in areas like machine learning, information retrieval and image/vision processing, but they have not yet been applied to many key scientific problems. Happily, while successful methods depend on key characteristics of the data and problems, there is much overlap and learning to be shared across disciplines. Real progress requires that experts in relevant disciplines work together in tightly integrated, cross-disciplinary teams. Our faculty recognize the need and synergies across disciplines, and PICASso has started to extend its reach to this key area of computational science, pushing the research frontiers and developing educational methods and forums to train interdisciplinary scientists.