Datamining in Science: Mining Patterns in Protein Structures—Algorithms and Applications

With the data explosion occurring in sciences, utilizing tools to help analyze the data efficiently is becoming increasingly important. This session of the 2005 Microsoft Research Faculty Summit describes tools included with Microsoft SQL Server (Yukon), and Wei Wang describes the MotifSpace project—a comprehensive database of candidate spatial protein motifs based on recently developed data mining algorithms.

One of the next great frontiers in molecular biology is to understand and predict protein function. Proteins are simple linear chains of polymerized amino acids (residues) whose biological functions are determined by the three-dimensional shapes that they fold into. A popular approach to understanding proteins is to break them down into structural sub-components called motifs. Motifs are recurring structural and spatial units that are frequently correlated with specific protein functions.

Traditionally, the discovery of motifs has been a laborious task of scientific exploration. This talk reviews recent data-mining algorithms that we have developed for automatically identifying potential spatial motifs. Our methods automatically find frequently occurring substructures within graph-based representations of proteins. The complexity of protein structures and corresponding graphs poses significant computational challenges. The kernel of our approach is an efficient subgraph-mining algorithm that detects all (maximal) frequent subgraphs from a graph database with a user-specified minimal frequency.

Date:
Speakers:
Jamie MacLennan, Wei Wang, and Zhaohui Tang
Affiliation:
Microsoft; University of North Carolina at Chapel Hill; Microsoft