Common diseases such as cardiovascular disease, cancer, obesity, diabetes and psychiatric illnesses are caused by a combination of multiple genetic and environmental factors. Understanding how the genetic factors interact with each other and with the environment would allow better prevention, diagnosis and treatment of these diseases, and thus allow individualized treatment of these diseases based on the genetic make-up of the patients. Almost all existing approaches for studying the genetic causes of disease are localized to studying effects of a very small set of genes, and thus are not capable of capturing subtle effects of many genes. At MSR Cambridge, we are collaborating with researchers at the Wellcome Trust Sanger Institute to perform genome-level analysis that integrates genetic and functional genomic data to study effects of multiple genes jointly.
Our goal of the joint project with Sanger institute is to model multiple sources of genomic data within a common statistical framework using large-scale machine learning tools. By combining the expertise of the Sanger Institute and MSR Cambridge, we aim to gain new insights into genetic networks and the pathogenesis, diagnosis and treatment of human disease, whilst also driving the development of machine learning tools usable by the wider scientific community. The project involves using two primary data sources - genetic variation (haplotype) sequences from the international HapMap project, and high throughput gene expression data from the Sanger Institute’s Population and Comparative Genomics group, in the first instance obtained by microarray analysis of cell lines collected from the HapMap project. The gene expression data is useful because it acts as an intermediary between SNP measurements and disease susceptibility. It is note-worthy to mention that almost all previous approaches have studied genetic basis for diseases either using only genetic variation data or by using only gene expression data and that our approach takes into account these two complementary genomic data sets to understand genetic variations. In fact, a preliminary study from the Sanger Institute analysed co-variation between pairs of SNP and gene expression probes and discovered significant relationships between them.
The figure above illustrates our approach for jointly modelling the haplotype data and the gene expression data. The haplotype model is a statistical model of correlations within the haplotype data. SNPs variants that are closely linked in the genome are in non-random association with neighbouring SNPs, leading to a block-like structure in the haplotypes, where blocks are conserved between different individuals. We aim to use a haplotype model that will capture correlated structures in the data to compactly represent the entire observed haplotype sequences. SNPs can affect the functionality of genes in two main ways; SNPs in the coding regions of a gene affect the protein that the gene codes for whilst SNPs in the regulatory region affect how much gene is expressed. Our framework separates out these two pathways so that direct and indirect effects on gene expression can be modelled in different ways. Genes are often co-regulated or co-expressed, leading to strong correlations in expression levels between different genes. These have been examined in previous gene expression models. As we have information from coding SNPs, we plan to extend such models to a richer interaction model that also models the intra-cellular relationship between levels of protein activity and levels of gene expression. The final stage of our analysis will be to identify correlations between our haplotype and interaction models, and the phenotypes or diseases of the individuals that the samples come from. This will identify relationships between genetic variation and gene expression, and hence lead to improved understanding of the genetic causes of human disease.
This project is run by the Cambridge Bioinformatics group.
- Research Goals
- Oliver Stegle, Leopold Parts, Matias Piipari, John Winn, and Richard Durbin, Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses, in Nature Protocols, vol. 7, pp. 500-507, Nature Publishing Group, February 2012.
- L. Parts, Å.K. Hedman, S. Keildson, A.J. Knights, C. Abreu-Goodger, M. van de Bunt, J.A. Guerra-Assunção, N. Bartonicek, S. van Dongen, R. Mägi, J. Nesbit, A. Barrett, M. Rantalainen, A. C. Nica, M. A. Quail, K. S. Small, D. Glass, A. J. Enright, J. Winn, P. Deloukas, E. T. Dermitzakis, M. I. McCarthy, T. D. Spector, R. Durbin, and C. M. Lindgren, Extent, Causes, and Consequences of Small RNA Expression Variation in Human Adipose Tissue, in PLoS Genetics, vol. 8, no. 5, pp. e1002704, Public Library of Science, 2012.
- Leopold Parts, Oliver Stegle, John Winn, and Richard Durbin, Joint Genetic Analysis of Gene Expression Data with Inferred Cellular Phenotypes, in PLoS Genetics, PLoS, January 2011.
- Oliver Stegle, Leopold Parts, Richard Durbin, and John Winn, A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies, in PLoS Computational Biology, PLoS Computational Biology (Public Library of Science Computational Biology), , 6 May 2010.
- Daniel Glass, Leopold Parts, David Knowles, Abraham Aviv, and Tim Spector, No Correlation Between Childhood Maltreatment and Telomere Length, in Biological Psychiatry, 2010.
- Magnus Rattray, Oliver Stegle, Kevin Sharp, and John Winn, Inference algorithms and learning theory for Bayesian sparse factor analysis, in International Workshop on Statistical-Mechanical Informatics 2009, Journal of Physics: Conference Series, 2009.
- Oliver Stegle, Anitha Kannan, Richard Durbin, and John M. Winn, Accounting for Non-genetic Factors Improves the Power of eQTL Studies, in International Conference on Research in Computational Molecular Biology, 2008.
- Jim C. Huang, Anitha Kannan, and John M. Winn, Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations, in International Conference on Intelligent Systems for Molecular Biology, 2007.