Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Computational Biology Seminar Series

An occasional forum for delivering academic computational biology talks. All talks are open to the public.

Upcoming Speakers and Events

TITLE: Learning the regulatory code of the accessible genome with deep convolutional neural networks

WHO: David Kelley

AFFILIATION: Harvard University

HOST: Jennifer Listgarten

WHEN: Friday, May 27th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 4pm-5pm


The complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many noncoding variants statistically associated with human disease, nearly all such variants have unknown mechanism. I’ll address this challenge using an approach based on a recent machine learning advance--deep convolutional neural networks (CNNs). My colleagues and I developed an open source package Basset ( to apply CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq and demonstrate far greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell's chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.


David completed his PhD in the Center for Bioinformatics and Computational Biology at the U. Maryland College Park, advised by Steven Salzberg, where he developed methods and software for genome assembly and gene prediction. In 2007, he joined John Rinn’s lab in the Stem Cell and Regenerative Biology department Harvard, where he performed multiple analyses of the function and evolution of long noncoding RNAs. David also introduced an approach based on deep convolutional neural networks to predict functional activity of DNA sequences. He recently joined Calico Labs where he’ll continue to apply machine learning approaches to genomics toward better understand the aging process.

TITLE: Integrative, interpretable deep learning frameworks for regulatory genomics and epigenomics

WHO: Anshul Kundaje

AFFILIATION: Stanford University

HOST: Jennifer Listgarten

WHEN: Friday, April 29th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 11am-12pm


We present generalizable and interpretable supervised deep learning frameworks to predict regulatory and epigenetic state of putative functional genomic elements by integrating raw DNA sequence with diverse chromatin assays such as ATAC-seq, DNase-seq or MNase-seq. First, we develop multi-modal convolutional neural networks (CNNs) that can integrate haploid or diploid DNA sequence and chromatin accessibility profiles (DNase-seq or ATAC-seq) to predict in-vivo binding sites of a diverse set of transcription factors (TF) across cell types with high accuracy. Our integrative models provide significant improvements over other state-of-the-art methods including recently published deep learning TF binding models. Next, we train multi-task, multi-modal deep CNNs to simultaneously predict multiple histone modifications and combinatorial chromatin state at regulatory elements by integrating DNA sequence, RNA-seq and ATAC-seq or a combination of DNase-seq and MNase-seq. Our models achieve high prediction accuracy even across cell-types revealing a fundamental predictive relationship between chromatin architecture and histone modifications. Finally, we develop DeepLIFT (Deep Linear Importance Feature Tracker), a novel interpretation engine for extracting and ranking predictive and biological meaningful patterns from deep neural networks (DNNs) for diverse genomic data types. We apply DeepLIFFT on our models to obtain unified TF sequence affinity motifs, infer high resolution point binding events of TFs, dissect regulatory sequence grammars involving homodimer and heterodimeric binding with co-factors, learn predictive chromatin architectural features and unravel the sequence and architectural heterogeneity of regulatory elements.


Anshul Kundaje is an Assistant Professor of Genetics and Computer Science at Stanford University and a 2014 Alfred Sloan Fellow. His primary research interest is computational regulatory genomics. His lab develops statistical and machine learning methods for large-scale integrative analysis of diverse functional genomic data to decipher heterogeneity of regulatory elements, uncover their long-range interactions in the context of 3D genome organization, learn transcriptional regulatory network models and understand the regulatory impact of non-coding genetic variation. Anshul has led the computational analysis efforts of two of the largest functional genomics consortia - The Encyclopedia of DNA Elements (ENCODE) Project and the Roadmap Epigenomics Project.

TITLE: Learning Cross-Corpora Models of Disease Progression in Autism Spectrum Disorder

WHO: Finale Doshi-Velez

AFFILIATION: Harvard University

HOST: Jennifer Listgarten

WHEN: Tuesday, May 3rd, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 3pm-4pm


Patients with developmental disorders, such as autism spectrum disorder (ASD), present with symptoms that change with time even if the named diagnosis remains fixed. For example, language impairments may present as delayed speech in a toddler and difficulty reading in a school-age child. Characterizing these trajectories is important for early treatment. However, deriving these trajectories from observational sources is challenging: electronic health records only reflect observations of patients at irregular intervals and only record what factors are clinically relevant at the time of observation. Meanwhile, caretakers discuss daily developments and concerns on social media.

In this talk, I will present a fully unsupervised approach for learning disease trajectories from incomplete medical records and social media posts, including cases in which we have only a single observation of each patient. In particular, we use a dynamic topic model approach which embeds each disease trajectory as a path in R^D. A polyagamma augmentation scheme is used to efficiently perform inference as well as incorporate multiple data sources. We learn disease trajectories from the electronic health records of 13,435 patients with ASD and the forum posts of 13,743 caretakers of children with ASD, deriving interesting clinical insights as well as good predictions. I'll end with broader questions about learning disease models from data.


Finale Doshi-Velez is an assistant professor in Computer Science at Harvard. She completed her Master's at the University of Cambridge, her PhD at MIT, and her postdoc at Harvard Medical School.

TITLE: Genetic Screens with CRISPR: A New Hope in Functional Genomics

WHO: John Doench

AFFILIATION: Broad Institute of MIT and Harvard

HOST: Jennifer Listgarten and Nicolo Fusi

WHEN: Wednesday, May 4th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 4pm-5pm.


Functional genomics attempts to understand the genome by disrupting the flow of information from DNA to RNA to protein and then observing how the cell or organism changes in response. Both RNAi and CRISPR technologies are simply hacks of systems that originally evolved to silence viruses, reprogrammed to target genes we’re interested in studying, as decoding the function of genes is a critical step towards understanding how gene dysfunction leads to disease. Here we will discuss the development and optimization of CRISPR technology for genome-wide genetic screens and its application to multiple biological problems.


John Doench is the Associate Director of the Genetic Perturbation Platform at the Broad Institute. He develops and applies the latest approaches in functional genomics, including RNAi, ORF, and CRISPR technologies, to understand the function of genes and how gene dysfunction leads to disease. John collaborates with researchers across the community to develop faithful biological models and execute genetic screens. Prior to joining the Broad in 2009, John did his postdoctoral work at Harvard Medical School, received his PhD from the biology department at MIT, and majored in history at Hamilton College. John lives in Jamaica Plain, MA with his wife and daughter, where he enjoys coaching soccer, cheering on the Red Sox and Patriots, playing volleyball, running, and avoiding imminent death while navigating the streets of Boston on a bicycle.

Past Speakers

TITLE: Insight into the biology of common diseases using summary statistics of large genome-wide association studies

WHO: Hilary Finucane

AFFILIATION: Harvard School of Public Health and MIT Mathematics

HOST: Jennifer Listgarten

WHEN: Tues. February 16th, 2016

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 11am-12pm


Datasets with genotype data for tens of thousands of individuals with and without a given disease contain valuable information about the genetic basis of the disease. However, for most common diseases, obtaining insights from these data is difficult because the signal is very diffuse: there are likely thousands or tens of thousands of genetic variants that each contribute a small amount to disease risk, and that are hidden among roughly a million variants in the dataset. Moreover, for many of the largest genotype datasets, no individual researcher has access to all of the genotype data; rather, the only data available are meta-analyzed marginal effect size estimates for each variant. I will describe a powerful approach to modeling these summary statistics that allows us, for example, to identify disease-relevant tissues or to quantify the degree to which two traits have a common genetic basis. The method, called LD score regression, is based on a commonly used model in genetics in which the effect of each variant on the disease is random. The parameters of this model provide information about the disease such as whether regions of the genome active in a given tissue (e.g., liver) tend to be more associated with disease than regions of the genome active in a second tissue (e.g., brain). The LD score regression method takes into account factors such as the correlational structure of the genome, potential confounding in the data, and the possibility that causal variants not in the dataset might be correlated with variants that are in the dataset.


Hilary Finucane is a graduate student in the MIT Mathematics department doing research in statistical genetics. Her advisor is Alkes Price, at the Harvard School of Public Health. As an undergraduate at Harvard, she majored in math and wrote her senior thesis on coding schemes for multilevel flash memory with Michael Mitzenmacher. She then completed an MSc in theoretical computer science at the Weizmann Institute of Science, working with Irit Dinur, followed by a year of research in probability theory and geometric group theory with Itai Benjamini, also at the Weizmann Institute of Science. She is supported by a Hertz Foundation Fellowship.

TITLE: Using Observed Controls to Infer the Effect of Unobserved Controls

WHO: Emily Oster

AFFILIATION: Brown University

HOST: Jennifer Listgarten

WHEN: Wed, February 3rd, 2015.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 1:30-2:30 PM


Omitted variable bias (equivalently, residual confounding) is a well known issue in deriving causal effects from observational data. This issue is especially problematic when the confounding arises from variables unobserved to the researcher and when there is no random variation in treatment to rely on. I will discuss a methodology to evaluate robustness of causal effects to unobserved confounders. The key assumption is that there is a correspondence between the relationship between the treatment variable and the observed controls and the treatment variable and unobserved controls. I will discuss the theory – an extension of Altonji, Elder and Taber (2005) – and present evidence that this may perform well in some social science settings. I will discuss its application to both the economics literature and the medical literature.


Emily Oster is an associate professor of economics at Brown University. She is one of the leading experts on public health issues, especially in developing contexts, within economics. Her research primarily uses observational data to reach important and provocative causal conclusions on major public health issues, such as the causes and consequence of infant mortality and HIV/AIDS. Beyond her purely academic work, Emily has tried to communicate the findings of the best work in health to consumers through her book on pregnancy Expecting Better and her column “Ask Emily” in the Wall Street Journal.

Title: Computational Aspects of Biological Information 2015

Computational Aspects of Biological Information (CABI) 2015 is the third one-day workshop on challenges and successes in computational biology and will bring together experts in the Boston/Cambridge area to discuss computational solutions to problems in biology, including systems biology, genomics, and related areas.

The workshop is open to everyone and registration is free. Continental breakfast and lunch will be served.

Tuesday, December 1, 2015
Microsoft Research New England
Horace Mann Conference Room
First Floor Conference Center
One Memorial Drive, Cambridge, Mass.

Registration and breakfast begin at 9:00 a.m.


Registration is now open

Poster session

There will be a poster session in the afternoon. To submit a poster, please review the guidelines and submission information by November 1st. Space is limited, and accepted poster notifications will be sent by November 10th

Workshop speakers

Confirmed speakers include:

  • Bonnie Berger (MIT CSAIL)
  • Arup Chakraporty (MIT Chemistry)
  • Michael Desai (Harvard Systems Biology)
  • Polina Golland (MIT CSAIL)
  • Rafael Irizarry (Harvard University)
  • Leonid Mirny (Harvard-MIT Division of Health Sciences and Technology, MIT)
  • Peter Park (Harvard Medical School)
  • David Sontag (NYU)
  • Shamil Sunyaev (Harvard Medical School)

Organizing committee

Nicolo Fusi (Microsoft Research)
Jennifer Listgarten (Microsoft Research)
James Zou (Microsoft Research)

For the most up-to-date information, please go to the CABI web site at:

Title: Recovering usable hidden structure using exploratory data analyses on genomic data

(This will be part of our MSR New England General Colloquium Series, intended for broad audiences of all backgrounds.)

Speaker: Barbara Engelhardt

Affiliation: Princeton

Host: Jennifer Listgarten

Date: Wed. November 4th, 2015

Time: 4pm - 5pm with reception to follow


Methods for exploratory data analysis have been the recent focus of much attention in `big data' applications because of their ability to quickly allow the user to explore structure in the underlying data in a controlled and interpretable way. In genomics, latent factor models are commonly used to identify population substructure, identify gene clusters, and control noise in large data sets. In this talk I will describe a series of statistical models for exploratory data analysis to illustrate the structure that they are able to identify in large genomic data sets. I will consider several downstream uses for the recovered latent structure: understanding technical noise in the data, developing undirected networks from the recovered structure, and using this latent structure to study genomic differences among people.


Barbara Engelhardt is an assistant professor in the Computer Science Department and the Center for Statistics and Machine Learning at Princeton University. Prior to that, she was at Duke University as an assistant professor in Biostatistics and Bioinformatics and Statistical Sciences. She graduated from Stanford University and received her Ph.D. from the University of California, Berkeley, advised by Professor Michael Jordan. She did postdoctoral research at the University of Chicago, working with Professor Matthew Stephens. Interspersed among her academic experiences, she spent two years working at the Jet Propulsion Laboratory, a summer at Google Research, and a year at 23andMe, a personal genomics company. Professor Engelhardt received an NSF Graduate Research Fellowship, the Google Anita Borg Memorial Scholarship, and the Walter M. Fitch Prize from the Society for Molecular Biology and Evolution. She also received the NIH NHGRI K99/R00 Pathway to Independence Award. Professor Engelhardt is currently a PI on the Genotype-Tissue Expression (GTEx) Consortium. Her research interests involve statistical models and methods for analysis of high-dimensional data, with a goal of understanding the underlying biological mechanisms of complex phenotypes and human diseases.


Title: Personalized Health with Gaussian Processes

(This will be part of our MSR New England General Colloquium Series, intended for broad audiences of all backgrounds.)

Speaker: Neil Lawrence

Affiliation: University of Sheffield

Host: Nicolo Fusi

Date: Wed, Aug 19

Time: 4pm - 5pm with reception to follow


Modern data connectivity gives us different views of the patient which need to be unified for truly personalized health care. I'll give a personal perspective on the type of methodological and social challenges we expect to arise in this this domain and motivate Gaussian process models as one approach to dealing with the explosion of data.


Neil Lawrence received his bachelor's degree in Mechanical Engineering from the University of Southampton in 1994. Following a period as an field engineer on oil rigs in the North Sea he returned to academia to complete his PhD in 2000 at the Computer Lab in Cambridge University. He spent a year at Microsoft Research in Cambridge before leaving to take up a Lectureship at the University of Sheffield, where he was subsequently appointed Senior Lecturer in 2005. In January 2007 he took up a post as a Senior Research Fellow at the School of Computer Science in the University of Manchester where he worked in the Machine Learning and Optimisation research group. In August 2010 he returned to Sheffield to take up a collaborative Chair in Neuroscience and Computer Science.

Neil's main research interest is machine learning through probabilistic models. He focuses on both the algorithmic side of these models and their application. He has a particular focus on applications in personalized health and computational biology, but happily dabbles in other areas such as speech, vision and graphics.

Neil was Associate Editor in Chief for IEEE Transactions on Pattern Analysis and Machine Intelligence (from 2011-2013) and is an Action Editor for the Journal of Machine Learning Research. He was the founding editor of the JMLR Workshop and Conference Proceedings (2006) and is currently series editor. He was an area chair for the NIPS conference in 2005, 2006, 2012 and 2013, Workshops Chair in 2010 and Tutorials Chair in 2013. He was General Chair of AISTATS in 2010 and AISTATS Programme Chair in 2012. He was Program Chair of NIPS in 2014 and is General Chair for 2015.


Title: Modeling molecular heterogeneity between individuals and single cells

Speaker: Oliver Stegle

Affiliation: European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI)

Host: Jennifer Listgarten and Nicolo Fusi

Date: Monday, May 11th, 2015

Time: 2:00 PM - 3:00 PM


The analysis of large-scale expression datasets is often compromised by hidden structure between samples. In the context of genetic association studies, this structure can be linked to differences between individuals, which can reflect their genetic makeup (such as population structure) or be traced back to environmental and technical factors. In this talk, I will discuss statistical methods to reconstruct this structure from the observed data to account for it in genetic analyses. By incorporating principles from causal reasoning, we show that critical pitfalls of falsely explaining away true biological signals can be circumvented. In the second part of this talk I will extend the introduced class of latent variable models to account for unwanted heterogeneity in single-cell transcriptome datasets. In applications to a T helper cell differentiation study, we show how this model allows for dissecting expression patterns of individual genes and reveals new substructure between cells that is linked to cell differentiation. I will finish with an outlook of modeling challenges and initial solutions that enable combining multiple omics layers that are profiled in the same set of single cells.


Oliver Stegle is a group leader at the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) in Cambridge, UK. His group develops statistical methods to analyse high-dimensional molecular traits both in the context of genetic association and single-cell biology. He received his Ph.D. from the University of Cambridge, UK, in physics in 2009, working with David MacKay. After a period as a postdoctoral researcher at the Max Planck Campus in Tübingen, Germany, he moved to the EMBL-EBI in November 2012 to establish his own research group.


TitleMapping single cells: A geometric approach

(This will be part of our MSR New England General Colloquium Series, intended for broad audiences of all backgrounds.)

Speaker: Dana Pe'er

Affiliation: Departments of Biological Sciences and Systems Biology, Columbia University

Host: Jennifer Listgarten

Date: Wed. Nov 5th, 2014

Time: 4:00 PM - 5:00 PM  


High dimensional single cell technologies are on the rise, rapidly increasing in accuracy and throughput. These offer computational biology both a challenge and an opportunity. One of the big challenges with this data-type is to understand regions of density in this multi-dimensional space, given millions of noisy measurements. Underlying many of our approaches is mapping this high-dimensional geometry into a nearest neighbor graph and characterization single cell behavior using this graph structure. We will discuss a number of approaches (1) An algorithm that harnesses the nearest neighbor graph to order cells according to their developmental maturity and its use to identify novel progenitor B-cell sub-populations. (2) Using reweighted density estimation to characterize cellular signal processing in T-cell activation. (2) New clustering and dimensionality reduction approaches to map heterogeneity between cells; with an application to characterizing tumor heterogeneity in Acute Myeloid Leukemia.


Dana Pe’er is an associate professor in the Departments of Biological Sciences and Systems Biology. Her team develops computational methods that integrate diverse high-throughput data to provide a holistic, systems-level view of molecular networks. Currently they have two key focuses: developing computational methods to interpret single cell data and understand cellular heterogeneity; modeling how genetic and epigenetic variation alters regulatory network function and subsequently phenotype in health and disease. This path has led them to explore how systems biology approaches can be used to personalize cancer care. Dana is recipient of the Burroughs Wellcome Fund Career Award, NIH Directors New Innovator Award, NSF CAREER award, Stand Up To Cancer Innovative Research Grant, a Packard Fellow in Science and Engineering, and very recently, the prestigious 2014 ISCB Overton Prize Award.


Title: Reconstructing tumour subpopulation genotypes and evolution from short-read sequencing of bulk tumour samples

Speaker: Quaid Morris

Affiliation: Donnelly Center for Cellular and Biomolecular Research, University of Toronto

Host: Jennifer Listgarten

Date: Friday, September 12th, 2014

Time: 2:00 PM - 3:30 PM  


Tumours consist of genetically diverse subpopulations of cells that differ in their response to therapy and their metastatic potential. The short read sequencing used to characterize tumour heterogeneity only provides the allelic frequencies of the tumour somatic mutations, not full genotypes of individual cells. I will describe my lab’s efforts to recover these full genotypes by fitting subpopulation phylogenies to the allele frequency data. In some circumstances, a full, unique reconstruction is possible but often multiple phylogenies are consistent with the data. Our methods (PhyloSub, PhyloWGS, treeCRP) use Bayesian inference to distinguish ambiguous and unambiguous portions of the phylogeny thereby explicitly representing reconstruction uncertainty. Our methods incorporate simple somatic mutations (point mutations and indels) as well as copy number variations; have excellent results on real and simulated data; and can take as input allele frequencies from single or multiple tumour samples where these frequencies are estimated using either targeted or whole genome sequencing.


Quaid Morris is an associate professor in the Donnelly Centre at the University of Toronto in Canada. He is a multi-disciplinary researcher with cross-appointments in the Departments of Computer Science, Engineering, and Molecular Genetics. He founded his lab in 2005 and after having received his PhD from the Massachusetts Institute of Technology (MIT) in 2003. His doctoral training was in machine learning and computational neuroscience under the supervision of Peter Dayan at M.I.T. and the Gatsby Unit at University College London. His lab uses statistical learning to make biological discoveries and develop new methodology for analysing large-scale biomedical datasets. He is currently interested in understanding cancer (and other complex diseases) using genomics; post-transcriptional regulation; text mining of medical records; and the automated prediction of gene function (see


Title: The Warped Linear Mixed Model: finding optimal phenotype transformations yields a substantial increase in signal in genetic analyses

Speaker: Nicolo Fusi

Affiliation: Microsoft Research, Los Angeles

Host: Jennifer Listgarten

Date: Wed. August 20th, 2014

Time: 2:00 PM - 3:30 PM  


Genome-wide association studies, now routine, still have many remaining methodological open problems. Among the most successful models for GWAS are linear mixed models, also used in several other key areas of genetics, such as phenotype prediction and estimation of heritability. However, one of the fundamental assumptions of these models—that the data have a particular distribution (i.e., the noise is Gaussian-distributed)—rarely holds in practice. As a result, standard approaches yield sub-optimal performance, resulting in significant losses in power for GWAS, increased bias in heritability estimation, and reduced accuracy for phenotype predictions. In this talk, I will discuss our solution to this important problem—a novel, robust and statistically principled method, the “Warped Linear Mixed Model”—which automatically learns an optimal “warping function” for the phenotype simultaneously as it models the data. Our approach effectively searches through an infinite set of transformations, using the principles of statistical inference to determine an optimal one. In extensive experiments, we find up to twofold increases in GWAS power, significantly reduced bias in heritability estimation and significantly increased accuracy in phenotype prediction, as compared to the standard LMM.




Microsoft Research New England
First Floor Conference Center
One Memorial Drive, Cambridge, MA

(directions can be found here)

Arrival Guidance

Upon arrival, be prepared to show a picture ID and sign the Building Visitor Log when approaching the Lobby Floor Security Desk. Alert them to the name of the event you are attending and ask them to direct you to the appropriate floor. Typically the talks are located in the First Floor Conference Center, however sometimes the location may change.

Parking Information

Guests are allowed to park in our garage located at One Memorial Drive. Microsoft receptionists will not validate parking for any guests. All day parking is $27.00 on weekdays and $10.00 on weekends. Please note that these rates are subject to change.

*Hospitality Notice: Microsoft Research may provide hospitality at this event. Because different universities and legal jurisdictions have differing rules, we rely on you to know whether acceptance of this invitation would be inconsistent with those rules. Accordingly, By accepting our invitation, you confirm that this invitation is compliant with your institution's policies.

To subscribe to talk announcements for this series, send a message to and enter subscribe msrne-cb-announce in the body of the message. If you have any questions or concerns please send us an email.