Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Microsoft Computational Biology Tools
The following tools are available:


FaST-LMM (Factored Spectrally Transformed Linear Mixed Models) is a program for performing genome-wide association studies (GWAS) on large data sets.


Open source Python code for benchmarking and evaluating GWAS algorithms.


An open source Python library for reading and manipulating genetic data. It can, for example, efficiently read whole PLINK *.bed/bim/fam files or parts of those files. It can also efficiently manipulate ranges of integers using set operators such as union, intersection, and difference.

Epistasis GWAS for 7 common diseases

Results of SNP-pair epistasis GWAS (genome-wide association study) on the Wellcome Trust data. P values are based on a likelihood ratio test comparing the likelihood with a multiplicative term versus that for an additive linear model. 63.5 billion SNP pairs were evaluated for seven common diseases (type I diabetes, type II diabetes, coronary artery disease, hypertension, rheumatoid arthritis, Crohns disease, and bipolar disorder). If you need more results than can be provided through this link, please email a request to,, and

This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from Funding for the project was provided by the Wellcome Trust under award 076113 and 085475.

Prediction of Cytosolic Stability of HIV-Derived Peptides

Pediction tool related to: E. Lazaro, C. Kadie, P. Stamegna, S. C. Zhang, P. Gourdain, N. Y. Lai, M. Zhang, S. A. Martinez, D. Heckerman, S. Le Gall. Variable HIV Peptide Stability in Human Cytosol Is Critical to Epitope Presentation and Immune Escape, J. Clin. Invest. 2011. The Stability Prediction tool takes a list of peptides as input (8-11 amino acids long peptides; HIV-derived peptides). The stability rate is calculated using non-linear regression (one-phase exponential decay) of the degradation profile over 30 minutes for an average of 3 to 5 degradation experiments in cytosolic extracts from human primary cells (peripheral blood mononuclear cells) of different donors.

Contig ploidy and allele dosage estimation ConPADE

ConPADE is a tool for contig ploidy estimation for genome assemblies of complex polyploid plant genomes from whole genome shotgun sequencing data. It also calls SNPs and provides estimates of allele dosages.

Correction for Hidden Confounders in the Genetic Analysis of Gene Expression

Software and associated materials to accompany the following paper: Correction for Hidden Confounders in the Genetic Analysis of Gene Expression, Jennifer Listgarten, Carl Kadie, Eric Schadt, David Heckerman, Proceedings of the National Academy of Sciences, in press. This software has two main functions (1) to perform association scans (e.g. Genome Wide Association Scans) using a linear mixed model, and (2) to learn an appropriate Expression Heterogeneity kernel for use with Linear Mixed Models when looking for associations with gene expression data as the target.


Pathogens live and reproduce inside the human host, whose immune system continually tries to rid the body of these pathogens. This leads to a tug-of-war between the pathogen and the human host, where the pathogen tries to adapt so as to "escape" the immune system, while the immune system learns to recognize and eliminate new foreign pathogens. A set of key players for the immune system are the HLA proteins, each of which can recognize specific short fragments of foreign (e.g. HIV) proteins or epitopes in infected cells and then alert the immune system to their presence. For rapidly evolving pathogens like HIV, a key defense mechanism is to evolve mutations that prevent the HLA proteins from recognizing the viral DNA. This evolution takes place anew in each patient, as each patient has a different set of HLA proteins that recognize different epitopes. PhyloD is a suite of statistical tools that can identify HIV mutations that defeat the function of the HLA proteins in certain patients, thereby allowing the virus to escape elimination by the immune system. By applying this tool to large studies of infected patients, researchers are now able to start decoding the complex rules that govern the HIV mutations, in the hope of one day creating a vaccine to which the virus is unable to develop resistance. See also our GitHub page.

Epitope Predictor

This tool computes the probability that a given kmer is a T-cell epitope restricted to a given HLA allele. The tool can scan for 8, 9, 10, and 11mer epitopes and over all common HLA alleles.

HLA Completion

HLA sequence typing sometimes yields uncertain results. For example, an allele may be identified as A6801/6802 or simply A02. This tool takes the uncertain information, and (probabilistically) expands it to four digit alleles, making use of linkage disequilibrium to inform the expansion.

HLA Assignment

One way to find epitopes is to do lab studies such as ELISPOT. One problem with this approach is that, if you see a reaction in a patient, you do not know which HLA genes of the patient is responsible for the reaction. This tool takes lab data from a series of patients and determines (probabilistically) which HLA genes are responsible for the reaction.

Create Epitome

This tool takes as input, a weighted list of amino acid sequences. It creates epitomes of all lengths.

False Discovery Rate for 2X2 Contingency Tables

False discovery rate (FDR) estimates the proportion of false positives among those tests that are deemed significant. This tool computes FDR for 2x2 contingency tables based on Fisher statistic.

Fisher Exact Test of Independence for 2X2 Contingency Tables

Fisher exact test is a statistical significance test for categorical data, measuring the association between two variables in a 2X2 contingency table.

Windows source code and downloadable programs

 > eScience  > Microsoft Computational Biology Tools