Please direct all questions and comments to Jonathan.
This tool builds on the PhyloD framework to allow the measurement of selection pressure exerted by one factor on another. Given a phylogeny, sequence data, and environmental data (e.g. HLA data, drug therapy data, demographic data), PhyloDOR builds a modified logistic regression model that models the odds of observing a given polymorphism (the target) in the presence or absence of some other variable (the predictor), conditioned on the phylogeny. This document explains in details the model and how to use the tool. Although the tool is quite general, for clarity, we will assume that you are interested in "HLA-mediated escape in HIV", which was the original purpose of this tool. Therefore, the following Help sections assume that the target variable is a specific HIV polymorphism, the phylogeny describes the evolutionary relatedness of HIV population sequences isolated from a number of HIV+ individuals, and the predictor variable of interest is HLA types.
If you find the results useful, please cite
Jonathan M. Carlson, Jennifer Listgarten, Nico Pfeifer, Vincent Tan, Carl Kadie, Bruce D. Walker, Thumbi Ndung'u, Roger Shapiro, John Frater, Zabrina L. Brumme, Philip J. R. Goulder, David Heckerman Widespread Impact of HLA Restriction on Immune Control and Escape Pathways in HIV-1. Journal of Virology, 86(9):5230-5243, May 2012.
The PhyloD framework is an general approach for building statistical models that condition on a phylogeny. The essence of the approach is that we build a generative model that assumes independent evolution of a target variable until that variable reaches the leaves of the phylogeny. At that point, the target variable is subjected to selection pressure. We can then define any model of selection the provides a probability of observing the target variable in a particular state, condioned on the infecting sequence and some set of predictor variables. For more details, see our Phylogenetic Dependency Network paper.
The phylogenetically corrected odds ratio is measured by constructing a phylogenetically corrected logistic regression model, the details of which are available here. Briefly, we start with a traditional logistic regression model, then remove the bias term and replace it with +1/-1 indicator variable that indicates whether an individual was transmitted the polymorphism at the time of infection. Thus, individuals transmitted the polymorphism have a built in bias to still have that polymorphism by the time the sequence data were collected, and individual who were not transmitted the polymorphism have the opposite bias. Because we don't know the true value of this indicator variable, we take a weighted average over both possibilities, where the weight is informed by the phylogeny. The basic model thus looks like
![]() | (1) |
P-values are computed using a likelihood ratio test that compares the likelihood under an alternative model to that under a nested null model. For consistency, we'll use equation (1) to represent the alternative model, and "log(p/[1-p]) = bY + cT" to represent the null model.
Thus, the PhyloDOR is a phylogenetically-corrected odds ratio, which reprents the increase in conditional log-odds of observing the target variable in the presence of the predictor variable compared to the log-odds of observing the target in the absence of the predictor, integrating over the infecting sequence, as informed by the phylogeny.
We currently support 3 different statistical tests, which are accessed by choosing one of four different tabs:
There are four shared data input boxes, as well as specific options for each of the four tests. In this section, we briefly describe the intent of the input boxes. File formats are described in detail below.
Data is uploaded via Data Input Boxes, of which there are four: HLA, DNA/AA, Tree, and Additional Covariates. Data can be pasted directly into
the boxes, or can be loaded from a file by clicking on the "..." button and selecting open file. The data will be loaded into the tool and validated when Check Format is selected
or on first use. Once loaded and validated, the indicator light will turn from gray to green.
There are two types of data sources used by PhyloDOR: A Phylogeny, and a Data Matrix. The Phylogeny is loaded via the Tree box. The phylogeny must be in Newick format, as is commonly
output by most tree inference programs.The Data Matrix is computed by merging the data loaded from the HLA, DNA/AA and Additional Covariates data boxes.
For this reason, any data can be loaded into these boxes. Multiple boxes are provided as a convenience reflecting the fact that different data types typically come in varying formats.
Each test includes an Amino Acid field, which specifies the target variable used in the logistic regression model (see eq 1). This can be any variable from
the Data Matrix, but is typically a single polymorphism. Sequence variables are specified in the form PosAA.
The HLA fields specify the predictor variables used for the specific test, as described in Available Tests. They can be any variable from the Data Matrix.
The Covariates fields allow you to add additional covariates to the model. Each covariate is added as an additional linear feature to the underlying model. Any variable from the
Data Matrix can be added, or as a convenience, all variables specified in the Additional Covariates data input box can be added.
The Patient Group test has an input box that takes a list of patient IDs. The leaves of the phylogeny will be split into two groups: those in this box and those not. The HLA will then
be tested for a group effect against these groups. Thus, the box should contain a strict subset of the leaf names in the phylogeny, separated by comma (,), tab or newline characters.
Data are loaded via the Data Input Boxes. Data can be pasted directly into the boxes, or loaded from file. For examples, please use the Load Small Example button,
or the Load Large Example button, then press ... → Save As File.
There are two types of formats that we describe here:
The phylogeny must be in Newick format. Each branch and leaf should have a branch length, and each leaf should be named. Please note that leaf names are case sensitive and must exactly match names in the Data Matrix.
The Data Matrix is loaded by merging the results of the HLA, DNA/AA and Additional Covariates Data Input Boxes. Each excepts the same file formats. Only
Individuals in all three input boxes are kept (empty input boxes are ignored). Patient names are case sensitive and must match the leaf names. Names that are not in the tree (and vice versa)
will be ignored.
The Data Input Boxes can load three data types, each of which has several formats. The program will figure out which format you used. Examples for each can be seen by clicking the
Load Small Example button.
For a complete description of the PhyloDOR method, as well as the results when applied on a large cohort of HIV clade C infected individuals from Southern Africa, see
- Widespread Impact of HLA Restriction on Immune Control and Escape Pathways in HIV-1. Jonathan M. Carlson, Jennifer Listgarten, Nico Pfeifer, Vincent Tan, Carl Kadie, Bruce D. Walker, Thumbi Ndung'u, Roger Shapiro, John Frater, Zabrina L. Brumme, Philip J. R. Goulder5,10 and David Heckerman. Widespread Impact of HLA Restriction on Immune Control and Escape Pathways in HIV-1. Journal of Virology, 86(9):5230-5243, May 2012.
For more information on the PhyloD framework, see
- Phylogenetic dependency networks: inferring patterns of CTL escape and codon covariation in HIV-1 Gag. Jonathan M. Carlson, Zabrina Brumme, Christine Rousseau, Chanson Brumme, Philippa Matthews, Carl Kadie, James Mullins, Bruce D. Walker, P. Richard Harrigan, Philip J.R. Goulder, David Heckerman. PLoS Computational Biology, 4(11): e1000225, November 2008.
For an argument on why phylogenetic correction is necessary, see
- Founder effects in the assessment of HIV polymorphisms and HLA allele associations.. Tanmoy Bhattacharya, Marcus Daniels, David Heckerman, Brian Foley, Nicole Frahm, Carl Kadie, Jonathan M. Carlson, Karina Yusim, Ben McMahon, Brian Gashen, Simon Mallal, James I. Mullins, David C. Nickle, Joshua Herbeck, Christine Rousseau, Gerald H. Learn, Toshiyuki Miura, Christian Brander, Bruce Walker, Bette Korber. Science, 315(5818):1583-1586, March 2007.
- Leveraging hierarchical population structure in discrete association studies.. Jonathan M. Carlson, Carl Kadie, Simon Mallal, David Heckerman. PLoS One, 2(7):e591, July 2007.
For an review of HLA-mediated escape, see
- HIV evolution in response to HLA-restricted CTL selection pressures: a population-based perspective. Jonathan M. Carlson and Zabrina L. Brumme. (February 2008) Microbes and Infection, 10(5):455-61.
You can find many of the papers that have used PhyloD here.