PhyloDOR: A phylogenetically corrected odds ratio

Please direct all questions and comments to Jonathan.

Overview

This tool builds on the PhyloD framework to allow the measurement of selection pressure exerted by one factor on another. Given a phylogeny, sequence data, and environmental data (e.g. HLA data, drug therapy data, demographic data), PhyloDOR builds a modified logistic regression model that models the odds of observing a given polymorphism (the target) in the presence or absence of some other variable (the predictor), conditioned on the phylogeny. This document explains in details the model and how to use the tool. Although the tool is quite general, for clarity, we will assume that you are interested in "HLA-mediated escape in HIV", which was the original purpose of this tool. Therefore, the following Help sections assume that the target variable is a specific HIV polymorphism, the phylogeny describes the evolutionary relatedness of HIV population sequences isolated from a number of HIV+ individuals, and the predictor variable of interest is HLA types.

If you find the results useful, please cite

Jonathan M. Carlson, Jennifer Listgarten, Nico Pfeifer, Vincent Tan, Carl Kadie, Bruce D. Walker, Thumbi Ndung'u, Roger Shapiro, John Frater, Zabrina L. Brumme, Philip J. R. Goulder, David Heckerman Widespread Impact of HLA Restriction on Immune Control and Escape Pathways in HIV-1. Journal of Virology, 86(9):5230-5243, May 2012.

Model Details

The PhyloD framework is an general approach for building statistical models that condition on a phylogeny. The essence of the approach is that we build a generative model that assumes independent evolution of a target variable until that variable reaches the leaves of the phylogeny. At that point, the target variable is subjected to selection pressure. We can then define any model of selection the provides a probability of observing the target variable in a particular state, condioned on the infecting sequence and some set of predictor variables. For more details, see our Phylogenetic Dependency Network paper.

The phylogenetically corrected odds ratio is measured by constructing a phylogenetically corrected logistic regression model, the details of which are available here. Briefly, we start with a traditional logistic regression model, then remove the bias term and replace it with +1/-1 indicator variable that indicates whether an individual was transmitted the polymorphism at the time of infection. Thus, individuals transmitted the polymorphism have a built in bias to still have that polymorphism by the time the sequence data were collected, and individual who were not transmitted the polymorphism have the opposite bias. Because we don't know the true value of this indicator variable, we take a weighted average over both possibilities, where the weight is informed by the phylogeny. The basic model thus looks like

log(p/[1-p])=aX + bY + cT  (1)

where p is the probability of observing a particular polymorphism, X and Y are two 0/1 predictor variables (e.g. HLA-B*58:01 and HLA-B*57:03), and T is the -1/+1 transmission indicator variable. a, b and c are parameters learned by maximum liklelihood and represent log-odds ratios. For example, a is the log odds ratio of observing the polymorphism with versus without X, conditioned on Y and T. Interactions and comparisons can be tested by constructing X and Y in different interesting ways.

P-values are computed using a likelihood ratio test that compares the likelihood under an alternative model to that under a nested null model. For consistency, we'll use equation (1) to represent the alternative model, and "log(p/[1-p]) = bY + cT" to represent the null model.

Thus, the PhyloDOR is a phylogenetically-corrected odds ratio, which reprents the increase in conditional log-odds of observing the target variable in the presence of the predictor variable compared to the log-odds of observing the target in the absence of the predictor, integrating over the infecting sequence, as informed by the phylogeny.

Available Tests

We currently support 3 different statistical tests, which are accessed by choosing one of four different tabs:

  • Single HLA measures the odds of observing the polymorphism among individual with the HLA compared to those without. This is simply the standard model (eq 1), where X is the HLA and Y is 0 or more conditions (see below).
  • Compare Two HLAs tests two HLA alleles (or other predictor variables) to see if the odds of escape attributable to the two alleles is comparable. Uses the standard model (eq 1) with Y = "HLA 1 or HLA 2" and X = "HLA 1". We then test the null model that a = 0 (ie, "HLA 1" adds no information beyond "HLA 1 or HLA 2").
  • Test context Tests for an interaction between an HLA allele and some other variable. For example, is escape more likely if an individual also expresses a second specific allele? In this case, Y = "Primary HLA" and X = "Primary HLA and Modulating HLA". That is, the null model assume only primary HLA predicts escape, while the alternative model allows the selection pressure induced by the HLA to be a function of whether or not an individual has the second HLA. Keep in mind that these predictors can be anything. Thus, this test could be used to test if a particular escape is more or less common in men versus women, or chronic versus acutely infected individuals, or those on or off therapy.
  • Compare groups use the same model as Test context, but provides a different mechanism for defining the context. Specifically, it creates a new variable based on the list of patient IDs provided in the "Patient Group" box, then plugs the results back into to Test context.

Inputs

Data Input Boxes

There are four shared data input boxes, as well as specific options for each of the four tests. In this section, we briefly describe the intent of the input boxes. File formats are described in detail below.

Data is uploaded via Data Input Boxes, of which there are four: HLA, DNA/AA, Tree, and Additional Covariates. Data can be pasted directly into the boxes, or can be loaded from a file by clicking on the "..." button and selecting open file. The data will be loaded into the tool and validated when Check Format is selected or on first use. Once loaded and validated, the indicator light will turn from gray to green.

There are two types of data sources used by PhyloDOR: A Phylogeny, and a Data Matrix. The Phylogeny is loaded via the Tree box. The phylogeny must be in Newick format, as is commonly output by most tree inference programs.The Data Matrix is computed by merging the data loaded from the HLA, DNA/AA and Additional Covariates data boxes. For this reason, any data can be loaded into these boxes. Multiple boxes are provided as a convenience reflecting the fact that different data types typically come in varying formats.

Options for each test

Each test includes an Amino Acid field, which specifies the target variable used in the logistic regression model (see eq 1). This can be any variable from the Data Matrix, but is typically a single polymorphism. Sequence variables are specified in the form PosAA.

The HLA fields specify the predictor variables used for the specific test, as described in Available Tests. They can be any variable from the Data Matrix.

The Covariates fields allow you to add additional covariates to the model. Each covariate is added as an additional linear feature to the underlying model. Any variable from the Data Matrix can be added, or as a convenience, all variables specified in the Additional Covariates data input box can be added.

The Patient Group test has an input box that takes a list of patient IDs. The leaves of the phylogeny will be split into two groups: those in this box and those not. The HLA will then be tested for a group effect against these groups. Thus, the box should contain a strict subset of the leaf names in the phylogeny, separated by comma (,), tab or newline characters.

Data Formats

Data are loaded via the Data Input Boxes. Data can be pasted directly into the boxes, or loaded from file. For examples, please use the Load Small Example button, or the Load Large Example button, then press ... → Save As File.

There are two types of formats that we describe here:

Phylogeny

The phylogeny must be in Newick format. Each branch and leaf should have a branch length, and each leaf should be named. Please note that leaf names are case sensitive and must exactly match names in the Data Matrix.

Data Matrix

The Data Matrix is loaded by merging the results of the HLA, DNA/AA and Additional Covariates Data Input Boxes. Each excepts the same file formats. Only Individuals in all three input boxes are kept (empty input boxes are ignored). Patient names are case sensitive and must match the leaf names. Names that are not in the tree (and vice versa) will be ignored.

The Data Input Boxes can load three data types, each of which has several formats. The program will figure out which format you used. Examples for each can be seen by clicking the Load Small Example button.

  • Sequence Data can be loaded in using one of several formats. DNA sequences will be automatically translated to Amino Acids, with gaps and mixtures arising from IUAPC ambiguities marked as missing data. If you don't like these settings, convert to a matrix, or email me and I'll think about exposing the options. Sequence formats can be one of
    • FASTA
    • Philip, relaxed so that sequences start after 10 characters, or the first non-space character after the 10th.
    • Two column, tab-delimited, with the first row specifying column names (doesn't matter what they are). The first column is the sequence name, the second is the sequence.
  • HLA Data can be loaded as a tab-delimited table. The first column is the patient name. The next 6 columns specify in order the two HLA-A alleles, the two HLA-B alleles and the two HLA-C alleles. The first row must be a header row (But can be named anything). HLA alleles can be parsed as either the "new" or "old" standard formats. The new format takes the form HLA-A*01:02:01:02N. The "HLA-" is optional. Any level of specification can be provided (type, subtype, allele, etc), but we will only use up to subtype (4-digit). The old format takes the form A*010201. In both cases, the A* can be dropped completely, in which case the locus will be infered from the column in which it sites.

    When we convert the HLA table to a matrix, we will include Supertype, Type (4-digit) and Subtype (2-digit) levels of resolution, if possible from the data. Supertypes are take the form B7_ST or B58_ST. They can be provided in the table, and will also be infered from the type.

    Missing data is denoted with a '?', and ambiguous calls can be denoted using for example A*02/A*03 [note the '/'] or A*02:(01-03) [note the range].

    We do not currently support Serotypes. B57 will be converted to B*57 and B7 will result in an error. Serotype data can be used by loading the data directly as a data matrix (see below).

    HLA data can be "completed" using the HLACompletion tool. You can paste in the output, provided you remove the version information line (the first line must be the table header). In this case, we'll average over the possible completions, using the HLACompletion result as the prior.
  • Matrix Data can be loaded in one of several formats:
    • Dense format is a tab-delimited matrix, in which the columns are individuals and rows are variables. Each entry should be a double (typically 0 or 1, but continuous data is allowed as well). Missing data is denoted with '?'. The first line of the file must provide the patient names, corresponding to those in the phylogeny. The first column is the variable names.
    • Sparse format is a 3 column, tab-delimited file. The head MUST be "var cid val". Each line includes a variable name, the patient, and the value. The value can be any continuous number but is typically 0/1. Missing data is denoted by the absense of an entry for the individual in question.
Both sequence and HLA data can be loaded directly as matrix data, which allows you to circumvent any automatic parsing that is not appropriate for your data.

Related publications

For a complete description of the PhyloDOR method, as well as the results when applied on a large cohort of HIV clade C infected individuals from Southern Africa, see

For more information on the PhyloD framework, see

For an argument on why phylogenetic correction is necessary, see

For an review of HLA-mediated escape, see

You can find many of the papers that have used PhyloD here.

 > eScience  > Microsoft Computational Biology Web Tools  > PhyloD OddsRatio  > Details