Research Interests
Theory:
Markov models on trees,
finite Markov chains,
interacting particle systems,
random graphs,
randomized algorithms.
Applications:
computational phylogenetics,
evolutionary game theory,
population genetics,
learning theory,
social network analysis.
Preprints
Shrinkage Effect in Ancestral Maximum Likelihood
Submitted, 2008.
With E. Mossel, M. Steel.
Ancestral maximum likelihood (AML) is a method that simultaneously
reconstructs a phylogenetic tree and ancestral sequences from extant
data (sequences at the leaves). The tree and ancestral sequences
maximize the probability of observing the given data under a Markov
model of sequence evolution, in which branch lengths are also optimized
but constrained to take the same value on any edge across all sequence
sites. AML differs from the more usual form of maximum likelihood (ML)
in phylogenetics because ML averages over all possible ancestral
sequences. ML has long been known to be statistically consistent --
that is, it converges on the correct tree with probability approaching
1 as the sequence length grows. However, the statistical consistency of
AML has not been formally determined, despite informal remarks in a
literature that dates back 20 years. In this short note we prove a general
result that implies that AML is statistically inconsistent. In particular
we show that AML can `shrink' short edges in a tree, resulting in a tree
that has no internal resolution as the sequence length grows. Our
results apply to any number of taxa.
Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep
Submitted, 2008.
With C. Daskalakis, E. Mossel.
We introduce a new phylogenetic reconstruction algorithm which,
unlike most previous rigorous inference techniques, does not rely
on assumptions regarding the branch lengths or the depth of the tree.
The algorithm returns a forest which is guaranteed to contain all edges
that are: 1) sufficiently long and 2) sufficiently close to the leaves.
How much of the true tree is recovered depends on the sequence length provided.
The algorithm is distance-based and runs in polynomial time.
Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci
To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2008.
With E. Mossel.
We introduce a simple algorithm for reconstructing phylogenies
from multiple gene trees in the presence of incomplete lineage
sorting, that is, when the topology of the gene trees may differ
from that of the species tree. We show that our technique is
statistically consistent under standard stochastic assumptions,
that is, it returns the correct tree given sufficiently many
unlinked loci. We also show that it can tolerate moderate estimation errors.
Network Delay Inference from Additive Metrics
Submitted, 2007.
With S. Bhamidi and R. Rajagopal.
We demonstrate the use of computational phylogenetic techniques
to solve a central problem in inferential
network monitoring. More precisely, we design a novel algorithm
for multicast-based delay inference, i.e.
the problem of reconstructing the topology and delay characteristics
of a network from end-to-end delay
measurements on network paths. Our inference algorithm is based on
additive metric techniques widely used in phylogenetics. It runs in
polynomial time and requires a sample of size only $\poly(\log n)$.
Refereed Proceedings
On the Submodularity of Influence in Social Networks
Proceedings of ACM STOC 2007, 128-134.
With E. Mossel.
We prove and extend a conjecture of Kempe,
Kleinberg, and Tardos (KKT) on the spread of influence in social networks.
A social network can be represented by a directed graph
where the nodes are individuals and the edges indicate a form of social relationship.
A simple way to model the diffusion of ideas, innovative behavior, or ``word-of-mouth'' effects
on such a graph is to consider an increasing process of ``infected'' (or active) nodes:
each node becomes infected once an activation function of the set of its
infected neighbors crosses a certain threshold value.
Such a model was introduced
by KKT in~\cite{KeKlTa:03,KeKlTa:05} where the authors
also impose several natural assumptions: the threshold
values are (uniformly) random to account for our lack of knowledge
of the true values; and the activation functions
are monotone and submodular, i.e.~have ``diminishing returns.''
The monotonicity condition
indicates that a node is more likely to become active if more of its neighbors
are active, while the submodularity condition,
indicates that the marginal effect
of each neighbor is decreasing when the set of active
neighbors increases.
For an initial set of active nodes $S$, let
$\sigma(S)$
denote the expected number of active nodes at termination.
Here we prove a conjecture of KKT: we show that the function $\sigma(S)$ is
submodular under the assumptions above.
We prove the same result for the expected value
of any monotone, submodular function of the set of active nodes at termination.
In other words,
our results demonstrate that ``local'' submodularity is preserved
``globally'' under diffusion processes. This is of natural computational interest, as
many optimization problems have good approximation algorithms for
submodular functions.
In particular, our results coupled with an argument in~\cite{KeKlTa:03} imply that a greedy
algorithm gives an $(1-1/e-\eps)$-approximation algorithm for maximizing
$\sigma(S)$ among all sets $S$ of a given size. This result has important
practical implications for many social network analysis problems, notably
viral marketing.
First to Market is not Everything:
an Analysis of Preferential Attachment with Fitness
Proceedings of ACM STOC 2007, 135-144.
With C. Borgs, J. Chayes and C. Daskalakis.
The design of algorithms on complex
networks, such as routing, ranking or recommendation algorithms,
requires a detailed understanding of the growth
characteristics of the networks of interest, such as the Internet,
the web graph, social networks or online communities. To this end,
preferential attachment, in which the popularity (or relevance) of
a node is determined by its degree, is a well-known and appealing
random graph model, whose predictions are in accordance with
experiments on the web graph and several social networks. However,
its central assumption, that the popularity of the nodes depends
only on their degree, is not a realistic one, since every node has
potentially some intrinsic quality which can differentiate its
attractiveness from other nodes with similar degrees.
In this paper, we provide a rigorous analysis of preferential
attachment with fitness, suggested by Bianconi and Barabasi and
studied by Motwani and Xu, in which the degree of a
vertex is scaled by its quality to determine its attractiveness.
Including quality considerations in the classical preferential
attachment model provides a much more realistic description of
many complex networks, such as the web graph, and allows to
observe a much richer behavior in the growth dynamics of these
networks. Specifically, depending on the shape of the distribution
from which the qualities of the vertices are drawn, we observe
three distinct phases, namely a first-mover-advantage phase, a
fit-get-richer phase and an innovation-pays-off
phase. We precisely characterize the properties of the quality
distribution that result in each of these phases and we compute
the exact growth dynamics for each phase. The dynamics provide
rich information about the quality of the vertices, which can be
very useful in many practical contexts, including ranking
algorithms for the web, recommendation algorithms, as well as the
study of social networks. Furthermore, the mathematical techniques
we introduce to establish these dynamics could be applicable to a
wide variety of problems.
The Kesten-Stigum Reconstruction Bound Is Tight for Roughly Symmetric Binary
Channels
Proceedings of IEEE FOCS 2006, 518-530.
With C. Borgs, J. Chayes, and E. Mossel.
We establish the exact threshold for the reconstruction problem for a
binary asymmetric channel on the b-ary tree,
provided that the asymmetry is sufficiently small. This is the first
exact reconstruction threshold obtained in roughly
a decade. We discuss the implications of our result for Glauber dynamics,
phylogenetic reconstruction, and
so-called ``replica symmetry breaking'' in spin glasses and random
satisfiability problems.
Optimal Phylogenetic Reconstruction
Proceedings of ACM STOC 2006, 159-168.
With C. Daskalakis, E. Mossel.
It is well known that in order to reconstruct a tree on $n$ leaves,
sequences of length $\Omega(\log n)$ are needed. It was conjectured
by M. Steel that for the CFN evolutionary model, if the mutation
probability on all edges of the tree is less than $p^{\ast} = (\sqrt{2}-1)/2^{3/2}$
than the tree can be recovered from sequences of length $O(\log n)$. This was proven
by the second author in the special case where the tree is ``balanced''.
The second author also proved that if all edges have mutation probability larger
than $p^{\ast}$ then the length needed is $n^{\Omega(1)}$. This ``phase-transition'' in
the number of samples needed is closely related to the phase transition for the
reconstruction problem (or extremality of free measure) studied extensively
in statistical physics and probability.
Here we complete the proof of Steel's conjecture and give a reconstruction algorithm
using optimal (up to a multiplicative constant) sequence length. Our results further
extend to obtain optimal reconstruction algorithm for the Jukes-Cantor model with
short edges. All reconstruction algorithms run in time polynomial in the sequence
length.
The algorithm and the proofs are based on a novel combination of combinatorial,
metric and probabilistic arguments.
Learning nonsingular phylogenies and hidden Markov models
Proceedings of ACM STOC 2005, 366-375.
With E. Mossel.
Also in Annals of Applied Probability, 16(2):583-614, 2006.
In this paper, we study the problem of
learning phylogenies and hidden Markov models.
We call the Markov model nonsingular
if all transtion matrices have determinants
bounded away from $0$ (and $1$).
We highlight the role of the nonsingularity condition for the learning problem.
Learning hidden Markov models without the nonsingularity condition
is at least as hard as learning parity with noise.
On the other hand, we give a polynomial-time algorithm for learning
nonsingular phylogenies and hidden Markov models.
Transient growth in exactly counter-rotating Couette-Taylor flow
Theoretical and Computational Fluid Dynamics 16:43-48, 2002.
With H. Hristova, P. Schmid, L. Tuckerman.
Also in Physics of Fluids, 14(10), 2002.
Transient growth due to non-normality is investigated for the Taylor-Couette problem with
counter-rotating cylinders as a function of aspect ratio $\eta$ and Reynolds number $Re$.
For all $Re \leq 500$, transient growth is enhanced by curvature, i.e. is greater for
$\eta < 1$ than for $\eta = 1$, the plane Couette limit. For fixed $Re < 130$ it is
found that the greatest transient growth is achieved for $\eta$ between the Taylor-Couette
linear stability boundary, if it exists, and one, while for $Re > 130$ the greatest
transient growth is achieved for $\eta$ on the linear stability boundary. Transient growth
is shown to be approximately 20% higher near the linear stability boundary at $Re = 310$,
$\eta = 0.986$ than at $Re = 310$, $\eta = 1$, near the threshold observed for transition
in plane Couette. The energy in the optimal inputs is primarily meridional; that in the
optimal outputs is primarily azimuthal. Pseudospectra are calculated for two contrasting
cases. For large curvature, $\eta = 0.5,$ the pseudospectra adhere more closely to the
spectrum than in a narrow gap case, $\eta = 0.99$.
Journal
On Learning Thresholds of Parities and Unions of Rectangles in Random Walk Models
Random Structures and Algorithms 31(4):406-417, 2007.
In a recent breakthrough, [Bshouty et al., 2003] obtained
the first passive-learning algorithm for DNFs under
the uniform distribution, using
random walks. They also showed that DNFs are learnable
in the weaker Noise Sensitivity model. We extend those results
in several directions. We first show that thresholds of
parities, a natural class encompassing DNFs, cannot be learned efficiently
in the Noise Sensitivity model
using only statistical queries. In contrast, we show that
a cyclic version of the Random Walk model allows to learn efficiently
polynomially weighted thresholds of parities.
We also extend the algorithm of Bshouty et al. to the case
of Unions of Rectangles, a natural generalization of DNFs to $\{0,\ldots,b-1\}^n$.
Slow Emergence of Cooperation for Win-Stay Lose-Shift on Trees
Machine Learning 67(1-2):7-22, 2007. (Special Issue on Learning and Computational Game Theory)
With E. Mossel.
We consider a group of agents on a graph who repeatedly play the prisoner's
dilemma game against their neighbors. The players adapt their actions to the
past behavior of their opponents by applying the win-stay lose-shift strategy.
On a finite connected graph, it is easy to see that the system learns
to cooperate
by converging to the all-cooperate state in a finite time. We analyze the rate
of convergence in terms of the size and structure of the graph. [Dyer
et al., 2002]
showed that the system converges rapidly on the cycle, but that it takes a time
exponential in the size of the graph to converge to cooperation on the complete
graph. We show that the emergence of cooperation is exponentially slow in some
expander graphs. More surprisingly, we show that it is also exponentially slow
in bounded-degree trees, where many other dynamics are known to converge
rapidly.
Upstream Reciprocity and the Evolution of Gratitude
If someone is nice to you, you feel good and may be inclined
to be nice to somebody else. This every day experience is borne out by experimental
games: the recipients of an act of kindness are more inclined to help in turn,
even if the person who benefits from their generosity is somebody else. This
behavior, which has been called `upstream reciprocity', appears to be a misdirected
act of gratitude: you help somebody because somebody else has helped you. Does
this make any sense from an evolutionary or game theoretic perspective? In this
paper, we show that upstream reciprocity alone does not lead to the evolution of
cooperation. But upstream reciprocity can evolve and increase the level of
cooperation if it is linked to either direct reciprocity or network reciprocity.
We calculate the random chains of altruistic acts that are induced by upstream
reciprocity. Our analysis shows that gratitude and other positive emotions,
which increase the willingness to help others, can evolve in the competitive world
of natural selection.
Learning nonsingular phylogenies and hidden Markov models
Annals of Applied Probability, 16(2):583-614, 2006.
With E. Mossel.
Also in Proceedings of ACM STOC 2005, 366-375.
In this paper, we study the problem of
learning phylogenies and hidden Markov models.
We call the Markov model nonsingular
if all transtion matrices have determinants
bounded away from $0$ (and $1$).
We highlight the role of the nonsingularity condition for the learning problem.
Learning hidden Markov models without the nonsingularity condition
is at least as hard as learning parity with noise.
On the other hand, we give a polynomial-time algorithm for learning
nonsingular phylogenies and hidden Markov models.
A smoothing heuristic for a bilevel pricing problem
European Journal of Operational Research, 174(3):1396-1413, 2006.
With J.P. Dussault, P.Marcotte, G. Savard.
We provide a global heuristic for a class of bilevel programs with
bilinear objectives and linear contraints. The algorithm relies on
an interior point approach that can be seen as a combination of the
smoothing
and implicit programming techniques well-known to the MPEC litterature.
But contrary to what is usually done there, here the focus is on global
optimality. We tested the algorithm on large-scale instances of a network
pricing problem, one of the main applications of this model. The results
show that, on hard instances, the heuristic is a good alternative to
exact solvers based on mixed 0-1 programming formulations.
A Short Proof that Phylogenetic Tree Reconstruction by Maximum Likelihood is Hard
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):92-94, 2006.
Maximum likelihood is one of the most widely used techniques to infer
evolutionary histories. Although it is thought to
be intractable, a proof of its hardness has been lacking. Here, we give
a short proof that computing the maximum
likelihood tree is NP-hard by exploiting a connection between likelihood
and parsimony observed by Tuffley and Steel.
Bounding Fastest Mixing
Electronic Communications in Probability, 10:282-296, 2005.
In a series of recent works,
Boyd, Diaconis, and their co-authors
have introduced a semidefinite programming approach
for computing
the fastest mixing Markov chain on a graph of allowed transitions,
given a target stationary distribution. In this paper, we show that
standard mixing-time analysis techniques---variational
characterizations, conductance, canonical paths---can
be used to give simple, nontrivial lower and upper bounds
on the fastest mixing time. To test the applicability of this idea,
we consider several detailed examples including
the Glauber dynamics of the Ising model---and get sharp bounds.
Design and Analysis of an Approximation Algorithm for Stackelberg Network Pricing
Networks, 46(1):57-67, 2005.
With P. Marcotte, G. Savard.
We consider the problem of maximizing the revenue
raised from tolls set on the arcs of a transportation network, under
the constraint that users are assigned to toll-compatible shortest
paths. We first provide a polynomial algorithm with a worst-case
guarantee of $O(\log m_T)$, where $m_T$ denotes the number of toll
arcs. Next we show that the approximation is tight with respect to the
linear programming relaxation by constructing examples for which the
integrality gap is reached.
Transient Growth in Taylor-Couette Flow
Physics of Fluids, 14(10), 2002.
With H. Hristova, P. Schmid, L. Tuckerman.
Also in Theoretical and Computational Fluid Dynamics 16:43-48, 2002.
Transient growth due to non-normality is investigated
for the Taylor-Couette problem with counter-rotating
cylinders as a function of aspect ratio $\eta$ and Reynolds
number $Re$. For all $Re \leq 500$, transient growth is
enhanced by curvature, i.e. is greater for $\eta < 1$
than for $\eta = 1$, the plane Couette limit. For fixed
$Re < 130$ it is found that the greatest transient growth
is achieved for $\eta$ between the Taylor-Couette linear
stability boundary, if it exists, and one, while for
$Re > 130$ the greatest transient growth is achieved
for $\eta$ on the linear stability boundary. Transient
growth is shown to be approximately 20% higher near the
linear stability boundary at $Re = 310$, $\eta = 0.986$
than at $Re = 310$, $\eta = 1$, near the threshold observed
for transition in plane Couette. The energy in the optimal
inputs is primarily meridional; that in the optimal outputs
is primarily azimuthal. Pseudospectra are calculated for
two contrasting cases. For large curvature, $\eta = 0.5,$
the pseudospectra adhere more closely to the spectrum than
in a narrow gap case, $\eta = 0.99$.
Non-colliding Random Walks, Tandem Queues and Discrete Orthogonal Polynomial Ensembles
Electronic Journal of Probability, 7:1-24, 2002.
With W. Koenig, Neil O'Connell.
We show that the function $h(x)=\prod_{i<j}(x_j-x_i)$ is harmonic for any
random walk in $\R^k$ with exchangeable increments, provided the required moments exist.
For the subclass of random walks which can only exit the Weyl chamber
$W=\{x\colon x_1<x_2<\cdots<x_k\}$ onto a point where $h$ vanishes,
we define the corresponding Doob $h$-transform. For certain special cases,
we show that the marginal distribution of the conditioned process at a fixed time
is given by a familiar discrete orthogonal polynomial ensemble. These include the
Krawtchouk and Charlier ensembles, where the underlying walks are binomial and
Poisson, respectively. We refer to the corresponding conditioned processes in
these cases as the Krawtchouk and Charlier processes. In [O'Connell and Yor (2001b)],
a representation was obtained for the Charlier process by considering a sequence
of $M/M/1$ queues in tandem. We present the analogue of this representation theorem
for the Krawtchouk process, by considering a sequence of discrete-time $M/M/1$ queues
in tandem. We also present related results for random
walks on the circle, and relate a system of non-colliding walks in this case
to the discrete analogue of the circular unitary ensemble (CUE).
Thesis
Markov Models on Trees: Reconstruction and Applications
Ph.D. Thesis, University of California, Berkeley, 2007.
In this dissertation, we consider a number of inference problems
related to Markov models on trees---mostly motivated by applications
in evolutionary biology.
We begin with the ``ancestral reconstruction'' problem---also
known simply as the reconstruction problem:
Given the state at the leaves of the tree, how accurately can
one guess the state at the root?
Our main result in this area is a new
condition for non-reconstructibility. More precisely,
we extend a known bound, the Kesten-Stigum bound, to a
more general class of Markov transition matrices
than previously known, more specifically, roughly
symmetric binary channels.
The second problem we consider is the ``phylogenetic reconstruction'' problem:
Given samples of the leaf states, when
can one infer efficiently the tree structure from which the data was generated?
This is a central problem in biology where it corresponds
to reconstructing the evolutionary history of a group
of species from molecular sequences. Here we prove
two main results. First, we show that a standard inference
technique used in biology, maximum likelihood, is NP-hard,
that is, computationally intractable on worst-case instances.
Then, using
a connection to the ancestral reconstruction problem, we show
that, when the data is generated from a Markov model
with sufficiently low mutation rates, the sample requirement
for an accurate reconstruction
drops significantly. Moreover, we show that this transition, conjectured
by Mike Steel, is sharp.
The third problem we consider is the ``full reconstruction''
problem: Given samples of the leaf states,
when can one reconstruct efficiently the full Markov
model, that is, the tree structure as well as the edge transition
matrices? Using a standard framework from computational
learning theory, we give conditions on the transition matrices
for the feasibility of
an efficient reconstruction of the full model.
Finally, we consider a related model in networking,
multicast delay inference. We give an efficient
algorithm for reconstructing the routing tree and
its delay parameters.
Coauthors
Shankar Bhamidi,
Christian Borgs,
Jennifer Chayes,
Constantinos Daskalakis,
Jean-Pierre Dussault,
Hristina Hristova,
Wolfgang Koenig,
Patrice Marcotte,
Elchanan Mossel,
Martin Nowak,
Neil O'Connell,
Ram Rajagopal,
Gilles Savard,
Peter Schmid,
Mike Steel,
Laurette Tuckerman