Annotated Bibliography of Papers by Christopher Meek
Christopher Meek
Microsoft Research
Redmond, WA 98052, USA
meek@microsoft.com
March, 2005
Year: 2005, 2004, 2003, 2002, 2001,
2000, 1999,
1998, 1997,
1996, 1995,
1994
Structured Variational Inference Procedures and their Realizations
with D. Geiger (AI&Stats 2005)
We describe and prove the convergence of several algorithms for
approximate structured variational inference. We discuss the
computation cost of these algorithms and describe their relationship
to the mean-field and generalized-mean-field variational approaches
and other structured variational methods.
Efficient gradient computation for conditional Gaussian models
with B. Thiesson (AI&Stats 2005)
We introduce Recursive Exponential Mixed Models (REMMs)
and derive the gradient of the parameters for the incomplete-data
likelihood. We demonstrate how one can use probabilistic inference in
Conditional Gaussian (CG) graphical models, a special case of REMMs,
to compute the gradient for a CG model. We also demonstrate that this
approach can yield simple and effective algorithms for computing the
gradient for models with tied parameters and illustrate this approach
on stochastic ARMA models.
Large-Sample Learning of Bayesian Networks is NP-Hard
with D.M. Chickering and D. Heckerman (JMLR 2004)
In this paper, we provide new complexity results for algorithms that
learn discrete-variable Bayesian networks from data. Our results apply
whenever the learning algorithm uses a scoring criterion that favors
the simplest structure for which the model is able to represent the
generative distribution exactly. Our results therefore hold whenever
the learning algorithm uses a consistent scoring criterion and is
applied to a sufficiently large dataset. We show that identifying
high-scoring structures is NP-hard, even when any combination of one
or more of the following hold: the generative distribution is perfect
with respect to some DAG containing hidden variables; we are given an
independence oracle; we are given an inference oracle; we are given an
information oracle; we restrict potential solutions to structures in
which each node has at most k parents, for all $k\geq 3$. Our proof relies
on a new technical result that we establish in the appendices. In
particular, we provide a method for constructing the local
distributions in a Bayesian network such that the resulting joint
distribution is provably perfect with respect to the structure of the
network.
[JMLR][bibtex]
Efficient approximations for learning phylogenetic HMM models from data
with V. Jojic, N. Jojic, D. Geiger, A. Siepel, D. Haussler and D. Heckerman
(Bioinformatics 2004)
We consider models useful for learning an evolutionary or phylogenetic
tree from data consisting of DNA sequences corresponding to the leaves
of the tree. In particular, we consider a general probabilistic model
described in Siepel and Haussler that we call the phylogenetic-HMM
model which generalizes the classical probabilistic models of Neyman
and Felsenstein. Unfortunately, computing the likelihood of
phylogenetic-HMM models is intractable. We consider several
approximations for computing the likelihood of such models including
an approximation introduced in Siepel and Haussler, loopy belief
propagation and several variational methods. We demonstrate that,
unlike the other approximations, variational methods are accurate and
are guaranteed to lower bound the likelihood. In addition, we identify
a particular variational approximation to be best one in which the
posterior distribution is variationally approximated using the classic
Neyman and Felsenstein model. The application of our best approximation to
data from the cystic fibrosis transmembrane conductance regulator gene
region across nine eutherian mammals reveals a CpG effect.
[ISMB][bibtex]
Variations on Undirected Graphical Models and their Relationships
with D. Heckerman and T. Richardson.
Associated with an undirected graph is the set of
distributions obeying the Local Markov property, which are also the
unique stationary distribution of the corresponding single-site Gibbs
sampler. In this note we show that this set of distributions is not a
subset of the set of distributions obeying the global Markov property
with respect to a graph. It follows that this set is also not a subset
of the set of distributions which are limits of distributions
factorizing according to an undirected graph.
[TechReport][bibtex]
Probabilistic Models for Relational Data
with D. Heckerman and D. Koller.
We introduce a graphical language for relational data called the
probabilistic entity-relationship (PER) model. The model is an
extension of the entity-relationship model, a common model for the
abstract representation of database structure. We concentrate on the
directed version of this model---the directed acyclic probabilistic
entity-relationship (DAPER) model. The DAPER model is closely related
to the plate model and the probabilistic relational model (PRM),
existing models for relational data. The DAPER model is more
expressive than either existing model, and also helps to demonstrate
their similarity. In addition to describing the new language, we
discuss important facets of modeling relational data, including the
use of restricted relationships, self relationships, and probabilistic
relationships. Many examples are provided.
[TechReport][bibtex]
ARMA Time-Series Modeling with Graphical Models
with B. Thiesson, D.M. Chickering, D. Heckerman (UAI 2004).
We express the classic ARMA time-series model as a directed graphical
model. In doing so, we find that the deterministic relationships in
the model make it effectively impossible to use the EM algorithm for
learning model parameters. To remedy this problem, we replace the
deterministic relationships with Gaussian distributions having a small
variance, yielding the stochastic ARMA ($\sigma$ARMA) model. This
modification allows us to use the EM algorithm to learn parameters and
to forecast, even in situations where some data is missing. This
modification, in conjunction with the graphical-model approach, also
allows us to include cross predictors in situations where there are
multiple time series and/or additional non-temporal covariates. More
surprising, experiments suggest that the move to stochastic ARMA
yields improved accuracy through better smoothing. We demonstrate
improvements afforded by cross prediction and better smoothing on real
data.
Model-Based Clustering and Visualization of Navigation Patterns on a
Web Site
with I. Cadez, D. Heckerman, P. Smyth and S. White
(Data Mining and
Knowledge Discovery 2003)
We present a new methodology for exploring and analyzing navigation
patterns on a web site. The patterns that can be analyzed consist of
sequences of URL categories traversed by users. In our approach, we
first partition site users into clusters such that users with similar
navigation paths through the site are placed into the same
cluster. Then, for each cluster, we display these paths for users
within that cluster. The clustering approach we employ is model-based
(as opposed to distance-based) and partitions users according to the
order in which they request web pages. In particular, we cluster users
by learning a mixture of first-order Markov models using the
Expectation-Maximization algorithm. The runtime of our algorithm
scales linearly with the number of clusters and with the size of the
data; and our implementation easily handles hundreds of thousands of
user sessions in memory. In the paper, we describe the details of our
method and a visualization tool based on it called WebCANVAS. We
illustrate the use of our approach on user-traffic data from
msnbc.com.
[not available online][bibtex]
Practically Perfect
with Max Chickering (UAI-2003)
We provide a lower-bound on the probability of sampling a non-perfect
distribution when using a fixed number of bits to represent the
parameters of the Bayesian network. This bound approaches zero
exponentially fast as one increases the number of bits used to
represent the parameters. This result implies that perfect
distributions with fixed-length representations exist. We also provide
a lower-bound on the number of bits needed to guarantee that a
distribution sampled from a uniform Dirichlet distribution is perfect
with probability greater than $1/2$. This result is useful for
constructing randomized reductions for hardness proofs.
[pdf][bibtex]
Discriminative Model Selection for Density Models
with Bo Thiesson (AI &
Stats 2003)
Density models are a popular tool for building classifiers. When
using density models to build a classifier, one typically learns a
separate density model for each class of interest. These density
models are then combined to make a classifier through the use of
Bayes' rule utilizing the prior distribution over the classes. In
this paper, we provide a discriminative method for choosing among
alternative density models for each class to improve classification
accuracy.
[ps][pdf][bibtex]
Large-Sample Learning of Bayesian Networks is Hard
with Max Chickering
and David Heckerman (UAI 2003)
In this paper, we provide new complexity results for algorithms that
learn discrete-variable Bayesian networks from data. Our results
apply whenever the learning algorithm uses a scoring criterion that
favors the simplest model able to represent the generative
distribution exactly. Our results therefore hold whenever the
learning algorithm uses a consistent scoring criterion and is applied
to a sufficiently large dataset. We show that identifying
high-scoring structures is hard, even when we are given an
independence oracle, an inference oracle, and/or an information
oracle. Our negative results also apply to the learning of
discrete-variable Bayesian networks in which each node has at most $k$
parents, for all $k \geq 3$.
[pdf][bibtex]
Monotone DAG Faithfulness: A Bad Assumption
with Max Chickering
(submitted AI 12/2002)
In a recent paper, Cheng, Greiner, Kelly, Bell and Liu (Artificial
Intelligence 137:43-90; 2002) describe an algorithm for learning
Bayesian networks that---in a domain consisting of $n$
variables---identifies the optimal solution using $O(n^4)$ calls to a
mutual-information oracle. This seemingly incredible result relies on
(1) the standard assumption that the generative distribution is Markov
and faithful to some directed acyclic graph (DAG), and (2) a new
assumption about the generative distribution that the authors call
{\em monotone DAG faithfulness (MDF)}. The MDF assumption rests on an
intuitive connection between active paths in a Bayesian-network
structure and the mutual information among variables. The assumption
states that the (conditional) mutual information between a pair of
variables is a monotonic function of the set of active paths between
those variables; the more active paths between the variables the
higher the mutual information. In this paper, we demonstrate the
unfortunate result that, for any realistic learning scenario, the
monotone DAG faithfulness assumption is {\em incompatible} with the
faithfulness assumption. In fact, by assuming both MDF and
faithfulness, we restrict the class of possible Bayesian-network
structures to one for which the optimal solution can be identified
with O($n^2$) calls to an independence oracle.
[TechReport][bibtex]
Factorization of discrete probability distributions
with Dan
Geiger and Bernd Sturmfels
We formulate necessary and sufficient conditions for
an arbitrary discrete probability distribution to factor according to
an undirected graphical model, or a log-linear model, or other more
general exponential models. This result generalizes the well known
Hammersley-Clifford Theorem.
[Not available on-line][bibtex]
On the Toric Algebra of Graphical Models
with Dan
Geiger and Bernd Sturmfels
We formulate necessary and sufficient conditions for an arbitrary
discrete probability distribution to factor according to an undirected
graphical model, or a log-linear model, or other more general
exponential models. This characterization generalizes the well-known
Hammersley-Clifford Theorem. We show that for decomposable graphical
models these conditions are equivalent to a set of statistical
independence facts as in the Hammersley-Clifford Theorem but that for
non-decomposable graphical models they are not. We also show that
non-decomposable models can have non-rational maximum likelihood
estimates. Finally, using these results, we provide a characterization
of decomposable graphical models.
[postscript][bibtex]
CFW: A Collaborative Filtering System Using Posteriors Over
Weights of Evidence
with Carl M. Kadie and David Heckerman
We describe CFW, a computationally efficient algorithm for
collaborative filtering that uses posteriors over weights of
evidence. In experiments on real data, we show that this method
predicts as well or better than other methods in situations where
the size of the user query is small. The new approach works
particularly well when the user's query contains low frequency
(unpopular) items. The approach complements that of dependency
networks which perform well when the size of the query is large.
Also in this paper, we argue that the use of posteriors over
weights of evidence is a natural way to recommend similar
items---a task that is somewhat different from the usual
collaborative-filtering task.
[postscript][bibtex]
The Road to Asymptopia with David Maxwell Chickering
Identifying the presence or absence of inducing paths is
essential for any algorithm that uses an independence-based approach
to learning causal models from observational data generated from a
graphical model with hidden variables and possibly sampled with
selection bias. Conceptually, an inducing path is a sequence of
vertices in a directed acyclic graphical model (with hidden variables)
that guarantees that the endpoints of the path cannot be d-separated
given any subset of observed vertices in the graph. If there exists an
inducing path in the model from which the observed data is generated,
then we can deduce that the endpoints either (1) have a hidden common
cause, (2) one of the endpoints is a (possibly indirect) cause of the
other, or (3) both. Not surprisingly, the asymptotic correctness of
the independence-based approaches to learning causal models rely
heavily on being able to detect correctly the existence of inducing
paths; that is, they must judge correctly that the endpoints of the
path are dependent given a subset of the observed variables. In this
paper, we investigate inducing paths from an information-theoretic
perspective and examine experimentally the number of samples required
to reliably detect inducing paths of various types.
[Not available on-line] [bibtex]
Finding Optimal Bayesian Networks with David Maxwell Chickering
In this paper, we derive optimality results for greedy
Bayesian-network search algorithms that perform single edge
modifications at each step and use asymptotically consistent scoring
criteria. Our results extend those of Meek (1997) and Chickering
(2002), who demonstrate that in the limit of large datasets, such
search algorithms identify the generative (i.e. optimal) model if that
model is DAG perfect among the observable variables. We instead
guarantee a weaker form of optimality whenever the generative
distribution satisfies the composition property over the
observable variables, which is a more realistic assumption for real
domains. In addition, we show that the composition property is
guaranteed to hold whenever the dependence relationships in the
generative distribution can be characterized by paths between
singleton elements in some (generative) graphical model even when we
allow the generative model to include unobserved variables, and the
observed data to be subject to selection bias. The generalization
encompasses any situation where the generative distribution is perfect
with respect to (1) a DAG, (2) a chain graph, or (3) a Markov network.
[pdf] [bibtex]
Staged Mixture Modeling and Boosting with Bo Thiesson and David
Heckerman
In this paper, we introduce and evaluate a data-driven staged mixture
modeling technique for building density, regression, and
classification models. Our basic approach is to sequentially add
components to a finite mixture model using the structural expectation
maximization (SEM) algorithm. We show that our technique is
qualitatively similar to boosting. This correspondence is a natural
byproduct of the fact that we use the SEM algorithm to
sequentially fit the mixture model. Finally, in our experimental
evaluation, we demonstrate the effectiveness of our approach on a
variety of prediction and density estimation tasks using real-world
data.
[postscript][bibtex]
Autoregressive tree models for time-series analysis with David
Maxwell Chickering and David Heckerman
In this paper, we describe models for continuous-valued
time-series data that are useful for data mining in that they (1)
can be learned efficiently from data, (2) support accurate
predictions, and (3) are easy to interpret. We describe a class
of models called autoregressive tree (ART) models that are a
generalization of linear autoregressive (AR) models that
allow one to model non-linear time-series data by piece-wise
linear autoregressive models. We accomplish this generalization
by using decision trees with linear regressions at the leaves to
predict successive observations in the time series. Also, we
describe a Bayesian approach for learning AR and ART models from
data. Finally, we compare several alternative models and learning
methods using 2,494 data sets.
[postscript][bibtex]
Published in 2001
The Learning-Curve Sampling Method Applied to Model-Based
Clustering with Bo Thiesson and David Heckerman
We examine the learning-curve sampling method, an approach for
applying machine-learning algorithms to large data sets. The
approach is based on the observation that the computational cost
of learning a model increases as a function of the sample size of
the training data, whereas the accuracy of a model has
diminishing improvements as a function of sample size. Thus, the
learning-curve sampling method monitors the increasing costs and
performance as larger and larger amounts of data are used for
training, and terminates learning when future costs outweigh
future benefits. In this paper, we formalize the learning-curve
sampling method and its associated cost-benefit tradeoff in terms
of decision theory. In addition, we describe the application of
the learning-curve sampling method to the task of model-based
clustering via the expectation-maximization (EM) algorithm. In
experiments on three real data sets, we show that the
learning-curve sampling method produces models that are nearly as
accurate as those trained on complete data sets, but with
dramatically reduced learning times. Finally, we describe an
extension of the basic learning-curve approach for model-based
clustering that results in an additional speedup. This extension is
based on the observation that the shape of the learning curve for
a given model and data set is roughly independent of the number
of EM iterations used during training. Thus, we run EM for only a
few iterations to decide how many cases to use for training, and
then run EM to full convergence once the number of cases is
selected.
In JMLR [postscript][pdf][bibtex] also in
AI&Statistics [bibtex]
Accelerating EM for large databases with Bo Thiesson and David
Heckerman
The EM algorithm is a popular method for parameter estimation in a
variety of problems involving missing data. However, the EM algorithm
often requires significant computational resources and has been
dismissed as impractical for large databases. We present two
approaches that significantly reduce the computational cost of
applying the EM algorithm to databases with a large number of cases,
including databases with large dimensionality. Both approaches are
based on partial E-steps for which we can use the results of Neal and
Hinton (1998) to obtain the standard convergence guarantees
of EM. The first approach is a version of the incremental EM algorithm,
described in Neal and Hinton (1998), which cycles through data cases
in blocks. The number of cases in each block dramatically effects the
efficiency of the algorithm. We provide a method for selecting a near
optimal block size. The second approach, which we call lazy EM, will,
at scheduled iterations, evaluate the significance of each data case
and then proceed for several iterations actively using only the
significant cases. We demonstrate that both methods can significantly
reduce computational costs through their application to
high-dimensional real-world and synthetic mixture modeling problems
for large databases.
[postscript][pdf][bibtex]
Efficient Determination of Dynamic Split Points in a Decision
Tree with David Maxwell Chickering and Robert Rounthwaite
We consider the problem of choosing split points for continuous
predictor variables in a decision tree. Previous approaches to this
problem typically either (1) discretize the continuous predictor
values prior to learning or (2) apply a dynamic method that considers
all possible split points for each potential split. In this paper, we
describe a number of alternative approaches that generate a small
number of candidate split points dynamically with little overhead. We
argue that these approaches are preferable to pre-discretization, and
provide experimental evidence that they yield probabilistic decision
trees with the same prediction accuracy as the traditional dynamic
approach. Furthermore, because the time to grow a decision tree is
proportional to the number of split points evaluated, our approach is
significantly faster than the traditional dynamic approach.
Available by request [bibtex]
Stratified exponential families: graphical models and model
selection with Dan Geiger, David Heckerman and Henry King
We describe a hierarchy of exponential families which is useful
for distinguishing types of graphical models. Undirected
graphical models with no hidden variables are linear exponential
families (LEFs), directed acyclic graphical (DAG) models and
chain graphs with no hidden variables, including DAG models with
several families of local distributions, are curved exponential
families (CEFs) and graphical models with hidden variables are,
what we term, stratified exponential families (SEFs). A SEF is a
finite union of CEFs of various dimensions satisfying some
regularity conditions. We also show that this hierarchy of
exponential families is non-collapsing with respect to graphical
models by providing a graphical model which is a CEF but not an
LEF and a graphical model that is a SEF but not a CEF. Finally,
we show how to compute the dimension of a stratified exponential
family. These results are discussed in the context of model
selection of graphical models.
[postscript][pdf][bibtex]
Finding a Path is Harder than Finding a Tree
I consider the problem of learning an optimal path graphical model
from data and show the problem to be NP-hard for the maximum
likelihood and minimum description length approaches and a Bayesian
approach. This hardness result holds despite the fact that the problem
is a restriction of the polynomially solvable problem of finding the
optimal tree graphical model.
In JAIR
[postscript][pdf][bibtex]
Using Temporal Data for Making Recommendations with Andrew Zimdars and David Maxwell Chickering
We treat collaborative filtering as a univariate time series
estimation problem: given a user's previous votes, predict the next
vote. We describe two families of methods for transforming data to
encode time order in ways amenable to off-the-shelf classification and
density estimation tools, and examine the results of using these
approaches on several real-world data sets. The improvements in
predictive accuracy we realize recommend the use of other predictive
algorithms that exploit the temporal order of data.
[postscript][bibtex]
Published in 2000
Challenges of the email domain for text classification with Jake Brutlag
Interactive classification of email into a user-defined hierarchy of
folders is a natural domain for application of text classification
methods. This domain presents several challenges. First, the user's
changing mail-filing habits mandate classification technology adapt in
a dynamic environment. Second, the classification technology needs to
be able to handle heterogeneity in folder content and folder size.
Performance when there are only a small number of messages in a folder
is especially important. Third, methods must meet the processing and
memory requirements of a software implementation. We study three
promising methods and present an analysis of their behavior with
respect to these domain-specific challenges.
[postscript][bibtex]
Perfect Tree-like Markovian Distributions with Ann Becker and
Dan Geiger
Abstract Not available
Not available electronically [bibtex]
Global Partial Orders from Sequential Data with Heikki Mannila
Sequences of events arise in many applications, such as web browsing,
e-commerce, and monitoring of processes. An important problem in
mining sets of sequences of events is to get an overview of the
ordering relationships in the data. We present a method for finding
partial orders that describe the ordering relationships between the
events in a collection of sequences. The method is based on viewing a
partial order as a generative model for a set of sequences, and
applying mixture modeling techniques to obtain a descriptive set of
partial orders. Runtimes for our algorithm scale linearly in the
number of sequences and polynomially in the number of different event
types. Thus, the methods scales to handle large data sets and can be
used for reasonable numbers of different types of events. We
illustrate our technique by applying it to student enrollment data and
web browsing data.
[postscript][pdf][bibtex]
Goal-oriented clustering with David Maxwell Chickering, David
Heckerman, John C. Platt and Bo Thiesson
Abstract not available
[postscript] [bibtex]
Visualization of navigation patterns on a web site
using model based clustering with I. Cadez, D. Heckerman, P. Smyth and S. White
We present a new methodology for exploring and analyzing
navigation patterns on a web site. The patterns that can be analyzed
consist of sequences of URL categories traversed by users. In our
approach, we first partition site users into clusters such that users
with similar navigation paths through the site are placed into the
same cluster. Then, for each cluster, we display these paths for users
within that cluster. The clustering approach we employ is model-based
(as opposed to distance-based) and partitions users according to the
order in which they request web pages. In particular, we cluster
users by learning a mixture of first-order Markov models using the
Expectation-Maximization algorithm. The runtime of our algorithm
scales linearly with the number of clusters and with the size of the
data; and our implementation easily handles hundreds of thousands of
user sessions in memory. In the paper, we describe the details of our
method and a visualization tool based on it called WebCANVAS. We
illustrate the use of our approach on user-traffic data from
msnbc.com.
[postscript][bibtex] also [bibtex]
Dependency networks for inference, collaborative filtering, and
data visualization with D. Heckerman, D.M. Chickering, R. Rounthwaite
and C. Kadie
We describe a graphical model for probabilistic
relationships---an alternative to the Bayesian network---called a
dependency network. The graph of a dependency network, unlike a
Bayesian network, is potentially cyclic. The probability component of
a dependency network, like a Bayesian network, is a set of conditional
distributions, one for each node given its parents. We identify
several basic properties of this representation and describe a
computationally efficient procedure for learning the graph and
probability components from data. We describe the application of this
representation to probabilistic inference, collaborative filtering
(the task of predicting preferences), and the visualization of acausal
predictive relationships.
[pdf][bibtex] also
in UAI98 [bibtex]
Published in 1999
Singling out ill-fit items in a classification. Application to
the
taxonomy of Enterobacteriaceae with Helge G. Gyllenberg, Mats Gyllenberg,
Timo Koski, Tatu Lund and Heikki Mannila
Abstract not available
Not available electronically [bibtex]
On the geometry of DAG models with hidden variables with Dan
Geiger, David Heckerman and Henry King
Abstract not available
See paper on stratified exponential families. [bibtex]
Quantifier Elimination for statistical problems with Dan Geiger
Recent improvements on Tarski's procedure for quantifier elimination
in the first order theory of real numbers makes it feasible to solve
small instances of the following problems completely automatically:
1. listing all equality and inequality constraints implied by a
graphical model with hidden variables. 2. Comparing graphical models
with hidden variables (i.e., model equivalence, inclusion, and
overlap). 3. Answering questions about the identification of a model
or portion of a model, and about bounds on quantities derived from a
model. 4. Determining whether an independence assertion is implied
from a given set of independence assertions. We discuss the
foundations of quantifier elimination and demonstrate its application
to these problems.
[postscript][bibtex]
An Algorithm for causal inference in the presence of latent
variables and selection bias with Peter Spirtes and Thomas Richardson
Abstract not available
Not available electronically [bibtex] also
see UAI95 [bibtex]
Prediction and experimental design with Graphical causal models
with P. Spirtes, C. Glymour, R. Scheines, S. Fienberg and E. Slate
Abstract not available
Not available electronically [bibtex]
A Bayesian approach to causal discovery with David Heckerman
and Greg Cooper
We consider the Bayesian approach to the discovery of
directed acyclic causal models. The Bayesian approach is
related to the constraint-based approach, in that both methods rely on
the Causal Markov Assumption. Nonetheless, an important difference
between the two approaches is that, whereas the constraint-based
approach uses categorical information about conditional-independence
constraints in the domain, the Bayesian approach weighs the degree to
which such constraints hold. As a result, the Bayesian approach has
three distinct advantages over its constraint-based counterpart. One,
conclusions derived from the Bayesian approach are not susceptible to
incorrect categorical decisions about independence facts that can
occur with data sets of finite size. Two, using the Bayesian
approach, finer distinctions among model structures---both
quantitative and qualitative---can be made. Three, information from
several models can be combined to make better inferences and to better
account for modeling uncertainty. In addition to describing the
general Bayesian approach to causal discovery, we review approximation
methods for missing data and hidden variables, and illustrate
differences between the Bayesian and constraint-based methods using
artificial and real examples.
[postscript][bibtex]
Computationally efficient methods for selecting among mixtures of
graphical models, with discussion with B. Thiesson, D. Chickering and D.
Heckerman
We describe computationally efficient methods for Bayesian
model selection. The methods select among mixtures in which each
mixture component is a directed acyclic graphical model and can be
applied to incomplete data sets. The model-selection criterion that
we consider is the posterior probability of the model (structure)
given data. Our model-selection problem is difficult because (1) the
number of possible model structures grows super-exponentially with the
number of random variables and (2) missing data necessitates the use
of computationally slow approximations of model posterior probability.
We argue that simple search-and-score algorithms are infeasible for a
variety of problems, and introduce a feasible approach in which
parameter and structure search is interleaved and expected data is
treated as real data. Our approach can be viewed as the combination
of (1) a modified Cheeseman--Stutz asymptotic approximation for model
posterior probability and (2) the Expectation--Maximization algorithm.
We evaluate our procedure for selecting among mixtures of directed
acyclic graphical models on synthetic and real examples.
Not available electronically (Bayesian Statistics 6) [bibtex]
also see [bibtex]
Graphical Models and Exponential Families with Dan Geiger
We provide a classification of graphical models according to their
representation as subfamilies of exponential families. Undirected
graphical models with no hidden variables are linear exponential
families (LEFs), directed acyclic graphical models and chain graphs
with no hidden variables, including Bayesian networks with several
families of local distributions, are curved exponential families
(CEFs) and graphical models with hidden variables are stratified
exponential families (SEFs). An SEF is a finite union of CEFs
satisfying a {\em frontier\/} condition. In addition, we illustrate how
one can automatically generate independence and non-independence
constraints on the distributions over the observable variables implied
by a Bayesian network with hidden variables. The relevance of these
results for model selection is examined.
[postscript][bibtex]
Learning mixtures of DAG models with B. Thiesson, D.
Chickering and D. Heckerman
We describe computationally efficient methods for Bayesian
model selection. The methods select among mixtures in which each
mixture component is a directed acyclic graphical model and can be
applied to incomplete data sets. The model-selection criterion that
we consider is the posterior probability of the model (structure)
given data. Our model-selection problem is difficult because (1) the
number of possible model structures grows super-exponentially with the
number of random variables and (2) missing data necessitates the use
of computationally slow approximations of model posterior probability.
We argue that simple search-and-score algorithms are infeasible for a
variety of problems, and introduce a feasible approach in which
parameter and structure search is interleaved and expected data is
treated as real data. Our approach can be viewed as the combination
of (1) a modified Cheeseman--Stutz asymptotic approximation for model
posterior probability and (2) the Expectation--Maximization algorithm.
We evaluate our procedure for selecting among mixtures of directed
acyclic graphical models on synthetic and real examples.
[postscript][bibtex]
also in UAI98 [bibtex]
Using path diagrams as a structural equation modeling tool with
Peter Spirtes, Thomas Richardson, Richard Scheines and Clark Glymour
Abstract not available
Not available electronically [bibtex]
Asymptotic model selection for directed networks with hidden
variables with D. Geiger and D. Heckerman
We extend the Bayesian Information Criterion (BIC), an asymptotic
approximation for the marginal likelihood, to Bayesian networks with
hidden variables. This approximation can be used to select models
given large samples of data. The standard BIC as well as our
extension punishes the complexity of a model according to the
dimension of its parameters. We argue that the dimension of a Bayesian
network with hidden variables is the rank of the Jacobian matrix of
the transformation between the parameters of the network and the
parameters of the observable variables. We compute the dimensions of
several networks including the naive Bayes model with a hidden root
node.
[postscript][bibtex] also in
UAI96 [bibtex] and Erice volume [bibtex]
Graphical models: selecting causal and statistical models
Abstract not available
Not available [bibtex]
Embedded Bayesian Network Classifiers with David Heckerman
We consider learning methods and probability models for the task of
classification: the determination of a conditional probability
distribution for a finite-state outcome or class variable
Y given a set of explanatory or input variables X.
In the first part of the paper, we consider methods for learning
traditional Bayesian networks for classification. Using the full
Bayesian approach, we determine the conditional distribution of
interest by averaging over all possible model structures and their
parameters, given the data we have seen. When this approach is
intractable, we sometimes identify a single "good"' model structure
and use it as if it were the correct one---a process known as
model selection. We review two alternative Bayesian criteria for
model-selection described by Spiegelhalter et al. (1993) and Buntine
(1993), and show how both can be viewed within the prequential
framework of Dawid (1984). We review methods for computing the two
criteria, and provide sufficient conditions under which the two
criteria coincide. In the second part of the paper, we describe a
parameterized model for the probability distributions of a node Y
given its parents X in a Bayesian network. The probability model,
which we call an embedded Bayesian network classifier (EBNC), is
obtained from a (usually different) Bayesian network for Y and X
where X need not be the parents of Y. Like the decision tree,
noisy OR, and log-linear model, the EBNC is a parsimonious (i.e.,
low-dimension) model for encoding the interaction between a variable
and its parents. We show that an EBNC is a special case of a softmax
polynomial regression model. Also, we show how to identify a
minimum-size parameterization of an EBNC, and describe an efficient
approximation for learning the structure of Bayesian networks that
contain EBNCs.
[postscript][bibtex]
Models and Selection Criteria for Regression and Classification with
David Heckerman
Abstract not available
[postscript] also in UAI97 [bibtex]
A Bayesian approach to learning Bayesian networks with
local structure with D. Chickering and D. Heckerman
Recently several researchers have investigated techniques for using data
to learn Bayesian networks containing compact representations for
the conditional probability distributions (CPDs) stored at each
node. The majority of this work has concentrated on using
decision-tree representations for the CPDs. In addition, researchers
typically apply non-Bayesian (or asymptotically Bayesian) scoring
functions such as MDL to evaluate the goodness-of-fit of networks
to the data.
In this paper we investigate a Bayesian approach to learning Bayesian
networks that contain the more general {\em decision-graph}
representations of the CPDs. First, we describe how to evaluate the
posterior probability---that is, the Bayesian score---of such a
network, given a database of observed cases. Second, we describe
various search spaces that can be used, in conjunction with a scoring
function and a search procedure, to identify one or more high-scoring
networks. Finally, we present an experimental evaluation of
the search spaces, using a greedy algorithm and a Bayesian
scoring function.
[postscript] also in UAI97 [bibtex]
Structure and Parameter Learning for Causal Independence and Causal Interaction Models with David Heckerman
We begin by discussing causal independence models and generalize these
models to causal interaction models. Causal interaction models are
models that have independent mechanisms where mechanisms can have
several causes. In addition to introducing several particular types of
causal interaction models, we show how we can apply
the Bayesian approach to learning causal interaction models obtaining
approximate posterior distributions for the models and obtain MAP and
ML estimates for the parameters. We illustrate the approach with a
simulation study of learning model posteriors.
Available by request [bibtex]
Heuristic greedy search algorithms for latent variable models
with P. Spirtes and T. Richardson
Abstract not available
Not available electronically [bibtex]
The dimensionality of mixed ancestral graphs with P. Spirtes and
T. Richardson
Abstract not available
Not available electronically [bibtex]
Asymptotic model selection for directed networks with hidden
variables with D. Geiger and D. Heckerman
We extend the Bayesian Information Criterion (BIC), an asymptotic
approximation for the marginal likelihood, to Bayesian networks with
hidden variables. This approximation can be used to select models
given large samples of data. The standard BIC as well as our
extension punishes the complexity of a model according to the
dimension of its parameters. We argue that the dimension of a Bayesian
network with hidden variables is the rank of the Jacobian matrix of
the transformation between the parameters of the network and the
parameters of the observable variables. We compute the dimensions of
several networks including the naive Bayes model with a hidden root
node.
[postscript][bibtex] also in
UAI96 [bibtex] and Erice volume [bibtex]
Using d-separation to calculate zero partial correlations in
linear models with correlated errors with P. Spirtes, T. Richardson, R.
Scheines and C. Glymour
Abstract not available
Not available [bibtex]
Learning Bayesian networks with discrete variables from data
with Peter Spirtes
This paper describes a new greedy Bayesian search algorithm
(GBPS) for Bayesian networks and a new algorithm that combines the
scoring-based GBPS with the independence-based PC algorithm.
Simulation tests of these algorithms with previously published
algorithms are presented.
[postscript][bibtex]
Strong completeness and faithfulness in Bayesian networks
A completeness result for d-separation applied to discrete Bayesian
networks is presented and it is shown that in a strong
measure-theoretic sense almost all discrete distributions for a given
network structure are faithful; i.e.\ the independence facts true of
the distribution are all and only those entailed by the network
structure.
[postscript] [bibtex]
Causal inference and causal explanation with background knowledge
This paper presents correct algorithms for answering the
following two questions; (i) Does there exist a causal explanation consistent
with
a set of background knowledge which explains all of the observed
independence facts in a sample? (ii) Given that there is such a causal
explanation what are the causal relationships common to every such
causal explanation?
[postscript][bibtex]
Causal inference in the presence of latent variables and selection bias
with P. Spirtes and T. Richardson
Abstract not available
Not available electronically [bibtex]
Tetrad {II}: User's Manual with Richard Scheines, Peter Spirtes and
Clark Glymour
Our approach is to axiomatize the connection between causal structure
and statistical independence and conditional independence, and then
formulate algorithms that take as input statistical data and
background knowledge and output classes of causal models that explain
the data. The algorithms make statistical decisions about
independence and conditional independence, and then apply graph
theoretic search strategies to arrive at the appropriate class of
causal models. We have developed a number of algorithms for different
contexts of background knowledge, for example, time order, knowledge
about feedback, knowledge about the presence or absence of omitted
variables, and partial knowledge of causal relations that do or do not
exist among the variables under study.
Not available electronically. [bibtex]
Conditioning and Intervening with Clark Glymour
We consider the dispute between causal decision theorists and
evidential decision theorists over Newcomb-like problems. We introduce
a framework relating causation and directed graphs developed by
Spirtes et al. (1993) and evaluate several arguments in this
context. We argue that much of the debate between the two camps is
misplaces; the disputes turn on the distinction between conditioning
on an event E as against conditioning on an event I which is an action
to bring about E. We give the essential machinery for calculating the
effect of an intervention and consider recent work which extends the
basic account given here to the case where causal knowledge is
incomplete.
Not available electronically. [bibtex]