Annotated Bibliography of Papers by Christopher Meek

Christopher Meek
Microsoft Research

Redmond, WA 98052, USA
meek@microsoft.com

March, 2008

Year: 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994

Tied Boltzmann Machines for Cold Start Recommendations
with Asela Gunawardana (ACM Recsys)

We describe a novel statistical model, the tied Boltzmann machine, for combining collaborative and content information for recommendations. In our model, pairwise interactions between items are captured through a Boltzmann machine, whose parameters are constrained according to the content associated with the items. This allows the model to use content information to recommend items that are not seen during training. We describe a tractable algorithm for training the model, and give experimental results evaluating the model in two cold start recommendation tasks on the MovieLens data set.

Inference for Multiplicative Models

Ydo Wexler and Christopher Meek, (UAI)

The paper introduces a generalization for known probabilistic models such as log-linear and graphical models, called here multiplicative models. These models, that express probabilities via product of parameters are shown to capture multiple forms of contextual independence between variables, including decision graphs and noisy-OR functions. An inference algorithm for multiplicative models is provided and its correctness is proved. The complexity analysis of the inference algorithm uses a more refined parameter than the tree-width of the underlying graph, and shows the computational cost does not exceed that of the variable elimination algorithm in graphical models. The paper ends with examples where using the new models and algorithm is computationally beneficial.

MAS: a multiplicative approximation scheme for probabilistic inference

Ydo Wexler and Christopher Meek (NIPS)

We propose a multiplicative approximation scheme (MAS) for inference problems in graphical models, which can be applied to various inference algorithms. The method uses \epsilon-decompositions which decompose functions used throughout the inference procedure into functions over smaller sets of variables with a known error \epsilon. MAS translates these local approximations into bounds on the accuracy of the results. We show how to optimize \epsilon-decompositions and provide a fast closed-form solution for an L2 approximation. Applying MAS to the Variable Elimination inference algorithm, we introduce an algorithm we call DynaDecomp which is extremely fast in practice and provides guaranteed error bounds on the result. The superior accuracy and efficiency of DynaDecomp is demonstrated.

Mining Recommendations from the Web

Guy Shani, David Maxwell Chickering, and Christopher Meek (ACM Recsys)

In this paper we show how to use data from the internet to construct recommender systems. We provide experimental results, which include a user study, showing that our methods produce good recommendations in realistic applications. We describe how standard evaluation metrics are biased toward systems that simply recommend the most popular items, and we propose modifications to these metrics to fix the problem.

A Quality-Based Auction for Search Ad Markets with Aggregators
with Gunawardana and Biggs (ACM EC Workshop on Ad Auctions)

We explore the role of aggregators in search ad markets and their effects on that marketplace. Aggregators can have positive effects similar to those of an arbitrageur. We argue and provide empirical evidence that, unlike arbitrageurs, aggregators can continue to profit even in the absence of price imbalances. Furthermore, we argue that the standard generalized second price (GSP) auction mechanisms create incentives for aggregators to design their sites to negatively affect the user experience and that the existence of aggregators in the marketplace can also negatively affect non-aggregator merchants. We propose a specific quality-based GSP mechanism that reduces the negative impacts of aggregators on the user and marketplace while allowing some of their positive benefits for users.

Aggregators and Contextual Effects in Search Ad Markets
with Gunawardana (WWW Workshop on Targeting and Ranking for Online Advertising)

Partitioned Logistic Regression for Spam Filtering

Ming-Wei Chang, Wen-tau Yih, and Christopher Meek, (ACM KDD)

Consistent Phrase Relevance Measures

Wen-tau Yih and Christopher Meek (ADKDD-08 Workshop)

Mobile Opportunistic Commerce: Mechanisms, Architecture, and Application

Ece Kumar, Eric Horvitz, and Chris Meek (AAMAS)

People Watcher: A Game for Eliciting Human-Transcribed Data for Automated Directory Assistance
with Paek and Yu (InterSpeech;ELRA Best Paper Award)

Automated Directory Assistance (ADA) allows users to request telephone or address information
of residential and business listings using speech recognition. Because callers often express listings
differently than how they are registered in the directory, ADA systems require transcriptions of
alternative phrasings for directory listings as training data, which can be costly to acquire. As
such, a framework in which data can be contributed voluntarily by large numbers of Internet users
has tremendous value. In this paper, we introduce People Watcher, a computer game that elicits
transcribed, alternative user phrasings for directory listings while at the same time entertaining
players. Data generated from the game not only overlapped actual audio transcriptions, but resulted
in a statistically significant utilized for ADA. Furthermore, semantic accuracy was not statistically
different than using the actual audio transcriptions.

Similarity Measures for Short Segments of Text
with Metzler and Dumais (ECIR)

Measuring the similarity between documents and queries has been extensively studied in information
retrieval. However, there are a growing number of tasks that require computing the similarity between
two very short segments of text. These tasks include query reformulation, sponsored search,
and image retrieval. Standard text similarity measures perform poorly on such tasks because of
data sparseness and the lack of context. In this work, we study this problem from an information
retrieval perspective, focusing on text representations and similarity measures. We examine a
range of similarity measures, including purely lexical measures, stemming, and language modeling-
based measures. We formally evaluate and analyze the methods on a query-query similarity task
using 363,822 queries from a web search log. Our analysis provides insights into the strengths and
weaknesses of each method, including important tradeoffs between effectiveness and efficiency.

Probabilistic Entity-Relationship Models, PRMs and Plate Models
with Heckerman and Koller (in Introduction to Statistical Relational Learning; MIT Press)

In this chapter, we introduce a graphical language for relational data called the probabilistic entity-
relationship (PER) model. The model is an extension of the entity-relationship model, a common
model for the abstract representation of database structure. We concentrate on the directed version
of hte model — the directed acyclic probabilistic entity-relationship model (DAPER) model. The
DAPER model is closely related to the plate model and the PRM, existing models of relational data.
The DAPER model is more expressive then either existing model. Many examples are provided.

Improving Similarity Measures for Short Segments of Text

with Yih (AAAI)

In this paper we improve previous work on measuring the similarity of short segments of text in
two ways. First, we introduce a Web-relevance similarity measure and demonstrate its effectiveness.
This measure extends the Web-kernel similarity function introduced by Sahami and Heilman (2006)
by using relevance weighted inner-product of term occurrences rather than TFIDF. Second, we
show that one can further improve the accuracy of similarity measures by using a machine learning
approach. Our methods outperform other state-of-the-art methods in a general query suggestion

Modeling Contextual Factors of Click Rates

with H. Becker and Chickering (AAAI)

In this paper, we develop and evaluate several probabilistic models of user click-through behavior
that are appropriate for modeling the click-through rate of items that are presented to the user
in a list. Potential applications include modeling the click-through rates of search results from a
search engine, items ranked by a recommendation system, and search advertisements returned by
a search engine. Our models capture contextual factors related to the presentation as well as the
underlying relevance or quality of the item. We focus on two types of contextual factors for a given
item; the positional context of the item and the quality of the other results. We evaluate our models
on a search advertising dataset from Microsofts Live search engine and demonstrate that modeling
contextual factors improves the accuracy of click-through models.

On the Toric Algebra of Graphical Models
with Geiger and Sturmfels (Annals of Statistics)

We formulate necessary and sufficient conditions for an arbitrary discrete probability distribution
to factor according to an undirected graphical model, or a log-linear model, or other more general
exponential models. For decomposable graphical models these conditions are equivalent to a set
of conditional independence statements similar to the Hammersley–Clifford theorem; however, we
show that for nondecomposable graphical models they are not. We also show that nondecomposable
models can have nonrational maximum likelihood estimates. These results are used to give several
novel characterizations of decomposable graphical models.

On the incompatibility of faithfulness and monotone-DAG-faithfulness
with Chickering (Artificial Intelligence)

Cheng, Greiner, Kelly, Bell and Liu [Artificial Intelligence 137 (2002) 4390] describe an algorithm for
learning Bayesian networks that in a domain consisting of n variables identifies the optimal solution
using O(n4) calls to a mutual-information oracle. This result relies on (1) the standard assumption
that the generative distribution is Markov and faithful to some directed acyclic graph (DAG), and (2)
a new assumption about the generative distribution that the authors call monotone DAG faithfulness
(MDF). The MDF assumption rests on an intuitive connection between active paths in a Bayesian-
network structure and the mutual information among variables. The assumption states that the
(conditional) mutual information between a pair of variables is a monotonic function of the set of
active paths between those variables; the more active paths between the variables the higher the
mutual information. In this paper, we demonstrate the unfortunate result that, for any realistic
learning scenario, the monotone DAG faithfulness assumption is incompatible with the faithfulness
assumption. Furthermore, for the class of Bayesian-network structures for which the two assumptions
are compatible, we can learn the optimal solution using standard approaches that require only O(n2)
calls to an independence oracle.

A Variational Inference Procedure Allowing Internal Structure for Overlapping Clusters and Deter-
ministic Constraints

with Geiger and Wexler (JAIR)

We develop a novel algorithm, called VIP*, for structured variational approximate inference. This
algorithm extends known algorithms to allow efficient multiple potential updates for overlapping
clusters, and overcomes the difficulties imposed by deterministic constraints. The algorithms con-
vergence is proven and its applicability demonstrated for genetic linkage analysis.

Structural Periodic Measures for Time-Series Data

with Vlachos et al. (Data Mining Knowledge Discovery)

This work motivates the need for more flexible structural similarity measures between time-series
sequences, which are based on the extraction of important periodic features. Specifically, we present
non-parametric methods for accurate periodicity detection and we introduce new periodic distance
measures for time-series sequences. We combine these new measures with an effective metric tree
index structure for efficiently answering k-Nearest-Neighbor queries. The goal of these tools and
techniques are to assist in detecting, monitoring and visualizing structural periodic changes. It is
our belief that these methods can be directly applicable in the manufacturing industry for preventive
maintenance and in the medical sciences for accurate classification and anomaly detection.

Good Word Attacks on Statistical Spam Filters

with D. Lowd (CEAS)

Unsolicited commercial email is a significant problem for users and providers of email services. While
statistical spam filters have proven useful, senders of spam are learning to bypass these filters by
systematically modifying their email messages. In a good word attack, one of the most common
techniques, a spammer modifies a spam message by inserting or appending words indicative of
legitimate email. In this paper, we describe and evaluate the effectiveness of active and passive good
word attacks against two types of statistical spam filters: naive Bayes and maximum entropy filters.
We find that in passive attacks without any filter feedback, an attacker can get 50filter by adding
150 words or fewer. In active attacks allowing test queries to the target filter, 30 words will get half
of blocked spam past either filter.

with D. Lowd (KDD)

Many classification tasks, such as spam filtering, intrusion detection, and terrorism detection, are
complicated by an adversary who wishes to avoid detection. Previous work on adversarial classifi-
cation has made the unrealistic assumption that the attacker has perfect knowledge of the classifier
[2]. In this paper, we introduce the adversarial classifier reverse engineering (ACRE) learning prob-
present efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean
features and demonstrate their effectiveness using real data from the domain of spam filtering.

Using epitomes to model genetic diversity: Rational design of HIV vaccines
with Jojic et al. (NIPS)

We introduce a new model of genetic diversity which summarizes a large input dataset into an
epitome, a short sequence or a small set of short sequences of probability distributions capturing
many overlapping subsequences from the dataset. The epitome as a representation has already been
used in modeling real-valued signals, such as images and audio. The discrete sequence model we
introduce in this paper targets applications in genetics, from multiple alignment to recombination
and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where
the epitome emerges as a natural model for producing relatively small vaccines covering a large
number of immune system targets known as epitopes. Our experiments show that the epitome
includes more epitopes than other vaccine designs of similar length, including cocktails of consensus
strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take
into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments,
we find that vaccine optimization is fairly robust to these uncertainties.

Perfect Tree-like Markovian Distributions
with Ann Becker and Dan Geiger (Probability and Mathematical Statistics)

We show that if a strictly positive joint probability distribution for a set of binary variables factors
according to a tree, then vertex separation represents all and only the independence relations encoded
in the distribution. The same result is shown to hold also for multivariate nondegenerate normal
distributions. Our proof uses a new property of conditional independence that holds for these two
classes of probability distributions.

Stochastic and contingent payment auctions
with Chickering and Wilson (Workshop on Sponsored Search Auctions, ACM-EC)

We describe a family of stochastic incentive-compatible auctions that generalize Vickrey auctions and
discuss situations in which such auctions are useful. We also discuss contingent-payment auctions and
their incentive-compatible counterparts. Finally, we discuss how one can utilize both reinforcement-
learning techniques and stochastic auctions to learn probabilities for contingencies.

Structured Variational Inference Procedures and their Realizations

with D. Geiger (AI&Stats 2005)

We describe and prove the convergence of several algorithms for
approximate structured variational inference. We discuss the
computation cost of these algorithms and describe their relationship
to the mean-field and generalized-mean-field variational approaches
and other structured variational methods.

[pdf][bibtex]

Efficient gradient computation for conditional Gaussian models

with B. Thiesson (AI&Stats 2005)

We introduce Recursive Exponential Mixed Models (REMMs)
and derive the gradient of the parameters for the incomplete-data
likelihood. We demonstrate how one can use probabilistic inference in
Conditional Gaussian (CG) graphical models, a special case of REMMs,
to compute the gradient for a CG model. We also demonstrate that this
approach can yield simple and effective algorithms for computing the
gradient for models with tied parameters and illustrate this approach
on stochastic ARMA models.

[pdf][bibtex]

Large-Sample Learning of Bayesian Networks is NP-Hard

with D.M. Chickering and D. Heckerman (JMLR 2004)

In this paper, we provide new complexity results for algorithms that
learn discrete-variable Bayesian networks from data. Our results apply
whenever the learning algorithm uses a scoring criterion that favors
the simplest structure for which the model is able to represent the
generative distribution exactly. Our results therefore hold whenever
the learning algorithm uses a consistent scoring criterion and is
applied to a sufficiently large dataset. We show that identifying
high-scoring structures is NP-hard, even when any combination of one
or more of the following hold: the generative distribution is perfect
with respect to some DAG containing hidden variables; we are given an
independence oracle; we are given an inference oracle; we are given an
information oracle; we restrict potential solutions to structures in
which each node has at most k parents, for all $k\geq 3$. Our proof relies
on a new technical result that we establish in the appendices. In
particular, we provide a method for constructing the local
distributions in a Bayesian network such that the resulting joint
distribution is provably perfect with respect to the structure of the
network.

[JMLR][bibtex]

Efficient approximations for learning phylogenetic HMM models from data

with V. Jojic, N. Jojic, D. Geiger, A. Siepel, D. Haussler and D. Heckerman
(Bioinformatics 2004)

We consider models useful for learning an evolutionary or phylogenetic
tree from data consisting of DNA sequences corresponding to the leaves
of the tree. In particular, we consider a general probabilistic model
described in Siepel and Haussler that we call the phylogenetic-HMM
model which generalizes the classical probabilistic models of Neyman
and Felsenstein. Unfortunately, computing the likelihood of
phylogenetic-HMM models is intractable. We consider several
approximations for computing the likelihood of such models including
an approximation introduced in Siepel and Haussler, loopy belief
propagation and several variational methods. We demonstrate that,
unlike the other approximations, variational methods are accurate and
are guaranteed to lower bound the likelihood. In addition, we identify
a particular variational approximation to be best one in which the
posterior distribution is variationally approximated using the classic
Neyman and Felsenstein model. The application of our best approximation to
data from the cystic fibrosis transmembrane conductance regulator gene
region across nine eutherian mammals reveals a CpG effect.

[ISMB][bibtex]

Variations on Undirected Graphical Models and their Relationships

with D. Heckerman and T. Richardson.

Associated with an undirected graph is the set of
distributions obeying the Local Markov property, which are also the
unique stationary distribution of the corresponding single-site Gibbs
sampler. In this note we show that this set of distributions is not a
subset of the set of distributions obeying the global Markov property
with respect to a graph. It follows that this set is also not a subset
of the set of distributions which are limits of distributions
factorizing according to an undirected graph.

[TechReport][bibtex]

Probabilistic Models for Relational Data

with D. Heckerman and D. Koller.

We introduce a graphical language for relational data called the
probabilistic entity-relationship (PER) model. The model is an
extension of the entity-relationship model, a common model for the
abstract representation of database structure. We concentrate on the
directed version of this model---the directed acyclic probabilistic
entity-relationship (DAPER) model. The DAPER model is closely related
to the plate model and the probabilistic relational model (PRM),
existing models for relational data. The DAPER model is more
expressive than either existing model, and also helps to demonstrate
their similarity. In addition to describing the new language, we
discuss important facets of modeling relational data, including the
use of restricted relationships, self relationships, and probabilistic
relationships. Many examples are provided.

[TechReport][bibtex]

ARMA Time-Series Modeling with Graphical Models

with B. Thiesson, D.M. Chickering, D. Heckerman (UAI 2004).

We express the classic ARMA time-series model as a directed graphical
model. In doing so, we find that the deterministic relationships in
the model make it effectively impossible to use the EM algorithm for
learning model parameters. To remedy this problem, we replace the
deterministic relationships with Gaussian distributions having a small
variance, yielding the stochastic ARMA ($\sigma$ARMA) model. This
modification allows us to use the EM algorithm to learn parameters and
to forecast, even in situations where some data is missing. This
modification, in conjunction with the graphical-model approach, also
allows us to include cross predictors in situations where there are
multiple time series and/or additional non-temporal covariates. More
surprising, experiments suggest that the move to stochastic ARMA
yields improved accuracy through better smoothing. We demonstrate
improvements afforded by cross prediction and better smoothing on real
data.

[pdf][bibtex]

Identifying Similarities, Periodicities and Bursts for Online Search Queries
with Vlachos et al. (SIGMOD)

We present several methods for mining knowledge from the query logs of the MSN search engine.
Using the query logs, we build a time series for each query word or phrase (e.g., Thanksgiving or
Christmas gifts) where the elements of the time series are the number of times that a query is issued
on a day. All of the methods we describe use sequences of this form and can be applied to time
series data generally. Our primary goal is the discovery of semantically similar queries and we do
so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and
the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity
matching. The extracted sequence features are then organized in an efficient metric tree index
structure. We also demonstrate how to efficiently and accurately discover the important periods in
a time-series. Finally we propose a simple but effective method for identification of bursts (long or
short-term). Using the burst information extracted from a sequence, we are able to efficiently perform
query-by-burst on the database of timeseries. We conclude the presentation with the description of
a tool that uses the described methods, and serves as an interactive exploratory data discovery tool
for the MSN query database.

Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

with I. Cadez, D. Heckerman, P. Smyth and S. White
(Data Mining and Knowledge Discovery 2003)

We present a new methodology for exploring and analyzing navigation
patterns on a web site. The patterns that can be analyzed consist of
sequences of URL categories traversed by users. In our approach, we
first partition site users into clusters such that users with similar
navigation paths through the site are placed into the same
cluster. Then, for each cluster, we display these paths for users
within that cluster. The clustering approach we employ is model-based
(as opposed to distance-based) and partitions users according to the
order in which they request web pages. In particular, we cluster users
by learning a mixture of first-order Markov models using the
Expectation-Maximization algorithm. The runtime of our algorithm
scales linearly with the number of clusters and with the size of the
data; and our implementation easily handles hundreds of thousands of
user sessions in memory. In the paper, we describe the details of our
method and a visualization tool based on it called WebCANVAS. We
illustrate the use of our approach on user-traffic data from
msnbc.com.

[not available online][bibtex]

Practically Perfect

with Max Chickering (UAI-2003)

We provide a lower-bound on the probability of sampling a non-perfect
distribution when using a fixed number of bits to represent the
parameters of the Bayesian network. This bound approaches zero
exponentially fast as one increases the number of bits used to
represent the parameters. This result implies that perfect
distributions with fixed-length representations exist. We also provide
a lower-bound on the number of bits needed to guarantee that a
distribution sampled from a uniform Dirichlet distribution is perfect
with probability greater than $1/2$. This result is useful for
constructing randomized reductions for hardness proofs.

[pdf][bibtex]

Discriminative Model Selection for Density Models

with Bo Thiesson (AI & Stats 2003)

Density models are a popular tool for building classifiers. When
using density models to build a classifier, one typically learns a
separate density model for each class of interest. These density
models are then combined to make a classifier through the use of
Bayes' rule utilizing the prior distribution over the classes. In
this paper, we provide a discriminative method for choosing among
alternative density models for each class to improve classification
accuracy.

[ps][pdf][bibtex]

Large-Sample Learning of Bayesian Networks is Hard

with Max Chickering and David Heckerman (UAI 2003)

In this paper, we provide new complexity results for algorithms that
learn discrete-variable Bayesian networks from data. Our results
apply whenever the learning algorithm uses a scoring criterion that
favors the simplest model able to represent the generative
distribution exactly. Our results therefore hold whenever the
learning algorithm uses a consistent scoring criterion and is applied
to a sufficiently large dataset. We show that identifying
high-scoring structures is hard, even when we are given an
independence oracle, an inference oracle, and/or an information
oracle. Our negative results also apply to the learning of
discrete-variable Bayesian networks in which each node has at most $k$
parents, for all $k \geq 3$.

[pdf][bibtex]

Monotone DAG Faithfulness: A Bad Assumption

with Max Chickering (submitted AI 12/2002)

In a recent paper, Cheng, Greiner, Kelly, Bell and Liu (Artificial
Intelligence 137:43-90; 2002) describe an algorithm for learning
Bayesian networks that---in a domain consisting of $n$
variables---identifies the optimal solution using $O(n^4)$ calls to a
mutual-information oracle. This seemingly incredible result relies on
(1) the standard assumption that the generative distribution is Markov
and faithful to some directed acyclic graph (DAG), and (2) a new
assumption about the generative distribution that the authors call
{\em monotone DAG faithfulness (MDF)}. The MDF assumption rests on an
intuitive connection between active paths in a Bayesian-network
structure and the mutual information among variables. The assumption
states that the (conditional) mutual information between a pair of
variables is a monotonic function of the set of active paths between
those variables; the more active paths between the variables the
higher the mutual information. In this paper, we demonstrate the
unfortunate result that, for any realistic learning scenario, the
monotone DAG faithfulness assumption is {\em incompatible} with the
faithfulness assumption. In fact, by assuming both MDF and
faithfulness, we restrict the class of possible Bayesian-network
structures to one for which the optimal solution can be identified
with O($n^2$) calls to an independence oracle.

[TechReport][bibtex]

Factorization of discrete probability distributions

with Dan Geiger and Bernd Sturmfels

We formulate necessary and sufficient conditions for
an arbitrary discrete probability distribution to factor according to
an undirected graphical model, or a log-linear model, or other more
general exponential models. This result generalizes the well known
Hammersley-Clifford Theorem.

[Not available on-line][bibtex]

On the Toric Algebra of Graphical Models

with Dan Geiger and Bernd Sturmfels

We formulate necessary and sufficient conditions for an arbitrary
discrete probability distribution to factor according to an undirected
graphical model, or a log-linear model, or other more general
exponential models. This characterization generalizes the well-known
Hammersley-Clifford Theorem. We show that for decomposable graphical
models these conditions are equivalent to a set of statistical
independence facts as in the Hammersley-Clifford Theorem but that for
non-decomposable graphical models they are not. We also show that
non-decomposable models can have non-rational maximum likelihood
estimates. Finally, using these results, we provide a characterization
of decomposable graphical models.

[postscript][bibtex]

CFW: A Collaborative Filtering System Using Posteriors Over Weights of Evidence

with Carl M. Kadie and David Heckerman

We describe CFW, a computationally efficient algorithm for
collaborative filtering that uses posteriors over weights of
evidence. In experiments on real data, we show that this method
predicts as well or better than other methods in situations where
the size of the user query is small. The new approach works
particularly well when the user's query contains low frequency
(unpopular) items. The approach complements that of dependency
networks which perform well when the size of the query is large.
Also in this paper, we argue that the use of posteriors over
weights of evidence is a natural way to recommend similar
items---a task that is somewhat different from the usual

[postscript][bibtex]

The Road to Asymptopia with David Maxwell Chickering

Identifying the presence or absence of inducing paths is
essential for any algorithm that uses an independence-based approach
to learning causal models from observational data generated from a
graphical model with hidden variables and possibly sampled with
selection bias. Conceptually, an inducing path is a sequence of
vertices in a directed acyclic graphical model (with hidden variables)
that guarantees that the endpoints of the path cannot be d-separated
given any subset of observed vertices in the graph. If there exists an
inducing path in the model from which the observed data is generated,
then we can deduce that the endpoints either (1) have a hidden common
cause, (2) one of the endpoints is a (possibly indirect) cause of the
other, or (3) both. Not surprisingly, the asymptotic correctness of
the independence-based approaches to learning causal models rely
heavily on being able to detect correctly the existence of inducing
paths; that is, they must judge correctly that the endpoints of the
path are dependent given a subset of the observed variables. In this
paper, we investigate inducing paths from an information-theoretic
perspective and examine experimentally the number of samples required
to reliably detect inducing paths of various types.

[Not available on-line] [bibtex]

Finding Optimal Bayesian Networks with David Maxwell Chickering

In this paper, we derive optimality results for greedy
Bayesian-network search algorithms that perform single edge
modifications at each step and use asymptotically consistent scoring
criteria. Our results extend those of Meek (1997) and Chickering
(2002), who demonstrate that in the limit of large datasets, such
search algorithms identify the generative (i.e. optimal) model if that
model is DAG perfect among the observable variables. We instead
guarantee a weaker form of optimality whenever the generative
distribution satisfies the composition property over the
observable variables, which is a more realistic assumption for real
domains. In addition, we show that the composition property is
guaranteed to hold whenever the dependence relationships in the
generative distribution can be characterized by paths between
singleton elements in some (generative) graphical model even when we
allow the generative model to include unobserved variables, and the
observed data to be subject to selection bias. The generalization
encompasses any situation where the generative distribution is perfect
with respect to (1) a DAG, (2) a chain graph, or (3) a Markov network.

[pdf] [bibtex]

Staged Mixture Modeling and Boosting with Bo Thiesson and David Heckerman

In this paper, we introduce and evaluate a data-driven staged mixture
modeling technique for building density, regression, and
classification models. Our basic approach is to sequentially add
components to a finite mixture model using the structural expectation
maximization (SEM) algorithm. We show that our technique is
qualitatively similar to boosting. This correspondence is a natural
byproduct of the fact that we use the SEM algorithm to
sequentially fit the mixture model. Finally, in our experimental
evaluation, we demonstrate the effectiveness of our approach on a
variety of prediction and density estimation tasks using real-world
data.

[postscript][bibtex]

Autoregressive tree models for time-series analysis with David Maxwell Chickering and David Heckerman

In this paper, we describe models for continuous-valued
time-series data that are useful for data mining in that they (1)
can be learned efficiently from data, (2) support accurate
predictions, and (3) are easy to interpret. We describe a class
of models called autoregressive tree (ART) models that are a
generalization of linear autoregressive (AR) models that
allow one to model non-linear time-series data by piece-wise
linear autoregressive models. We accomplish this generalization
by using decision trees with linear regressions at the leaves to
predict successive observations in the time series. Also, we
describe a Bayesian approach for learning AR and ART models from
data. Finally, we compare several alternative models and learning
methods using 2,494 data sets.

[postscript][bibtex]

Published in 2001

The Learning-Curve Sampling Method Applied to Model-Based Clustering with Bo Thiesson and David Heckerman

We examine the learning-curve sampling method, an approach for
applying machine-learning algorithms to large data sets. The
approach is based on the observation that the computational cost
of learning a model increases as a function of the sample size of
the training data, whereas the accuracy of a model has
diminishing improvements as a function of sample size. Thus, the
learning-curve sampling method monitors the increasing costs and
performance as larger and larger amounts of data are used for
training, and terminates learning when future costs outweigh
future benefits. In this paper, we formalize the learning-curve
sampling method and its associated cost-benefit tradeoff in terms
of decision theory. In addition, we describe the application of
the learning-curve sampling method to the task of model-based
clustering via the expectation-maximization (EM) algorithm. In
experiments on three real data sets, we show that the
learning-curve sampling method produces models that are nearly as
accurate as those trained on complete data sets, but with
dramatically reduced learning times. Finally, we describe an
extension of the basic learning-curve approach for model-based
clustering that results in an additional speedup. This extension is
based on the observation that the shape of the learning curve for
a given model and data set is roughly independent of the number
of EM iterations used during training. Thus, we run EM for only a
few iterations to decide how many cases to use for training, and
then run EM to full convergence once the number of cases is
selected.

In JMLR [postscript][pdf][bibtex] also in AI&Statistics [bibtex]

Accelerating EM for large databases with Bo Thiesson and David Heckerman

The EM algorithm is a popular method for parameter estimation in a
variety of problems involving missing data. However, the EM algorithm
often requires significant computational resources and has been
dismissed as impractical for large databases. We present two
approaches that significantly reduce the computational cost of
applying the EM algorithm to databases with a large number of cases,
including databases with large dimensionality. Both approaches are
based on partial E-steps for which we can use the results of Neal and
Hinton (1998) to obtain the standard convergence guarantees
of EM. The first approach is a version of the incremental EM algorithm,
described in Neal and Hinton (1998), which cycles through data cases
in blocks. The number of cases in each block dramatically effects the
efficiency of the algorithm. We provide a method for selecting a near
optimal block size. The second approach, which we call lazy EM, will,
at scheduled iterations, evaluate the significance of each data case
and then proceed for several iterations actively using only the
significant cases. We demonstrate that both methods can significantly
reduce computational costs through their application to
high-dimensional real-world and synthetic mixture modeling problems
for large databases.

[postscript][pdf][bibtex]

Efficient Determination of Dynamic Split Points in a Decision Tree with David Maxwell Chickering and Robert Rounthwaite

We consider the problem of choosing split points for continuous
predictor variables in a decision tree. Previous approaches to this
problem typically either (1) discretize the continuous predictor
values prior to learning or (2) apply a dynamic method that considers
all possible split points for each potential split. In this paper, we
describe a number of alternative approaches that generate a small
number of candidate split points dynamically with little overhead. We
argue that these approaches are preferable to pre-discretization, and
provide experimental evidence that they yield probabilistic decision
trees with the same prediction accuracy as the traditional dynamic
approach. Furthermore, because the time to grow a decision tree is
proportional to the number of split points evaluated, our approach is
significantly faster than the traditional dynamic approach.

Available by request [bibtex]

Stratified exponential families: graphical models and model selection with Dan Geiger, David Heckerman and Henry King

We describe a hierarchy of exponential families which is useful
for distinguishing types of graphical models. Undirected
graphical models with no hidden variables are linear exponential
families (LEFs), directed acyclic graphical (DAG) models and
chain graphs with no hidden variables, including DAG models with
several families of local distributions, are curved exponential
families (CEFs) and graphical models with hidden variables are,
what we term, stratified exponential families (SEFs). A SEF is a
finite union of CEFs of various dimensions satisfying some
regularity conditions. We also show that this hierarchy of
exponential families is non-collapsing with respect to graphical
models by providing a graphical model which is a CEF but not an
LEF and a graphical model that is a SEF but not a CEF. Finally,
we show how to compute the dimension of a stratified exponential
family. These results are discussed in the context of model
selection of graphical models.

[postscript][pdf][bibtex]

Finding a Path is Harder than Finding a Tree

I consider the problem of learning an optimal path graphical model
from data and show the problem to be NP-hard for the maximum
likelihood and minimum description length approaches and a Bayesian
approach. This hardness result holds despite the fact that the problem
is a restriction of the polynomially solvable problem of finding the
optimal tree graphical model.

In JAIR [postscript][pdf][bibtex]

Using Temporal Data for Making Recommendations with Andrew Zimdars and David Maxwell Chickering

We treat collaborative filtering as a univariate time series
estimation problem: given a user's previous votes, predict the next
vote. We describe two families of methods for transforming data to
encode time order in ways amenable to off-the-shelf classification and
density estimation tools, and examine the results of using these
approaches on several real-world data sets. The improvements in
predictive accuracy we realize recommend the use of other predictive
algorithms that exploit the temporal order of data.

[postscript][bibtex]

Published in 2000

Challenges of the email domain for text classification with Jake Brutlag

Interactive classification of email into a user-defined hierarchy of
folders is a natural domain for application of text classification
methods. This domain presents several challenges. First, the user's
changing mail-filing habits mandate classification technology adapt in
a dynamic environment. Second, the classification technology needs to
be able to handle heterogeneity in folder content and folder size.
Performance when there are only a small number of messages in a folder
is especially important. Third, methods must meet the processing and
memory requirements of a software implementation. We study three
promising methods and present an analysis of their behavior with
respect to these domain-specific challenges.

[postscript][bibtex]

Perfect Tree-like Markovian Distributions with Ann Becker and Dan Geiger

Abstract Not available

Not available electronically [bibtex]

Global Partial Orders from Sequential Data with Heikki Mannila

Sequences of events arise in many applications, such as web browsing,
e-commerce, and monitoring of processes. An important problem in
mining sets of sequences of events is to get an overview of the
ordering relationships in the data. We present a method for finding
partial orders that describe the ordering relationships between the
events in a collection of sequences. The method is based on viewing a
partial order as a generative model for a set of sequences, and
applying mixture modeling techniques to obtain a descriptive set of
partial orders. Runtimes for our algorithm scale linearly in the
number of sequences and polynomially in the number of different event
types. Thus, the methods scales to handle large data sets and can be
used for reasonable numbers of different types of events. We
illustrate our technique by applying it to student enrollment data and
web browsing data.

[postscript][pdf][bibtex]

Goal-oriented clustering with David Maxwell Chickering, David Heckerman, John C. Platt and Bo Thiesson

Abstract not available

[postscript] [bibtex]

Visualization of navigation patterns on a web site using model based clustering with I. Cadez, D. Heckerman, P. Smyth and S. White

We present a new methodology for exploring and analyzing
navigation patterns on a web site. The patterns that can be analyzed
consist of sequences of URL categories traversed by users. In our
approach, we first partition site users into clusters such that users
with similar navigation paths through the site are placed into the
same cluster. Then, for each cluster, we display these paths for users
within that cluster. The clustering approach we employ is model-based
(as opposed to distance-based) and partitions users according to the
order in which they request web pages. In particular, we cluster
users by learning a mixture of first-order Markov models using the
Expectation-Maximization algorithm. The runtime of our algorithm
scales linearly with the number of clusters and with the size of the
data; and our implementation easily handles hundreds of thousands of
user sessions in memory. In the paper, we describe the details of our
method and a visualization tool based on it called WebCANVAS. We
illustrate the use of our approach on user-traffic data from
msnbc.com.

[postscript][bibtex] also [bibtex]

Dependency networks for inference, collaborative filtering, and data visualization with D. Heckerman, D.M. Chickering, R. Rounthwaite and C. Kadie

We describe a graphical model for probabilistic
relationships---an alternative to the Bayesian network---called a
dependency network. The graph of a dependency network, unlike a
Bayesian network, is potentially cyclic. The probability component of
a dependency network, like a Bayesian network, is a set of conditional
distributions, one for each node given its parents. We identify
several basic properties of this representation and describe a
computationally efficient procedure for learning the graph and
probability components from data. We describe the application of this
representation to probabilistic inference, collaborative filtering
(the task of predicting preferences), and the visualization of acausal
predictive relationships.

[pdf][bibtex] also in UAI98 [bibtex]

Published in 1999

Singling out ill-fit items in a classification. Application to the
taxonomy of Enterobacteriaceae
with Helge G. Gyllenberg, Mats Gyllenberg, Timo Koski, Tatu Lund and Heikki Mannila

Abstract not available

Not available electronically [bibtex]

On the geometry of DAG models with hidden variables with Dan Geiger, David Heckerman and Henry King

Abstract not available

See paper on stratified exponential families. [bibtex]

Quantifier Elimination for statistical problems with Dan Geiger

Recent improvements on Tarski's procedure for quantifier elimination
in the first order theory of real numbers makes it feasible to solve
small instances of the following problems completely automatically:
1. listing all equality and inequality constraints implied by a
graphical model with hidden variables. 2. Comparing graphical models
with hidden variables (i.e., model equivalence, inclusion, and
or portion of a model, and about bounds on quantities derived from a
model. 4. Determining whether an independence assertion is implied
from a given set of independence assertions. We discuss the
foundations of quantifier elimination and demonstrate its application
to these problems.

[postscript][bibtex]

An Algorithm for causal inference in the presence of latent variables and selection bias with Peter Spirtes and Thomas Richardson

Abstract not available

Not available electronically [bibtex] also see UAI95 [bibtex]

Prediction and experimental design with Graphical causal models with P. Spirtes, C. Glymour, R. Scheines, S. Fienberg and E. Slate

Abstract not available

Not available electronically [bibtex]

A Bayesian approach to causal discovery with David Heckerman and Greg Cooper

We consider the Bayesian approach to the discovery of
directed acyclic causal models. The Bayesian approach is
related to the constraint-based approach, in that both methods rely on
the Causal Markov Assumption. Nonetheless, an important difference
between the two approaches is that, whereas the constraint-based
approach uses categorical information about conditional-independence
constraints in the domain, the Bayesian approach weighs the degree to
which such constraints hold. As a result, the Bayesian approach has
three distinct advantages over its constraint-based counterpart. One,
conclusions derived from the Bayesian approach are not susceptible to
incorrect categorical decisions about independence facts that can
occur with data sets of finite size. Two, using the Bayesian
approach, finer distinctions among model structures---both
quantitative and qualitative---can be made. Three, information from
several models can be combined to make better inferences and to better
account for modeling uncertainty. In addition to describing the
general Bayesian approach to causal discovery, we review approximation
methods for missing data and hidden variables, and illustrate
differences between the Bayesian and constraint-based methods using
artificial and real examples.

[postscript][bibtex]

Computationally efficient methods for selecting among mixtures of graphical models, with discussion with B. Thiesson, D. Chickering and D. Heckerman

We describe computationally efficient methods for Bayesian
model selection. The methods select among mixtures in which each
mixture component is a directed acyclic graphical model and can be
applied to incomplete data sets. The model-selection criterion that
we consider is the posterior probability of the model (structure)
given data. Our model-selection problem is difficult because (1) the
number of possible model structures grows super-exponentially with the
number of random variables and (2) missing data necessitates the use
of computationally slow approximations of model posterior probability.
We argue that simple search-and-score algorithms are infeasible for a
variety of problems, and introduce a feasible approach in which
parameter and structure search is interleaved and expected data is
treated as real data. Our approach can be viewed as the combination
of (1) a modified Cheeseman--Stutz asymptotic approximation for model
posterior probability and (2) the Expectation--Maximization algorithm.
We evaluate our procedure for selecting among mixtures of directed
acyclic graphical models on synthetic and real examples.

Not available electronically (Bayesian Statistics 6) [bibtex] also see [bibtex]

Graphical Models and Exponential Families with Dan Geiger

We provide a classification of graphical models according to their
representation as subfamilies of exponential families. Undirected
graphical models with no hidden variables are linear exponential
families (LEFs), directed acyclic graphical models and chain graphs
with no hidden variables, including Bayesian networks with several
families of local distributions, are curved exponential families
(CEFs) and graphical models with hidden variables are stratified
exponential families (SEFs). An SEF is a finite union of CEFs
satisfying a {\em frontier\/} condition. In addition, we illustrate how
one can automatically generate independence and non-independence
constraints on the distributions over the observable variables implied
by a Bayesian network with hidden variables. The relevance of these
results for model selection is examined.

[postscript][bibtex]

Learning mixtures of DAG models with B. Thiesson, D. Chickering and D. Heckerman

We describe computationally efficient methods for Bayesian
model selection. The methods select among mixtures in which each
mixture component is a directed acyclic graphical model and can be
applied to incomplete data sets. The model-selection criterion that
we consider is the posterior probability of the model (structure)
given data. Our model-selection problem is difficult because (1) the
number of possible model structures grows super-exponentially with the
number of random variables and (2) missing data necessitates the use
of computationally slow approximations of model posterior probability.
We argue that simple search-and-score algorithms are infeasible for a
variety of problems, and introduce a feasible approach in which
parameter and structure search is interleaved and expected data is
treated as real data. Our approach can be viewed as the combination
of (1) a modified Cheeseman--Stutz asymptotic approximation for model
posterior probability and (2) the Expectation--Maximization algorithm.
We evaluate our procedure for selecting among mixtures of directed
acyclic graphical models on synthetic and real examples.

[postscript][bibtex] also in UAI98 [bibtex]

Using path diagrams as a structural equation modeling tool with Peter Spirtes, Thomas Richardson, Richard Scheines and Clark Glymour

Abstract not available

Not available electronically [bibtex]

Asymptotic model selection for directed networks with hidden variables with D. Geiger and D. Heckerman

We extend the Bayesian Information Criterion (BIC), an asymptotic
approximation for the marginal likelihood, to Bayesian networks with
hidden variables. This approximation can be used to select models
given large samples of data. The standard BIC as well as our
extension punishes the complexity of a model according to the
dimension of its parameters. We argue that the dimension of a Bayesian
network with hidden variables is the rank of the Jacobian matrix of
the transformation between the parameters of the network and the
parameters of the observable variables. We compute the dimensions of
several networks including the naive Bayes model with a hidden root
node.

[postscript][bibtex] also in UAI96 [bibtex] and Erice volume [bibtex]

Graphical models: selecting causal and statistical models

Abstract not available

Not available [bibtex]

An evaluation of machine-learning methods for predicting pneumonia mortality
with Cooper et al. (Artificial Intelligence in Medicine)

This paper describes the application of eight statistical and machine-learning methods to derive
computer models for predicting mortality of hospital patients with pneumonia from their finadings
at initial presentation. The eight models were each constructed based on 9847 patient cases and
they were each evaluated on 4352 additional cases. The primary evaluation metric was the error in
predicted survival as a function of the fraction of patients predicted to survive. This metric is useful
in assessing a models potential to assist a clinician in deciding whether to treat a given patient in
the hospital or at home. We examined the error rates of the models when predicting that a given
fraction of patients will survive. We examined survival fractions between 0.1 and 0.6. Over this
range, each models predictive error rate was within 1model. When predicting that approximately
30% of the patients will survive, all the models have an error rate of less than 1.5%. The models are
distinguished more by the number of variables and parameters that they contain than by their error
rates; these differences suggest which models may be the most amenable to future implementation
as paper-based guidelines.

Embedded Bayesian Network Classifiers with David Heckerman

We consider learning methods and probability models for the task of
classification: the determination of a conditional probability
distribution for a finite-state outcome or class variable
Y given a set of explanatory or input variables X.
In the first part of the paper, we consider methods for learning
traditional Bayesian networks for classification. Using the full
Bayesian approach, we determine the conditional distribution of
interest by averaging over all possible model structures and their
parameters, given the data we have seen. When this approach is
intractable, we sometimes identify a single "good"' model structure
and use it as if it were the correct one---a process known as
model selection. We review two alternative Bayesian criteria for
model-selection described by Spiegelhalter et al. (1993) and Buntine
(1993), and show how both can be viewed within the prequential
framework of Dawid (1984). We review methods for computing the two
criteria, and provide sufficient conditions under which the two
criteria coincide. In the second part of the paper, we describe a
parameterized model for the probability distributions of a node Y
given its parents X in a Bayesian network. The probability model,
which we call an embedded Bayesian network classifier (EBNC), is
obtained from a (usually different) Bayesian network for Y and X
where X need not be the parents of Y. Like the decision tree,
noisy OR, and log-linear model, the EBNC is a parsimonious (i.e.,
low-dimension) model for encoding the interaction between a variable
and its parents. We show that an EBNC is a special case of a softmax
polynomial regression model. Also, we show how to identify a
minimum-size parameterization of an EBNC, and describe an efficient
approximation for learning the structure of Bayesian networks that
contain EBNCs.

[postscript][bibtex]

Models and Selection Criteria for Regression and Classification with David Heckerman

Abstract not available

[postscript] also in UAI97 [bibtex]

A Bayesian approach to learning Bayesian networks with local structure with D. Chickering and D. Heckerman

Recently several researchers have investigated techniques for using data
to learn Bayesian networks containing compact representations for
the conditional probability distributions (CPDs) stored at each
node. The majority of this work has concentrated on using
decision-tree representations for the CPDs. In addition, researchers
typically apply non-Bayesian (or asymptotically Bayesian) scoring
functions such as MDL to evaluate the goodness-of-fit of networks
to the data.

In this paper we investigate a Bayesian approach to learning Bayesian
networks that contain the more general {\em decision-graph}
representations of the CPDs. First, we describe how to evaluate the
posterior probability---that is, the Bayesian score---of such a
network, given a database of observed cases. Second, we describe
various search spaces that can be used, in conjunction with a scoring
function and a search procedure, to identify one or more high-scoring
networks. Finally, we present an experimental evaluation of
the search spaces, using a greedy algorithm and a Bayesian
scoring function.

[postscript] also in UAI97 [bibtex]

Structure and Parameter Learning for Causal Independence and Causal Interaction Models with David Heckerman

We begin by discussing causal independence models and generalize these
models to causal interaction models. Causal interaction models are
models that have independent mechanisms where mechanisms can have
several causes. In addition to introducing several particular types of
causal interaction models, we show how we can apply
the Bayesian approach to learning causal interaction models obtaining
approximate posterior distributions for the models and obtain MAP and
ML estimates for the parameters. We illustrate the approach with a
simulation study of learning model posteriors.

Available by request [bibtex]

Heuristic greedy search algorithms for latent variable models with P. Spirtes and T. Richardson

Abstract not available

Not available electronically [bibtex]

The dimensionality of mixed ancestral graphs with P. Spirtes and T. Richardson

Abstract not available

Not available electronically [bibtex]

Asymptotic model selection for directed networks with hidden variables with D. Geiger and D. Heckerman

We extend the Bayesian Information Criterion (BIC), an asymptotic
approximation for the marginal likelihood, to Bayesian networks with
hidden variables. This approximation can be used to select models
given large samples of data. The standard BIC as well as our
extension punishes the complexity of a model according to the
dimension of its parameters. We argue that the dimension of a Bayesian
network with hidden variables is the rank of the Jacobian matrix of
the transformation between the parameters of the network and the
parameters of the observable variables. We compute the dimensions of
several networks including the naive Bayes model with a hidden root
node.

[postscript][bibtex] also in UAI96 [bibtex] and Erice volume [bibtex]

Using d-separation to calculate zero partial correlations in linear models with correlated errors with P. Spirtes, T. Richardson, R. Scheines and C. Glymour

Abstract not available

Not available [bibtex]

Learning Bayesian networks with discrete variables from data with Peter Spirtes

This paper describes a new greedy Bayesian search algorithm
(GBPS) for Bayesian networks and a new algorithm that combines the
scoring-based GBPS with the independence-based PC algorithm.
Simulation tests of these algorithms with previously published
algorithms are presented.

[postscript][bibtex]

Strong completeness and faithfulness in Bayesian networks

A completeness result for d-separation applied to discrete Bayesian
networks is presented and it is shown that in a strong
measure-theoretic sense almost all discrete distributions for a given
network structure are faithful; i.e.\ the independence facts true of
the distribution are all and only those entailed by the network
structure.

[postscript] [bibtex]

Causal inference and causal explanation with background knowledge

This paper presents correct algorithms for answering the
following two questions; (i) Does there exist a causal explanation consistent with
a set of background knowledge which explains all of the observed
independence facts in a sample? (ii) Given that there is such a causal
explanation what are the causal relationships common to every such
causal explanation?

[postscript][bibtex]

Causal inference in the presence of latent variables and selection bias with P. Spirtes and T. Richardson

Abstract not available

Not available electronically [bibtex]

Tetrad {II}: User's Manual with Richard Scheines, Peter Spirtes and Clark Glymour

Our approach is to axiomatize the connection between causal structure
and statistical independence and conditional independence, and then
formulate algorithms that take as input statistical data and
background knowledge and output classes of causal models that explain
the data. The algorithms make statistical decisions about
independence and conditional independence, and then apply graph
theoretic search strategies to arrive at the appropriate class of
causal models. We have developed a number of algorithms for different
contexts of background knowledge, for example, time order, knowledge
variables, and partial knowledge of causal relations that do or do not
exist among the variables under study.

Not available electronically. [bibtex]

Conditioning and Intervening with Clark Glymour

We consider the dispute between causal decision theorists and
evidential decision theorists over Newcomb-like problems. We introduce
a framework relating causation and directed graphs developed by
Spirtes et al. (1993) and evaluate several arguments in this
context. We argue that much of the debate between the two camps is
misplaces; the disputes turn on the distinction between conditioning
on an event E as against conditioning on an event I which is an action
to bring about E. We give the essential machinery for calculating the
effect of an intervention and consider recent work which extends the
basic account given here to the case where causal knowledge is
incomplete.

Not available electronically. [bibtex]