Annotated Bibliography of Papers by Christopher Meek

Christopher Meek

Microsoft Research

Redmond, WA 98052, USA

meek@microsoft.com

March, 2008

Year: 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001,
2000, 1999,
1998, 1997,
1996, 1995,
1994

*
Tied Boltzmann Machines for Cold Start Recommendations
* with Asela Gunawardana (ACM Recsys)

We describe a novel statistical model, the tied Boltzmann machine,
for combining collaborative and content information for recommendations.
In our model, pairwise interactions between items are captured through
a Boltzmann machine, whose parameters are constrained according to the
content associated with the items. This allows the model to use content
information to recommend items that are not seen during training. We
describe a tractable algorithm for training the model, and give
experimental results evaluating the model in two cold start
recommendation tasks on the MovieLens data set.

*Inference for Multiplicative Models*

Ydo Wexler and Christopher Meek, (UAI)

The paper introduces a generalization for known probabilistic models such as log-linear and graphical models, called here multiplicative models. These models, that express probabilities via product of parameters are shown to capture multiple forms of contextual independence between variables, including decision graphs and noisy-OR functions. An inference algorithm for multiplicative models is provided and its correctness is proved. The complexity analysis of the inference algorithm uses a more refined parameter than the tree-width of the underlying graph, and shows the computational cost does not exceed that of the variable elimination algorithm in graphical models. The paper ends with examples where using the new models and algorithm is computationally beneficial.

*MAS: a multiplicative approximation scheme for
probabilistic inference*

Ydo Wexler and Christopher Meek (NIPS)

We propose a multiplicative approximation scheme (MAS) for inference problems in
graphical models, which can be applied to various inference algorithms. The
method uses \epsilon-decompositions which decompose functions used throughout
the inference procedure into functions over smaller sets of variables with a
known error \epsilon. MAS translates these local approximations into bounds on
the accuracy of the results. We show how to optimize \epsilon-decompositions and
provide a fast closed-form solution for an L2 approximation. Applying MAS to the
Variable Elimination inference algorithm, we introduce an algorithm we call
DynaDecomp which is extremely fast in practice and provides guaranteed error
bounds on the result. The superior accuracy and efficiency of DynaDecomp is
demonstrated.

*Mining Recommendations from the Web*

Guy Shani, David Maxwell Chickering, and Christopher Meek (ACM Recsys)

In this paper we show how to use data from the internet to construct recommender systems. We provide experimental results, which include a user study, showing that our methods produce good recommendations in realistic applications. We describe how standard evaluation metrics are biased toward systems that simply recommend the most popular items, and we propose modifications to these metrics to fix the problem.

*A Quality-Based Auction for Search Ad Markets with Aggregators
* with Gunawardana and Biggs (ACM EC Workshop on
Ad Auctions)

We explore the role of aggregators in search ad markets and their effects on that marketplace. Aggregators can have positive effects similar to those of an arbitrageur. We argue and provide empirical evidence that, unlike arbitrageurs, aggregators can continue to profit even in the absence of price imbalances. Furthermore, we argue that the standard generalized second price (GSP) auction mechanisms create incentives for aggregators to design their sites to negatively affect the user experience and that the existence of aggregators in the marketplace can also negatively affect non-aggregator merchants. We propose a specific quality-based GSP mechanism that reduces the negative impacts of aggregators on the user and marketplace while allowing some of their positive benefits for users.

*Aggregators and Contextual Effects in Search Ad Markets
with Gunawardana (WWW Workshop on Targeting and Ranking for Online Advertising)*

*
*

Allocation and pricing of ad slots in search ad markets is typically done through a Generalized Second Price (GSP) auction, which assumes that the click-through rate (CTR) of an ad displayed in a particular position depends only on the identity of the ad and the position it is displayed in. In particular, it is assumed that there are no contextual effects, where the CTR of an ad depends on the other ads displayed with it. We argue that such effects do exist. In particular, we discuss how the economics of ad aggregation may lead to ad aggregators imposing a negative contextual effect on the CTR of advertisers listed below them. We perform an experiment where we intervene in the auction to vary the position of selected aggregators relative to other advertisers. We then measure CTR on ads controlling for ad identity and position to quantify the magnitude of the contextual effect of being listed below an aggregator. In addition we describe the result of the intervention on search engine revenue, which is consistent with an increase in advertiser social welfare.

*Partitioned Logistic
Regression for Spam Filtering *

Ming-Wei Chang, Wen-tau Yih, and Christopher Meek, (ACM KDD)

*Consistent Phrase Relevance Measures *

Wen-tau Yih and Christopher Meek (ADKDD-08 Workshop)

*Mobile Opportunistic Commerce:
Mechanisms, Architecture, and Application*

Ece Kumar, Eric Horvitz, and Chris Meek (AAMAS)

*People Watcher: A Game for Eliciting Human-Transcribed Data for Automated
Directory Assistance
* with Paek and Yu (InterSpeech;ELRA Best Paper Award)

Automated Directory Assistance (ADA) allows users to request telephone or
address information

of residential and business listings using speech recognition. Because callers
often express listings

differently than how they are registered in the directory, ADA systems require
transcriptions of

alternative phrasings for directory listings as training data, which can be
costly to acquire. As

such, a framework in which data can be contributed voluntarily by large numbers
of Internet users

has tremendous value. In this paper, we introduce People Watcher, a computer
game that elicits

transcribed, alternative user phrasings for directory listings while at the same
time entertaining

players. Data generated from the game not only overlapped actual audio
transcriptions, but resulted

in a statistically significant utilized for ADA. Furthermore, semantic accuracy
was not statistically

different than using the actual audio transcriptions.

*Similarity Measures for Short Segments of Text
* with Metzler and Dumais (ECIR)

Measuring the similarity between documents and queries has been extensively
studied in information

retrieval. However, there are a growing number of tasks that require computing
the similarity between

two very short segments of text. These tasks include query reformulation,
sponsored search,

and image retrieval. Standard text similarity measures perform poorly on such
tasks because of

data sparseness and the lack of context. In this work, we study this problem
from an information

retrieval perspective, focusing on text representations and similarity measures.
We examine a

range of similarity measures, including purely lexical measures, stemming, and
language modeling-

based measures. We formally evaluate and analyze the methods on a query-query
similarity task

using 363,822 queries from a web search log. Our analysis provides insights into
the strengths and

weaknesses of each method, including important tradeoffs between effectiveness
and efficiency.

*Probabilistic Entity-Relationship Models, PRMs and Plate Models
* with Heckerman and Koller (in Introduction to Statistical
Relational Learning; MIT Press)

In this chapter, we introduce a graphical language for relational data called
the probabilistic entity-

relationship (PER) model. The model is an extension of the entity-relationship
model, a common

model for the abstract representation of database structure. We concentrate on
the directed version

of hte model — the directed acyclic probabilistic entity-relationship model (DAPER)
model. The

DAPER model is closely related to the plate model and the PRM, existing models
of relational data.

The DAPER model is more expressive then either existing model. Many examples are
provided.

*Improving Similarity Measures for Short Segments of Text
*with Yih (AAAI)

In this paper we improve previous work on measuring the similarity of short
segments of text in

two ways. First, we introduce a Web-relevance similarity measure and demonstrate
its effectiveness.

This measure extends the Web-kernel similarity function introduced by Sahami and
Heilman (2006)

by using relevance weighted inner-product of term occurrences rather than TFIDF.
Second, we

show that one can further improve the accuracy of similarity measures by using a
machine learning

approach. Our methods outperform other state-of-the-art methods in a general
query suggestion

task for multiple evaluation metrics.

*Modeling Contextual Factors of Click Rates
*with H. Becker and Chickering (AAAI)

In this paper, we develop and evaluate several probabilistic models of user
click-through behavior

that are appropriate for modeling the click-through rate of items that are
presented to the user

in a list. Potential applications include modeling the click-through rates of
search results from a

search engine, items ranked by a recommendation system, and search
advertisements returned by

a search engine. Our models capture contextual factors related to the
presentation as well as the

underlying relevance or quality of the item. We focus on two types of contextual
factors for a given

item; the positional context of the item and the quality of the other results.
We evaluate our models

on a search advertising dataset from Microsofts Live search engine and
demonstrate that modeling

contextual factors improves the accuracy of click-through models.

*On the Toric Algebra of Graphical Models*

with Geiger and Sturmfels (Annals of Statistics)

We formulate necessary and sufficient conditions for an arbitrary discrete
probability distribution

to factor according to an undirected graphical model, or a log-linear model, or
other more general

exponential models. For decomposable graphical models these conditions are
equivalent to a set

of conditional independence statements similar to the Hammersley–Clifford
theorem; however, we

show that for nondecomposable graphical models they are not. We also show that
nondecomposable

models can have nonrational maximum likelihood estimates. These results are used
to give several

novel characterizations of decomposable graphical models.

*On the incompatibility of faithfulness and monotone-DAG-faithfulness
* with Chickering (Artificial Intelligence)

Cheng, Greiner, Kelly, Bell and Liu [Artificial Intelligence 137 (2002) 4390]
describe an algorithm for

learning Bayesian networks that in a domain consisting of n variables identifies
the optimal solution

using O(n4) calls to a mutual-information oracle. This result relies on (1) the
standard assumption

that the generative distribution is Markov and faithful to some directed acyclic
graph (DAG), and (2)

a new assumption about the generative distribution that the authors call
monotone DAG faithfulness

(MDF). The MDF assumption rests on an intuitive connection between active paths
in a Bayesian-

network structure and the mutual information among variables. The assumption
states that the

(conditional) mutual information between a pair of variables is a monotonic
function of the set of

active paths between those variables; the more active paths between the
variables the higher the

mutual information. In this paper, we demonstrate the unfortunate result that,
for any realistic

learning scenario, the monotone DAG faithfulness assumption is incompatible with
the faithfulness

assumption. Furthermore, for the class of Bayesian-network structures for which
the two assumptions

are compatible, we can learn the optimal solution using standard approaches that
require only O(n2)

calls to an independence oracle.

*A Variational Inference Procedure Allowing Internal Structure for Overlapping
Clusters and Deter-
ministic Constraints
*with Geiger and Wexler (JAIR)

We develop a novel algorithm, called VIP*, for structured variational
approximate inference. This

algorithm extends known algorithms to allow efficient multiple potential updates
for overlapping

clusters, and overcomes the difficulties imposed by deterministic constraints.
The algorithms con-

vergence is proven and its applicability demonstrated for genetic linkage
analysis.

*Structural Periodic Measures for Time-Series Data
*with Vlachos et al. (Data Mining Knowledge Discovery)

This work motivates the need for more flexible structural similarity measures
between time-series

sequences, which are based on the extraction of important periodic features.
Specifically, we present

non-parametric methods for accurate periodicity detection and we introduce new
periodic distance

measures for time-series sequences. We combine these new measures with an
effective metric tree

index structure for efficiently answering k-Nearest-Neighbor queries. The goal
of these tools and

techniques are to assist in detecting, monitoring and visualizing structural
periodic changes. It is

our belief that these methods can be directly applicable in the manufacturing
industry for preventive

maintenance and in the medical sciences for accurate classification and anomaly
detection.

*Good Word Attacks on Statistical Spam Filters
*with D. Lowd (CEAS)

Unsolicited commercial email is a significant problem for users and providers of
email services. While

statistical spam filters have proven useful, senders of spam are learning to
bypass these filters by

systematically modifying their email messages. In a good word attack, one of the
most common

techniques, a spammer modifies a spam message by inserting or appending words
indicative of

legitimate email. In this paper, we describe and evaluate the effectiveness of
active and passive good

word attacks against two types of statistical spam filters: naive Bayes and
maximum entropy filters.

We find that in passive attacks without any filter feedback, an attacker can get
50filter by adding

150 words or fewer. In active attacks allowing test queries to the target
filter, 30 words will get half

of blocked spam past either filter.

*Adversarial learning*

with D. Lowd (KDD)

Many classification tasks, such as spam filtering, intrusion detection, and
terrorism detection, are

complicated by an adversary who wishes to avoid detection. Previous work on
adversarial classifi-

cation has made the unrealistic assumption that the attacker has perfect
knowledge of the classifier

[2]. In this paper, we introduce the adversarial classifier reverse engineering
(ACRE) learning prob-

lem, the task of learning sufficient information about a classifier to construct
adversarial attacks. We

present efficient algorithms for reverse engineering linear classifiers with
either continuous or Boolean

features and demonstrate their effectiveness using real data from the domain of
spam filtering.

*Using epitomes to model genetic diversity: Rational design of HIV vaccines
* with Jojic et al. (NIPS)

We introduce a new model of genetic diversity which summarizes a large input
dataset into an

epitome, a short sequence or a small set of short sequences of probability
distributions capturing

many overlapping subsequences from the dataset. The epitome as a representation
has already been

used in modeling real-valued signals, such as images and audio. The discrete
sequence model we

introduce in this paper targets applications in genetics, from multiple
alignment to recombination

and mutation inference. In our experiments, we concentrate on modeling the
diversity of HIV where

the epitome emerges as a natural model for producing relatively small vaccines
covering a large

number of immune system targets known as epitopes. Our experiments show that the
epitome

includes more epitopes than other vaccine designs of similar length, including
cocktails of consensus

strains, phylogenetic tree centers, and observed strains. We also discuss
epitome designs that take

into account uncertainty about Tcell cross reactivity and epitope presentation.
In our experiments,

we find that vaccine optimization is fairly robust to these uncertainties.

*Perfect Tree-like Markovian Distributions
* with Ann Becker and Dan Geiger (Probability and
Mathematical Statistics)

We show that if a strictly positive joint probability distribution for a set of
binary variables factors

according to a tree, then vertex separation represents all and only the
independence relations encoded

in the distribution. The same result is shown to hold also for multivariate
nondegenerate normal

distributions. Our proof uses a new property of conditional independence that
holds for these two

classes of probability distributions.

*Stochastic and contingent payment auctions
* with Chickering and Wilson (Workshop on Sponsored Search
Auctions, ACM-EC)

We describe a family of stochastic incentive-compatible auctions that generalize
Vickrey auctions and

discuss situations in which such auctions are useful. We also discuss
contingent-payment auctions and

their incentive-compatible counterparts. Finally, we discuss how one can utilize
both reinforcement-

learning techniques and stochastic auctions to learn probabilities for
contingencies.

*Structured Variational Inference Procedures and their Realizations*

with D. Geiger (AI&Stats 2005)

We describe and prove the convergence of several algorithms for

approximate structured variational inference. We discuss the

computation cost of these algorithms and describe their relationship

to the mean-field and generalized-mean-field variational approaches

and other structured variational methods.

Efficient gradient computation for conditional Gaussian models

with B. Thiesson (AI&Stats 2005)

We introduce Recursive Exponential Mixed Models (REMMs)

and derive the gradient of the parameters for the incomplete-data

likelihood. We demonstrate how one can use probabilistic inference in

Conditional Gaussian (CG) graphical models, a special case of REMMs,

to compute the gradient for a CG model. We also demonstrate that this

approach can yield simple and effective algorithms for computing the

gradient for models with tied parameters and illustrate this approach

on stochastic ARMA models.

*Large-Sample Learning of Bayesian Networks is NP-Hard*

with D.M. Chickering and D. Heckerman (JMLR 2004)

In this paper, we provide new complexity results for algorithms that

learn discrete-variable Bayesian networks from data. Our results apply

whenever the learning algorithm uses a scoring criterion that favors

the simplest structure for which the model is able to represent the

generative distribution exactly. Our results therefore hold whenever

the learning algorithm uses a consistent scoring criterion and is

applied to a sufficiently large dataset. We show that identifying

high-scoring structures is NP-hard, even when any combination of one

or more of the following hold: the generative distribution is perfect

with respect to some DAG containing hidden variables; we are given an

independence oracle; we are given an inference oracle; we are given an

information oracle; we restrict potential solutions to structures in

which each node has at most k parents, for all $k\geq 3$. Our proof relies

on a new technical result that we establish in the appendices. In

particular, we provide a method for constructing the local

distributions in a Bayesian network such that the resulting joint

distribution is provably perfect with respect to the structure of the

network.

[JMLR][bibtex]

*Efficient approximations for learning phylogenetic HMM models from data*

with V. Jojic, N. Jojic, D. Geiger, A. Siepel, D. Haussler and D. Heckerman

(Bioinformatics 2004)

We consider models useful for learning an evolutionary or phylogenetic

tree from data consisting of DNA sequences corresponding to the leaves

of the tree. In particular, we consider a general probabilistic model

described in Siepel and Haussler that we call the phylogenetic-HMM

model which generalizes the classical probabilistic models of Neyman

and Felsenstein. Unfortunately, computing the likelihood of

phylogenetic-HMM models is intractable. We consider several

approximations for computing the likelihood of such models including

an approximation introduced in Siepel and Haussler, loopy belief

propagation and several variational methods. We demonstrate that,

unlike the other approximations, variational methods are accurate and

are guaranteed to lower bound the likelihood. In addition, we identify

a particular variational approximation to be best one in which the

posterior distribution is variationally approximated using the classic

Neyman and Felsenstein model. The application of our best approximation to

data from the cystic fibrosis transmembrane conductance regulator gene

region across nine eutherian mammals reveals a CpG effect.

[ISMB][bibtex]

Variations on Undirected Graphical Models and their Relationships

with D. Heckerman and T. Richardson.

Associated with an undirected graph is the set of

distributions obeying the Local Markov property, which are also the

unique stationary distribution of the corresponding single-site Gibbs

sampler. In this note we show that this set of distributions is not a

subset of the set of distributions obeying the global Markov property

with respect to a graph. It follows that this set is also not a subset

of the set of distributions which are limits of distributions

factorizing according to an undirected graph.

[TechReport][bibtex]

*Probabilistic Models for Relational Data*

with D. Heckerman and D. Koller.

We introduce a graphical language for relational data called the

probabilistic entity-relationship (PER) model. The model is an

extension of the entity-relationship model, a common model for the

abstract representation of database structure. We concentrate on the

directed version of this model---the directed acyclic probabilistic

entity-relationship (DAPER) model. The DAPER model is closely related

to the plate model and the probabilistic relational model (PRM),

existing models for relational data. The DAPER model is more

expressive than either existing model, and also helps to demonstrate

their similarity. In addition to describing the new language, we

discuss important facets of modeling relational data, including the

use of restricted relationships, self relationships, and probabilistic

relationships. Many examples are provided.

[TechReport][bibtex]

*ARMA Time-Series Modeling with Graphical Models*

with B. Thiesson, D.M. Chickering, D. Heckerman (UAI 2004).

We express the classic ARMA time-series model as a directed graphical

model. In doing so, we find that the deterministic relationships in

the model make it effectively impossible to use the EM algorithm for

learning model parameters. To remedy this problem, we replace the

deterministic relationships with Gaussian distributions having a small

variance, yielding the stochastic ARMA ($\sigma$ARMA) model. This

modification allows us to use the EM algorithm to learn parameters and

to forecast, even in situations where some data is missing. This

modification, in conjunction with the graphical-model approach, also

allows us to include cross predictors in situations where there are

multiple time series and/or additional non-temporal covariates. More

surprising, experiments suggest that the move to stochastic ARMA

yields improved accuracy through better smoothing. We demonstrate

improvements afforded by cross prediction and better smoothing on real

data.

*Identifying Similarities, Periodicities and Bursts for Online Search Queries
* with Vlachos et al. (SIGMOD)

We present several methods for mining knowledge from the query logs of the MSN
search engine.

Using the query logs, we build a time series for each query word or phrase
(e.g., Thanksgiving or

Christmas gifts) where the elements of the time series are the number of times
that a query is issued

on a day. All of the methods we describe use sequences of this form and can be
applied to time

series data generally. Our primary goal is the discovery of semantically similar
queries and we do

so by identifying queries with similar demand patterns. Utilizing the best
Fourier coefficients and

the energy of the omitted components, we improve upon the state-of-the-art in
time-series similarity

matching. The extracted sequence features are then organized in an efficient
metric tree index

structure. We also demonstrate how to efficiently and accurately discover the
important periods in

a time-series. Finally we propose a simple but effective method for
identification of bursts (long or

short-term). Using the burst information extracted from a sequence, we are able
to efficiently perform

query-by-burst on the database of timeseries. We conclude the presentation with
the description of

a tool that uses the described methods, and serves as an interactive exploratory
data discovery tool

for the MSN query database.

*Model-Based Clustering and Visualization of Navigation Patterns on a
Web Site*

with I. Cadez, D. Heckerman, P. Smyth and S. White

(Data Mining and
Knowledge Discovery 2003)

We present a new methodology for exploring and analyzing navigation

patterns on a web site. The patterns that can be analyzed consist of

sequences of URL categories traversed by users. In our approach, we

first partition site users into clusters such that users with similar

navigation paths through the site are placed into the same

cluster. Then, for each cluster, we display these paths for users

within that cluster. The clustering approach we employ is model-based

(as opposed to distance-based) and partitions users according to the

order in which they request web pages. In particular, we cluster users

by learning a mixture of first-order Markov models using the

Expectation-Maximization algorithm. The runtime of our algorithm

scales linearly with the number of clusters and with the size of the

data; and our implementation easily handles hundreds of thousands of

user sessions in memory. In the paper, we describe the details of our

method and a visualization tool based on it called WebCANVAS. We

illustrate the use of our approach on user-traffic data from

msnbc.com.

[not available online][bibtex]

*Practically Perfect*

with Max Chickering (UAI-2003)

We provide a lower-bound on the probability of sampling a non-perfect

distribution when using a fixed number of bits to represent the

parameters of the Bayesian network. This bound approaches zero

exponentially fast as one increases the number of bits used to

represent the parameters. This result implies that perfect

distributions with fixed-length representations exist. We also provide

a lower-bound on the number of bits needed to guarantee that a

distribution sampled from a uniform Dirichlet distribution is perfect

with probability greater than $1/2$. This result is useful for

constructing randomized reductions for hardness proofs.

[pdf][bibtex]

*Discriminative Model Selection for Density Models*

with Bo Thiesson (AI &
Stats 2003)

Density models are a popular tool for building classifiers. When

using density models to build a classifier, one typically learns a

separate density model for each class of interest. These density

models are then combined to make a classifier through the use of

Bayes' rule utilizing the prior distribution over the classes. In

this paper, we provide a discriminative method for choosing among

alternative density models for each class to improve classification

accuracy.

[ps][pdf][bibtex]

*Large-Sample Learning of Bayesian Networks is Hard*

with Max Chickering
and David Heckerman (UAI 2003)

In this paper, we provide new complexity results for algorithms that

learn discrete-variable Bayesian networks from data. Our results

apply whenever the learning algorithm uses a scoring criterion that

favors the simplest model able to represent the generative

distribution exactly. Our results therefore hold whenever the

learning algorithm uses a consistent scoring criterion and is applied

to a sufficiently large dataset. We show that identifying

high-scoring structures is hard, even when we are given an

independence oracle, an inference oracle, and/or an information

oracle. Our negative results also apply to the learning of

discrete-variable Bayesian networks in which each node has at most $k$

parents, for all $k \geq 3$.

[pdf][bibtex]

*Monotone DAG Faithfulness: A Bad Assumption*

with Max Chickering
(submitted AI 12/2002)

In a recent paper, Cheng, Greiner, Kelly, Bell and Liu (Artificial

Intelligence 137:43-90; 2002) describe an algorithm for learning

Bayesian networks that---in a domain consisting of $n$

variables---identifies the optimal solution using $O(n^4)$ calls to a

mutual-information oracle. This seemingly incredible result relies on

(1) the standard assumption that the generative distribution is Markov

and faithful to some directed acyclic graph (DAG), and (2) a new

assumption about the generative distribution that the authors call

{\em monotone DAG faithfulness (MDF)}. The MDF assumption rests on an

intuitive connection between active paths in a Bayesian-network

structure and the mutual information among variables. The assumption

states that the (conditional) mutual information between a pair of

variables is a monotonic function of the set of active paths between

those variables; the more active paths between the variables the

higher the mutual information. In this paper, we demonstrate the

unfortunate result that, for any realistic learning scenario, the

monotone DAG faithfulness assumption is {\em incompatible} with the

faithfulness assumption. In fact, by assuming both MDF and

faithfulness, we restrict the class of possible Bayesian-network

structures to one for which the optimal solution can be identified

with O($n^2$) calls to an independence oracle.

[TechReport][bibtex]

*Factorization of discrete probability distributions*

with Dan
Geiger and Bernd Sturmfels

We formulate necessary and sufficient conditions for

an arbitrary discrete probability distribution to factor according to

an undirected graphical model, or a log-linear model, or other more

general exponential models. This result generalizes the well known

Hammersley-Clifford Theorem.

[Not available on-line][bibtex]

*On the Toric Algebra of Graphical Models*

with Dan
Geiger and Bernd Sturmfels

We formulate necessary and sufficient conditions for an arbitrary

discrete probability distribution to factor according to an undirected

graphical model, or a log-linear model, or other more general

exponential models. This characterization generalizes the well-known

Hammersley-Clifford Theorem. We show that for decomposable graphical

models these conditions are equivalent to a set of statistical

independence facts as in the Hammersley-Clifford Theorem but that for

non-decomposable graphical models they are not. We also show that

non-decomposable models can have non-rational maximum likelihood

estimates. Finally, using these results, we provide a characterization

of decomposable graphical models.

[postscript][bibtex]

*CFW: A Collaborative Filtering System Using Posteriors Over
Weights of Evidence*

with Carl M. Kadie and David Heckerman

We describe CFW, a computationally efficient algorithm for

collaborative filtering that uses posteriors over weights of

evidence. In experiments on real data, we show that this method

predicts as well or better than other methods in situations where

the size of the user query is small. The new approach works

particularly well when the user's query contains low frequency

(unpopular) items. The approach complements that of dependency

networks which perform well when the size of the query is large.

Also in this paper, we argue that the use of posteriors over

weights of evidence is a natural way to recommend similar

items---a task that is somewhat different from the usual

collaborative-filtering task.

[postscript][bibtex]

*The Road to Asymptopia* with David Maxwell Chickering

Identifying the presence or absence of *inducing paths* is

essential for any algorithm that uses an independence-based approach

to learning causal models from observational data generated from a

graphical model with hidden variables and possibly sampled with

selection bias. Conceptually, an inducing path is a sequence of

vertices in a directed acyclic graphical model (with hidden variables)

that guarantees that the endpoints of the path cannot be d-separated

given any subset of observed vertices in the graph. If there exists an

inducing path in the model from which the observed data is generated,

then we can deduce that the endpoints either (1) have a hidden common

cause, (2) one of the endpoints is a (possibly indirect) cause of the

other, or (3) both. Not surprisingly, the asymptotic correctness of

the independence-based approaches to learning causal models rely

heavily on being able to detect correctly the existence of inducing

paths; that is, they must judge correctly that the endpoints of the

path are dependent given a subset of the observed variables. In this

paper, we investigate inducing paths from an information-theoretic

perspective and examine experimentally the number of samples required

to reliably detect inducing paths of various types.

[Not available on-line] [bibtex]

*Finding Optimal Bayesian Networks* with David Maxwell Chickering

In this paper, we derive optimality results for greedy

Bayesian-network search algorithms that perform single edge

modifications at each step and use asymptotically consistent scoring

criteria. Our results extend those of Meek (1997) and Chickering

(2002), who demonstrate that in the limit of large datasets, such

search algorithms identify the generative (i.e. optimal) model if that

model is DAG perfect among the observable variables. We instead

guarantee a weaker form of optimality whenever the generative

distribution satisfies the *composition property* over the

observable variables, which is a more realistic assumption for real

domains. In addition, we show that the composition property is

guaranteed to hold whenever the dependence relationships in the

generative distribution can be characterized by paths between

singleton elements in some (generative) graphical model even when we

allow the generative model to include unobserved variables, and the

observed data to be subject to selection bias. The generalization

encompasses any situation where the generative distribution is perfect

with respect to (1) a DAG, (2) a chain graph, or (3) a Markov network.

[pdf] [bibtex]

*Staged Mixture Modeling and Boosting* with Bo Thiesson and David
Heckerman

In this paper, we introduce and evaluate a data-driven staged mixture

modeling technique for building density, regression, and

classification models. Our basic approach is to sequentially add

components to a finite mixture model using the structural expectation

maximization (SEM) algorithm. We show that our technique is

qualitatively similar to boosting. This correspondence is a natural

byproduct of the fact that we use the SEM algorithm to

sequentially fit the mixture model. Finally, in our experimental

evaluation, we demonstrate the effectiveness of our approach on a

variety of prediction and density estimation tasks using real-world

data.

[postscript][bibtex]

*Autoregressive tree models for time-series analysis* with David
Maxwell Chickering and David Heckerman

In this paper, we describe models for continuous-valued

time-series data that are useful for data mining in that they (1)

can be learned efficiently from data, (2) support accurate

predictions, and (3) are easy to interpret. We describe a class

of models called *autoregressive tree (ART) models* that are a

generalization of linear *autoregressive (AR)* models that

allow one to model non-linear time-series data by piece-wise

linear autoregressive models. We accomplish this generalization

by using decision trees with linear regressions at the leaves to

predict successive observations in the time series. Also, we

describe a Bayesian approach for learning AR and ART models from

data. Finally, we compare several alternative models and learning

methods using 2,494 data sets.

[postscript][bibtex]

Published in 2001

*The Learning-Curve Sampling Method Applied to Model-Based
Clustering* with Bo Thiesson and David Heckerman

We examine the learning-curve sampling method, an approach for

applying machine-learning algorithms to large data sets. The

approach is based on the observation that the computational cost

of learning a model increases as a function of the sample size of

the training data, whereas the accuracy of a model has

diminishing improvements as a function of sample size. Thus, the

learning-curve sampling method monitors the increasing costs and

performance as larger and larger amounts of data are used for

training, and terminates learning when future costs outweigh

future benefits. In this paper, we formalize the learning-curve

sampling method and its associated cost-benefit tradeoff in terms

of decision theory. In addition, we describe the application of

the learning-curve sampling method to the task of model-based

clustering via the expectation-maximization (EM) algorithm. In

experiments on three real data sets, we show that the

learning-curve sampling method produces models that are nearly as

accurate as those trained on complete data sets, but with

dramatically reduced learning times. Finally, we describe an

extension of the basic learning-curve approach for model-based

clustering that results in an additional speedup. This extension is

based on the observation that the shape of the learning curve for

a given model and data set is roughly independent of the number

of EM iterations used during training. Thus, we run EM for only a

few iterations to decide how many cases to use for training, and

then run EM to full convergence once the number of cases is

selected.

In JMLR [postscript][pdf][bibtex] also in
AI&Statistics [bibtex]

*Accelerating EM for large databases* with Bo Thiesson and David
Heckerman

The EM algorithm is a popular method for parameter estimation in a

variety of problems involving missing data. However, the EM algorithm

often requires significant computational resources and has been

dismissed as impractical for large databases. We present two

approaches that significantly reduce the computational cost of

applying the EM algorithm to databases with a large number of cases,

including databases with large dimensionality. Both approaches are

based on partial E-steps for which we can use the results of Neal and

Hinton (1998) to obtain the standard convergence guarantees

of EM. The first approach is a version of the incremental EM algorithm,

described in Neal and Hinton (1998), which cycles through data cases

in blocks. The number of cases in each block dramatically effects the

efficiency of the algorithm. We provide a method for selecting a near

optimal block size. The second approach, which we call lazy EM, will,

at scheduled iterations, evaluate the significance of each data case

and then proceed for several iterations actively using only the

significant cases. We demonstrate that both methods can significantly

reduce computational costs through their application to

high-dimensional real-world and synthetic mixture modeling problems

for large databases.

[postscript][pdf][bibtex]

*Efficient Determination of Dynamic Split Points in a Decision
Tree* with David Maxwell Chickering and Robert Rounthwaite

We consider the problem of choosing split points for continuous

predictor variables in a decision tree. Previous approaches to this

problem typically either (1) discretize the continuous predictor

values prior to learning or (2) apply a dynamic method that considers

all possible split points for each potential split. In this paper, we

describe a number of alternative approaches that generate a small

number of candidate split points dynamically with little overhead. We

argue that these approaches are preferable to pre-discretization, and

provide experimental evidence that they yield probabilistic decision

trees with the same prediction accuracy as the traditional dynamic

approach. Furthermore, because the time to grow a decision tree is

proportional to the number of split points evaluated, our approach is

significantly faster than the traditional dynamic approach.

Available by request [bibtex]

*Stratified exponential families: graphical models and model
selection* with Dan Geiger, David Heckerman and Henry King

We describe a hierarchy of exponential families which is useful

for distinguishing types of graphical models. Undirected

graphical models with no hidden variables are linear exponential

families (LEFs), directed acyclic graphical (DAG) models and

chain graphs with no hidden variables, including DAG models with

several families of local distributions, are curved exponential

families (CEFs) and graphical models with hidden variables are,

what we term, stratified exponential families (SEFs). A SEF is a

finite union of CEFs of various dimensions satisfying some

regularity conditions. We also show that this hierarchy of

exponential families is non-collapsing with respect to graphical

models by providing a graphical model which is a CEF but not an

LEF and a graphical model that is a SEF but not a CEF. Finally,

we show how to compute the dimension of a stratified exponential

family. These results are discussed in the context of model

selection of graphical models.

[postscript][pdf][bibtex]

*Finding a Path is Harder than Finding a Tree*

I consider the problem of learning an optimal path graphical model

from data and show the problem to be NP-hard for the maximum

likelihood and minimum description length approaches and a Bayesian

approach. This hardness result holds despite the fact that the problem

is a restriction of the polynomially solvable problem of finding the

optimal tree graphical model.

In JAIR
[postscript][pdf][bibtex]

*Using Temporal Data for Making Recommendations* with Andrew Zimdars and David Maxwell Chickering

We treat collaborative filtering as a univariate time series

estimation problem: given a user's previous votes, predict the next

vote. We describe two families of methods for transforming data to

encode time order in ways amenable to off-the-shelf classification and

density estimation tools, and examine the results of using these

approaches on several real-world data sets. The improvements in

predictive accuracy we realize recommend the use of other predictive

algorithms that exploit the temporal order of data.

[postscript][bibtex]

Published in 2000

*Challenges of the email domain for text classification* with Jake Brutlag

Interactive classification of email into a user-defined hierarchy of

folders is a natural domain for application of text classification

methods. This domain presents several challenges. First, the user's

changing mail-filing habits mandate classification technology adapt in

a dynamic environment. Second, the classification technology needs to

be able to handle heterogeneity in folder content and folder size.

Performance when there are only a small number of messages in a folder

is especially important. Third, methods must meet the processing and

memory requirements of a software implementation. We study three

promising methods and present an analysis of their behavior with

respect to these domain-specific challenges.

[postscript][bibtex]

*Perfect Tree-like Markovian Distributions* with Ann Becker and
Dan Geiger

Abstract Not available

Not available electronically [bibtex]

*Global Partial Orders from Sequential Data* with Heikki Mannila

Sequences of events arise in many applications, such as web browsing,

e-commerce, and monitoring of processes. An important problem in

mining sets of sequences of events is to get an overview of the

ordering relationships in the data. We present a method for finding

partial orders that describe the ordering relationships between the

events in a collection of sequences. The method is based on viewing a

partial order as a generative model for a set of sequences, and

applying mixture modeling techniques to obtain a descriptive set of

partial orders. Runtimes for our algorithm scale linearly in the

number of sequences and polynomially in the number of different event

types. Thus, the methods scales to handle large data sets and can be

used for reasonable numbers of different types of events. We

illustrate our technique by applying it to student enrollment data and

web browsing data.

[postscript][pdf][bibtex]

*Goal-oriented clustering* with David Maxwell Chickering, David
Heckerman, John C. Platt and Bo Thiesson

Abstract not available

[postscript] [bibtex]

*Visualization of navigation patterns on a web site
using model based clustering* with I. Cadez, D. Heckerman, P. Smyth and S. White

We present a new methodology for exploring and analyzing

navigation patterns on a web site. The patterns that can be analyzed

consist of sequences of URL categories traversed by users. In our

approach, we first partition site users into clusters such that users

with similar navigation paths through the site are placed into the

same cluster. Then, for each cluster, we display these paths for users

within that cluster. The clustering approach we employ is model-based

(as opposed to distance-based) and partitions users according to the

order in which they request web pages. In particular, we cluster

users by learning a mixture of first-order Markov models using the

Expectation-Maximization algorithm. The runtime of our algorithm

scales linearly with the number of clusters and with the size of the

data; and our implementation easily handles hundreds of thousands of

user sessions in memory. In the paper, we describe the details of our

method and a visualization tool based on it called WebCANVAS. We

illustrate the use of our approach on user-traffic data from

msnbc.com.

[postscript][bibtex] also [bibtex]

*Dependency networks for inference, collaborative filtering, and
data visualization* with D. Heckerman, D.M. Chickering, R. Rounthwaite
and C. Kadie

We describe a graphical model for probabilistic

relationships---an alternative to the Bayesian network---called a

dependency network. The graph of a dependency network, unlike a

Bayesian network, is potentially cyclic. The probability component of

a dependency network, like a Bayesian network, is a set of conditional

distributions, one for each node given its parents. We identify

several basic properties of this representation and describe a

computationally efficient procedure for learning the graph and

probability components from data. We describe the application of this

representation to probabilistic inference, collaborative filtering

(the task of predicting preferences), and the visualization of acausal

predictive relationships.

[pdf][bibtex] also
in UAI98 [bibtex]

Published in 1999

*Singling out ill-fit items in a classification. Application to
the
taxonomy of Enterobacteriaceae* with Helge G. Gyllenberg, Mats Gyllenberg,
Timo Koski, Tatu Lund and Heikki Mannila

Abstract not available

Not available electronically [bibtex]

Abstract not available

See paper on stratified exponential families. [bibtex]

Recent improvements on Tarski's procedure for quantifier elimination

in the first order theory of real numbers makes it feasible to solve

small instances of the following problems completely automatically:

1. listing all equality and inequality constraints implied by a

graphical model with hidden variables. 2. Comparing graphical models

with hidden variables (i.e., model equivalence, inclusion, and

overlap). 3. Answering questions about the identification of a model

or portion of a model, and about bounds on quantities derived from a

model. 4. Determining whether an independence assertion is implied

from a given set of independence assertions. We discuss the

foundations of quantifier elimination and demonstrate its application

to these problems.

[postscript][bibtex]

Abstract not available

Not available electronically [bibtex] also see UAI95 [bibtex]

*Prediction and experimental design with Graphical causal models*
with P. Spirtes, C. Glymour, R. Scheines, S. Fienberg and E. Slate

Abstract not available

Not available electronically [bibtex]

*A Bayesian approach to causal discovery* with David Heckerman
and Greg Cooper

We consider the Bayesian approach to the discovery of

directed acyclic causal models. The Bayesian approach is

related to the constraint-based approach, in that both methods rely on

the Causal Markov Assumption. Nonetheless, an important difference

between the two approaches is that, whereas the constraint-based

approach uses categorical information about conditional-independence

constraints in the domain, the Bayesian approach weighs the degree to

which such constraints hold. As a result, the Bayesian approach has

three distinct advantages over its constraint-based counterpart. One,

conclusions derived from the Bayesian approach are not susceptible to

incorrect categorical decisions about independence facts that can

occur with data sets of finite size. Two, using the Bayesian

approach, finer distinctions among model structures---both

quantitative and qualitative---can be made. Three, information from

several models can be combined to make better inferences and to better

account for modeling uncertainty. In addition to describing the

general Bayesian approach to causal discovery, we review approximation

methods for missing data and hidden variables, and illustrate

differences between the Bayesian and constraint-based methods using

artificial and real examples.

[postscript][bibtex]

*Computationally efficient methods for selecting among mixtures of
graphical models, with discussion* with B. Thiesson, D. Chickering and D.
Heckerman

We describe computationally efficient methods for Bayesian

model selection. The methods select among mixtures in which each

mixture component is a directed acyclic graphical model and can be

applied to incomplete data sets. The model-selection criterion that

we consider is the posterior probability of the model (structure)

given data. Our model-selection problem is difficult because (1) the

number of possible model structures grows super-exponentially with the

number of random variables and (2) missing data necessitates the use

of computationally slow approximations of model posterior probability.

We argue that simple search-and-score algorithms are infeasible for a

variety of problems, and introduce a feasible approach in which

parameter and structure search is interleaved and expected data is

treated as real data. Our approach can be viewed as the combination

of (1) a modified Cheeseman--Stutz asymptotic approximation for model

posterior probability and (2) the Expectation--Maximization algorithm.

We evaluate our procedure for selecting among mixtures of directed

acyclic graphical models on synthetic and real examples.

Not available electronically (Bayesian Statistics 6) [bibtex]
also see [bibtex]

*Graphical Models and Exponential Families* with Dan Geiger

We provide a classification of graphical models according to their

representation as subfamilies of exponential families. Undirected

graphical models with no hidden variables are linear exponential

families (LEFs), directed acyclic graphical models and chain graphs

with no hidden variables, including Bayesian networks with several

families of local distributions, are curved exponential families

(CEFs) and graphical models with hidden variables are stratified

exponential families (SEFs). An SEF is a finite union of CEFs

satisfying a {\em frontier\/} condition. In addition, we illustrate how

one can automatically generate independence and non-independence

constraints on the distributions over the observable variables implied

by a Bayesian network with hidden variables. The relevance of these

results for model selection is examined.

[postscript][bibtex]

*Learning mixtures of DAG models* with B. Thiesson, D.
Chickering and D. Heckerman

We describe computationally efficient methods for Bayesian

model selection. The methods select among mixtures in which each

mixture component is a directed acyclic graphical model and can be

applied to incomplete data sets. The model-selection criterion that

we consider is the posterior probability of the model (structure)

given data. Our model-selection problem is difficult because (1) the

number of possible model structures grows super-exponentially with the

number of random variables and (2) missing data necessitates the use

of computationally slow approximations of model posterior probability.

We argue that simple search-and-score algorithms are infeasible for a

variety of problems, and introduce a feasible approach in which

parameter and structure search is interleaved and expected data is

treated as real data. Our approach can be viewed as the combination

of (1) a modified Cheeseman--Stutz asymptotic approximation for model

posterior probability and (2) the Expectation--Maximization algorithm.

We evaluate our procedure for selecting among mixtures of directed

acyclic graphical models on synthetic and real examples.

[postscript][bibtex]
also in UAI98 [bibtex]

*Using path diagrams as a structural equation modeling tool* with
Peter Spirtes, Thomas Richardson, Richard Scheines and Clark Glymour

Abstract not available

Not available electronically [bibtex]

*Asymptotic model selection for directed networks with hidden
variables* with D. Geiger and D. Heckerman

We extend the Bayesian Information Criterion (BIC), an asymptotic

approximation for the marginal likelihood, to Bayesian networks with

hidden variables. This approximation can be used to select models

given large samples of data. The standard BIC as well as our

extension punishes the complexity of a model according to the

dimension of its parameters. We argue that the dimension of a Bayesian

network with hidden variables is the rank of the Jacobian matrix of

the transformation between the parameters of the network and the

parameters of the observable variables. We compute the dimensions of

several networks including the naive Bayes model with a hidden root

node.

[postscript][bibtex] also in
UAI96 [bibtex] and Erice volume [bibtex]

*Graphical models: selecting causal and statistical models*

Abstract not available

Not available [bibtex]

*An evaluation of machine-learning methods for predicting pneumonia mortality
* with Cooper et al. (Artificial Intelligence in Medicine)

This paper describes the application of eight statistical and machine-learning
methods to derive

computer models for predicting mortality of hospital patients with pneumonia
from their finadings

at initial presentation. The eight models were each constructed based on 9847
patient cases and

they were each evaluated on 4352 additional cases. The primary evaluation metric
was the error in

predicted survival as a function of the fraction of patients predicted to
survive. This metric is useful

in assessing a models potential to assist a clinician in deciding whether to
treat a given patient in

the hospital or at home. We examined the error rates of the models when
predicting that a given

fraction of patients will survive. We examined survival fractions between 0.1
and 0.6. Over this

range, each models predictive error rate was within 1model. When predicting that
approximately

30% of the patients will survive, all the models have an error rate of less than
1.5%. The models are

distinguished more by the number of variables and parameters that they contain
than by their error

rates; these differences suggest which models may be the most amenable to future
implementation

as paper-based guidelines.

*Embedded Bayesian Network Classifiers* with David Heckerman

We consider learning methods and probability models for the task of

classification: the determination of a conditional probability

distribution for a finite-state *outcome* or *class* variable

*Y* given a set of *explanatory* or *input* variables ** X**.

In the first part of the paper, we consider methods for learning

traditional Bayesian networks for classification. Using the full

Bayesian approach, we determine the conditional distribution of

interest by averaging over all possible model structures and their

parameters, given the data we have seen. When this approach is

intractable, we sometimes identify a single "good"' model structure

and use it as if it were the correct one---a process known as

model-selection described by Spiegelhalter et al. (1993) and Buntine

(1993), and show how both can be viewed within the prequential

framework of Dawid (1984). We review methods for computing the two

criteria, and provide sufficient conditions under which the two

criteria coincide. In the second part of the paper, we describe a

parameterized model for the probability distributions of a node

given its parents

which we call an

obtained from a (usually different) Bayesian network for

where

noisy OR, and log-linear model, the EBNC is a parsimonious (i.e.,

low-dimension) model for encoding the interaction between a variable

and its parents. We show that an EBNC is a special case of a softmax

polynomial regression model. Also, we show how to identify a

minimum-size parameterization of an EBNC, and describe an efficient

approximation for learning the structure of Bayesian networks that

contain EBNCs.

[postscript][bibtex]

*Models and Selection Criteria for Regression and Classification* with
David Heckerman

Abstract not available

[postscript] also in UAI97 [bibtex]

*A Bayesian approach to learning Bayesian networks with
local structure* with D. Chickering and D. Heckerman

Recently several researchers have investigated techniques for using data

to learn Bayesian networks containing compact representations for

the conditional probability distributions (CPDs) stored at each

node. The majority of this work has concentrated on using

decision-tree representations for the CPDs. In addition, researchers

typically apply non-Bayesian (or asymptotically Bayesian) scoring

functions such as MDL to evaluate the goodness-of-fit of networks

to the data.

In this paper we investigate a Bayesian approach to learning Bayesian

networks that contain the more general {\em decision-graph}

representations of the CPDs. First, we describe how to evaluate the

posterior probability---that is, the Bayesian score---of such a

network, given a database of observed cases. Second, we describe

various search spaces that can be used, in conjunction with a scoring

function and a search procedure, to identify one or more high-scoring

networks. Finally, we present an experimental evaluation of

the search spaces, using a greedy algorithm and a Bayesian

scoring function.

[postscript] also in UAI97 [bibtex]

*Structure and Parameter Learning for Causal Independence and Causal Interaction Models* with David Heckerman

We begin by discussing causal independence models and generalize these

models to causal interaction models. Causal interaction models are

models that have independent mechanisms where mechanisms can have

several causes. In addition to introducing several particular types of

causal interaction models, we show how we can apply

the Bayesian approach to learning causal interaction models obtaining

approximate posterior distributions for the models and obtain MAP and

ML estimates for the parameters. We illustrate the approach with a

simulation study of learning model posteriors.

Available by request [bibtex]

*Heuristic greedy search algorithms for latent variable models*
with P. Spirtes and T. Richardson

Abstract not available

Not available electronically [bibtex]

*The dimensionality of mixed ancestral graphs* with P. Spirtes and
T. Richardson

Abstract not available

Not available electronically [bibtex]

*Asymptotic model selection for directed networks with hidden
variables* with D. Geiger and D. Heckerman

We extend the Bayesian Information Criterion (BIC), an asymptotic

approximation for the marginal likelihood, to Bayesian networks with

hidden variables. This approximation can be used to select models

given large samples of data. The standard BIC as well as our

extension punishes the complexity of a model according to the

dimension of its parameters. We argue that the dimension of a Bayesian

network with hidden variables is the rank of the Jacobian matrix of

the transformation between the parameters of the network and the

parameters of the observable variables. We compute the dimensions of

several networks including the naive Bayes model with a hidden root

node.

[postscript][bibtex] also in
UAI96 [bibtex] and Erice volume [bibtex]

*Using d-separation to calculate zero partial correlations in
linear models with correlated errors* with P. Spirtes, T. Richardson, R.
Scheines and C. Glymour

Abstract not available

Not available [bibtex]

*Learning Bayesian networks with discrete variables from data*
with Peter Spirtes

This paper describes a new greedy Bayesian search algorithm

(GBPS) for Bayesian networks and a new algorithm that combines the

scoring-based GBPS with the independence-based PC algorithm.

Simulation tests of these algorithms with previously published

algorithms are presented.

[postscript][bibtex]

*Strong completeness and faithfulness in Bayesian networks*

A completeness result for d-separation applied to discrete Bayesian

networks is presented and it is shown that in a strong

measure-theoretic sense almost all discrete distributions for a given

network structure are faithful; i.e.\ the independence facts true of

the distribution are all and only those entailed by the network

structure.

[postscript] [bibtex]

*Causal inference and causal explanation with background knowledge*

This paper presents correct algorithms for answering the

following two questions; (i) Does there exist a causal explanation consistent
with

a set of background knowledge which explains all of the observed

independence facts in a sample? (ii) Given that there is such a causal

explanation what are the causal relationships common to every such

causal explanation?

[postscript][bibtex]

*Causal inference in the presence of latent variables and selection bias*
with P. Spirtes and T. Richardson

Abstract not available

Not available electronically [bibtex]

*Tetrad {II}: User's Manual* with Richard Scheines, Peter Spirtes and
Clark Glymour

Our approach is to axiomatize the connection between causal structure

and statistical independence and conditional independence, and then

formulate algorithms that take as input statistical data and

background knowledge and output classes of causal models that explain

the data. The algorithms make statistical decisions about

independence and conditional independence, and then apply graph

theoretic search strategies to arrive at the appropriate class of

causal models. We have developed a number of algorithms for different

contexts of background knowledge, for example, time order, knowledge

about feedback, knowledge about the presence or absence of omitted

variables, and partial knowledge of causal relations that do or do not

exist among the variables under study.

Not available electronically. [bibtex]

*Conditioning and Intervening *with Clark Glymour

We consider the dispute between causal decision theorists and

evidential decision theorists over Newcomb-like problems. We introduce

a framework relating causation and directed graphs developed by

Spirtes et al. (1993) and evaluate several arguments in this

context. We argue that much of the debate between the two camps is

misplaces; the disputes turn on the distinction between conditioning

on an event E as against conditioning on an event I which is an action

to bring about E. We give the essential machinery for calculating the

effect of an intervention and consider recent work which extends the

basic account given here to the case where causal knowledge is

incomplete.

Not available electronically. [bibtex]