SEER: MAXIMUM LIKELIHOOD REGRESSION
FOR LEARNINGSPEED CURVES
BY
CARL MYERS KADIE
B.S., University of Illinois, 1985
M.S., University of Illinois, 1989
THESIS
Submitted in partial fulfillment of the
requirements
for the degree of Doctor of Philosophy
in Computer Science
in the Graduate College of the
University of Illinois at UrbanaChampaign, 1995
Urbana, Illinois
Seer: Maximum Likelihood Regression
for LearningSpeed Curves
Carl
Myers Kadie
Department of Computer Science
University of Illinois at UrbanaChampaign, 1995
David C. Wilkins, Advisor
The research presented here focuses on modeling machinelearning performance. The thesis introduces Seer, a system that generates empirical observations of classificationlearning performance and then uses those observations to create statistical models. The models can be used to predict the number of training examples needed to achieve a desired level and the maximum accuracy possible given an unlimited number of training examples. Seer advances the state of the art with 1) models that embody the best constraints for classification learning and most useful parameters, 2) algorithms that efficiently find maximumlikelihood models, and 3) a demonstration on realworld data from three domains of a practicable application of such modeling.
The first part of the thesis gives an overview of the requirements for a good maximumlikelihood model of classificationlearning performance. Next, reasonable design choices for such models are explored. Selection among such models is a task of nonlinear programming, but by exploiting appropriate problem constraints, the task is reduced to a nonlinear regression task that can be solved with an efficient iterative algorithm. The latter part of the thesis describes almost 100 experiments in the domains of soybean disease, heart disease, and audiological problems. The tests show that Seer is excellent at characterizing learningperformance and that it seems to be as good as possible at predicting learning performance. Finally, recommendations for choosing a regression model for a particular situation are made and directions for further research are identified.
Acknowledgments
Thanks to Nanci, my wife, for encouragement, support, patience, and love. Thanks to Benjamin, my son, for an extra boost of motivation. Thanks, also, to family and friends for their encouragement.
Thanks to my thesis advisor David C. Wilkins for his guidance. Thanks to James Edstrom, Ziad Najem, and Gunner Blix for reviewing drafts of this thesis. Thanks also to my thesis committee.
This research was conducted at the Beckman Institute for Advanced Science and Technology at the University of Illinois at UrbanaChampaign. Support was provided by the Fannie and John Hertz Foundation. This research was also supported in part by AFOSR grant F4926092J0545, ONR grant N0001488K0124, and ONR grant N000149410432.
Table of Contents
CHAPTER
1. Introduction...............................................................................................................
1.1. Inductive Classification Learning........................................................................
1.2. Regression on Inductive Learning.......................................................................
1.3. Overview of Thesis...........................................................................................
2. Related Work............................................................................................................
2.1. Theoretical Approaches: Computational Learning Theory....................................
2.2. Empirical Approaches......................................................................................
2.3. Effect of Skew and Multiple Classes on Learning Performance..........................
2.4. Summary........................................................................................................
3. Overview of LearningPerformance Models..............................................................
3.1. GoodFitting Models of LearningPerformance..................................................
3.2. Generalized CrossValidation...........................................................................
3.3. Conclusion......................................................................................................
4. Candidate Models of Learning Performance: Design and Selection Method.................
4.1. Candidate Deterministic Models.......................................................................
4.2. Modeling the Effect of Multiple Classes, Skewed Classes, and Noise..................
4.3. Nondeterministic Design Choices.....................................................................
4.4. Fitting Models to Data Efficiently.....................................................................
4.5. Summary........................................................................................................
5. Experimental Procedure and Results.........................................................................
5.1. Experimental Procedure...................................................................................
5.2. Experimental Results.......................................................................................
5.3. Summary and Conclusion.................................................................................
6. Conclusion...............................................................................................................
6.1. Summary........................................................................................................
6.2. Future Work...................................................................................................
6.3. Contributions...................................................................................................
References.................................................................................................................. 93
Vita............................................................................................................................ 97
Inductive learning methods now routinely create the initial knowledge base of classification expert systems from a library of classified examples [Gervarter, 1987]. Researchers have developed effective inductive learning methods for a variety of classification expert system representations. These include:
· the back propagation method for artificial neural networks [Hinton, 1989], the ID3 and PLS algorithms for decision trees [Quinlan, 1986; Rendell, 1986],
· the AQ15 and C4.5 methods for production rules [Michalski, 1986; Quinlan, 1992],
· parameter adjustment for Perceptrons [Minsky and Papert, 1988],
· the genetic algorithm for Holland classifiers [Booker, 1989],
· the Protos method for casebased systems [Porter, 1990], and
· algorithms for belief networks [Cooper, 1992].
For the expert system developer using these inductive approaches, the major effort in terms of time and cost is usually the creation of the library of classified cases. In domains such as equipment and medical diagnosis the data collection and encoding of a single case can take days of effort and considerable cost. Regression on learning performance as a tool in the creation of classified examples offers these benefits: regression models can tell whether the induction program can achieve a required level of classification accuracy in a particular domain. This is important because the classification level achievable by induction varies greatly from problem domain to problem domain. Shavlik et al. [1991] and Weiss [1991] reported variation ranging from 55% to 100%. If a developer requires a level of classification accuracy for a particular domain and this level is achievable by induction, regression models can also specify the approximate number of classified cases that the developer must collect to achieve the desired accuracy.
This thesis presents Seer, a system that generates observations of inductiveclassificationlearning performance and then uses those observations to create models of learning performance. Compared to earlier systems, Seer offers better and more useful models of learning performance, the ability to efficiently find the maximumlikelihood model, and validation with a practical inductive learning system (C4.5) on the realworld diagnosis domains of soybean disease, heart disease, and audiological proems. In a broader sense, regression is among the most important tools in experimental science. It allows fields as diverse as biometrics and econometrics make strong scientific conclusions based on the results of empirical observation. Seer shows how regression can be a practical analysis tool for the field of empirical machine learning.
The rest of this chapter offers a brief overview of the important ideas of the thesis. It defines inductive classification learning task and the task of regression on learningperformance data. It then lists the goals of this research. Finally, it previews the other chapters of this thesis.
Inductive classification learning is the type of learning that Seer models and about which it makes predictions. An instance of (inductive classification) learning has 6 components:
· An example space, E, of examples, e_{1}, e_{2}, e_{3}, ... described in terms of attributes. For example, here is an E and an e_{1} (from a madeup domain):
_{}
_{}
· A class space, C. For example:
_{}
· A target function , _{}, drawn from a space of possible targets
If 2.3 < brightness < 4.0 and
color
in {red, blue}
then if symmetric then dangerous
else normal
else normal
· A set of classified training examples in E´C drawn independently according to some fixed by unknown probability distribution D
_{}
· A classification rule (or hypothesis), _{}produced by a learning program (e.g. C4.5) from the training examples
If color in {blue}
then if brightness < 5.0 then dangerous
else
normal
else normal
· A set of labeled testing examples drawn according to D
The accuracy of a classification rule is the fraction of testing examples it classifies the same as the target.
The problem addressed by Seer is modeling and predicting learning performance based on learningperformance data. The input to Seer has two parts: a set of classified examples and an inductive learning program, such as C4.5. Figure 1.1 gives an example of classified examples from the domain of audiological problems.
t,mild,t,?, ... ,p3,mixed_cochlear_age_fixation
t,mild,t,?, ... ,p4,mixed_cochlear_age_otitis_media
f,mild,f,normal, ... ,p8,cochlear_unknown
...
t,normal,f,elevated, ... ,t26,cochlear_age_and_noise
Figure 1.1: Classified examples from the domain of audiological problems  Such examples are part of the input to Seer. Each line is an example. The last value in the line is the example’s classification. Chapter 5 gives more details of the Audiological and other realworld domains used in this thesis.
The output of Seer is a model of learning performance that can questions such as:
1. How many examples like these would the learner need to achieve 78.8% accuracy?
2. What accuracy would be possible if unlimited examples were available?
3. Factoring out the effects of noise, skewed classes, and multiple classes, what VapnikChervonenkis dimension would produce worstcase learning performance most like the performance observed?
4. How would stratified sampling (changing the proportion of examples of each class) affect learning?
Seer’s solution approach has two parts. First, it uses the classified examples to generate learningperformance data. Figure 1.2 shows a plot of learningperformance data. Each dot in the plot shows the result of one run of the learning program. For each run, Seer splits the classified examples into three disjoint sets: training examples, testing examples, and a possibly empty set of unused examples. The learning program is run on the training examples and the classification rule produced is tested on the testing examples. In the plot, the xaxis shows the number of training examples. The yaxis shows the accuracy of the classification rule on the testing examples.
Figure 1.2: Learningperformance data  Each dot shows the accuracy of a classification rule created with the number of training examples specified by the xaxis.
The second part of Seer’s solution approach is to find a good fitting model for the learningperformance data. Figure 1.3 shows the model that Seer creates for the learning performance data of Figure 1.2.. The thick line shows the deterministic part of the model. A nondeterministic part of the model, not shown, predicts how far the “accuracy dots” will scatter from the thick line. The dotted line is the model’s prediction of the largest possible accuracy.
Figure 1.3: The thick curve shows the deterministic part of the learningperformance model created by Seer. The dotted line is the model’s prediction of the largest achievable accuracy. The model predicts that the expected number of examples needed. The diamond shows that the actual accuracy achieved with 218 examples is very close to the 78.8% that the model predicted.
From the model, Seer can predict answers to the questions posed at the beginning of this section. (See Chapter 5 for representative examples of Seer’s predictions.)
1. How many examples like these would the learner need to achieve 78.8% accuracy?
The model predicts that with 218 examples the learner would achieve 78.8%. The diamond in Figure 1.3 shows the actual accuracy achieved with 218 examples. It is very close to what is predicted.
2. What accuracy would be possible if unlimited examples were available?
The model predicts a maximum accuracy of 86.2%.
3. Factoring out the effects of noise, skewed classes, and multiple classes, what VapnikChervonenkis dimension would produce worstcase learning performance most like the performance observed?
Seer can create a model that predicting that a learning task with a VC dimension d=0.559 would have a worstcase learning performance similar to the performance observed. (See Chapter 4 for details of how the VC dimension is generalized to realnumber values and to values less than 1.)
4. How would stratified sampling (changing the proportion of examples of each class) affect learning?
The model’s start parameter has value 0.149 and its skew parameter has value 1.191. Chapter 4 details how these parameters model the effect of class frequencies on learning performance.
The thesis is organized as follows. Chapter 2 describes previous work in three areas. Its first section reviews theoretical work on learning performance. Its second section discusses work characterizing the form of learning curves. The third discusses previous work on the effect of skew and multiple classes on learning performance. Chapter 3 gives an overview of learningperformance models. It defines them and argues that maximumlikelihood is the most appropriate criterion by which to select among them. It shows why popular regression methods, such as ordinary linear regression, are inappropriate for learning data and it demonstrates how to generate learningperformance data from a relatively small set of classified examples. Chapter 4 treats the selection of a set of candidate models as a design problem and enumerates some possible design choices for both the deterministic and nondeterministic part of the models. The chapter also develops a heuristic model of the effect of noise, skewed classes and multiple classes on learning. Finally, it shows how to reduce the difficult problem of finding the best model from a set of candidates to a problem of nonlinear regression that can be solved with an efficient iterative algorithm. In Chapter 5, the models are put to the test in a series of almost 100 experiments on real data from three diagnosis domains: soybean disease, heart disease, and audiological problem. The chapter also provides an analysis of the experiments. The thesis concludes with Chapter 6, which provides a summary and suggestions for future work.
Researchers have been creating models of learning performance since at least 1919 [Thorstone, 1919]. This chapter divides previous work into two approaches. Section 2.1 describes theoretical approaches, concentrating on computational learning theory. Section 2.2 covers empirical approaches and includes a table summarizing previous curvefitting approaches and showing how Seer advances the state of the art. Finally, Section 2.3 describes specific previous work related to predicting the effect of multiple classes and skewed classes on learning.
Work in computational learning theory (CLT) goes back at least to Gold [1967], but the work most relevant to this thesis started with Valiant [1984]. In that paper, Valiant introduced what is now called probablyapproximatelycorrect (PAC) learning. Roughly, a set of concepts is said to be PAC learnable if any concept in the set can be learned, with probability 1d, to within error less than e, in time (and trainingexample set size) polynomial in 1/d and 1/e.
Blumer et al. [1989] refined Valiant’s approach by introducing VapnikChervonenkis (VC) dimension analysis. VC analysis can be used to measure the expressibility of some representation schemes and to show if the schemes are PAC learnable. In addition to showing what is learning in polynomial time (and with a polynomial number of examples), computational learning theory is also used to determine worstcase learning performance bounds. ShaweTaylor et al. [1993] give this upper bound on the number of training examples needed to learn any concept in a concept set with VC dimension, d (>=2):
(2.1)
For practical problems, such as estimating the number of additional classified learning examples needed to achieve a classification rule with some desired accuracy, computational learning theory is of limited utility. One limitation is that, to date, CLT only has results for classification rules with simple representations, such as 3term conjunctive normal form functions. Developers cannot easily use computational learning theory on machine learning’s standard representations such as the ID3 decision tree and the AQ15 rule representations, although research is proceeding in this direction. Another limitation is that most computational learning theory analyzes only worstcase problems and classifiedexample distributions; the resulting learningperformance curves are very difference from the averagecase analysis needed to predict learning performance within the context of a realworld application. A third limitation is that the VapnikChervonenkis dimension approach used by computational learning theory cannot quantify the effect of noise, and essentially all realworld domains to an inductive learning program will have some degree of noise. Finally, computational learning theory generally analyzes only consistent learning algorithms, algorithms that produce classification rules that are consistent with all training examples. They ignore the lessthanconsistent algorithms used for realworld problems. A lessthanconsistent algorithm employs a stronger inductive bias to guide the search for a generalization of the data, such as heuristically trying to choose the generalization with the fewest numbers of disjunctions.
With time some or all of these limitations might be overcome, and there are some hopeful signs of progress. For example, the works of Kearns and Schapire [1989] and Goldman and Sloan [1992] deal with worstcase learning in the presence of noise. The works of Pazzani and Sarrett [1990], Hirschberg and Pazzani [1991], Iba and Langley [1992], and Langley et al. [1992] describe methods that produce averagecase analysis for some simple induction algorithms. The work of Ehrenfeucht and Haussler deals with worstcase learning of a less heuristic variant of the ID3 algorithm [Ehrenfeucht and Haussler, 1989]. To date, these techniques are not suitable for everyday use. For example, Iba and Langley [1992] perform an averagecase analysis of a simple machine learning algorithm that creates onelevel “decision trees”. The resultant learningperformance model is a complex combinatorial expression that could take more computer time to evaluate that would be taken to actually run a learning expression [Langley, personal communications, 1994]. This eliminates any advantage that analytic methods might have over empirical approaches.
In contrast to the theoretical approaches that start with some model of the learning task and then predict (or bounds) learning performance, empirical approaches start with observations of learning performance. Sometimes these observations are used to select a model from a set of models; sometimes they stand on their own.
The work of Gaines [1989] is an empirical exploration of how the quality of training examples created by an expert (called the expert’s case library) affects the number of training examples needed create a satisfactory set of rules via induction.
Gaines used the INDUCT knowledge acquisition tool in the domain of diagnoses of contact lens. He found that INDUCT created satisfactory rules when INDUCT was supplied with 18 “critical cases.” This number of needed cases, however, jumped to 90 cases when INDUCT was supplied with a representative sample of “merely correct cases.” In addition, INDUCT could still create satisfactory rules even if these correct expert cases contained 25% errors; however, it then required 326 cases. Moreover, the number of cases required jumped to 1970 cases when the cases contained 10% error and 1 irrelevant attribute on the average. Gaines’ research results demonstrate how the quality of the data dramatically affects the number of library cases required for induction.
Note that the method used by Gaines is empirical in nature. The determination of the effect of a particular amount of noise, irrelevant attributes and the like was calculated by running more and more cases under a particular set of conditions until a satisfactory set of rules was obtained.
Seer also gathers observations of learning performance, but in addition it has a set of candidate models—for example, a family of curves. It selects the model from the model set that best fits the data. If that data is representative and the model set is appropriate, the resulting model may be useful for predicting future learning performance over a wider range conditions that was observed. For example, Seer might be able to predict that approximately 2000 cases are required when the cases contain 10% errors, without having to construct and run 2000 cases with these characteristics.
This section reviews 75 years of research analyzing learningperformance data by fitting models to the data and highlighting Seer’s unique contributions.
Table 2.1 summarizes the section’s conclusions.
Over the years researchers in many fields have fit many curves (sets of candidate learningperformance models) to learningperformance data from many kinds of learning. The most popular candidate curves have been on these orders:
Exponential: log(e) = m (2.2)
Hyperbolic: e = 1/m (2.3)
Power Law : e = 1/m^{x}^{ }(2.4)^{}
Logit: logit[e] = m (2.5)
where m is the number of training examples and e is some measure of error or reaction time., The most popular kinds of learning considered have been skill acquisition (also called speedup learning) and classification learning. A rat learning to run a maze faster is an example of skill learning. The time to complete the maze would be e.
Quantitative modeling of learning was first considered in the field of experimental psychology. In Thorstone’s 1919 analysis of 40 curve shapes corresponding to data derived from humans learning to use a typewriter, he concluded that the hyperbolic curve fit best. . Mazur and Hastie [1978] reviewed of 75 years of debate in the psychological community about whether learning curves are exponential or hyperbolic. They then put these two curve families to the test on a wordrecall task and concluded that the hyperbolic fits the data better.
_{ }
Table 2.1: Summary of previous work on the shape of learning curves
Newell & Rosenbloom [1981] bridged the gap between psychology and artificial intelligence. Concentrating on skill learning, they reviewed work with skilllearning curves back to 1926. They also tested the applicability of the Power Law to a dozen skillacquisition domains, concluding:
“There exists a ubiquitous quantitative law of practice: It appears to follow a power law`_{;} that is, plotting the logarithm of the time to perform a task against the logarithm of the trial number always yields a straight line, more or less. We shall refer to this law variously as the loglog linear learning law or the power law of practice.”
Anderson [1989] showed that the Power Law of Learning could be a consequence of “rational” memory retrieval. He also referred to other models of memory and skill learning that are consistent with the Power Law of Learning. Logan [1992] gave a useful overview of Power Law research on skill acquisition since 1981.
Kadie [1991] analyzed constructiveinductive classification learning by fitting a model/curve based on the VC dimension bound to learningperformance data. The least squares fit resulted in an R^{2} of greater than 0.99
Amari [1993] reviewed “1/m results” from the fields of general stochastic descent dynamics, computational learning theory, statistical mechanics, information theory, and Bayesian statistics. Cohen [1992] reviewed “1/m” computational learning theory work. Cohen fit linear perceptron and a more complex artificial neural net to synthetic data from artificial targets and found some synthetic problems in which the exponential models fit better and some where the hyperbolic models fit better.
This thesis does not examine speedup learning. Its general goal is to provide the field of empirical machine learning with a practical and useful regression tool for classification learning. It extends the work of Kadie [1991] and the others’ work in three ways.
Learningperformance models: Best constraints and most useful parameters  We will see in Chapter 5 that all models of the same order (e.g., hyperbolic) fit the (classification) learningperformance data about equally well. What differentiates models is not their general shape but the details of their constraints and parameters. For example, unlike Seer, many learning models do not capture the constraint that classification accuracy is bounded by 0.0 and 1.0. Such models can predict expected accuracies less than 0 for low m values. Also, many models treat variance as continuously distributed. Such models can erroneously predict that it is possible to achieve 101% classification accuracy. Chapter 4 will show that Seer correctly models variance as a discrete distribution and that it can distinguish between variance caused by variation in the training examples from variance cause by variation in the testing examples. Like most of the models, Seer has a parameter, max, that models noise. Unlike the other models, it also has parameters, start and skew, that directly measure (and can be used to predict) the effect of multiple classes and skewed classes. Also, unlike other models, its measure of learning difficulty, d, can be calibrated to correspond to the VapnikChervonenkis dimension from computational learning theory.
Fitting criteria and algorithm: Finds maximumlikelihood models efficiently  As the quotation from Newell & Rosenbloom [1981] above suggests, many systems do curve fitting by first transforming the data and then applying linear regression. This introduces two problems. First, the procedure can fail. For example, if a learner’s imperfect classification rule just happens to correctly classify 5 of 5 examples, the observed error is 0. The fact that the logarithm of 0 is undefined results in the failure of the procedure. The second problem is that a leastsquares fit does not find the maximumlikelihood model for classification learning. Even if nonlinear regression is used rather than linear regression, leastsquares fit only finds the maximumlikelihood model if the data is continuous and the variance is constant. Neither assumption holds for (classification) learningperformance data. (See Section 3.1.) Seer works directly on the data (no transformations), finds the maximumlikelihood model directly, and makes the reasonable assumption that variance, rather than being constant, is determined by the binomial or similar discrete distribution. This would seem to make Seer’s fitting task one of difficult nonlinear programming, but as Section 4.4 details, Seer is able to reduce the problem to nonlinear regression and which it can solve with a quickly converging iterative algorithm.
Application and
experimental validation: Demonstrates a practical application on realworld
data  It is the author’s belief that Seer shows the first practical
application for learningperformance modeling of machine learning data, namely,
the estimation of the number of training examples needed to develop a
classification expert system via induction. Related to this, it is one of the
few efforts used to make predictions,
not just characterizations. Although several other efforts have applied
learningperformance modeling to noisy learning situations, Seer is the first
(within the author’s knowledge) to use noisy data produced by a practical
machine learning system such as C4.5. Experimentally, Seer provides (to the
best of the author’s knowledge), the first learningperformance modeling of a
machine learning algorithm on “natural” learning examples (See Chapter 5). The
other previous machine learning work has either been entirely theoretical (and
not applied to any empirical data) or has been applied only to synthetic
learning examples.
One of the contributions of this thesis is an analysis of how skewed classes and multiple classes affect learning performance and the learning curve. The analysis shows how class frequency information constrains the form of the curve. Previous work on the effect of class frequencies developed from a statistical design method called stratification [Brewer, 1982].
In some statistical problems, examples can be chosen by class. By way of illustration, a statistician may decide to poll exactly 25 Libertarians and exactly 25 Republicans even if Libertarians are less frequent than Republicans in the population being studied. Such a design would allow the statistician to make better conclusions about Libertarians and about the differences between the two groups while keeping cost constant relative to random sampling.
In inductive machine learning, work with skewed classes has concentrated on adapting algorithms designed for even class frequencies to work well with skewed class frequencies. Buntine [1989] created a decision tree learner that could be given class frequency information. Catlett [1991] performed a series of experiments on the effect of reducing the number of examples in the most frequent class (so that the learning algorithm would run faster). Although he concluded this is a useful tool for some situations, he believed that throwing away training examples (even from the most frequent class) is not a general solution to the problem of scaling to large data sets.
Quinlan [1991] documented some of the difficulty of learning with two skewed classes and shows how machine learners can minimize some of the difficulty by selectively adopting a “when in doubt, guess the most frequent class” policy.
Seer provides an explicit and predictive empirical model of the effects that previous work has observed.
Theoretical approaches of learningperformance modeling can be used, if enough is known about the learning task, to predict bounds on learning performance. However, for practical problems with noise and that use the best machine learning algorithms, theory cannot yet predict averagecase performance. Empirical approaches try to gather additional information, usually about a more narrow learning task, by observing actual learning performance. This information can then be generalized into quantitative models. Seer does this by fitting a curve (a learningperformance model) to observed learningperformance data. It advances the state of the art with: 1) learningperformance models that embody the best constraints (for classification learning) and most useful parameters 2) fitting algorithms that efficiently find maximumlikelihood models, and 3) a demonstration, on realworld data, of a practical application.
Our ultimate goal is to create a statistical model from observations of learning performance that can predict future learning performance. This chapter addresses two prerequisite questions. First, given enough learningperformance data and some candidate models of learning performance, how do we evaluate the fit of those models on that data? Second, how can we generate enough learningperformance data from a relatively small number of classified examples?
For both questions, assume learningperformance data is a set of tuples. Specifically, let _{} represent w runs of the learning program. For each run of the learning program a tuple <m_{i},y_{i},k_{i}> is recorded, where m_{i} is the number of training cases given to the inductive learning program, y_{i} is the number of testing cases the classification rule classifies correctly, and k_{i} is the number of cases used to test the classification rule created by the learning program.. Figure 3.1 shows an example plot of m_{i} vs. y_{i}/k_{i}.
Figure 3.1: Learningperformance data. The horizontal axis is m_{i}, the number of training examples. The vertical axis is y_{i}/k_{i}, the fraction of testing examples the learner’s classification rule correctly classifies.
This section will answer the question “What is a goodfitting model of learning performance?”.
A model is a function from m, the number of training examples, to a probability density. Given a value m, the number of training examples, and k, the number of testing examples, a model predicts the probability of each possible value of y, the number of correctly classified examples. Regression models are often defined in two parts: a deterministic part and a nondeterministic part [Aldrich and Nelson, 1984; McClave and Dietrich, 1988]. For example, in the model in Figure 3.2a, the deterministic part of the model has the form of a line _{}, and the nondeterministic part of the model is a normal distribution of fixed variance centered on z. This is the kind of model found by ordinary linear regression. Figure 3.2b illustrates another model. Again, there is a deterministic part, in this case taking the form of a curve, and a nondeterministic part, in this case taking the form of a discrete probability density. Chapter 4 will discuss specific candidates for the curve and the density. For now, our focus is on what a model does.
a) b)
Figure 3.2: Two statistical models
Regression is well established as a field of statistics and includes many specific techniques, such as ordinary linear regression via leastsquares minimization. This section shows why continuousresponse methods such as ordinary linear regression are not adequate for learningperformance data.
In situations with one realvalued dependent variable and one realvalued independent variable (that is, continuousresponse data), one can think of ordinary linear regression as a method of finding the straight line that best characterizes a set of observations. “Best” here means the line that minimizes the sum of the squares of vertical distances between the data points and the line. Two properties of learningperformance data make it inappropriate for ordinary linear regression. First, learning curves (plots of m vs. y/k) are not typically linear. They range from 0% to 100% expected accuracy, typically rising quickly and then slowly converging to or toward 100%. This violates ordinarylinear regression’s linearity assumption and results in predicted accuracies that are greater than 100%, such as 110%. Second, ordinary linear regression (and many other nonlinear continuous response methods) assume that errors  that is, differences between the data and the model  are distributed according to some constant variance. In other words, They assume that a situation where the model predicts 30% accuracy and the data shows 50% is just as likely as a situation where the model predicts 99% accuracy and the data shows 79% accuracy. In practice, the variance of the error shrinks significantly for learningperformance data as the accuracy gets closer to 100%.
These problems can sometimes be overcome by transforming the data so that it has more appropriate properties. For example, averaging observed accuracy and then transforming it with the _{} function [or the function _{}] will tend to linearize its relation with the number of training examples, m. Such a transformation will also tend to make the variance of the errors more constant.
A problem remains, however, for all regression methods that assume learningperformance data is continuousresponse, for example, by assuming that errors are normally distributed. Learningperformance data is binaryresponse data, not continuousresponse data; on each classification test case, the classifier will either have an accuracy of 0 or an accuracy of 1. If p(m) is the accuracy predicted by the model for a learning classification rule based on m training examples, the error on a single testing example will, with probability p(m), be 1p(m) and will, with probability, 1p(m), be 0p(m). Thus, the deviance between prediction and actual result will at best have a mean of zero and a variance equal to p(m)[1p(m)]. In other words, it is distributed discretely according to the binomial (or some similar distribution), not continuously. Although the binomial distribution can sometimes be approximated with the normal distribution, that approximation is not accurate if the sample size is small or if p(m) is near 1. Because small samples and accuracies near 1 are of special interest to us when predicting learning performance, we should not use continuousresponse regression on learningperformance data.
An alternative to continuousresponse regression is binaryresponse regression. For some types of continuousresponse regression, such as ordinary linear regression, finding a leastsquares fit produces the model with maximum likelihood. This is not true for binaryresponse regression so that it is usual to use other, usually more direct, but less efficient, methods for finding maximumlikelihood models [Cramer, 1986; Hosmer and Lemeshow, 1989]. Chapter 4 demonstrates Seer’s use of nonlinear optimization to this end. The next section discuses maximum likelihood further.
If our general goal is to find the most probable model from a set _{}, given some observed data D.:
Find_{}that maximizes _{} (3.1)
Then by Bayes’ rule, this is the same task as:
Find_{}that maximizes _{} (3.2)
If the number of M_{i}’s is finite, all M_{i}'s equiprobable (and that D is constant), this is the same task as:
Find_{}that maximizes _{} (3.3)
This assumption is typical but usually implicit in the field of statistics [Cramer, 1986, p. 7]. It is the reasonable given that no other knowledge is available. The result is called the maximumlikelihood model.
Consider a simple example of computing the likelihood of a model given some learningperformance data. Suppose the learning performance data is
{<100, 1, 1>, <100, 0, 0>, <150, 0, 1>, <150, 1, 1>, <200, 1, 1>, <200, 1, 1>}. (3.4)
Suppose furthermore one of our candidate models predicts an expected accuracy of 0.70 for classification rules based on 100 training examples, an expected accuracy of 0.80 for classification rules based on 150 training examples, and an expected accuracy of 0.90 for classification rules based on 200 training examples. Table 3.1 shows the likelihood calculation. Note that maximizing likelihood is equivalent to minimizing the log likelihood; this fact is used to simplify some calculations..
Learning Performance Datum 
Model’s Predicted Expected Accuracy 
Likelihood 
Log (natural) Likelihood 
<100, 1, 1> 
0.70 
0.70 
0.36 
<100, 0, 1> 
0.70 
0.30 
1.20 
<150, 0, 1> 
0.80 
0.20 
1.61 
<150, 1, 1> 
0.80 
0.80 
0.22 
<200, 1, 1> 
0.90 
0.90 
0.11 
<200, 1, 1> 
0.90 
0.90 
0.11 

Total: 
0.027 
3.60. 
Table 3.1. An example of computing the likelihood of a model
In many applications, a small, fixed set of examples (a case library) must supply both training examples for the inductive learning program and testing examples for measuring the performance of the classification rule the program produces. This section tells how to collect learning performance data from a small number of classified examples with a heuristic called generalized crossvalidation.
The standard statistical technique of crossvalidation (see Figure 3.3) would try to solve the problem by dividing the N classified examples into v disjoint subsets, each of approximate size N/v. Then it would repeatedly hold one subset out for testing and give the other examples in the other v1 subsets to the learner for training. The result would be learningperformance data of the form:
{<m_{1},y_{1},k_{1}>, <m_{2},y_{2},k_{2}>, ..., <m_{v},y_{v},k_{v}>} (3.5)
where the m_{i}’s are the number of training examples given to the learner (approximately (v1)N/v). The k_{i}’s are the number of testing examples (approximately N/v). The sum of the k_{i}’s will be exactly N because each case is used as a testing example exactly once. The y_{i}’s are the number of examples a classification rule classified correctly. They will range in value between 0 and k_{i}.
Input: A set N of classified examples
Procedure:
1. Shuffle order of examples
2. Partition into subsets (say, v subsets):
S_{1}, S_{2}, ... S_{v}
3. For each subset, train on the others, test on it.
Run 1: Train on N S_{1}; Test on S_{1}
...
Run i: Train on the set N S_{i} ; Test on S_{i}
...
Output: {<m_{1},y_{1}, k_{1}>,<m_{2},y_{2},k_{2}>, ..., <m_{i},y_{i},k_{i}>,...}
Figure 3.3: The crossvalidation procedure
As an example, if N is 101 and v is 20, then each subset will have 5 or 6 cases in it and resultant learningperformance data will look something like:
{<95, y_{1},6>, <96, y_{2},5>, ... <96, y_{20},5>} (3.6)
Crossvalidation is good in that it uses each case exactly once for testing. This means that the y_{i}’s are independent with respect to the testing examples (but not the training examples). Crossvalidation is inappropriate for learningperformance analysis, however, because it only provides data for one or two distinct values of m. In the example, it only provided data for m=95 and m=96. To do learningperformance analysis, we need data for many distinct values of m.
To meet this need, Seer uses generalized crossvalidation, a new heuristic. The idea is to perform crossvalidation for each distinct value of m of interest (see Figure 3.4).
Input: A set N of classified examples, set of mvalues of interest
Procedure:
For each m:
a. Shuffle the order of the examples
b. Partition the examples into v subsets:
c. For each subset, S_{i}
Train on m examples randomly selected without replacement from the set MSi
Test on S_{i}
Output: {<m_{1},y_{1,1}, k_{1,1}>,<m_{1},y_{1,2},k_{1,2}>, ..., <m_{i},y_{i,1},k_{i,1}>,...}
Figure 3.4: The generalized crossvalidation procedure
For example, suppose N is 101 and that one mvalue of interest is 45. We could divide the data into three subsets a, b, and c of size 33, 34, and 34, respectively. Next we could create three pairs of training data and testing data:
{<Nc, c>, <Nb, b>, <Na, a>} (3.7)
Each pair consists of 1) the union of all but one subset and 2) the remaining subset. The first element of the pair is the training set. In the example, each training set has 67 or 68 examples in it. That is more than we want, so we randomly select (without replacement) 45 examples from each training subset to actually be used for training. The process is repeated for all mvalues of interest.
In the example, the data was divided in the three initial subsets. In general, it will be divided into v subsets. The only constraint is that the number of initial subsets v must be such that _{}. In other words, it must be large enough that there will be enough training examples. In the experiments of Chapter 5, the default value of v is 10. For larger m, it is the minimum value that meets the constraint.
With generalized crossvalidation, for each distinct value of m, each of the N cases is used for testing once and only once. Because we assume the N cases are independent, learning performance data with the same m value is independent (with respect to the testing examples). Learning performance data with different values of m, however, are not independent with respect to the testing examples because the same examples used for testing when, say, m=49 are reused for testing when m=50. Considered as a whole, the learning performance data is, thus, not mutually independent. This will prevent us from making some statistical conclusions that depend heavily on the independence of the data. For example, it will make it impossible to create good confidence intervals on predictions because such confidence intervals depend on a good estimate of variance and that requires independent data. This is unfortunate, but it is inevitable when working with a small amount of data.
This chapter gave an overview of the regression problems on which Seer works. It defined the form of learningperformance data of interest to Seer as a set of tuples, {<m_{1},y_{1},k_{1}>, <m_{1},y_{1},k_{1}>, ...}, where m_{i} is the number of training examples given to a learning program, k_{i} is the number of testing examples given to the learner’s classification rule, and y_{i} is the number of testing examples the classification rule classifies correctly. The chapter also defined models of learning performance. A model is a function that given the number of training examples, m, and the number of testing examples, k, a model tells the probability of correctly classifying any y_{i} number of examples. The chapter also showed how to apply the notion (from the field of statistics) of maximum likelihood and that continuousreposes regression methods will not finding maximumlikelihood models of the learningperformance data. Finally, the chapter showed that when the total number of classified examples is small, a new procedure called generalized crossvalidation can be used to gather learningperformance data (at a cost of some statistical independence).
The next chapter gives the specifics of Seer’s regression. It details the learningperformance models that Seer considers and shows how Seer uses nonlinear optimization to find maximumlikelihood models of learning performance.
The previous chapter defined learningperformance models and showed how the maximumlikelihood criterion defines the best model in a possibly infinite set of candidate models. But which sets of candidate models should we consider? And how can we, algorithmically, find the best model from a set? This chapter answers these two questions. The chapter treats the first question (which sets of models to consider) as a threepart design problem: 1) designing the deterministic part of the model for noisefree learning with two equiprobable classes, 2) generalizing the deterministic model to handle multiple classes, skewed classes, and noise, and 3) designing the nondeterministic part of the model. Figure 4.1 gives an overview of the design choices considered. The chapter answers the second question (how to algorithmically find the best model from a set) in Section 4.4 which details the optimization techniques used. This chapter draws no conclusions about which designs are best. That question will be addressed in Chapter 5 by putting the designs to the test on real learningperformance data.
Figure 4.1: Design of sets of candidate models. We will start with sets of deterministic models based on logit, probit, Burr_{1,} and ED. Then we will add Model_{gen} or NoExp (NoExp defined in the next chapter) to improve fit when classification classes are skewed, multiple, or noisy. Finally, we will add a nondeterministic component to the (sets of) models. The numbered paths show the five designs put to the test on real data in Chapter 5.
Recall from section 3.1.1 that the deterministic part of a learningperformance model is the part that, given the number of training examples provided to the learner, predicts the expected accuracy of the learner’s hypothesis. (The nondeterministic part of the model predicts the error in the accuracy prediction.) This section finds some reasonable and some unreasonable deterministic (sets of) models. The most interesting of these, the effectivedimension learningperformance model, is inspired by models of learning developed by computational learning theory.
All the models considered will be binaryresponse models, that is; they predict number of occurrences of some binary event. In the context of learningperformance data, they predict the expected number of correctly classified examples, y, when the learner’s classification rule is tested on k examples. Put another way, they predict the expected accuracy, y/k, on a set of testing examples, where accuracy can range from 0 to 1.
This section will start out assuming that the learning task of interest is twoclass classification (for example, distinguishing healthy people from sick people). Also, it will assume that the two classes are equally probable and that classification is noisefree. In the example, this would mean that it assumes that it will be asked to classify about the same number of healthy people as sick people and that any two people with the identical symptoms will have the same classification (either healthy or sick). Section 4.2 relaxes this assumption. It shows how to generalize a learningperformance model of learning two, equiprobable, noisefree classes into a learningperformance model of learning multiple skewed classes with noise.
A common way to approach the problem of choosing a deterministic model is to try to find a link function. Its inverse, a regression model, should fit the data well with few parameters (simplicity). Four candidate link functions and their corresponding regression models are summarized in Table 4.1.
Link Function 
Corresponding Regression Model 

Name 
Value 
Name 
Value 
logit 
z= log(p/(1p)) 
logistic 
_{} 
probit 
z = ^{1}(p) 
normal 
p = (z) 
EDit 
_{} 
EDmodel 
p = EDit^{1}(z) 
Burr_{1} 
_{} 
BurrModel 
p = z/(z+1) 
Table 4.1 Seer's link functions and regression models  is the cumulative distribution function for the normal distribution, p is a predicted probability and m is the number of training examples. The variable z is (m1)/d+1, where m is the number of training examples and d is a parameter.
The two most commonly used link functions for binaryresponse regression are logit and probit. They are based on statistical distributions [McCullagh and Nelder, 1989]. The logit link function is the inverse of the cumulative distribution function (CDF) for the logistic probability distribution. Logit's inverse is called the logistic regression model. It is defined as:
_{} (4.1)
Figure 4.2a shows its plot. Because the logistic regression model is a CDF, it maps values from the _{} to _{} to the interval 0 to 1. The probit link function's inverse is called the normal regression model because it is the CDF of the standard normal distribution. Figure 4.2b plots the normal regression model.
EDit and Burr_{1} are link functions especially designed to fit learningperformance curves. The EDit link is inspired by results in computational learning theory [ShaweTaylor et al., 1993]. The next section explains the relationship. The Burr_{1} link, a simplification of the EDit link, is of a class of functions (the inverse polynomials) that have been used for leastsquares regression [Nelder, 1966], but not (to the author’s knowledge) for binaryresponse regression [Nelson, 1994].
a)
b)
Figure 4.2: a) Logistic model:_{} b) Probit (normal) model: p = (x)
Informally, the effective dimension is just a backward version of the VapnikChervonenkis (VC) dimension [Blumer et al., 1986]. It is a measure of complexity based on the expressiveness of a hypothesis (classificationrule) language.
Taking as input the complexity of the hypothesis space of the learning problem, a given VC dimension, d, defines a (worst‑case) learning curve. In contrast, the effective dimension of a learning problem is defined by a learning curve. Given a set of learningperformance data, the effective dimension is the d that defines a curve that best fits the data. Thus, unlike the way that computational learning theory uses the VC dimension, effective dimension analysis can take advantage of existing empirical performance data of an induction algorithm to make quantitative predictions of the behavior of the learning algorithm if it was given additional cases or was subjected to different environmental conditions, such as different levels of noise, irrelevant attributes, number of hidden units, etc.
Formally, ShaweTaylor et al [1993] gives this upper bound on the number of cases needed to learn any hypothesis in a hypothesis space with VC dimension, d (>=2):
m=_{} (4.2)
In these equations is the maximum allowed error (and 1 minus the minimum required accuracy). d is the minimum probability of failure. Log is natural logarithm. An intriguing aspect of this formulation is that the VCdimension approach offers statistical guarantees about induction which we think of as offering no guarantee.
The ShaweTaylor et al. relation, as used in computational learning theory, is unsuitable as a regression model for averagecase learning. It is only defined on integer d greater than 2, but even d=2 gives a slower raising learning curve than is typically seen with averagecase learning. The first step to defining the EDit link function of Seer is to change the VC bound so that it is defined on all nonnegative realvalued d:
for constant d, _{} (4.3)
Also, the function is expressed in terms of accuracy, p, rather than error, :
_{}=_{} (4.4)
_{} corresponds to a fixed learning curve. To allow it to be fit to data, parameters must be added. Putting in the effective dimension parameter, d, allows the curve to expand and compress horizontally:
_{}=_{} (4.5)
Figure 4.3 plots the original VC curve and EDit at d=2, the value of d at which they are most different. Even here they are close.
Figure 4.3: The solid line is the learning accuracy curve for the _{} for d=2. The dashed line is for _{} for d=2.
We will see in the next section that a learning curve for learning problems with two equiprobable classes should map one training example to an accuracy of 0.5. We can do this with this curve by shifting it to the left:
_{}=_{} (4.6)
The effectivedimension model is defined in terms of the inverse of _{}:
_{} (4.7)
Because _{} is of the form _{} it has no closedform inverse. _{} is, however, differentiable, so its inverse can be computed numerically using Newton's method. For even greater speed, a table can store selected values of _{}, and other values can be determined via linear interpolation.
The Burr_{1} model is similar in shape to, but simpler than, the EDmodel:
p = BurrModel(z) = z/(z+1) (4.8)
where z = (m1)/d+1. Nelder [1966] describes a class of regression models he calls inverse polynomial models and develops procedures for using these models with continuousresponse data. The reciprocal model is a specialization of this class of models. Aldrich and Nelson [1984] mention the function, which they call Burr_{1 }as a candidate for regression, but in personal communications Nelson [1994] says that, as far as he knows, no one has used it for regression on binaryresponse data before. See Chapter 2 for a discussion of the use of similar functions with continuousresponse data.
In this section the four deterministic models are compared. The comparison shows that the two models based on computational learning theory (EDmodel, based on EDit, and BurrModel, based on Burr_{1}) have different shapes than the betterknown logistic and probit (normal) binaryresponse models. We will see in the next chapter that, as one might suspected, the learninginspired models fit and predict learningperformance data better.
Figure 4.4 plots BurrModel, EDmodel, and the logistic and probit (normal) regression models fitted so that all four intersect at p = 0.70 and p = 0.75.
Figure 4.4: Four regression models fitted to intersect at accuracies 0.70 and 0.75. The top two curves are the logistic and probit (normal) regression models. The lower dashed line is BurrModel. The lower solid line is EDmodel.
Figure 4.5 shows the slope in the curves of Figure 4.4 as a function of accuracy. It shows that EDmodel and BurrModel are similar to each and quite different from the logistic model and the normal model. Figure 4.6 shows the ratio of EDit to Burr_{1}. It highlights the difference between EDit and Burr_{1}. As accuracy goes to 1, EDit becomes infinitely larger than Burr_{1}.
Figure 4.5: The slopes of the four curves as a function of accuracy. It shows that the logit and probit models are much different than the new models designed for learning curves.
Figure 4.6: The ratio for EDit and Burr_{1} as a function of accuracy. At high accuracies, the functions are not identical
This section develops a learningperformance model for learning in the presence of multiple classes, skewed classes, and noise. Like any good model, it makes predictions. Three of the model’s most interesting predictions are:
1. The effects of class skew and class multiplicity can be predicted from the class frequencies. Actual learningperformance data need not be considered.
2. Once the effects of skew, multiple classes, and noise are factored out, learning performance can be well characterized by a single parameter.
3. When classes are skewed, learning error does not decrease linearly with the reciprocal of the number of examples; rather, it decreases in a predictable exponential way with that reciprocal.
The learningperformance model presented in this section is not meant to be a statement of computational learning theory. Rather, it is meant to be a heuristic that will be tested empirically.
Intuitively, we would expect multiple classes to make learning more difficult. For example, learning to distinguish 19 diseases should be harder than learning to distinguish two. Similarly, noise should also make learning harder. For example, if 10% of the training and testing examples given to a learner have the wrong class label, the ultimate achievable accuracy as measured on those noisy testing examples is reduced to at least 90%. On the other hand, skewed data can make learning easier. For example, suppose the learning problem is skewed so that 95% of the training and testing data is of one class. The learner could be 95% accurate just by always guessing the most frequent class. Even with just one training, the learner would have a 95% chance of seeing the most frequent class, and thus by hypothesizing that everything is of the class it sees, obtain an expected accuracy of at least 90.25% (=95%´95%).
The model ultimately developed in this section, model_{gen}[z,start,skew,max], is this parameterized function from examples, m, to accuracy, p (as illustrated in Figure 4.7):
_{} (4.9)
where:
· model5[z] is a learningperformance model for the base case: 2 classes, 50%/50% class frequencies, no noise. One possible candidate is z/(z+1), the BurrModel of Section 4.1.2.2.
· d measures the learning rate.
· start is the expect accuracy when the number of training examples, m, is 1.
· skew is a measure of skew. A value of 1 is the skew when all classes are equiprobable.
· max is the highest possible expected accuracy. It is a measure of noise.
Figure 4.7: Function model_{gen}[z,start,skew,max] maps m, the number of examples, into p, the expected accuracy. It is defined in terms of a function model_{5}[m]. Its d parameter expands or contracts the curve along the xaxis. Its start and max parameters define the low and high yaxis values. Its skew parameter makes the curve more concave or less concave.
The model_{gen}[z,start,skew,max] learningperformance model depends on a model_{5}[z] learningperformance model must be assumed. Section 4.1 discusses reasonable candidates for model_{5}[z]. Values for start and skew can be determined from the class frequencies observed in the training examples (Section 4.2.5). Values for d and max can be determined with maximum likelihood estimation. That is the topic of Section 4.2.
As a base case, consider the learning performance of an inductive learning when it is given only a single training example. Its best strategy if it has no other knowledge is to guess what it sees. That is, it should hypothesize that all examples are of the same class as the only training example it has seen.
· Observation 4.1: If there are two classes each with probability 0.5, the expected accuracy of the guesswhatyousee classification rule on independent testing data is 0.5.
· Observation 4.2: If there is only one class with class frequency 100%  in other words, if all the testing and training examples have the same class label  then the expected accuracy is 1.
· Observation 4.3: As the number of classes goes to infinity each with an infinitesimal probability of appearing, the expected accuracy goes to 0.
· Observation 4.4: If there are two classes, class “A” with probability 0.9 and class “B” with probability 0.1  the expected accuracy is:
The
example is A: probability 0.9,
expected accuracy 0.9, 0.81
The example is B: probability 0.1,
expected accuracy 0.1, 0.01
TOTAL
EXPECTED ACCURACY = 0.82
· Observation 4.5: Generalizing, given 1 example, if the probability of any class c is f[c], then the expected accuracy on that class is f[c]. Let fvector be the vector of class probabilities, then overall expected accuracy is:
_{} (4.10)
So, if fvector is <0.9, 0.1>, p = <0.9, 0.1>^{T }· <0.9, 0.1> = 0.9^{2 }+ 0.1^{2} = 0.82
Modeling the effect of m training examples requires some additional assumptions.
· Assumption 4.1: For learning tasks with two classes each with frequency 0.5, the expected accuracy p can be well estimated by some function model_{5}[z], where z is (m1)/d+1, m is the number of training examples and d is a parameter related to learning rate. From Observation 4.1, the function should have the property that:
model_{5}[1]=0.5 (4.11)
This assumption and possible candidates for model_{5}[z] were the subject of Section 4.1. One candidate for the function is BurrModel:
model_{5}[z] = z/(z+1) (4.12)
It is plotted in Figure 4.8.
Figure 4.8. The z/(z+1) function, a candidate for model5[z]. The function maps z  z is (m1)/d+1, where m is the number of training examples and d is a constant  to p , accuracy. model5[z] is assumed to model the learning on data with two classes, each with a 0.5 frequency.
· Observation 4.6: If there is only one class with probability 1, the expected accuracy for all m is 1 (as before).
· Observation 4.7: As the number of classes all with an infinitesimal probability of appearing goes to infinity, the expected accuracy goes to 0 (for all finite m).
· Definition 4.1: Let model_{cla}[z,f[c]], where z is (m1)/d+1, be the expected accuracy on any class c with class probability f[c] of a learner given m training examples. This definition assumes that expected accuracy on a class is only a function of class frequency and number of examples and not of the class itself.
· Assumption 4.2: If f_{1 }< f_{2}, model_{cla}[z,f_{1}] < model_{cla}[z,f_{2}] (for all finite z). In other words, for a given number of training examples, the expected accuracy on a class will be greater the larger that class's frequency.
In general, given m examples, the expected accuracy on all classes is:
the weighted average of the accuracy on each class. Note that when there are two classes and each has probability 0.5, by (4.13) and Definition 4.1:
_{} (4.14)
In other words, model_{5}[z] is both the total expected accuracy on all classes when there are 2 classes of 50%/50% probability and the expected accuracy on a single class with 50% probability.
The next goal is to find a simple heuristic expression of model_{5}[z] in terms of model_{5}[z]. The constraints are:
model_{cla}[z,0.5] = model_{5}[z] Assumption 4.1
model_{cla}[z,0.0] = 0.0 (for finite z) Observation 4.7
model_{cla}[z,1.0] = 1.0 Observation 4.6
model_{cla}[1,f]= f Observation 4.5
If f_{1 }< f_{2}, model_{cla}[z,f_{1}] < model_{cla}[z,f_{2}] (for finite z) Assumption 4.2
The intuition motivating the heuristic is that the model_{5}[z] curve is our standard learning curve and that when a class probability is large, this standard curve will be pulled up toward higher accuracy and that when a class probability is low the class probability will be pulled low toward zero.
Figure 4.9. How a standard learningperformance model which has value 0.5 when z=1, might be pulled up to produced a curve with greater value when z=1 or pulled down to produced a lesser value when z=1.
For any z and f,
if 0.0 < f < 0.5, 0.0 < model_{cla}[z,f] < model_{5}[z] (4.15)
and
if 0.5 < f < 1.0, model_{5}[z] < model_{cla}[z,f] < 1.0 (4.16)
Here is a simple function with the right properties:
_{} (4.17)
For example, if model_{5}[z] is z/(z+1), here are the curves for f= 0.0, 0.1, …, 1.0.
Figure 4.10: Learningperformance models for classes with class frequencies f= 0.0, 0.1, …, 1.0. Each model is a learning curve function, model_{cla}[z,f], that maps z (a linear function of the number of training examples) to p, accuracy in identifying that class. Note that when the number of training examples is 1 (and z=1), the accuracy is the class frequency. This plot assumes that the “standard” learningperformance model, model_{5}[z], is z/(z+1) .
Applying the heuristic to model_{all}[z,fvector] gives
_{} (4.18)
If we have classnum classes each of equiprobability, the heuristic predicts this overall accuracy:
_{} (4.19)
If we have classnum classes with a variety of class probabilities, the heuristic predicts an overall accuracy of:
_{} (4.20)
This is the weighted sum of model_{cla}[z,f] functions. The sum can't be expressed exactly as a model_{cla}[z,f] curve, but it can be closely approximated with a generalization of the model_{cla}[z,f] function. Such an approximation makes sense since the model_{cla}[z,f] function is itself only a heuristic.
To see the intuition behind the approximation, start with the model_{cla}[z,f] function:
_{} (4.21)
Recall that when z=1, , the model_{cla}[z,f]= f. Thus, the exponent, Log_{½}[f/2]1, and value when z=1 coupled. The approximation is a generalization that uncouples these two values:
_{} (4.22)
where start is the expected accuracy when z=1 and skew is the exponent.
The final step in the approximation is setting the parameters start and skew so that model_{skew}[z,start,skew] _{} model_{all}[z,fvector]. The strategy will be to set start and skew so that the two functions intersect for three values of z.
First, we would like them to intersect as z goes to infinity. This is automatic because both functions do go to 1 as z goes to infinity.
Second, we would like them in intersect when z=1. The value of model_{all}[1,fvector] is fvector^{T}· fvector. The value of model_{skew}[1,start,skew] is start. So we can make the two functions intersect be setting _{} to fvector^{T}· fvector.
Third, we would like them to intersect at a third point. A convenient point is that _{} such that model_{5}[z]=0.75. At that value of z, model_{skew}[z,start,skew] will equal model_{all}[z,<f_{1},f_{2},…,f_{classnum}>] if and only if:
_{} (4.23)
The only unknown is skew; its value can be determined numerically with, for example, the secant method or with the NewtonRaphson Method [Press et al., 1992].
Here is how model_{skew}[z,start,skew] fits in with our previous results for learning in which all the classes are equiprobable. When number of classes is two and each has 50%/50% probability, our model is model_{5}[z]. This exactly equals model_{skew}[z,0.5,1]. More generally, when number of classes is classnum and each has the same probability, our model was model_{5}[z]^{ Log}½^{[f/2]1}. This exactly equals model_{skew}[z,1/classnum,1]
The plot shows model_{skew}[z,0.5,skew] for skew=0.001, 1, 2, 4, and 8. The topmost curve is skew=0.001.
Figure 4.11: Learningperformance models with a start value of .5 and skew values of 0.001, 2, 4, and 8. Each model is a learning curve function, model_{skew}[z,start,skew], that maps z (a linear function of the number of training examples) to p, accuracy in identifying that class. This plot assumes that the “standard” learningperformance model, model_{5}[z], is z/(z+1).
Here is an example from soybeandisease diagnoses in which there are many classes and the class frequencies vary widely . Suppose the class frequencies are:
<0.0130293, 0.00325733, 0.019544, 0.019544, 0.130293, 0.130293, 0.0325733, 0.0651466, 0.0325733, 0.0325733, 0.0325733, 0.130293, 0.0325733, 0.0325733, 0.0651466, 0.130293, 0.0325733, 0.0325733, 0.0325733 > (4.24)
The model_{all}[z,fvector] heuristic for these class frequencies is:
0.521172 model_{5}[z]^{2.94017} + 0.130293 model_{5}[z]^{3.94017} + 0.29316 model_{5}[z]^{4.94017} +
0.039088 model_{5}[z]^{5.67713}
+ 0.0130293 model_{5}[z]^{6.2621}
+ 0.00325733 model_{5}[z]^{8.26209 }(4.25)
If model_{5}[z] is assumed to be z/(z+1), model_{all}[z,fvector] looks like
Figure 4.12: A learningperformance model for the soybean disease data. The model is a function, model_{all}[z,fvector], that maps z (a linear function of the number of training examples) to p , accuracy in identifying that class. The quantity fvector is the vector of class frequencies for the soybean data. This plot assumes that the “standard” learningperformance model, model_{5}[z], is z/(z+1).
To determine the model_{skew}[z,start,skew] approximation to model_{all}[z,fvector], we need to determine start and skew. The value of start is just fvector^{T}· fvector = 0.0868867.
To determine skew, we want to find a value for it such that
model_{skew}[z,start,skew]=model_{all}[z,fvector] when model_{5}[z]=0.75. (4.26)
Substituting in values, we want to find a value for skew that will solve
_{} (4.27)
The value is found numerically to be 1.07556. So, model_{skew}[z,start,skew] is
0.0157871 + 0.984213 model_{5}[z]^{3.79106 }(4.28)^{}
which is much simpler than model_{all}[z,fvector].
Again assuming that model_{5}[z] is z/(z+1), here is a plot of both the model_{all}[z,fvector] heuristic and its model_{skew}[z,start,skew] approximation:
Figure 4.13: Two almost indistinguishable learningperformance models for the soybean disease data. One is the same model_{all}[z,fvector] function plotted in Figure 4.12. The other is model_{skew}[z,start,skew], a simpler function with much the same shape. This plot assumes that the “standard” learningperformance model, model_{5}[z], is z/(z+1).
There are many possible ways to define noise. For the purposes of this thesis, it is defined as the difference between 100% accuracy and max, the expected accuracy as z (and m) go to infinity.
This suggests that noise can be added to a learning curve model through scaling a learning curve model by the quantity max. This doesn't quite work because it would scale down start. The start value is determined from the observed class frequencies. If there is noise, the noise will already be reflected in these class frequencies and in start. Thus, a model of noise should not affect start.
We will do the next simplest thing, scale down the learning curve model, but do so with start as the origin.
The result is an expression very similar to the expression for model_{skew}[z,start,skew]. The plot shows model_{gen}[z,0.5,1,0.8] with model_{5}[z] defined as z/(z+1). Notice that model_{gen}[z,start,skew,1] = model_{skew}[z,start,skew].
In outline, the steps to fitting a model_{gen}[z,start,skew,max] curve to data are:
1. From the training data, measure fvector, the class frequencies.
2. From fvector, compute start and skew.
3. Into model_{gen}[z,start,skew,max] substitute the values of start and skew, also substitute in (m1)/d+1 for z and whatever function is being used for model_{5}[z]. For example, using the z/(z+1) for model_{5}[z] and the soybean frequencies we saw earlier produces:
_{} (4.30)
4. The learningperformance data gives observed values for m, the number of training examples, and p, the observed accuracy. Find the values of d and max that produce the curve that best fits the data. The details on how to do this are the subject of Section 4.4.
Figure 4.14: A learningperformance model that models the effect of 20% noise. The learningperformance model is model_{gen}[z,0.5,1,0.8]. The 0.8 parameter is 100%20%, the maximum attainable accuracy. This plot assumes that the “standard” learningperformance model, model_{5}[z], is z/(z+1).
But is the complexity of Equation (4.29) really necessary? To find out, we create model_{noexp,}, a simpler model based on Equation (4.29) but without the exponents.
In these case of two equiprobable classes, Equation (4.29) reduces to Equation (4.31), but in the next chapter we will true it on other situations to see how important the exponents are.
Recall from Section 3.1.1 that regression models are often constructed in two parts, a deterministic part and a nondeterministic part. In ordinary linear regression, for example, the deterministic part of the model is a line and the nondeterministic part is a normal distribution around the points in the line. Section 3.1.4 showed why the popular normal distribution was inappropriate for learningperformance data. This section develops two appropriate candidates for the nondeterministic part of learningperformance models.
The learningperformance data with which Seer works includes y, the number of testing examples correctly classified. This is necessarily an integer between 0 and k, the number of testing examples. If the testing examples are selected independently, the number of correct classifications should be binomially distributed because it is the result of independent identically distributed Bernoulli trials. For example, if the expected accuracy of the classification rule being tested is 0.705 and it is tested on 100 examples, the binomial distribution will specify the probability that the classification rule will get 70 examples correct, the probability it will get 71 correct, 69 correct, and so on for the range from 0 to 100.
The binomial distribution is:
_{} (4.32)
where y is the number of correctly classified testing examples, k is the total number of testing examples, p is the probability that a given testing example will be classified correctly, and _{}is the number of distinct subsets of size y of a set of size k.
The binomial distribution accounts for the randomness in testing examples, but not for other sources of randomness. For example, a learning program might produce a classification rule based on 49 training examples that is more accurate (as measured on an extremely large number of testing examples) than one based on 50 training examples. This could happen with a decision tree learner, for example, if the learner tended to make a split with 50 training examples that it didn’t make when given only 49 examples. In other words, even with an infinite number of training examples, reallife learning curves may have an inherent roughness caused by the randomness in the training examples.
A reasonable way to model this roughness is with the beta distribution. Intuitively, it can be thought of as a normal distribution that is bounded by 0 and 1. The beta distribution is usually parameterized with two values, a and b. For our purposes, however, it is more convenient to parameterize it with p, its mean, and pvar, a parameter related to variance. With this parameterization the beta distribution has this density:
_{}, where p_{1 }ranges from 0 to 1. (4.33)
To see how this could be put together with the binomial distribution, consider a random learningperformancelike data generator. The input to the deterministic model would be m, the number of training examples. The output will be p, the expected accuracy before considering the randomness in the training and testing examples. This p and a pvar value are the input into a beta random generator. The output is p_{1 }the expected accuracy after taking into account the randomness in the training examples. This p_{1} along with k, the number of testing examples, is the input into a random binomial random generator. The output y is the number of correctly classified training examples according to the random generator.
Combining a beta distribution with a binomial distribution, as done here, creates a betabinomial distribution:
_{} (4.34)
where y is the number of testing examples, k is the number of testing examples classified correctly, p is the output of the deterministic part of the model, and pvar is related to the amount of roughness caused by randomness in the training examples. When pvar is zero, this distribution is becomes the binomial distribution.
The distribution is also known as the Polya, PolyaEggenberger, negative hypergeometric, and generalized hypergeometric type IIA [Johnson et al., 1992; Griffiths, 1973]. It has been used in the past to model the scores of students in a class on a test, but not, so far as the author knows, as the nondeterministic part of a regression model [Wilcox, 1981].
When used as part of a regression model, y and k will be given as part of the learningperformance data. The value p will be provided by the deterministic part of the model. The pvar parameter, however, will need to be estimated. The estimation method is analogous to estimating the variance of the data for linear regression except that instead of minimizing least square error, the method must find the value of pvar that maximizes likelihood. The next section has computational details on this.
Chapter 3 defined the criteria with which to select the best learningperformance model for a set of data (namely maximizing likelihood). The previous sections of this chapter detailed the sets of learningperformance models to be considered. This section specifies the numerical techniques Seer uses to efficiently find the model with the maximum likelihood. Efficiency is important because the algorithms used are iterative. The difference in speed between a good implementation and a poor one could easily be 1000 to 1, and even more if the poor implementation doesn’t converge.
For any given problem, Seer is given 1) learning performance data, 2) a nondeterministic model (either the binomial distribution or the betabinomial distribution) and a deterministic model (for example, model_{gen} with the BurrModel as model_{5}). Seer’s task is to find the parameters of the deterministic model (e.g. d and max) and the parameters of the nondeterministic model (e.g. none or pvar) that produce a model with the maximum likelihood with respect to the data. Such parameters are called the maximum likelihood estimators.
As posed, the problem is an instance of nonlinear programming. The nonlinearprogramming constraints relate to the parameters. For example, pvar and d must be greater than 0 and max must range between start and 1.0. The problem can be simplified by eliminating these constraints. This can be done by making the original parameters functions of parameters that can take any real value [Dixon, 1972, p. 89]. For example, d can be replaced with daug^{2} and pvar can be replaced with pvaraug^{2}. Then maximum likelihood estimators can be found for daug and pvaraug. The square root of these estimators will be the estimators for d and pvar. The max parameter can be replaced with
_{} (4.35)
which will guarantee that max will be in the range from start to 1.
Removing the constraints turns the problem from one of nonlinear programming into one of nonlinear optimization. Seer uses the very fast LevenbergMarquardt method of nonlinear optimization [Marquardt, 1963; Press et al., 1992]. This method smoothly varies between inverseHessian methods (like NewtonRaphson) and steepest descent methods. Its use normally requires both the first partial derivatives (the gradient) and second partial derivatives (the Hessian) of the function to optimize. Seer does require the first derivatives, but it avoids the need for the Hessian estimating the Hessian Q as
_{} (4.36)
where q_{i} is the vector of first partial derivatives evaluated on the ith learning performance tuple [Cramer, 1986, p. 27].
When the learningperformance data is independent, covariance can be approximated. The asymptotic covariance matrix of the estimated parameters can be approximated with Q^{1.} [Cramer, 1986]. When the generalized crossvalidation method of Section 3.2 is used, the data is not independent, so covariance cannot be estimated.
This chapter detailed the design and selection method of the learningperformance models that Seer uses. We considered the design problem in three parts: First, create deterministic models for noisefree learning of two equally probable classes. We saw that the two models inspired by computational learning theory, EDmodel and BurrModel, have a much different shape than the two most popular binaryresponse regression models. The second part of the model design was modeling the effect of multiple classes, skewed classes, and noise. By considering progressively more complex cases, we traced the development of the new model_{gen} heuristic. It models the effects of interest with remarkably few parameters. The third part of model design involved creating a nondeterministic model. We saw that the Binomial distribution should be sufficient if variation in testing data was our only concern. However, if we are also concerned about variation in the training data, then the BetaBinomial distribution might be better. Finally, after all the design choices were enumerated, Section 4.4 specified how to efficiently find the best model in a possibly infinite set of models. The basic method is nonlinear optimization with an iterative algorithm.
Given these design choices and this selection method, which sets of models work best on real learningperformance data? That is the topic of the next chapter.
The previous chapter dealt with designing various reasonable components of learningperformance models. In this chapter we assemble these components into five learningperformance model sets. Using Seer, each set is tested on three realworld learning domains . Each model set is tested on how well it can characterize learningperformance data (3 trials) and on how well it can predict learning performance data (30 trials). Figure 5.1 summarizes the experimental variables (domain, task, and learningperformance model set) investigated. The experiments show that many of the model sets characterize well and predict as well as the data permits. They suggest that the choice of a model set is a tradeoff between computational ease and the ability to measure interesting aspects of the learning problem.
Domain 
Description 
# Examples 
# Attr 
# Classes 
Soybean 
Soybean Disease Diagnosis 
307 
35 
19 
Heart 
Heart Disease Diagnosis 
303 
13 
2 
Audio 
Audiological Problems 
226 
69 
24 
´
Task 
# Trials 
Characterize 
3 
Predict 
30 
´
Model Set 
Det. #2 
Det. #1 
Nondet. 
Modelgen 
Burr_{1} 
model_{gen} 
binomial 
Logistic 
Logistic model 
 
binomial 
ED 
EDit 
model_{gen} 
binomial 
NoExp 
Burr_{1} 
model_{noexp} 
binomial 
Pvar 
Burr_{1} 
model_{gen} 
betabinomial 
Figure 5.1: Experimental variables (domain, task, and model set) explored.
This section details the learning examples and inductive learning system Seer used in the experiments. It details the domain and task experimental variables. The model set experimental variable is the subject of chapter 4. Also, the section discusses how results were measured and statistical significance determined.
The experiments were run on classified learning examples from three domains, the same as were used in Shavlik et al. [1991] The sets used were:
Soybean disease diagnosis (“Soy”)
These classified examples come originally from R.S. Michalski and R.L. Chilausky [1980]. The collection consists of 307 examples with 35 attributes labeled with one of 19 classes. The classified examples are available on the Internet in <ftp://ics.uci.edu/pub/machinelearningdatabases/soybean/>.
Heart disease diagnosis (“Heart”)
Robert Detrano, M.D., Ph.D., created these classified examples at the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. They were originally used in Sandhu et al., [1989]. The examples consist of 303 examples with 13 attributes and 2 classes. The classified examples are available on the Internet in <ftp://ics.uci.edu/pub/machinelearningdatabases/heartdisease/>.
Audiological problems (“Audio”)
Professor Jergen developed these classified examples at the Baylor College of Medicine. They were originally used in Bareiss & Porter [1987]. The examples consist of 226 examples with 69 attributes labeled with one of 24 classes. The examples are available on the Internet in <ftp://ics.uci.edu/pub/machinelearningdatabases/audiology/>.
Seer was tested on two tasks: characterization and prediction. For the characterization task, all available classified learning examples were used to generate learningperformance data (using the generalizedcrossvalidation procedure described in Section 3.2). From each model set, Seer found the maximumlikelihood model. Success at characterization was measured (as detailed in Section 5.1.4) by how well a model fit the data in terms of loglikelihood. In order to see the variation caused by generalized crossvalidation, three trials were run for each combination of domain and model set.
For the prediction task, 100 classified learning examples were randomly selected (without replacement) from all the available examples. Using generalized crossvalidation on the 100 examples, Seer produced learningperformance data. For each model set, Seer found the model with maximum likelihood with respect to the learningperformance data. Success at prediction was measured by how well the model, which was based on only 100 classified examples, predicted learningperformance when the learner was given the entire data set. Figure 5.2 illustrates the process. To see the variation caused by the selection of the 100 examples and by generalized crossvalidation, 30 trials were run for each combination of domain and model set. Section 5.1.4 details how Seer compared the models to observed learning performance.
All the experiments reported here used the C4.5 machinelearning program [Quinlan, 1992]. C4.5 does its initial learning with decision trees and then tries to improve its generalization and noise handling by turning the trees to rules. It was run with its default settings.
Results were measured primarily with loglikelihood:
L_{1} = log_{} (5.1)
where Pr is probability, D is observed learningperformance data, and M_{1} is a learningperformance model. Because probably (or likelihood) ranges between 0 and 1, loglikelihood is a negative number. One disadvantage of the loglikelihood measure is that it depends on the amount of learningperformance data. Doubling the amount of data tends to double the value of the loglikelihood. This is a common problem in the field of statistics. The usual solution is to develop a normalized measure of it. By analogy to the R^{2} measure of linear regression, these measures are usually named R^{2}_{X} where “X” is some identifier. The rest of these section will develop an R^{2}_{X} measure for use on learningperformance model.
Figure 5.2: Prediction task  From 100 classified examples, Seer predicts the learning performance if all most all available _{l} classified examples were available, where m_{all‑most‑all} =300, 290, 218 for the Soy, Heart, and Audio domains, respectively.
The measures needs to take as input:
1. A learningperformance model or, if repeated experiments are conducted, a set of learningperformance models. This model is produced by a system such as Seer from a small set of classified examples.
2. A likelihood criterion, for example, the binomial distribution
3. Learningperformance data based on a larger set of classified examples  Each datum is of the form <m, y, k>, where m is the number of training examples, y is the number of testing examples the learner's hypothesis classified correctly, and k is the number of testing examples.
For example, suppose we wish to compare “A” and “B”, two ways of producing learningperformance models. They could, for example, could select the maximumlikelihood model from two sets of candidate models. Suppose that we give each method three sets of learningperformance data, each based on 100 classified examples, and call the resulting models “A1,” “A2,” “A3,” “B1,” “B2,” and “B3.”. Next, we ask each of the six models to predict the learning performance when m is 300. Suppose they predict: p_{A1} = 0.89, p_{A2}=.091, p_{A3}=.095, p_{B1} = 0.81, p_{B2}=0.85, p_{B3}=0.89. To create the final piece of input information, we put the predictions to the test. For example, suppose we have a total of 400 examples and do 4way crossvalidation with these results:
{<300, 92, 100>, <300, 100, 100>, <300, 96, 100>, <300, 89, 100>} (5.2)
where the first element of each tuple is the number of training examples, the second element is the number of correctly classified testing examples, and the third element is the total number of testing examples.
Intuitively, the output should be a better measure for better predictions. Hosmer and Lemeshow [1989, p. 148] recommends this measure that returns a value between 0 and 1 for logistic regression:
_{} (5.3)
L_{1} is the loglikelihood of the model of interest. L_{0} is the loglikelihood of the baseline model, the model that predicts the same accuracy regardless of m. L_{S} is the loglikelihood of the perfect, saturated model. The saturated model is the model with as many parameters are there are distinct values of m; for each m, it returns exactly the frequency observed. It is the optimum model with respect to likelihood (assuming error is measured with a distribution such a the binomial or the normal).
For our purposes, R^{2}_{L} is unsatisfactory. The problem is that it assumes that L_{0}, the log likelihood of the baseline model has a worse fit than L_{1}. Because our model is being tested on learning data it has not seen before, the assumption does not necessarily hold.
A new measure that avoids the problems and is, thus, more suitable is:
R^{2}_{Seer} = L_{s} / L_{1} (5.4)
Returning to the example, if the likelihood criterion is the binomial distribution, loglikelihood is:
_{} (5.5)
where p is predicted accuracy, y is the number of testing examples classified correct, and k is the total number of testing examples. Hosmer's measure can ignore the combinatorial term binomial(k,y) because the term cancels out. The term does not cancel out of the R^{2}_{Seer} measure and so must be considered.
The loglikelihood of model “A1” with respect to observations is:
l1_{A1} = ll_{binom}(0.89,92,100) + ll_{binom}(0.89,100,100)
+ ll_{binom}(0.89,96,100) + ll_{binom}(0.89, 89,100) = 20.9849
(5.6)
The value of the saturated model when m=300 is minimized (for the binomial likelihood criterion) when the saturated model predicts the mean observed frequency [Cramer, 1986, p. 159], which in this case is 0.9425.
The loglikelihood of this saturated model is:
ll_{binom}(0.9425,92,100)
+ ll_{binom}(0.9425,100,100) + ll_{binom}(0.9425,96,100) + ll_{binom}(0.9425,
89,100)
= 14.2974 (5.7)
So R^{2}_{Seer} is 14.2974/20.9849 = 0.681316.
The loglikelihood and R^{2}_{Seer} values for all of the example models are given in Table 5.1.
Model 
Loglike. 
R^{2}_{Seer} 
A1 
20.9849 
0.681316 
A2 
17.2222 
0.830170 
A3 
14.5238 
0.984412 
B1 
43.9230 
0.325510 
B2 
31.1878 
0.458429 
B3 
20.9849 
0.681316 
Table 5.1: R^{2}_{Seer} values for six models
The R^{2}_{Seer} measure can be generalized to produce a single measure for a group of related models, using the fact that the loglikelihood of a set of independent models is the sum of the loglikelihood of each.
_{} (5.8)
Applied to our example, this yields:
R^{2}_{Seer}(A’s) = _{}= 0.813417 (5.9)
and
R^{2}_{Seer} (B’s) = _{}= 0.446349 (5.10)
Two methods of generating model sets can be compared by looking at composite R^{2}_{Seer} values, but it is hard to evaluate the statistical significance of the values. Seer solves this problem by using paireddifference analysis, a standard statistical method [McClave & Dietrich, 1988]. With paireddifference analysis, each modelgeneration method is compared, headtohead, in a series of trials In each trial, every individual modelgeneration method is applied to the same set of learningperformance data. As Table 5.2 illustrates  using the example from Section 5.1.4  the loglikelihoods of the two models are found and the difference is calculated. Finally, these difference values are evaluated with the twotailed Student ttest to determine if the mean of the differences is significantly different than zero.

A 
B 

Trial 
Loglike. 
Loglike. 
Diff. 
1 
20.9849 
43.9230 
22.9381 
2 
17.2222 
31.1878 
13.9656 
3 
14.5238 
20.9849 
6.4611 
Table 5.2: The first step of paireddifference analysis: For each trial’s learningperformance data set, find the loglikelihood of the model produced by each of the two modelgeneration methods. Then, calculate the difference. The set of differences is evaluated with the twotailed Student ttest to see if it is different than zero.
Each series of experiments tested a set of learningperformance models The goals of this chapter were to:
1) Find which (if any) of the model sets’ designs were useful for characterizing and predicting learning performance, and
2) Help develop criteria for choosing which of the useful models to use in a particular situation.
The Modelgen experiments tested the set of learningperformance models described in Table 5.3. The set’s deterministic component is made up of the model_{gen} and Burr_{1.} Its nondeterministic component is the binomial distribution. The three goals of the Modelgen experiments were 1) to measure how well model_{gen }can characterize machine learning performance data 2) to measure how well model_{gen} can predict machine learning performance 3) to establish a benchmark for the other experiments.
Component 
Submodel 
Value 
Det. #1 
model_{gen} 
_{} 
Det. #2 
Burr_{1} 
model_{5}[z] = z/(z+1) 
Nondet. 
binomial 
_{} 
Table 5.3: Learningperformance model set tested in the Modelgen experiments
Figure 5.3 shows the first of three characterization trials for each of the three domain. The characterization task can be thought of as 1) generating accuracy points and 2) fitting a curve to those points. Table 5.4 quantifies the goodnessoffit for all 3 trials in all 3 domains. The R^{2}_{Seer} values are near 1.0 indicating that model_{gen }can fit learningperformance data well. (To get a perfect R^{2}_{Seer} value of 1.0, a learningperformance model would need to go through every accuracy point, which is generally undesirable).
Domain 
Loglike. 
R^{2}_{Seer} 
Soy (3) 
7654.25 
0.974 
Heart (3) 
7568.45 
0.984 
Audio (3) 
5582.62 
0.955 
Table 5.4: Goodnessoffit for Modelgen on the characterization task  Fit is measured with loglikelihood and R^{2}_{Seer}. A perfect fit has a R^{2}_{Seer} value of 1.0. All measures are composites of three trials.
The next several subsections look at prediction with model_{gen.} We focus first on just the Audio domain. In each of the 30 trials, Seer was given 100 classified examples. From those, it tried to answer two questions: 1) what accuracy could be expected with 218 training examples, and 2) what accuracy could be expected given an infinite number of training examples (asymptotic accuracy).
Soy Heart Audio
Figure 5.3: Characterization trial #1 for the three domains using Modelgen  The points show the learningperformance data generated with generalized crossvalidation. The solid curve is the deterministic part of the maximum likelihood Modelgen model. The dotted line is the estimated asymptotic accuracy, the accuracy given an infinite number of training examples.

Seer’s first step toward answering these questions is to generate learningperformance data. The plots of Figure 5.4 represents the learningperformance data (for 3 of the trials) as accuracy points. (We will see in the next section that these 3 trials were the best, median, and worst of the 30 trials, as measured by prediction accuracy.)
The curves in Figure 5.5 show Seer’s predictions for these 3 trials. The dotted lines show its estimate of the asymptotic accuracy. In the plots, the diamond shows “true” accuracy given 218 training examples. (Empirically, determining the real true accuracy would require an infinite number of testing examples; This “true” value is determined with crossvalidation on all 226 available classified examples.)
Table 5.5 shows the accuracies Seer predicts would be observed if the learner was given 218 training examples or an infinite number of examples. Thirty trials were run; the best, median, and worst trial are shown in the table. The “true” accuracy, if an infinite number of training examples were provided, is estimated by the model fit on 226 examples in Section 5.2.1.1.

p_{218} 
p_{¥} 
Best 
0.788 
0.862 
Median 
0.717 
0.757 
Worst 
0.606 
0.619 
“True” 
0.788 
0.823 
Table 5.5: Modelgen’s predicted accuracies (based on 100 classified examples) if the learner is given 218 or an infinite number of training examples  The predictions from the best, median, and worst trials are shown. The “true” values are estimated from the full set of 226 classified examples.
Table 5.6 first quantifies how well the models fit the learningperformance data (shown in Figure 5.5) generated from the 100 classified examples. The models do well and have a composite R^{2}_{Seer} of 0.981. Second, the table quantifies how well the models predicted p_{218}. Here they do less well, with a composite R^{2}_{Seer} of 0.925. This suggests that the problem may not be with prediction given the 100 classified examples, but rather that the 100 classified examples are sometimes not representative of the full set of 218 classified examples We will further investigate this problem in the next section.
Best Median Worst
Figure 5.4: Audiodomain learningperformance data based on 100 classified examples for the best, median, and worst Modelgen trials  Seer must find the Modelgen model that best fits each set of data. This maximumlikelihood model is used to predict the expected learning performance if more training examples were available.

Best Median Worst
Figure 5.5: Audiodomain maximumlikelihood models for the best, median, and worst trials for Modelgen  The solid curves are the deterministic models. The dotted lines are the estimated asymptotic accuracies. The diamond shows the “true” accuracy given 218 training examples.


Characterize the 100 
Predict p_{218} 


Loglike. 
R^{2}_{Seer} 
Loglike. 
R^{2}_{Seer} 

Best 
1838.29 
0.979 
43.81 
1.000 

Median 
1847.15 
0.980 
46.76 
0.937 

Worst 
1902.73 
0.981 
60.74 
0.721 

All (30) 
55923.45 
0.981 
1420.50 
0.925 

Table 5.6: Modelgen goodnessoffit on learningperformance data from 1) the given 100 classified examples and 2) the full data set  The better the prediction, the closer the “Predict p_{218}” R^{2}_{Seer} value will be to 1.000.
This subsection looks at prediction with model_{gen }on all three domains. The curves in Figure 5.6 show Seer’s predictions for all 30 trials. The plots provide a graphical way to see the variation caused, mostly, by starting with different sets of 100 classified examples. In the plots, the diamond shows “true” accuracy given nearly all the available classified examples as training examples. In all three domains, some models predict too high, others too low. Particularly in the Soy and Heart domains, the models seem unbiased; that is, they are as likely to go too high as to go too low.
Table 5.7 quantifies the characterization and prediction in all three domains. For the Soy domain, the composite R^{2}_{Seer} for prediction, 0.982, is high, higher even than the R^{2}_{Seer} for characterization. For the Heart domain, the R^{2}_{Seer} for characterization is higher, 0.978, and the R^{2}_{Seer} for prediction is a little bit lower, 0.963. For the Audio domain, as we saw in the previous section, the R^{2}_{Seer} for characterization is high, but the R^{2}_{Seer} for prediction is relatively low.

Characterize the 100 
Predict p_{all most all} 

Domain 
Loglike. 
R^{2}_{Seer} 
Loglike. 
R^{2}_{Seer} 
Soy (30) 
54507. 
0.969 
1664.62 
0.982 
Heart (30) 
54166. 
0.978 
1446.20 
0.963 
Audio (30) 
55923. 
0.981 
1420.50 
0.925 
Table 5.7: Modelgen’s goodnessoffit on learningperformance data from 1) the given 100 classified examples and 2) the full data set  The better the prediction, the closer the “Predict p_{218}” R^{2}_{Seer} value will be to 1.000. All values are composites from the 30 trials.
Soy Heart Audio
Figure 5.6: Modelgen’s predictive models for 30 trials in 3 domains  The diamond shows the “true” accuracy.

Why don’t all the predictions hit the target exactly? We can get insight into this question by looking in more detail at the worst prediction in each domain. Figure 5.7 shows those predictions and the learningperformance data on which they are based. As the plots show and Table 5.8 confirms, even the worstpredicting models characterize the given learningperformance data well. The problem is that the learningperformance data from 100 classified examples is unrepresentative of the learningperformance of the full set of classified examples. This is most obvious in the Heart domain. Here, working with only 100 classified examples, Seer observed accuracies greater than 0.80, but working with the full set of 303 examples, the observed accuracy p_{290} was 0.752. This strongly suggests that the main cause of bad predictions is that a random subset of classified examples may be unrepresentative of a larger set. Put more optimistically, Seer  using model_{gen}  seems to be doing as well as possible with the classified learning examples it has available.

Characterize the 100 
Predict p_{all most all} 

Domain 
Loglike. 
R^{2}_{Seer} 
Loglike. 
R^{2}_{Seer} 
Soy (worst) 
1855. 
0.979 
66.28 
0.822 
Heart (worst) 
1710. 
0.968 
65.10 
0.713 
Audio (worst) 
1903. 
0.981 
60.74 
0.721 
Table 5.8: Modelgen’s goodnessoffit on learningperformance data from the worst trials
Another way to analyze the experiments is to plot the parameters estimated by each model. Because 30 trials were run for each domain, standard deviations can be estimated after the fact. The Modelgen model set has 4 parameters. Seer estimates two of the parameters, start and skew, from the class frequency of the classified examples. Seer estimates the other two parameters, d and max, from the learningperformance data. Figure 5.8 is a scatter plot for the Heart domain of the d and max parameters. In the plot, the letters “A”, “B”, and “C” are the parameter estimates in the 3 characterization trials. The numbers, 1 ... 30, are the parameters estimates for the 30 prediction trials. Note that the prediction points (which are based on only 100 classified examples) cluster well around the characterization points (which are based on the full sets of classified examples).
Heart
Audio Heart Soy
Figure 5.7: Modelgen maximumlikelihood models for the worst trials in the three domains  The models, represented by the solid curves, fit the available learningperformance data well. The learningperformance data, however, is misleading, so the predictions of the accuracy to be expected on the set of training examples are inaccurate.

Figure 5.8: Scatter plot for the Heart domain of the d and max Modelgen parameters. The letters shows the parameter values for the 3 characterization trials. The numbers show the parameter values in for the 30 prediction trials.
Table 5.9 gives the mean and observed standard deviation for all parameters in all three domains. The Soy and Heart parameters estimates seem unbiased, clustering around the “true” parameter values. Audio’s d estimate is off apparently because the estimates of max are too low, and this affects the estimated accuracy.
These experiments showed that Modelgen can characterize learningperformance data well. The R^{2}_{Seer} values for the three domains were 0.974, 0.984, and 0.955. The predictions were generally good, too, with composite R^{2}_{Seer} values (over all 30 trials) for the three domains of 0.982, 0.963, and 0.925. In the worst of the trials the predictions were not as good. (R^{2}_{Seer} values 0.822, 0.713, and 0.721.) Even in these cases, however, the Seer using Modelgen seemed to be doing about as well as possible. The problem seemed to be that some samples of classified examples were not representative of the entire set. The Soy and Heart parameter estimates seem unbiased, clustering around the “true” parameter values. Audio’s d estimate is seemed to be systematically off, perhaps because its estimates of max were too low.

Characterization (all available classified examples, 3 trials) 
Prediction (100 classifies examples, 30 trials) 


mean 
s.d. 
mean 
s.d. 


Soy 






d 
12.680 
0.1659 
14.253 
2.1784 


max 
0.999 
0.0018 
0.988 
0.0459 


start 
0.087 
0.0010 
0.093 
0.0089 


skew 
1.075 
0.0019 
1.081 
0.0169 


p_{300} 
0.840 
0.0209 
0.840 
0.0321 


Heart 





d 
6.129 
1.1067 
6.207 
4.4863 


max 
0.753 
0.0023 
0.760 
0.0516 


start 
0.503 
0.0000 
0.509 
0.0110 


skew 
1.015 
0.0000 
1.042 
0.0516 


p_{290} 
0.752 
0.0248 
0.749 
0.0452 


Audio 





d 
7.718 
4.3788 
4.635 
1.5372 


max 
0.884 
0.1016 
0.795 
0.0855 


start 
0.141 
0.0000 
0.149 
0.0162 


skew 
1.196 
0.0000 
1.191 
0.0345 


p_{218} 
0.788 
0.0272 
0.746 
0.0650 


Table 5.9: Mean and standard deviation of estimated Modelgen parameters
In the next series of experiments, logistic models were found for exactly the same learningperformance data as was generated for these Modelgen experiments. This will allow headtohead comparison between Modelgen, a model set that was designed especially for learning prediction, and the Logistic model set, the most popular model set for binaryresponse data.
The Logistic experiments tested the set of learningperformance models described in Table 5.12. The set’s deterministic component is the logistic model. Its nondeterministic component is the binomial distribution. This experiment was designed to see how bad (or good) logistic models are for learning data. The model set was applied to the characterization task and prediction task, then it was compared to Modelgen, trialbytrial.
Component 
Submodel 
Value 
Det. 
logistic 
_{} 
Nondet. 
binomial 
_{} 
Table 5.10: Learningperformance model set tested in the Logistic experiments
Table 5.11 and Figure 5.9 show the Logistic model set trying to characterize the learningperformance data from the three domains. The composite R^{2}_{Seer} values reported in the table show that the fit is not good. In two domains, it is less than 0.9. The plots in Figure 5.9 (from trial #1, but the other two trials were similar) show the problem: the curve defined by logistic models are not the right shape for the learningperformance data. The Logistic model set only does best in the Heart domain and from the plot we can see the reason is that this data is the most linear and hence is easiest to model.
Domain 
Loglike. 
R^{2}_{Seer} 
Soy (3) 
9023.51 
0.826 
Heart (3) 
7730.06 
0.963 
Audio (3) 
5971.10 
0.893 
Table 5.11: Goodnessoffit for Logistic on learningperformance data from the three domains
Given the problems the Logistic model set has predicting learning performance, we would expect it to do even worse at predicting from a subset of 100 classified examples. Table 5.12 and Figure 5.10 confirm this expectation. In all 30 trials and in all 3 domains, every prediction is off. Even worse, the predictions are very biased; every prediction is too optimistic. Even though all the R^{2}_{Seer} values for characterization of the 100 examples is greater than 0.90, the best prediction R^{2}_{Seer} value is a poor 0.501.

Characterize the 100 
Predict p_{all most all} 

Domain 
Loglike. 
R^{2}_{Seer} 
Loglike. 
R^{2}_{Seer} 
Soy (30) 
55725. 
0.947 
7100. 
0.230 
Heart (30) 
54648. 
0.970 
2778. 
0.501 
Audio (30) 
58556.99 
0.937 
2898. 
0.454 
Table 5.12: Logistic’s goodnessoffit on learningperformance data from 1) the given 100 classified examples and 2) the full data set  Values are composites from the 30 trials.
Heart
Soy Heart Audio
Figure 5.9: Characterizing learningperformance data with logistic models  The curves defined by the models do not fit the data well.

Soy Heart Audio
Figure 5.10: Logistic’s predictive models for 30 trials in 3 domains  All 90 predictions are too high.

Table 5.13 compares, using paireddifference analysis, the performance of the Logistic model set to that of the Modelgen model set in all three domains. Both the performance characterizing the learning performance and performance predicting learning performance are shown. The table tells if one model set (Modelgen or Logistic) produced significantly better learningprediction models than the other in each scenario. For Soy/Predict, for example, there were 30 learningprediction problems. For every problem, the two model sets each produced a learningprediction model. The loglikelihoods of these two learningprediction models provide a measure of their goodness. A twotailed ttest was used to test if the mean difference between the two loglikelihood values was significantly different from zero. For these tests an a value less than 0.05 was considered significant. For every combination of task and domain, the fit of the Modelgen models (as measured by loglikelihood) was on average better than the fit of the logistic models and this difference was statistically significant.
Scenario 
Task 
Domain 
# of Trials 
Sig.Diff? 
Better 
a 
test stat. 
1 
characterize 
soy 
3 
yes 
modelgen 
<0.001 
24.730 
2 
characterize 
heart 
3 
yes 
modelgen 
0.001 
12.946 
3 
characterize 
audio 
3 
yes 
modelgen 
<0.001 
26.300 
4 
predict 
soy 
30 
yes 
modelgen 
<0.001 
24.888 
5 
predict 
heart 
30 
yes 
modelgen 
<0.001 
6.750 
6 
predict 
audio 
30 
yes 
modelgen 
<0.001 
8.980 
Table 5.13: Results of paireddifference analysis  For all six combinations of task and domain, the Modelgen models produce statisticallysignificantly better loglikelihoods than the Logistic models.
Analysis:
} ¥
The logistic models
characterized learning performance poorly and predicted learning performance
even more poorly. Paireddifference analysis showed that the Modelgen models
were better than the logistic models in every combination of task and domain
investigated.
The ED experiments tested the set of learningperformance models described in Table 5.14. The set’s deterministic component is made up of the model_{gen} and EDit_{.} Its nondeterministic component is the binomial distribution. The purpose of this experiment was to see if the EDit model, which is based more closely on computational learning results, performs better than the simple Burr_{1} model.
Component 
Submodel 
Value 
Det. #1 
model_{gen} 
_{} 
Det. #2 
EDit 
model_{5}[z] = EDit^{1}[z], where EDit[p] = _{} 
Nondet. 
binomial 
_{} 
Table 5.14: Learningperformance model set tested in the ED experiments
Figure 5.11 shows a typical characterization trial. Table 5.15 quantifies all the characterization trials in all three domains. The ED model set does well; all R^{2}_{Seer} values are greater than 0.95.
Figure 5.11: Characterizing learningperformance data from the Heart domain (Trial #1) using a ED model
Domain 
Loglike. 
R^{2}_{Seer} 
Soy (3) 
7828.60 
0.952 
Heart (3) 
7570.18 
0.984 
Audio (3) 
5426.34 
0.983 
Table 5.15: Goodnessoffit for ED predicting learningperformance data in three domains
Figure 5.12 shows all the models found for all 30 prediction trials in the Audio domain. As Table 5.16 shows, the ED model set predicts well based on a subset of examples. The lowest R^{2}_{Seer} values as 0.939. Also, as the plot illustrates, the predictions are clustered around the true value.
Figure 5.12: ED’s predictive models for the 30 trials of the Audio domain  The diamond shows the “true” accuracy.

Characterize 
Predict 

Domain 
Loglike. 
R^{2}_{Seer} 
Loglike. 
R^{2}_{Seer} 
Soy(30) 
55315. 
0.954 
1669. 
0.980 
Heart(30) 
54184. 
0.978 
1467. 
0.949 
Audio(30) 
56057. 
0.979 
1400. 
0.939 
Table 5.16: ED’s goodnessoffit on learningperformance data from 1) the given 100 classified examples and 2) the full data set  Values are composites from the 30 trials.
Table 5.17 gives the mean and observed standard deviation for all parameters in all three domains. The values of start and skew are by definition exactly the same as for Modelgen. The ED model set estimated slightly more optimistic max values, but they did not seem to be reliably closer or further from the "true" max value. Both models have a d parameter, but the d’s are used differently within the model set and so are not directly comparable. In the Soy domain, the ratio between the two d’s is 9.2. In Heart, it is 6.3 and in Audio it is 8.3. On the other hand, ED’s d parameter is comparable to the VapnikChervonenkis (VC) dimension from computational learning theory. For example, the mean d value for the Soydomain characterization task was 47.692. After factoring out the effects of noise, multiple classes, and skew, a ED model will approximate the worstcase learning model of a learning task with a VC dimension of 48.

Characterization (all available classified examples, 3 trials) 
Prediction (100 classifies examples, 30 trials) 


mean 
s.d. 
mean 
s.d. 


Soy 






d 
1.186 
0.0086 
1.555 
0.2212 


max 
1.000 
0.0000 
0.997 
0.0171 


start 
0.087 
0.0010 
0.093 
0.0089 


skew 
1.075 
0.0019 
1.081 
0.0169 


p_{300} 
0.840 
0.0209 
0.814 
0.0220 


Heart 





d 
0.711 
0.1503 
0.979 
1.0641 


max 
0.760 
0.0035 
0.783 
0.0695 


start 
0.503 
0.0000 
0.509 
0.0110 


skew 
1.015 
0.0000 
1.042 
0.0516 


p_{290} 
0.752 
0.0248 
0.761 
0.0504 


Audio 





d 
0.562 
0.0629 
0.559 
0.1956 


max 
0.852 
0.0148 
0.840 
0.0934 


start 
0.141 
0.0000 
0.149 
0.0162 


skew 
1.196 
0.0000 
1.191 
0.0345 


p_{218} 
0.788 
0.0272 
0.764 
0.0640 


Table 5.17: Mean and standard deviation of estimated ED parameters
So how does the ED model set do compared to the Modelgen data set? The question is interesting because the two model sets are identical except that Modelgen uses the simple Burr_{1}, while ED uses the computationallearning inspired EDit. Table 5.18 shows which did best in each combination of task and domain. Modelgen characterized better in 2 of the 3 domains. For the more important prediction task, Modelgen did better in one domain, ED did better in another, and there was no significant difference in the third domain.
Analysis: There is no reliable, measurable performance difference for prediction between ED and Modelgen, but Modelgen may characterize better. Both models have the same number of parameters (two determined by the class frequency of the classified examples and two determined by fitting learningperformance data). Modelgen’s Burr_{1} component is simpler and computationally much easier to work with than the ED models’ EDit component. The ED model set is worth using if we want a d parameter we can interpret as the empirical analog of the VC dimension.
Scenario 
Task 
Domain 
Sig.Diff? 
Better 
a 
test stat. 
1 
characterize 
soy 
yes 
modelgen 
0.002 
10.868 
2 
characterize 
heart 
yes 
modelgen 
0.016 
3.702 
3 
characterize 
audio 
no 
 
0.144 
0.995 
4 
predict 
soy 
no 
 
0.180 
0.362 
5 
predict 
heart 
yes 
modelgen 
0.018 
1.847 
6 
predict 
audio 
yes 
ED 
<0.001 
3.745 
Table 5.18: Results of paireddifference analysis  Neither the Modelgen nor the ED predicts consistently better.
The NoExp experiments tested the set of learningperformance models described in Table 5.19. The set’s deterministic component is made up of the model_{noexp} and Burr_{1.} Its nondeterministic component is the binomial distribution. The purpose of this experiment was to see if an even simpler submodel than model_{gen }could be as effective. The alternative, model_{noexp}, still scales the curve between start and max, but it does not adjust for skew using exponentiation. The hope is that the d in model_{noexp} can take the role of both d and skew in model_{gen} and that exponentiation can be avoided, simplifying the model.
Component 
Submodel 
Value 
Det. #1 
model_{noexp} 
_{} 
Det. #2 
Burr_{1} 
model_{5}[z] = z/(z+1) 
Nondet. 
binomial 
_{} 
Table 5.19: Learningperformance model set tested in the NoExp experiments
Figure 5.13 shows a typical characterization trial. Table 5.20 quantifies all the characterization trials in all three domains. The NoExp model does well; all R^{2}_{Seer} values are greater than 0.96.
Figure 5.13: Characterizing learningperformance data from the Heart domain (Trial #1) with the NoExp model
Domain 
Loglike. 
R^{2}_{Seer} 
Soy (3) 
7736.67 
0.964 
Heart (3) 
7568.44 
0.984 
Audio (3) 
5424.74 
0.983 
Table 5.20: Goodnessoffit for NoExp characterizing learningperformance data from the three domains
Figure 5.14 shows all the models found for all 30 prediction trials in the Heart domain. As Table 5.21 shows, the NoExp models predicts fairly well based on a subset of examples but the lowest R^{2}_{Seer} value, 0.866 in Heart domain, is lower than we have seen before. The plot suggests a cause: in a minority of the Heartdomain trials, the estimate of max, the asymptotic accuracy, was far off.
Table 5.22 gives the mean and observed standard deviation for all parameters in all three domains. The d parameter of the NoExp model set is incomparable with both the d parameter of Modelgen and the d parameter of ED.

Characterize 
Predict 

Domain 
Loglike. 
R^{2}_{Seer} 
Loglike. 
R^{2}_{Seer} 
Soy(30) 
54903. 
0.962 
1657. 
0.987 
Heart(30) 
54283. 
0.976 
1607. 
0.866 
Audio(30) 
55972. 
0.980 
1420. 
0.926 
Table 5.21: NoExp’s goodnessoffit on learningperformance data from 1) the given 100 classified examples and 2) the full data set
Figure 5.14: NoExp’s predictive models for the 30 trials of the Heart domain

Characterization (all available classified examples, 3 trials) 
Prediction (100 classifies examples, 30 trials) 


mean 
s.d. 
mean 
s.d. 


Soy 






d 
39.561 
0.3780 
47.692 
7.5012 


max 
1.000 
0.0000 
0.995 
0.0268 


start 
0.087 
0.0010 
0.093 
0.0089 


p_{300} 
0.840 
0.0209 
0.828 
0.0244 


Heart 





d 
8.668 
1.5806 
10.866 
9.9616 


max 
0.753 
0.0024 
0.773 
0.0792 


start 
0.503 
0.0000 
0.509 
0.0110 


p_{290} 
0.752 
0.0248 
0.758 
0.0621 


Audio 





d 
15.259 
1.6522 
15.096 
6.6226 


max 
0.835 
0.0136 
0.825 
0.0992 


start 
0.141 
0.0000 
0.149 
0.0162 


p_{218} 
0.788 
0.0272 
0.762 
0.0683 


Table 5.22: Mean and standard deviation of estimated NoExp parameters
Table 5.23 compares NoExp to Modelgen on each combination of task and domain. For characterization, Modelgen did better in one domain, NoExp did better in the Heart domain, and there was no noticeable difference in the third domain. For prediction, in two domains, there was no significant difference in predictive power. In the Heart domain, Modelgen predicted better.
Scenario 
Task 
Domain 
Sig.Diff? 
Better 
a 
test stat. 
1 
characterize 
soy 
yes 
modelgen 
0.002 
8.963 
2 
characterize 
heart 
yes 
NoExp 
0.026 
2.846 
3 
characterize 
audio 
no 
 
0.105 
1.007 
4 
predict 
soy 
no 
 
0.157 
0.916 
5 
predict 
heart 
yes 
modelgen 
0.006 
2.321 
6 
predict 
audio 
no 
 
0.221 
0.139 
Table 5.23: Results of paireddifference analysis  Modelgen predicted better in one domain.
Analysis: In most of the scenarios investigated, the exponent in the Modelgen model set is not important. The d parameter can be adjusted to compensate for the lack of an exponent. The NoExp models have the advantage of being simpler and easier to work with. On the other hand, they gets its simplicity by conflating two parameters (skew and d), one of which can be determined without fitting by just by looking at class frequencies. Modelgen’s d parameter is thus more useful because it comes closer to quantifying the inherent hardness of learning a classification rule without regard to class frequencies and skew.
The Pvar experiments tested the set of learningperformance models described in Table 5.24. The set’s deterministic component is made up of the model_{gen} and Burr_{1}, just as in the Modelgen experiments_{.} Its nondeterministic component is the betabinomial distribution. This experiment was designed to tell if there is any significant difference between using the binomial distribution and using the more sophisticated betabinomial. The only difficulty in this comparison is that each distribution defines “goodness” differently. If one is much better than the other, however, it ought to be able to beat the other at prediction by either likelihood standard.
Table 5.26 gives the mean and observed standard deviation for all parameters in all three domains. A pvar value of 0 reduces the betabinomial distribution to the binomial distribution. On all three domains, for the characterization task, the mean pvar is several standard deviations away from 0, suggesting the parameter is significant. On the prediction task, however, the mean pvar value is a best two standard deviations from 0 and, in the Soy domain, only one standard deviation from 0. This suggests that, for these prediction tasks, the betabinomial distribution was not useful.
Component 
Submodel 
Value 
Det. #1 
model_{gen} 
_{} 
Det. #2 
Burr_{1} 
model_{5}[z] = z/(z+1) 
Nondet. 
betabinomial 
_{} 
Table 5.24: Learningperformance model set tested in the Pvar experiments
Table 5.26 compares Pvar to Modelgen on each combination of task and domain. Assuming that the binomial distribution is correct, Modelgen did significantly better in two of the six scenarios. In the other four, neither model set did significantly better. On the other hand, if we assume that betabinomial is correct, then Pvar does significantly better in 4 of the scenarios, Modelgen does significantly better in one, and neither does significantly better in the remaining scenario.
Analysis: Only prediction matters here (since each finds the best characterizing fit by its criteria). Neither model set consistently beats the other by the other’s criteria. Thus, while the betabinomial distribution has intuitive appeal as a way to measure variance in training examples, for these scenarios the simpler and more popular binomial distribution seems to suffice.

Characterization (all available classified examples, 3 trials) 
Prediction (100 classifies examples, 30 trials) 


mean 
s.d. 
mean 
s.d. 


Soy 






d 
12.773 
0.0887 
14.252 
2.1762 


max 
1.000 
0.0000 
0.988 
0.0458 


pvar 
0.013 
0.0016 
0.010 
0.0060 


start 
0.087 
0.0010 
0.093 
0.0089 


skew 
1.075 
0.0019 
1.081 
0.0169 


p_{300} 
0.840 
0.0209 
0.840 
0.0321 


Heart 





d 
6.113 
1.1021 
6.188 
4.5237 


max 
0.753 
0.0023 
0.760 
0.0516 


pvar 
0.002 
0.0009 
0.006 
0.0063 


start 
0.503 
0.0000 
0.509 
0.0110 


skew 
1.015 
0.0000 
1.042 
0.0516 


p_{290} 
0.752 
0.0248 
0.749 
0.0451 


Audio 





d 
5.067 
0.5460 
4.645 
1.5405 


max 
0.823 
0.0127 
0.795 
0.0856 


pvar 
0.006 
0.0027 
0.010 
0.0052 


start 
0.141 
0.0000 
0.149 
0.0162 


skew 
1.196 
0.0000 
1.191 
0.0345 


p_{218} 
0.788 
0.0272 
0.746 
0.0650 


Table 5.25: Mean and standard deviation of estimated Pvar parameters
Scenario 
Task 
Domain 
Sig.Diff? 
Better 
a 
test stat. 
Assuming the Binomial Distribution is Correct 

1 
characterize 
soy 
yes 
modelgen 
0.049 
1.912 
2 
characterize 
heart 
yes 
modelgen 
0.015 
3.840 
3 
characterize 
audio 
no 
 
0.144 
1.000 
4 
predict 
soy 
no 
 
0.150 
0.863 
5 
predict 
heart 
no 
 
0.155 
0.511 
6 
predict 
audio 
no 
 
0.140 
0.794 
Assuming the BetaBinomial Distribution is Correct 

1 
characterize 
soy 
yes 
pvar 
0.004 
7.425 
2 
characterize 
heart 
yes 
pvar 
0.025 
2.903 
3 
characterize 
audio 
no 
 
0.099 
1.078 
4 
predict 
soy 
yes 
modelgen 
<0.001 
7.091 
5 
predict 
heart 
yes 
pvar 
0.015 
1.947 
6 
predict 
audio 
yes 
pvar 
0.005 
2.405 
Table 5.26: Results of paireddifference analysis  Neither model predicts consistently better.
The goal of this chapter was to find which (if any) of the model set designs were useful for characterizing and predicting learning performance. The goal was also to help develop criteria for choosing which of the useful models to use in a particular situation.
To meet these goals, Seer generated 99 learningperformance data sets (3 characterization trials and 30 prediction trials ´ 3 domains) using generalized crossvalidation and thousands of runs of the C4.5 learning algorithm. For each of the 99 learningperformance data sets, Seer found the maximumlikelihood model from 5 model sets. Results were measured with loglikelihood and R^{2}_{Seer}, a measure of goodnessoffit. The maximumlikelihood models were compared to each other using paireddifference analysis.
The Logistic model set was the only one that did poorly on both characterization and prediction. This model set is not appropriate for learningperformance data. The other models all did well on characterization and seemed to do as well as possible on prediction. The poor prediction seen in some trials seemed mostly attributable to unrepresentative subsets of the classified examples.
One way to select among the good performing model sets would be to always use the simplest. Simplicity is often measured by number of parameters. Most of the models had two parameters that needed to be fitted to the learningperformance data (max and d) and two parameters (start and skew) that could be determined from the class frequency of the classified examples. The NoExp model set did not have a skew parameter. The Pvar model set had an additional parameter, pvar, to fit to the data. Simplicity can also be evaluated by how easy a model set is to work with, computationally. By this criterion, NoExp is again the simplest. The next simplest would be Modelgen, whereas. ED and Pvar would be the most complex.
Another way to select among candidate
model sets is to choose the most useful and meaningful parameters. For example,
the max parameter found in most of
the models is very useful because it is an estimate of asymptotic accuracy.
Likewise, start and skew quantify the effect of multiple
classes and skewed classes on the learning. Also, the pvar parameter tries to quantify variance in the training data.
Finally, most of the model sets have a d
parameter. This is a measure of learning difficulty after other parameterized
effects are factored out. Uniquely among the model sets, the ED model set
produces a d that is comparable to
the d produced through
VapnikChervonenkisdimension analysis. Thus, although one model set is worst,
selection of the best model set depends on a tradeoff between simplicity and
usefulness. Table
5.27 summarizes the tradeoff.
Model Set 
Fit 
Complexity 
Measured effects 
Modelgen 
good 
medium 
multiple classes, skew, noise, difficulty 
Logistic 
poor 
 
 
ED 
good 
high 
multiple classes, skew, noise, VCcomparable difficulty 
NoExp 
good 
low 
multiple classes, noise, difficulty 
Pvar 
good 
high 
multiple classes, skew, noise, variance from training examples 
Table 5.27: Tradeoffs in selection of a model set. Except for Logistic, all the model sets characterize well and predict as well as the data allows. Some of the model sets are more simple, but others have useful and interesting parameters.
This concluding chapter highlights the important points of the preceding chapters, suggests possibilities for future work, and discusses the principle contributions of this research.
Inductive classification learning is the process of creating a classification rule based on a set of training examples. Two questions are of interest when constructing an expert system from classified examples:
1. How many classified examples will be needed to create a classification rule of the desired accuracy?
2. If an unlimited number of examples were possible, what accuracy would be possible?
Seer answers these questions by first generating learningperformance data from a set of existing classified examples. Then, it fits a learningperformance model to the data.
Seer advances the start of the art by making regression on learningperformance data practicable. Theoretical approaches to modeling learning performance, when applicable, can make predictions over a broad scope of learning tasks. For practical problems, however, with noise and the most useful machinelearning algorithms, theory cannot yet predict averagecase performance. Empirical approaches such as Seer are more narrow in scope. They work by gathering observations of actual learningperformance and then generalizing it. (They are in essence empirically learning about empirical learning.) Seer’s basic approach, which goes back at least 75 years, is to use regression to fit a curve (a learningperformance model) to the observed learningperformance data. Seer differs from previous learningcurveregression work:
1. with learningperformance models that embody the best constraints for classification learning and most useful parameters
2. with algorithms that efficiently find maximumlikelihood models, and
3. with a demonstration on realworld data of a practicable application.
Formally, Seer’s task starts with learningperformance data in the form of a set of tuples, {<m_{1},y_{1},k_{1}>, <m_{1},y_{1},k_{1}>, ...}, where m_{i} is the number of training examples given to a learning program, y_{i} is the number of testing examples the classification rule classifies correctly, and k_{i} is the number of testing examples given to the learner’s hypothesis. When the total number of classified examples given to Seer is small, it uses a new procedure called generalized crossvalidation to generate learningperformance data (at a cost of some statistical independence). Although we can informally think of the models Seer fits to that data as “curves”, more formally a learningperformance model is a function. From the number of training examples, m, and the number of testing examples, k, a model predicts the probability of correctly classifying exactly y examples.
To create Seer, two questions needed answers: first, which sets of candidate learningperformance models should Seer consider, and, second, how should Seer efficiently find the maximumlikelihood model from a model set. The question of creating candidate models was treated as a threepart design problem. The first involved creating deterministic models for noisefree learning of two equally probable classes. Two models inspired by computational learning theory, EDmodel and BurrModel, were found to have a much different shape than the models of logit and probit regression. The second part of the design involved modeling the effect of multiple classes, skewed classes, and noise. The new model_{gen} heuristically models the effects of interest with remarkably few parameters. The third part involved creating a nondeterministic model. The binomial distribution should be sufficient if variation in testing data is the only concern. However, if variation in the training data is also a concern, then the betabinomial distribution might be better.
The second question of how to find the maximumlikelihood model cannot be solved with popular and simple techniques such as leastsquares regression because the assumptions of leastsquares regression, such as constant variance, are not reasonable for learningperformance data. The problem of finding the maximumlikelihood model for learningperformance data is an instance of nonlinear programming. Seer, however, overcomes the difficulties of nonlinear programming by exploiting problem constraints and using appropriate transformations to reduce the problem to nonlinear regression, which Seer solves with an efficient iterative algorithm.
Experiments were conducted to determine which, if any, of the model sets designs were useful for characterizing and predicting learning performance. Seer generated 99 learningperformance data sets (3 characterization trials and 30 prediction trials ´ 3 domains). It used generalized crossvalidation and thousands of runs of the C4.5 learning algorithm. For each of the 99 learningperformance data sets, Seer found the maximumlikelihood model from the 5 model sets. Results were measured with loglikelihood and R^{2}_{Seer}, a measure based on loglikelihood but normalized to the range 0.0 to 1.0. The maximumlikelihood models were compared headtohead using paireddifference analysis. All the model sets (except the logistic model set) did well on characterization and seemed to do as well as possible on prediction. The poor prediction seen in some trials seemed mostly attributable to unrepresentative subsets of the classified examples.
One way to select among the good performing model sets would be to always use the simplest. Of the model sets considered, the one named NoExp had fewer parameters than the other models and was the easiest with which to work computationally. Another way to select model sets, however, would be to choose the one with the most useful and meaningful parameters. For example, the max parameter found in most of the models is very useful because it is an estimate of asymptotic accuracy. Likewise, start and skew quantify and predict the effect of multiple classes and skewed classes on the learning. Also, the pvar parameter tries to quantify variance in the training data. Finally, most of the model sets have a d parameter. This is a measure of learning difficulty after other parameterized effects are factored out. Uniquely among the model sets, the ED model set produces a d that is comparable to the d produced through VapnikChervonenkisdimension analysis. Thus, except for the Logistic model set, all the model sets characterize well and predict as well as the data allows. Some of the model sets are more simple, but others have useful and interesting parameters. Selecting the best one will always involve tradeoffs.
This section discusses opportunities for future work. Three directions are considered: 1) expanding the amount of performancedata available, 2) exploring the pvar and d_{ED} model parameters, and 3) generalizing the models to be multivariate.
Infinite Data Supply  If our only interest is in estimating how many more classified examples we need to create an acceptable expert system, then Seer’s current method of generating learningperformance data suffices. However, what if we are interested in characterizing the behavior of our inductive learning under specific distributions of learning targets and learning examples? In that case, the distributions can be used to generate random targets and examples. This gives Seer an inexhaustible supply of classified examples. By using each of these synthetic examples only once, the learningperformance data Seer creates is statistically independent. Confidence intervals can be obtained for all parameter estimates either from the covariance matrix created during model fitting or via Monte Carlo simulation [Press et al. 1992, p. 695]. The down side to this method is that it is no longer based on realworld data, but rather on a syntheticdata generator.
The pvar and d_{ED} Parameters  This thesis presented the pvar parameter from the betabinomial distribution as a way to characterize variation in training examples. This parameter could be important because it captures a source of uncertainty that is otherwise ignored. For example, it could tell us that although one inductive learner appears to have greater predictive accuracy than another (as measured with ordinary cross validation) that the difference is not statistically significant because it could be explained by trainingexample variation. In the experiments of this thesis, however, the pvar parameter did not contribute, in a compelling way, to prediction accuracy. Further work is needed to determine if and when reliable pvar values can be estimated. One way to try to gain an understanding of the problem might be to use infinite supplies of data as suggested above.
The thesis also presented the ED model, a model designed to create a parameter d that would be comparable to the VapnikChervonenkis dimension. Haussler [1988] promotes the VCdimension, d_{VC} as a measure of inductive bias. Because d_{ED} can be determined in many cases where d_{VC} cannot (and visa versa), it may be a useful measure of inductive bias. Future work is needed to explore this application.
Multivariate Models  When predicting learning performance, the number of training examples is typically the most important dependent variable, but it is not the only one of possible interest. We might also want to model how other variables affect performance. Examples of such variables include:
· Choice of inductive learning algorithm; for example, C4.5 versus ID3
· Configuration of learning algorithm; for example, number of hidden units in a neuralnet algorithm
· Number of irrelevant attributes in the classified examples
· Complexity of the target; for example, number of disjuncts in the target
Some of these variables are nearly always available, for example, the learning algorithm and its configuration are typically known. Other variables, such as number of disjuncts in the target, would only be known for syntheticallygenerated learning problems.
The result of multivariate regression would be a model that might look something like the plot in Figure 6.15. It would map dependent variables into expected accuracy, capturing with only a few numbers the general effects of learningtask attributes.
Figure 6.15: Example of what a multivariate learningperformance regression model might look like. In this example, the vertical axis represents accuracy, the horizontal axis represents number of training examples, and the axis that goes into the page represents the number of disjuncts in the learning target.
The general contribution of the thesis was to show how regression could be used to practically and efficiently analyze inductive classification learning. More specifically, given a set of classified examples and a learning system, such as C4.5, the thesis presented a way to predict learning averagecase learning performance and asymptotic accuracy. The data could be noisy and could originate from multiple and skewed classes. Furthermore, the thesis developed and evaluated several reasonable and effective model sets of learning performance. These models varied in their simplicity and in the usefulness of their parameters. One model set, EDmodel, produced a d parameter comparable to the VapnikChervonenkis dimension from Computation Learning Theory. It developed a heuristic that predicts, without needing learningperformance data, the effect of skewed classes and multiple classes on learning. It developed a measure, pvar, of the variation caused by different trainingexample sets. (Most similar statistical methods only measure the variation caused by different testing example sets.) It argued that maximum likelihood is the most appropriate criterion for selecting a model within a family of models that best fits the data and it showed how such maximumlikelihood models could be found efficiently.. It introduced a new method called generalized crossvalidation to create learningperformance data from a finite set of classified examples, if dependencies in the data are allowed. It was observed that learningperformance data based on subsets of the classified examples are sometimes unrepresentative. The thesis identified likelihood as the most appropriate way to judge the goodnessoffit of a model to data and defined R^{2}_{Seer} (a new measure based on likelihood) as a convenient way to normalize likelihood. Finally, the thesis showed how to use pairdifference analysis to compare competing learningperformance models.
References
[Aldrich and Nelson, 1984] Aldrich, J. H. and Nelson, F. D. (1984). Linear probability, logit and probit models. Sage Publications, Beverly Hills, CA.
[Amari, 1993] Amari, S.i. (1993). A universal theorem on learning curves. Neural Networks, 6:161166.
[Anderson and Milson, 1989] Anderson, J. R. and Milson, R. (1989). Human memory: An adaptive perspective. Psychological Review, 96(4):703719.
[Bareiss and Porter, 1987] Bareiss, E. and Porter, B. (1987). Protos: An exemplarbased learning apprentice. In Proceedings of the Fourth International Workshop on Machine Learning, pages 1223, Irvine, CA. Morgan Kaufmann Publishers.
[Blumer et al., 1989] Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1989). Learnability and the VapnikChervonenkis dimension. J. ACM, 36(4):929965.
[Booker et al., 1989] Booker, L. B., Goldberg, D. E., and Holland, J. H. (1989). Classifier systems and genetic algorithms. Artificial Intelligence, 40(13):235282.
[Brewer, 1982] Brewer, K. (1982). Stratified Designs, pages 18. Wiley, New York.
[Buntine, 1989] Buntine, W. L. (1989). Decision tree induction systems: A Bayesian analysis. In Uncertainty in Artificial Intelligence, volume 3, pages 190197. Elsevier Science Publishers, Amsterdam.
[Catlett, 1991] Catlett, J. (1991). Inductive learning from subsets or disposable of excess training data considered harmful. Technical Report 402, Basser Department of Computer Science, The University of Sydney, N.S.W. Australia 2006.
[Cohn and Tesauro, 1992] Cohn, D. and Tesauro, G. (1992). How tight are the VapnikChevonenkis bounds? Neural Computation, 4:249269.
[Cooper and Herskovits, 1992] Cooper, G. F. and Herskovits, E. (1992). A Baysian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309348.
[Cramer, 1986] Cramer, J. S. (1986). Econometric applications of maximum likelihood methods. Cambridge University Press, Cambridge, UK.
[Dixon, 1972] Dixon, L. C. W. (1972). Nonlinear optimization. English Universities Press, London.
[Ehrenfeucht and Haussler, 1989] Ehrenfeucht, A. and Haussler, D. (1989). Learning decision trees from random examples. Information and Computation, 82:231246.
[Gaines, 1989] Gaines, B. R. (1989). The quantification of knowledge: Formal foundations for acquisition methodologies. In Ras, Z. W., editor, Methodologies for Intelligent Systems, 4, volume 4, pages 137149, Charlotte, NC.
[Gevarter, 1987] Gevarter, W. (1987). The nature and evaluation of commercial expert system building tools. IEEE Computer, 21(5):2441.
[Gold, 1967] Gold, E. M. (1967). Language identification in the limit. Information and Control, 10:447474.
[Goldman and Sloan, 1992] Goldman, S. A. and Sloan, R. H. (1992). Can PAC learning algorithms tolerate random attribute noise? Technical Report WUCS9225, Department of Computer Science, Washington University, St. Louis.
[Griffiths, 1973] Griffiths, D. A. (1973). Maximum likelihood estimation for the betabinomial distribution and an application to the household distribution of the total number of cases of disease. Biometrics, 29:637649.
[Haussler, 1988] Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177221.
[Hinton, 1989] Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40(13):185234.
[Hirschbert and Pazzani, 1991] Hirschbert, D. and Pazzani, M. (1991). Averagecase analysis of a kCNF learning algorithm. Technical Report 9150, Department of Information and Computer Science, University of California, Irvine, CA.
[Hosmer and Lemeshow, 1989] Hosmer, D. W. and Lemeshow, S. (1989). Applied logistic regression. Wiley, New York.
[Iba and Langley, 1992] Iba, W. and Langley, P. (1992). Induction of onelevel decision trees. In Proceedings of the Ninth International Workshop on Machine Learning, pages 233240, Aberdeen, Scotland. Morgan Kaufmann Publishers.
[Johnson et al., 1992] Johnson, N. L., Kotz, S., and Kemp, A. W. (1992). Univariate discrete distributions. John Wiley Sons, New York, 2nd edition.
[Kadie, 1991] Kadie, C. M. (1991). Quantifying the values of constructive induction, knowledge, and noise filtering on inductive learning. In Machine Learning: Proceedings of the Eighth International Workshop, pages 153157, Evanston, Illinois. Morgan Kaufmann Publishers.
[Kearns and Schapire, 1989] Kearns, M. J. and Schapire, R. E. (1989). Efficient distributionfree learning of probabilistic concepts. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science, pages 382391, Research Triangle Park, NC. IEEE Computer Society Press.
[Langley, 1994] Langley, P. (1994). Personal communication.
[Langley et al., 1992] Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the 1992 National Conference on Artificial Intelligence, pages 223228, San Jose, CA.
[Logan, 1992] Logan, G. D. (1992). Shapes of reactiontime distributions and shapes of learning curves: A test of the instance theory of automaticity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(5):883914.
[Marquardt, 1963] Marquardt, D. (1963). Journal of the Society for Industrial and Applied Mathematics, volume 11.
[Mazur and Hastie, 1978] Mazur, J. E. and Hastie, R. (1978). Learning as accumulation: A reexamination of the learning curve. Psychological Bulletin, 85(6):11561274.
[McClave and Dietrich, 1988] McClave, J. T. and Dietrich, F. H. (1988). Statistics. Dellen Publishing Company, San Francisco, fourth edition.
[McCullagh and Nelder, 1989] McCullagh, P. and Nelder, J. A. (1989). Generalized linear models. Chapman and Hall, London, 2nd edition.
[Michalski and Chilausky, 1980] Michalski, R. S. and Chilausky, R. L. (1980). Learning by being told and learning from examples: an experimental comparison of the two methods of knowledge acquisition in the context of developing and expert system for soybean disease diagnosis. Policy Analysis and Information Systems, 4(2).
[Minsky and Papert, 1988] Minsky, M. and Papert, S. (1988). The Perceptron: Expanded Edition. MIT Press, Cambridge, MA. Original edition published in 1969.
[Nelder, 1966] Nelder, J. A. (1966). Inverse polynomials, a useful group of multifactor response functions. Biometrics, 22:128141.
[Nelson, 1994] Nelson, F. D. (1994). Personal communication.
[Newell and Rosenbloom, 1981] Newell, A. and Rosenbloom, P. (1981). Mechanism of skill acquisition and the law of practice. In The Soar Papers, volume 1, pages 81135. MIT Press.
[Pazzani and Sarrett, 1990] Pazzani, M. and Sarrett, W. (1990). Averagecase analysis of conjunctive learning algorithms. In Proceedings of the Seventh International Conference on Machine Learning, pages 339347, Austin, TX. Morgan Kaufmann Publishers.
[Porter et al., 1990] Porter, B. W., Bareiss, R., and Hlote, R. C. (1990). Concept learning and heuristic classification in weaktheory domains. Artificial Intelligence, 45(3):229263.
[Press et al., 1992] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge, UK, 2nd edition.
[Quinlan, 1991] Quinlan, J. (1991). Improved estimates for the accuracy of small disjuncts. Machine Learning, 6:9398.
[Quinlan, 1986] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1).
[Quinlan, 1992] Quinlan, J. R. (1992). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
[Rendell, 1986] Rendell, L. A. (1986). A general framework for induction and a study of selective induction. Machine Learning, 1(2):177226.
[Sandhu et al., 1989] Sandhu, S., Guppy, K., Lee, S., and Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64:304310.
[Shavlik et al., 1991] Shavlik, J. W., Mooney, R., and Towell, G. (1991). Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6:111143.
[ShaweTaylor et al., 1993] ShaweTaylor, J., Anthony, M., and Biggs, N. (1993). Bounding sample size with the VapnikChervonenkis dimension. Discrete Applied Mathematics, 42(1):6573.
[Thurstone, 1919] Thurstone, L. L. (1919). The learning curve equation. Psychological Monographs, 26(114):51.
[Valiant, 1984] Valiant, L. G. (1984). A theory of the learnable. Comm. Assoc. Comp. Mach., 27(11):11341142.
[Weiss and Kulikowski, 1991] Weiss, S. and Kulikowski, C. (1991). Computer Systems that Learn. Morgan Kaufmann Publishers, San Mateo, CA.
[Wilcox, 1981] Wilcox, R. R. (1981). A review of the betabinomial model and its extensions. Journal of Educational Statistics, 6(1):332.
Vita
Carl Myers Kadie was born on July 4, 1962, in Denver, Colorado. He attended the University of Illinois at UrbanaChampaign, where he received a Bachelor of Science degree in computer science with highest honors in January 1995. While an undergraduate, he worked every other semester as a coop at IBM, Houston, on Space Shuttle software and related projects. Following graduation he worked for 8 months as a coop at IBM Research, Yorktown Heights, NY, on the prototype of the Prodigy online service.
In the fall of 1985 he began his graduate studies in computer science at the University of Illinois. He completed his Masters of Science degree in October of 1989 with a thesis on machine learning of robot operators from examples of operator effects. In 1989 he was awarded Fannie and John Hertz Fellowship; it supported his graduate work for the next five years. In August of 1994, he started work in the Advanced Technology Division of Microsoft Corporation as a software development engineer working on Internet features for MSN, the Microsoft Network. His Ph.D., which focused on using maximumlikelihood regression to predict learningspeed curves, was conferred in October 1995.
Publications
and Presentations
C. M. Kadie & D. C. Wilkins, “LearningSpeed Curves: Their Use In Induction Of Classification Expert Systems,” International Journal of Knowledge Acquisition (To appear).
C. M. Kadie, “Applying Library Intellectual Freedom Principles to Public and Academic Computers,” In Computerization and Controversy: Value, Conflict, and Social Choices, 2^{nd} Edition, edited by R. Kling, Academic Press, San Diego (To appear).
C. M. Kadie, “Computers and Academic Freedom News: List of Banned Computer Material on College Campuses,” In Conceptual Issues on the Electronic Frontier, edited by Peter Ludlow, MIT Press, Cambridge, Massachusetts (To appear).
C. M. Kadie, “The AAAI93 Robot Building Contest,” AI Magazine, 14:4, Winter 1993.
C. M. Kadie, “The Academic Freedom Model,” In Proceedings of the ACM’s Third Conference on Computers, Freedom and Privacy, San Francisco, March 1993. Association for Computing Machinery.
C. M. Kadie, “Computers and Academic Freedom,” Presented at the National Academy of Sciences Workshop on Rights and Responsibilities of Participants in Networked Communities, November, 1992.
C. M. Kadie, “Quantifying the Value of Constructive Induction, Knowledge, and Noise Filtering on Inductive Learning,” Proceedings of the Eighth International Conference on Machine Learning, Evanston, Illinois, 1991.
C. M. Kadie, “Continuous Conceptual Set Covering: Learning Robot Operators from Examples,” Proceedings of the Eighth International Conference on Machine Learning, Evanston, Illinois, 1991.
C. M. Kadie, “Conceptual Set Covering: Improving FitandSplit Algorithms,” Proceedings of the Seventh International Conference on Machine Learning, Austin, Texas, June 1990.
C. M. Kadie, “Book Review of R.C. Johnson and C. Brown’s Cognizers: Neural Networks and Machines that Think,” Artificial Intelligence Review, 4:4, 1990.
C. M. Kadie, “Quantifying the Effects of Representation, Noise, and Method on Inductive Learning,” Presented at the first Workshop on Computational Learning Theory and Natural Learning Systems, 1990.
C. M. Kadie, “DiffyS: Learning Robot Operator Schemata From Examples,” Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, Michigan, June 1988.
C. M. Kadie. “Rational Nonmonotonic Reasoning,” Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, Minneapolis, August 1988.