Christopher M. Bishop
Neural Networks for Pattern Recognition
This book, currently in its 12th print run, provides the first comprehensive treatment of neural networks from the perspective of statistical pattern recognition. It comprises 504 pages with 160 figures, 300 references, and 129 graded exercises, and is published by Oxford University Press.
For those entering the field of artificial neural networks, there has been an acute need for an authoritative textbook that explains the main ideas clearly and consistently using the basic tools of linear algebra, calculus, and simple probability theory. There have been many attempts to provide such a text, but until now, none has succeeded. Some authors have failed to separate the basic ideas and principles from the soft and fuzzy intuitions that led to some of the models as well as to most of the exaggerated claims. Others have been unwilling to use the basic mathematical tools that are essential for a rigorous understanding of the material. Yet others have tried to cover too many different kinds of neural network without going into enough depth on any one of them. The most successful attempt to date has been ``Introduction to the Theory of Neural Computation'' by Hertz, Krogh and Palmer. Unfortunately, this book started life as a graduate course in statistical physics and it shows. So despite its many admirable qualities it is not ideal as a general textbook.
Bishop is a leading researcher who has a deep understanding of the material and has gone to great lengths to organize it into a sequence that makes sense. He has wisely avoided the temptation to try to cover everything and has therefore omitted interesting topics like reinforcement learning, Hopfield Networks and Boltzmann machines in order to focus on the types of neural network that are most widely used in practical applications. He assumes that the reader has the basic mathematical literacy required for an undergraduate science degree, and using these tools he explains everything from scratch. Before introducing the multilayer perceptron, for example, he lays a solid foundation of basic statistical concepts. So the crucial concept of overfitting is first introduced using easily visualised examples of one-dimensional polynomials and only later applied to neural networks. An impressive aspect of this book is that it takes the reader all the way from the simplest linear models to the very latest Bayesian multilayer neural networks without ever requiring any great intellectual leaps.
Although Bishop has been involved in some of the most impressive applications of neural networks, the theme of the book is principles rather than applications. Nevertheless, it is much more useful than any of the applications-oriented texts in preparing the reader for applying this technology effectively. The crucial issues of how to get good generalization and rapid learning are covered in great depth and detail and there are also excellent discussions of how to preprocess the input and how to choose a suitable error function for the output.
It is a sign of the increasing maturity of the field that methods which were once justified by vague appeals to their neuron-like qualities can now be given a solid statistical foundation. Ultimately, we all hope that a better statistical understanding of artificial neural networks will help us understand how the brain actually works, but until that day comes it is reassuring to know why our current models work and how to use them effectively to solve important practical problems.
In recent years neural computing has emerged as a practical technology, with successful applications in many fields. The majority of these applications are concerned with problems in pattern recognition, and make use of feed-forward network architectures such as the multi-layer perceptron and the radial basis function network. Also, it has also become widely acknowledged that successful applications of neural computing require a principled, rather than ad hoc, approach. My aim in writing this book has been to provide a more focused treatment of neural networks than previously available, which reflects these developments. By deliberately concentrating on the pattern recognition aspects of neural networks, it has become possible to treat many important topics in much greater depth. For example, density estimation, error functions, parameter optimization algorithms, data pre-processing, and Bayesian methods are each the subject of an entire chapter.
From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades. Indeed, this book includes discussions of several concepts in conventional statistical pattern recognition which I regard as essential for a clear understanding of neural networks. More extensive treatments of these topics can be found in the many texts on statistical pattern recognition, including Duda and Hart (1973), Hand (1981), Devijer and Kittler (1982), and Fukunaga (1990). Recent review articles by Ripley (1994) and Titterington (1994) have also emphasized the statistical underpinnings of neural networks.
Historically, many concepts in neural computing have been inspired by
studies of biological networks. The perspective of statistical pattern
recognition, however, offers a much more direct and principled route to many
of the same concepts. For example, the sum-and-threshold model of a neuron
arises naturally as the optimal discriminant function needed to distinguish
two classes whose distributions are normal with equal covariance matrices.
Similarly, the familiar logistic sigmoid is precisely the function needed to
allow the output of a network to be interpreted as a probability, when the
distribution of hidden unit activations is governed by a member of the
Use as a course text
This book is aimed at researchers in neural computing as well as those wishing to apply neural networks to practical applications. It is also intended to be used used as the primary text for a graduate-level, or advanced undergraduate-level, course on neural networks. In this case the book should be used sequentially, and care has been taken to ensure that where possible the material in any particular chapter depends only on concepts developed in earlier chapters.
Exercises are provided at the end of each chapter, and these are intended to reinforce concepts developed in the main text, as well as to lead the reader through some extensions of these concepts. Each exercise is assigned a grading according to its complexity and the length of time needed to solve it, ranging from (*) for a short, simple exercise, to (***) for a more extensive or more complex exercise. Some of the exercises call for analytical derivations or proofs, while others require varying degrees of numerical simulation. Many of the simulations can be carried out using numerical analysis and graphical visualization packages, while others specifically require the use of neural network software. Often suitable network simulators are available as add-on tool-kits to the numerical analysis packages. No particular software system has been prescribed, and the course tutor, or the student, is free to select an appropriate package from the many available. A few of the exercises require the student to develop the necessary code in a standard language such as C or C++. In this case some very useful software modules written in C, together with background information, can be found in Press et al. (1992).
This book is intended to be largely self-contained as far as the subject of neural networks is concerned, although some prior exposure to the subject may be helpful to the reader. A clear understanding of neural networks can only be achieved with the use of a certain minimum level of mathematics. It is therefore assumed that the reader has a good working knowledge of vector and matrix algebra, as well as integral and differential calculus for several variables. Some more specific results and techniques which are used at a number of places in the text are described in the appendices.
Overview of the chapters
The first chapter provides an introduction to the principal concepts of pattern recognition. By drawing an analogy with the problem of polynomial curve fitting, it introduces many of the central ideas, such as parameter optimization, generalization and model complexity, which will be discussed at greater length in later chapters of the book. This chapter also gives an overview of the formalism of statistical pattern recognition, including probabilities, decision criteria and Bayes' theorem.
Chapter 2 deals with the problem of modelling the probability distribution of a set of data, and reviews conventional parametric and non-parametric methods, as well as discussing more recent techniques based on mixture distributions. Aside from being of considerable practical importance in their own right, the concepts of probability density estimation are relevant to many aspects of neural computing.
Neural networks having a single layer of adaptive weights are introduced
in Chapter 3. Although such networks have less flexibility than multi-layer
networks, they can play an important role in practical applications, and
they also serve to motivate several ideas and techniques which are
applicable also to more general network structures.
An alternative, and complementary, approach to representing general non-linear mappings is provided by radial basis function networks, and is discussed in Chapter 5. These networks are motivated from several distinct perspectives, and hence provide a unifying framework linking a number of different approaches.
Several different error functions can be used for training neural networks, and these are motivated, and their properties examined, in Chapter 6. The circumstances under which network outputs can be interpreted as probabilities are discussed, and the corresponding interpretation of hidden unit activations is also considered.
Chapter 7 reviews many of the most important algorithms for optimizing the values of the parameters in a network, in other words for network training. Simple algorithms, based on gradient descent with momentum, have serious limitations, and an understanding of these helps to motivate some of the more powerful algorithms, such as conjugate gradients and quasi-Newton methods.
One of the most important factors in determining the success of a practical application of neural networks is the form of pre-processing applied to the data. Chapter 8 covers a range of issues associated with data pre-processing, and describes several practical techniques related to dimensionality reduction and the use of prior knowledge.
Chapter 9 provides a number of insights into the problem of generalization, and describes methods for addressing the central issue of model order selection. The key insight of the bias--variance trade-off is introduced, and several techniques for optimizing this trade-off, including regularization, are treated at length.
The final chapter discusses the treatment of neural networks from a Bayesian perspective. As well as providing a more fundamental view of learning in neural networks, the Bayesian approach also leads to practical procedures for assigning error bars to network predictions and for optimizing the values of regularization coefficients.
Some useful mathematical results are derived in the appendices, relating to the properties of symmetric matrices, Gaussian integration, Lagrange multipliers, calculus of variations, and principal component analysis.
An extensive bibliography is included, which is intended to provide useful pointers to the literature rather than a complete record of the historical development of the subject.
Finally, I wish to express my considerable gratitude to the many people who, in one way or another, have helped with the process of writing this book. The first of these is Jenna, who has displayed considerable patience and good humour, notwithstanding my consistent underestimates of the time and effort required to complete this book. I am particularly grateful to a number of people for carefully reviewing draft material for this book, and for discussions which in one way or another have influenced parts of the text: Geoff Hinton, David Lowe, Stephen Luttrell, David MacKay, Alan McLachlan, Martin Moller, Radford Neal, Cazhaow Qazaz, Brian Ripley, Richard Rohwer, David Saad, Iain Strachan, Markus Svensen, Lionel Tarassenko, David Wallace, Chris Williams, Peter Williams and Colin Windsor. I would also like to thank Richard Lister for providing considerable assistance while I was typesetting the text in LaTeX. Finally, I wish to thank staff at Oxford University Press for their help in the final stages of preparing this book.
1. Statistical Pattern Recognition
Overview of pattern recognition. Non-linear mappings. The curse of dimensionality. Analogy with polynomial curve fitting. Simple examples. Generalization and model complexity. Probabilities and probability density functions. Bayes' theorem. Decision boundaries and discriminant functions. Loss matrices and rejection thresholds.
2. Probability Density Estimation
Parametric methods. The normal distribution. Maximum likelihood. Bayesian learning. The Robbins-Monro algorithm. Non-parametric methods. Histograms, kernel-based models and K-nearest-neighbours. Smoothing parameters. Gaussian mixture models. The EM algorithm. On-line parameter estimation.
3. Single-Layer Networks
Linear discriminant functions. Logistic discrimination. Linear separability. Least-squares techniques, including geometrical interpretation and pseudo-inverse solution. The perceptron convergence theorem. Fisher's discriminant and its relation to least-squares.
4. The Multi-layer Perceptron
Threshold and sigmoidal activation functions. Universality. Weight-space symmetries. Higher-order networks. Kolmogorov's theorem. Error back-propagation and its computational efficiency. The Jacobian matrix. The Hessian matrix including diagonal and outer product approximations and exact evaluation. Fast multiplication by the Hessian.
5. Radial Basis Functions
Exact interpolation. Smooth interpolation and the pseudo-inverse. Origins in regularization theory, noisy interpolation theory, potential function theory and probability density estimation. Radial basis functions for classification. Relation between radial basis functions and the multi-layer perceptron. Basis function optimization including orthogonal least squares and clustering algorithms. Links with Gaussian mixtures and density estimation.
6. Error Functions
Relation to maximum likelihood. Sum-of-squares error. Linear sum rules. Interpretation of network outputs as conditional averages and as conditional probabilities. Justification for outer product Hessian approximation. Inverse problems. Minkowski-R measure. Modelling conditional probability densities. Periodic variables. Sum-of-squares error for classification. Interpretation of hidden unit activations. Weighted sum-of-squares. Cross-entropy error for two classes. Hidden units as feature detectors. Multi-class cross entropy. Softmax activation function and error function. Entropy and information theory. General conditions for network outputs to represent probabilities.
7. Parameter Optimization Algorithms
Error surfaces. Local quadratic approximation. Use of gradient information. Weight initialization. Stopping criteria. Test error functions. Gradient descent, including the role of momentum. Batch versus sequential training. Convergence properties and limitations of gradient descent. Conjugate gradients. Scaled conjugate gradients. Newton's method. Quasi-Newton methods. Limited memory quasi-Newton methods. The Levenberg-Marquardt algorithm.
8. Pre-Processing and Feature Extraction
Motivation. Pre- and post-processing. Input normalization and encoding. Representation of discrete data. Missing data. Time series prediction. Feature selection criteria and search procedures. Principal component analysis and its limitations. Intrinsic dimensionality. Invariances and prior knowledge. Shared weights. Higher-order networks.
9. Learning and Generalization
The bias-variance decomposition. Regularization. Simple weight decay. Relation to early stopping. Training with noise. Soft weight sharing. Architecture selection. Binary classification problems. Cascade correlation. Weight saliency and pruning. Node pruning. Committees of networks. Mixture of experts. Model order selection. Cross-validation. Stacked generalization. Complexity criteria. VC dimension.
10. Bayesian Techniques
Bayesian learning of network weights. Choice of priors and noise model. Consistent priors. Posterior distribution of weights. A simple example of Bayesian learning. The Gaussian approximation. Distribution of network outputs. Application to classification problems. The evidence framework for hyper-parameters. Integration over hyper-parameters versus maximization. Bayesian model comparison. Committees revisited. Application of Bayesian techniques in practice. Monte-Carlo methods. Minimum description length.
A. Eigenvectors and eigenvalues of real symmetric matrices. Quadratic forms.
[return to homepage]