Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Book web pages]
Lasserre, J., C. M. Bishop, and T. Minka (2006). Principled hybrids of generative and discriminative models. In Proceedings 2006 IEEE Conference on Computer Vision and Pattern Recognition, New York. [PDF]
Abstract
When labelled training data is plentiful, discriminative techniques are widely used since they give excellent generalization performance. However, for large-scale applications such as object recognition, hand labelling of data is expensive, and there is much interest in semi-supervised techniques based on generative models in which the majority of the training data is unlabelled. Although the generalization performance of generative models can often be improved by `training them discriminatively', they can then no longer make use of unlabelled data. In an attempt to gain the benefit of both generative and discriminative approaches, heuristic procedure have been proposed which interpolate between these two extremes by taking a convex combination of the generative and discriminative objective functions. In this paper we adopt a new perspective which says that there is only one correct way to train a given model, and that a `discriminatively trained' generative model is fundamentally a new model. From this viewpoint, generative and discriminative models correspond to specific choices for the prior over parameters. As well as giving a principled interpretation of `discriminative training', this approach opens door to very general ways of interpolating between generative and discriminative extremes through alternative choices of prior. We illustrate this framework using both synthetic data and a practical example in the domain of multi-class object recognition. Our results show that, when the supply of labelled training data is limited, the optimum performance corresponds to a balance between the purely generative and the purely discriminative.
Bishop, C. M. and J. Lasserre (2006). Generative or discriminative? getting the best of both worlds. In Valencia ISBA Eighth World Meeting on Bayesian Statistics. To appear.
Szummer, M. and C. M. Bishop (2006). Discriminative writer adaptation. In 10th International Workshop on Frontiers in Handwriting Recognition (IWFHR).
Abstract
We propose a general method for adapting a writer-independent classifier to a specific writer. We employ a mixture of experts formulation, where each classifier is trained on weighted clusters of writers. The clusters are determined by what experts classify their writing correctly. The method adapts by choosing the appropriate combination of classifiers for a new user. It applies to any probabilistic discriminative classifier, and adapts discriminatively without modeling the input feature distribution. We apply the method for online character recognition. We employ a mixture of neural networks as well as a mixture of logistic regressions. We train the mixture via conjugate gradient ascent or via the EM algorithm on 192,000 Latin characters of 98 classes and 216 writers, and show adaptation results for 21 writers.
Bishop, C. M., S. Muggleton, A. Kuppermann, P. Moin, and N. Ferguson (2006). Prediction machines. In S. Emmott and S. Rison (Eds.), 2020 Science, pp. 3435. Microsoft Research.
Winn, J. and C. M. Bishop (2005). Variational message passing. Journal of Machine Learning Research 6, 661694. [PDF]
Abstract
This paper presents Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to a Bayesian Network. Like belief propagation, Variational Message Passing proceeds by passing messages between nodes in the graph and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational Message Passing has been implemented in the form of a general purpose inference engine called VIBES (`Variational Inference for BayEsian networkS') which allows models to be specified graphically and then solved variationally without recourse to coding.
Ulusoy, I. and C. M. Bishop (2005a). Comparison of generative and discriminative techniques for object detection and classification. In C. S. J. Ponce, M. Herbert and A. Zisserman (Eds.), Proceedings Sicily Workshop on Object Recognition., Sicily. To appear. [PDF]
Abstract
Many approaches to object recognition are founded on probability theory, and can be broadly characterized as either generative or discriminative according to whether or not the distribution of the image features is modelled. Generative and discriminative methods have very different characteristics, as well as complementary strengths and weaknesses. In this chapter we introduce new generative and discriminative models for object detection and classification based on weakly labelled training data. We use these models to illustrate the relative merits of the two approaches in the context of a data set of widely varying images of non-rigid objects (animals). Our results support the assertion that neither approach alone will be sufficient for large scale object recognition, and we discuss techniques for combining the strengths of generative and discriminative approaches.
Ulusoy, I. and C. M. Bishop (2005b). Generative versus discriminative models for object recognition. In Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, CVPR., San Diego. [PDF]
Abstract
Many approaches to object recognition are founded on probability theory, and can be broadly characterized as either generative or discriminative according to whether or not the distribution of the image features is modelled. Generative and discriminative methods have very different characteristics, as well as complementary strengths and weaknesses. In this paper we introduce new generative and discriminative models for object detection and classification based on weakly labelled training data. We use these models to illustrate the relative merits of the two approaches in the context of a data set of widely varying images of non-rigid objects (animals). Our results support the assertion that neither approach alone will be sufficient for large scale object recognition, and we discuss techniques for combining the strengths of generative and discriminative approaches.
Bishop, C. M. and I. Ulusoy (2005). Object recognition via local patch labelling. In J. Winkler, M. Niranjan, and N. Lawrence (Eds.), Proceedings 2004 Workshop on Machine Learning, Sheffield, pp. 121. Springer. [PDF]
Abstract
In recent years the problem of object recognition has received considerable attention from both the machine learning and computer vision communities. The key challenge of this problem is to be able to recognize any member of a category of objects in spite of wide variations in visual appearance due to variations in the form and colour of the object, occlusions, geometrical transformations (such as scaling and rotation), changes in illumination, and potentially non-rigid deformations of the object itself. In this paper we focus on the detection of objects within images by combining information from a large number of small regions, or `patches', of the image. Since detailed hand-segmentation and labelling of images is very labour intensive, we make use of `weakly labelled' data in which the training images are labelled only according to the presence or absence of each category of object. A major challenge presented by this problem is that the foreground object is accompanied by widely varying background clutter, and the system must learn to distinguish the foreground from the background without the aid of labelled data. In this paper we first show that patches which are highly relevant for the object discrimination problem can be selected automatically from a large dictionary of candidate patches during learning, and that this leads to improved classification compared to direct use of the full dictionary. We then explore alternative techniques which are able to provide labels for the individual patches, as well as for the image as a whole, so that each patch is identified as belonging to one of the object categories or to the background class. This provides a rough indication of the location of the object or objects within the image. Again these individual patch labels must be learned on the basis only of overall image class labels. We develop two such approaches, one discriminative and one generative, and compare their performance both in terms of patch labelling and image labelling. Our results show that good classification performance can be obtained on challenging data sets using only weak training labels, and they also highlight some of the relative merits of discriminative and generative approaches.
Bishop, C. M., M. Svens\'en, and G. E. Hinton (2004). Distinguishing text from graphics in on-line handwritten ink. In F. Kimura and H. Fujisawa (Eds.), Proceedings Ninth International Workshop on Frontiers in Handwriting Recognition, IWFHR-9, Tokyo, Japan, pp. 142147. [PDF]
Abstract
We present a system that parses handwritten digital ink collected on-line by separating text strokes from graphics strokes. It utilizes not just the characteristics of the strokes, but also the information provided by the gaps between the strokes, as well as the temporal characteristics of the stroke sequence. It is built using machine learning techniques that infer the internal parameters of the system from real digital ink, collected using a Tablet PC.
Krishnapuram, B., C. M. Bishop, and M. Szummer (2004). Generative models and Bayesian model comparison for shape recognition. In F. Kimura and H. Fujisawa (Eds.), Proceedings Ninth International Workshop on Frontiers in Handwriting Recognition, IWFHR-9, Tokyo, Japan, pp. 2025. [PDF]
Abstract
Recognition of hand-drawn shapes is an important and widely studied problem. By adopting a generative probabilistic framework we are able to formulate a robust and flexible approach to shape recognition which allows for a wide range of shapes and can learn new shapes from a single exemplar. It also provides meaningful probabilistic measures of model score which can be used as part of a larger probabilistic framework for interpreting a page of ink. We also show how Bayesian model comparison allows the trade-off between fitting the data and model complexity to be optimized automatically.
Svens\'en, M. and C. M. Bishop (2004). Robust Bayesian mixture modelling. Neurocomputing 64, 235252. [PDF]
Abstract
Bayesian approaches to density estimation and clustering using mixture distributions allow the automatic determination of the number of components in the mixture. Previous treatments have focussed on mixtures having Gaussian components, but these are well known to be sensitive to outliers. This can lead to excessive sensitivity to small numbers of data points and consequent over-estimates of the number of components. In this paper we develop a Bayesian approach to mixture modelling based on Student-$t$ distributions, which are heavier tailed than Gaussians and hence more robust. By expressing the Student-$t$ distribution as a marginalization over additional latent variables we are able to derive a tractable variational inference algorithm for this model, which includes Gaussian mixtures as a special case. Results on a variety of real data sets demonstrate the improved robustness of our approach.
Bishop, C. M. and M. Svens\'en (2004). Robust Bayesian mixture modelling. In M. Verleysen (Ed.), Proceedings Twelfth European Symposium on Artificial Neural Networks, pp. 6974. d-side. [PDF]
Bishop, C. M. (2004). Clumps, clusters and classification. In A. Herbert and K. S. Jones (Eds.), Computer Systems: Theory, Technology and Applications. A Tribute to Roger Needham, Computer Monographs, pp. 3949. Springer.
Bishop, C. M. and M. Svens\'en (2003). Bayesian hierarchical mixtures of experts. In U. Kjaerulff and C. Meek (Eds.), Proceedings Nineteenth Conference on Uncertainty in Artificial Intelligence, pp. 5764. Morgan Kaufmann. [PDF] [Postscript]
Abstract
The Hierarchical Mixture of Experts (HME) is a well-known tree-structured model for regression and classification, based on soft probabilistic splits of the input space. In its original formulation its parameters are determined by maximum likelihood, which is prone to severe over-fitting, including singularities in the likelihood function. Furthermore the maximum likelihood framework offers no natural metric for optimizing the complexity and structure of the tree. Previous attempts to provide a Bayesian treatment of the HME model have relied either on local Gaussian representations based on the Laplace approximation, or have modified the model so that it represents the joint distribution of both input and output variables, which can be wasteful of resources if the goal is prediction. In this paper we describe a fully Bayesian treatment of the original HME model based on variational inference. By combining `local' and `global' variational methods we obtain a rigorous lower bound on the marginal probability of the data under the model. This bound is optimized during the training phase, and its resulting value can be used for model order selection. We present results using this approach for data sets describing robot arm kinematics.
Bishop, C. M. and B. Frey (Eds.) (2003). Proceedings Ninth International Workshop on Artificial Intelligence and Statistics. January 3 - 6, Key West, Florida, Published on CD-ROM and on-line. [On line proceedings]
Bishop, C. M., A. Blake, and B. Marthi (2003). Super-resolution enhancement of video. In C. M. Bishop and B. Frey (Eds.), Proceedings Artificial Intelligence and Statistics, Key West, Florida. Society for Artificial Intelligence and Statistics. ISBN 0-9727358-0-1. [PDF] [Postscript]
Abstract
We consider the problem of enhancing the resolution of video through the addition of perceptually plausible high frequency information. Our approach is based on a learned data set of image patches capturing the relationship between the middle and high spatial frequency bands of natural images. By introducing an appropriate prior distribution over such patches we can ensure consistency of static image regions across successive frames of the video, and also take account of object motion. A key concept is the use of the previously enhanced frame to provide part of the training set for super-resolution enhancement of the current frame. Our results show that a marked improvement in video quality can be achieved at reasonable computational cost.
Bishop, C. M. and J. Winn (2003). Structured variational distributions in VIBES. In C. M. Bishop and B. Frey (Eds.), Proceedings Artificial Intelligence and Statistics, Key West, Florida. Society for Artificial Intelligence and Statistics. ISBN 0-9727358-0-1. [PDF] [Postscript]
Abstract
Variational methods are becoming increasingly popular for the approximate solution of complex probabilistic models in machine learning, computer vision, information retrieval and many other fields. Unfortunately, for every new application it is necessary first to derive the specific forms of the variational update equations for the particular probabilistic model being used, and then to implement these equations in application-specific software. Each of these steps is both time consuming and error prone. We have therefore recently developed a general purpose inference engine called VIBES (`Variational Inference for Bayesian Networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified as a directed acyclic graph using an interface analogous to a drawing package, and VIBES then automatically generates and solves the variational equations. The original version of VIBES assumed a fully factorized variational posterior distribution. In this paper we present an extension of VIBES in which the variational posterior distribution corresponds to a sub-graph of the full probabilistic model. Such structured distributions can produce much closer approximations to the true posterior distribution. We illustrate this approach using an example based on Bayesian hidden Markov models.
Bishop, C. M. and M. E. Tipping (2003). Bayesian regression and classification. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle (Eds.), Advances in Learning Theory: Methods, Models and Applications, Volume 190, pp. 267285. IOS Press, NATO Science Series III: Computer and Systems Sciences. [PDF] [Postscript]
Abstract
In recent years Bayesian methods have become widespread in many domains such as computer vision, signal processing, information retrieval and genome data analysis. The availability of fast computers allows the required computations to be performed in reasonable time, and thereby makes the benefits of a Bayesian treatment accessible to an ever broadening range of applications. In this tutorial we give an overview of the Bayesian approach to pattern recognition in the context of simple regression and classification problems. We then describe in detail a specific Bayesian model for regression and classification called the `Relevance Vector Machine'. This overcomes many of the limitations of the widely used Support Vector Machines, whilst retaining the highly desirable property of sparseness.
Bishop, C. M., D. Spiegelhalter, and J. Winn (2003). VIBES: A variational inference engine for Bayesian networks. In S. Becker, S. Thrun, and K. Obermeyer (Eds.), Advances in Neural Information Processing Systems, Volume 15, pp. 793800. MIT Press. [PDF] [Postscript]
Abstract
In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (`Variational Inference for Bayesian Networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling.
Tipping, M. E. and C. M. Bishop (2003). Bayesian image super-resolution. In S. Becker, S. Thrun, and K. Obermeyer (Eds.), Advances in Neural Information Processing Systems, Volume 15, pp. 13031310. [PDF] [Postscript]
Abstract
The extraction of a single high-quality image from a set of low-resolution images is an important problem which arises in fields such as remote sensing, surveillance, medical imaging and the extraction of still images from video. Typical approaches are based on the use of cross-correlation to register the images followed by the inversion of the transformation from the unknown high resolution image to the observed low resolution images, using regularization to resolve the ill-posed nature of the inversion process. In this paper we develop a Bayesian treatment of the super-resolution problem in which the likelihood function for the image registration parameters is based on a marginalization over the unknown high-resolution image. This approach allows us to determine the unknown point spread function, and is rendered tractable through the introduction of a Gaussian process prior over images. Results indicate a significant improvement over techniques based on MAP (maximum a-posteriori) point optimization of the high resolution image.
Bishop, C. M. (2002). Discussion of `Bayesian treed generalized linear models' by H. A. Chipman, E. I. George and R. E. McCulloch. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. SMith, and M. West (Eds.), Proceedings Seventh Valencia International Meeting on Bayesian Statistics, Volume 7, pp. 98101. Oxford University Press. [PDF]
Lawrence, N. D., A. I. T. Rowstron, C. M. Bishop, and M. J. Taylor (2002). Optimising synchronisation times for mobile devices. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, Volume 14, pp. 14011408. MIT Press. [PDF] [Postscript]
Abstract
With the increasing number of users of mobile computing devices (e.g. personal digital assistants) and the advent of third generation mobile phones, wireless communications are becoming increasingly important. Many applications rely on the device maintaining a replica of a data-structure which is stored on a server, for example news databases, calendars and e-mail. In this paper we explore the question of the optimal strategy for synchronising such replicas. We utilise probabilistic models to represent how the data-structures evolve and to model user behaviour. We then formulate objective functions which can be minimised with respect to the synchronisation timings. We demonstrate, using two real world data-sets, that a user can obtain more up-to-date information using our approach.
Corduneanu, A. and C. M. Bishop (2001). Variational Bayesian model selection for mixture distributions. In T. Richardson and T. Jaakkola (Eds.), Proceedings Eighth International Conference on Artificial Intelligence and Statistics, pp. 2734. Morgan Kaufmann. [PDF] [Postscript]
Abstract
Mixture models, in which a probability distribution is represented as a linear superposition of component distributions, are widely used in statistical modelling and pattern recognition. One of the key tasks in the application of mixture models is the determination of a suitable number of components. Conventional approaches based on cross-validation are computationally expensive, are wasteful of data, and give noisy estimates for the optimal number of components. A fully Bayesian treatment, based on Markov chain Monte Carlo methods for instance, will return a posterior distribution over the number of components. However, in practical applications it is generally convenient, or even computationally essential, to select a single, most appropriate model. Recently it has been shown, in the context of linear latent variable models, that the use of hierarchical priors governed by continuous hyperparameters whose values are set by type-II maximum likelihood, can be used to optimize model complexity. In this paper we extend this framework to mixture distributions by considering the classical task of density estimation using mixtures of Gaussians. We show that, by setting the mixing coefficients to maximize the marginal log-likelihood, unwanted components can be suppressed, and the appropriate number of components for the mixture can be determined in a single training run without recourse to cross-validation. Our approach uses a variational treatment based on a factorized approximation to the posterior distribution.
Rowstron, A. I. T., N. D. Lawrence, and C. M. Bishop (2001). Probabilistic modelling of replica divergence. In HotOS 2001. [PDF] [Postscript]
Abstract
It is common in distributed systems to replicate data. In many cases this data evolves in a consistent fashion and this evolution can be modelled. A probabilistic model of the evolution allows us to estimate the divergence of the replicas and can be used by the application to alter its behaviour, for example to control synchronisation times, to determine the propagation of writes, and to convey to the user information about how much the data may have evolved. In this paper, we describe how the evolution of the data may be modelled and outline how the probabilistic model may be utilised in various applications, concentrating on a news database example.
Lerner, B., W. F. Clocksin, S. Dhanjal, M. A. Hulten, and C. M. Bishop (2001a). Feature representation for the automatic analysis of fluorescence in-situ hybridization images. IEEE Transactions on Systems, Man and Cybernetics A 31(6), 655665. [PDF]
Abstract
Fast and accurate analysis of fluorescence in-situ hybridization images for signal counting will depend mainly upon two components: a classifier to discriminate between artifacts and valid signals of several fluorophores (colors), and well discriminating features to represent the signals. Our previous work (2001) has focused on the first component. To investigate the second component, we evaluate candidate feature sets by illustrating the probability density functions and scatter plots for the features. The analysis provides first insight into dependencies between features, indicates the relative importance of members of a feature set, and helps in identifying sources of potential classification errors. Class separability yielded by different feature subsets is evaluated using the accuracy of several neural network-based classification strategies, some of them hierarchical, as well as using a feature selection technique making use of a scatter criterion. Although applied to cytogenetics, the paper presents a comprehensive, unifying methodology of qualitative and quantitative evaluation of pattern feature representation essential for accurate image classification. This methodology is applicable to many other real-world pattern recognition problems.
Lerner, B., W. F. Clocksin, S. Dhanjal, M. A. Hulten, and C. M. Bishop (2001b). Automatic signal classification in fluorescence in-situ hybridization images. Cytometry 43(2), 8793. [PDF]
Abstract
Background: Previous systems for dot (signal) counting in Fluorescence in situ hybridization (FISH) images have relied on an auto-focusing method for obtaining a clearly defined image. Because signals are distributed in three dimensions within the nucleus and artifacts such as debris and background fluorescence can attract the focussing method, valid signals can be left unfocussed or unseen. This leads to dot counting errors, which increase with the number of probes. Methods: The approach described here dispenses with auto-focussing, and instead relies on a neural network (NN) classifier that discriminates between in and out-of-focus images taken at different focal planes of the same field of view. Discrimination is performed by the NN, which classifies signals of each image as valid data or artifacts (due to out of focussing). The image that contains no artifacts is the in-focus image selected for dot counting proportion estimation. Results: Using an NN classifier and a set of features to represent signals improves upon previous discrimination schemes that are based on nonadaptable decision boundaries and single-feature signal representation. Moreover, the classifier is not limited by the number of probes. Three classification strategies, two of them hierarchical, have been examined and found to achieve each between 83% and 87% accuracy on unseen data. Screening, while performing dot counting, of in and out-of-focus images based on signal classification suggests an accurate and efficient alternative to that obtained using an auto-focussing mechanism.
Bishop, C. M. and J. Winn (2000). Non-linear Bayesian image modelling. In Proceedings Sixth European Conference on Computer Vision, Dublin, Volume 1, pp. 317. Springer. [PDF] [Postscript]
Abstract
In recent years several techniques have been proposed for modelling the low-dimensional manifolds, or `subspaces', of natural images. Examples include principal component analysis (as used for instance in `eigen-faces'), independent component analysis, and auto-encoder neural networks. Such methods suffer from a number of restrictions such as the limitation to linear manifolds or the absence of a probabilistic representation. In this paper we exploit recent developments in the fields of variational inference and latent variable models to develop a novel and tractable probabilistic approach to modelling manifolds which can handle complex non-linearities. Our framework comprises a mixture of sub-space components in which both the number of components and the effective dimensionality of the sub-spaces are determined automatically as part of the Bayesian inference procedure. We illustrate our approach using two classical problems: modelling the manifold of face images and modelling the manifolds of hand-written digits.
Bishop, C. M. and M. E. Tipping (2000a). Variational relevance vector machines. In C. Boutilier and M. Goldszmidt (Eds.), Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 4653. Morgan Kaufmann. [PDF] [Postscript]
Abstract
The Support Vector Machine (SVM) of Vapnik has become widely established as one of the leading approaches to pattern recognition and machine learning. It expresses predictions in terms of a linear combination of kernel functions centred on a subset of the training data, known as support vectors. Despite its widespread success, the SVM suffers from some important limitations, one of the most significant being that it makes point predictions rather than generating predictive distributions. Recently Tipping has formulated the Relevance Vector Machine (RVM), a probabilistic model whose functional form is equivalent to the SVM. It achieves comparable recognition accuracy to the SVM, yet provides a full predictive distribution, and also requires substantially fewer kernel functions. The original treatment of the RVM relied on the use of type II maximum likelihood (the `evidence framework') to provide point estimates of the hyperparameters which govern model sparsity. In this paper we show how the RVM can be formulated and solved within a completely Bayesian paradigm through the use of variational inference, thereby giving a posterior distribution over both parameters and hyperparameters. We demonstrate the practicality and performance of the variational RVM using both synthetic and real world examples.
Bishop, C. M. and M. E. Tipping (2000b). Variational relevance vector machines. In V. Nunez-Anton and E. Ferreira (Eds.), Proceedings 15th International Workshop on Statistical Modelling, Bilbao, Spain, pp. 117. Universidad del Pais Vasco.
Abstract
The Support Vector Machine (SVM) of Vapnik (1998) has become widely established as one of the leading approaches to pattern recognition and machine learning. It expresses predictions in terms of a linear combination of kernel functions centred on a subset of the training data, known as support vectors. Despite its widespread success, the SVM suffers from some important limitations, one of the most significant being that it makes point predictions rather than generating predictive distributions. Recently Tipping (1999) has formulated the Relevance Vector Machine (RVM), a probabilistic model whose functional form is equivalent to the SVM. It achieves comparable recognition accuracy to the SVM, yet provides a full predictive distribution, and also requires substantially fewer kernel functions. The original treatment of the RVM relied on the use of type II maximum likelihood (the `evidence framework') to provide point estimates of the hyperparameters which govern model sparsity. In this paper we show how the RVM can be formulated and solved within a completely Bayesian paradigm through the use of variational inference, thereby giving a posterior distribution over both parameters and hyperparameters. We demonstrate the practicality and performance of the variational RVM using both synthetic and real world examples.
Bishop, C. M. (1999). Variational principal components. In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN'99, Volume 1, pp. 509514. IEE. [PDF] [Postscript]
Abstract
One of the central issues in the use of principal component analysis (PCA) for data modelling is that of choosing the appropriate number of retained components. This problem was recently addressed through the formulation of a Bayesian treatment of PCA (Bishop, 1998) in terms of a probabilistic latent variable model. A central feature of this approach is that the effective dimensionality of the latent space (equivalent to the number of retained principal components) is determined automatically as part of the Bayesian inference procedure. In common with most non-trivial Bayesian models, however, the required marginalizations are analytically intractable, and so an approximation scheme based on a local Gaussian representation of the posterior distribution was employed. In this paper we develop an alternative, variational formulation of Bayesian PCA, based on a factorial representation of the posterior distribution. This approach is computationally efficient, and unlike other approximation schemes, it maximizes a rigourous lower bound on the marginal log probability of the observed data.
Tipping, M. E. and C. M. Bishop (1999a). Mixtures of probabilistic principal component analyzers. Neural Computation 11(2), 443482. [PDF] [Postscript]
Abstract
Principal component analysis (PCA) is one of the most popular techniques for processing, compressing and visualising data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Previous attempts to formulate mixture models for PCA have therefore to some extent been ad hoc. In this paper, PCA is formulated within a maximum-likelihood framework, based on a specific form of Gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analysers, whose parameters can be determined using an EM algorithm. We discuss the advantages of this model in the context of clustering, density modelling and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.
Tipping, M. E. and C. M. Bishop (1999b). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 21(3), 611622. [PDF]
Abstract
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss, with illustrative examples, the advantages conveyed by this probabilistic approach to PCA.
Cornford, D., I. T. Nabney, and C. M. Bishop (1999). Neural network-based wind vector retrieval from satellite scatterometer data. Neural Computing and Applications 8, 206217. [PDF] [Postscript]
Abstract
Obtaining wind vectors over the ocean is important for weather forecasting and ocean modelling. Several satellite systems used operationally by meteorological agencies utilise scatterometers to infer wind vectors over the oceans. In this paper we present the results of using novel neural network based techniques to estimate wind vectors from such data. The problem is partitioned into estimating wind speed and wind direction. Wind speed is modelled using a multi-layer perceptron (MLP) and a sum of squares error function. Wind direction is a periodic variable and a multi-valued function for a given set of inputs; a conventional MLP fails at this task, and so we model the full periodic probability density of direction conditioned on the satellite derived inputs using a Mixture Density Network (MDN) with periodic kernel functions. A committee of the resulting MDNs is shown to improve the results.
McGrogan, N., C. M. Bishop, and L. Tarassenko (1999). Neural network training using multi-channel data with aggregate labelling. In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN'99, Volume 2, pp. 862867. IEE. [HTML]
Abstract
The solution of classification problems using statistical techniques requires appropriately labelled training data. In the case of multi-channel data, however, the labels may only be available in aggregate form rather than as separate labels for each individual channel. Standard techniques, using a trained model to predict each channel separately, are therefore precluded. In this paper we present a new method of training neural network classifiers from aggregate labels. This technique allows the network to learn what significant events on individual channels result in the given labels. We apply this training method to two synthetic (but, in the second case, realistic) problems and compare the results with those from a classifier trained on the accurate channel labels, which would usually not be available. On previously unseen test data for the two problems 97.75\% and 99.1\% of feature vectors were classified correctly. These represent reductions of only 0.5\% and 0.1\% from classifiers trained on accurate labels for all channels.
Bishop, C. M. (1999a). Pattern recognition and feedforward neural networks. In R. A. Wilson and F. C. Keil (Eds.), The MIT Encyclopedia of the Cognitive Sciences, pp. 629631. MIT Press. [PDF]
Bishop, C. M. (1999b). Latent variable models. In M. I. Jordan (Ed.), Learning in Graphical Models, pp. 371403. MIT Press. [PDF] [Postscript]
Abstract
A powerful approach to probabilistic modelling involves supplementing a set of observed variables with additional latent, or hidden, variables. By defining a joint distribution over visible and latent variables, the corresponding distribution of the observed variables is then obtained by marginalization. This allows relatively complex distributions to be expressed in terms of more tractable joint distributions over the expanded variable space. One well-known example of a hidden variable model is the mixture distribution in which the hidden variable is the discrete component label. In the case of continuous latent variables we obtain models such as factor analysis. The structure of such probabilistic models can be made particularly transparent by giving them a graphical representation, usually in terms of a directed acyclic graph, or Bayesian network. In this chapter we provide an overview of latent variable models for representing continuous variables. We show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA). By extending this technique to mixtures, and hierarchical mixtures, of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization. We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (GTM). Finally, we show how GTM can itself be extended to model temporal data.
Bishop, C. M. (1999c). Bayesian PCA. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems, Volume 11, pp. 382388. MIT Press. [PDF]
Abstract
The technique of principal component analysis (PCA) has recently been expressed as the maximum likelihood solution for a generative latent variable model. In this paper we use this probabilistic reformulation as the basis for a Bayesian treatment of PCA. Our key result is that effective dimensionality of the latent space (equivalent to the number of retained principal components) can be determined automatically as part of the Bayesian inference procedure. An important application of this framework is to mixtures of probabilistic PCA models, in which each component can determine its own effective complexity.
Maass, W. and C. M. Bishop (1998). Pulsed Neural Networks. MIT Press. [Information about this book]
Bishop, C. M. (1998). Neural Networks and Machine Learning. Springer. [Information about this book]
Bishop, C. M. and M. E. Tipping (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 281293. [PDF]
Abstract
Visualization has proven to be a powerful and widely-applicable tool for the analysis and interpretation of multi-variate data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach on a toy data set, and we then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines, and to data in 36 dimensions derived from satellite images. A Matlab software implementation of the algorithm is publicly available from the World Wide Web.
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1998). Developments of the Generative Topographic Mapping. Neurocomputing 21, 203224. [PDF] [Postscript]
Abstract
The Generative Topographic Mapping (GTM) model was introduced by Bishop et al. (1998) as a probabilistic re-formulation of the self-organizing map (SOM). It offers a number of advantages compared with the standard SOM, and has already been used in a variety of applications. In this paper we report on several extensions of the GTM, including an incremental version of the EM algorithm for estimating the model parameters, the use of local subspace models, extensions to mixed discrete and continuous data, semi-linear models which permit the use of high-dimensional manifolds whilst avoiding computational intractability, Bayesian inference applied to hyper-parameters, and an alternative framework for the GTM based on Gaussian processes. All of these developments directly exploit the probabilistic structure of the GTM, thereby allowing the underlying modelling assumptions to be made explicit. They also highlight the advantages of adopting a consistent probabilistic framework for the formulation of pattern recognition algorithms.
Lawrence, N., C. M. Bishop, and M. Jordan (1998). Mixture representations for inference and learning in Boltzmann machines. In Uncertainty in Artificial Intelligence, Volume 14, pp. 320327. Morgan Kaufmann. [Postscript]
Abstract
Boltzmann machines are undirected graphical models with two-state stochastic variables, in which the logarithms of the clique potentials are quadratic functions of the node states. They have been widely studied in the neural computing literature, although their practical applicability has been limited by the difficulty of finding an effective learning algorithm. One well-established approach, known as mean field theory, represents the stochastic distribution using a factorized approximation. However, the corresponding learning algorithm often fails to find a good solution. We conjecture that this is due to the implicit uni-modality of the mean field approximation which is therefore unable to capture multi-modality in the true distribution. In this paper we use variational methods to approximate the stochastic distribution using multi-modal mixtures of factorized distributions. We present results for both inference and learning to demonstrate the effectiveness of this approach.
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1998). GTM: the Generative Topographic Mapping. Neural Computation 10(1), 215234. [PDF] [Postscript]
Abstract
Latent variable models represent the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. A familiar example is factor analysis which is based on a linear transformations between the latent space and the data space. In this paper we introduce a form of non-linear latent variable model called the Generative Topographic Mapping for which the parameters of the model can be determined using the EM algorithm. GTM provides a principled alternative to the widely used Self-Organizing Map (SOM) of Kohonen (1982), and overcomes most of the significant limitations of the SOM. We demonstrate the performance of the GTM algorithm on a toy problem and on simulated data from flow diagnostics for a multi-phase oil pipeline.
Bishop, C. M. (1998). Variational learning in graphical models and neural networks. In Proceedings 8th International Conference on Artificial Neural Networks, ICANN'98, pp. 1322. Springer. [PDF]
Abstract
Variational methods are becoming increasingly popular for inference and learning in probabilistic models. By providing bounds on quantities of interest, they offer a more controlled approximation framework than techniques such as Laplace's method, while avoiding the mixing and convergence issues of Markov chain Monte Carlo methods, or the possible computational intractability of exact algorithms. In this paper we review the underlying framework of variational methods and discuss example applications involving sigmoid belief networks, Boltzmann machines and feed-forward neural networks.
Barber, D. and C. M. Bishop (1998). Ensemble learning in Bayesian neural networks. In C. M. Bishop (Ed.), Generalization in Neural Networks and Machine Learning, pp. 215237. Springer. [PDF]
Abstract
Bayesian treatments of learning in neural networks are typically based either on a local Gaussian approximation to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called `ensemble learning', was introduced by Hinton (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. The original derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and hence was unable to capture the posterior correlations between parameters. In this chapter we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. One of the benefits of our approach is that it yields a strict lower bound on the marginal likelihood, in contrast to other approximate procedures.
Goldberg, P. W., C. K. I. Williams, and C. M. Bishop (1998). Regression with input-dependent noise: A Gaussian process treatment. In Advances in Neural Information Processing Systems, Volume 10, pp. 493499. MIT Press. [PDF]
Abstract
Gaussian processes provide natural non-parametric prior distributions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth function of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that prior uncertainty about the parameters controlling both processes can be handled and that the posterior distribution of the noise rate can be sampled from using Markov chain Monte Carlo methods. Our results on a synthetic data set give a posterior noise variance that well-approximates the true variance.
Bishop, C. M., N. Lawrence, T. Jaakkola, and M. I. Jordan (1998). Approximating posterior distributions in belief networks using mixtures. In Advances in Neural Information Processing Systems, Volume 10, pp. 416422. [PDF]
Abstract
Exact inference in densely connected Bayesian networks is computationally intractable, and so there is considerable interest in developing effective approximation schemes. One approach which has been adopted is to bound the log likelihood using a mean-field approximating distribution. While this leads to a tractable algorithm, the mean field distribution is assumed to be factorial and hence unimodal. In this paper we demonstrate the feasibility of using a richer class of approximating distributions based on mixtures of mean field distributions. We derive an efficient algorithm for updating the mixture parameters and apply it to the problem of learning in sigmoid belief networks. Our results demonstrate a systematic improvement over simple mean field theory as the number of mixture components is increased.
Barber, D. and C. M. Bishop (1998). Ensemble learning for multi-layer networks. In M. I. Jordan, K. J. Kearns, and S. A. Solla (Eds.), Advances in Neural Information Processing Systems, Volume 10, pp. 395401. [PDF]
Abstract
Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. However, the derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.
Bishop, C. M. (1997). Latent variables, topographic mappings and data visualization. In M. Marinaro and R. Tagliaferri (Eds.), Proceedings IX Italian Workshop on Neural Networks, Vietri sur Mare, Salerno, pp. 132. Springer.
Abstract
Most pattern recognition tasks, such as regression, classification and novelty detection, can be viewed in terms of probability density estimation. A powerful approach to probabilistic modelling is to represent the observed variables in terms of a number of hidden, or latent, variables. One well-known example of a hidden variable model is the mixture distribution in which the hidden variable is the discrete component label. In the case of continuous latent variables we obtain models such as factor analysis. In this paper we provide an overview of latent variable models, and we show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA). By extending this technique to mixtures, and hierarchical mixtures, of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization. We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (GTM). Finally, we show how GTM can itself be extended to model temporal data.
Barber, D. and C. M. Bishop (1997). On computing the KL divergence for Bayesian neural networks. Technical report, Neural Computing Research Group, Aston University, Birmingham, U.K.
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1997). GTM: a principled alternative to the Self-Organizing Map. In C. von der Malsburg, W. von Selen, J. C. Vorbruggen, and B. Sendhoff (Eds.), International Conference on Artificial Neural Networks, ICANN'96, pp. 165170. Springer.
Abstract
The Self-Organizing Map (SOM) algorithm has been extensively studied and has been applied with considerable success to a wide variety of problems. However, the algorithm is derived from heuristic ideas and this leads to a number of significant limitations. In this paper, we consider the problem of modelling the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. We introduce a novel form of latent variable model, which we call the GTM algorithm (for Generative Topographic Mapping), which allows general non-linear transformations from latent space to data space, and which is trained using the EM (expectation-maximization) algorithm. Our approach overcomes the limitations of the SOM, while introducing no significant disadvantages. We demonstrate the performance of the GTM algorithm on simulated data from flow diagnostics for a multi-phase oil pipeline.
Bishop, C. M. and C. S. Qazaz (1997). Bayesian inference of noise levels in regression. In Proceedings 1996 International Conference on Artificial Neural Networks, ICANN'96, Bochum, Germany, pp. 5964. Springer. [PDF]
Abstract
In most treatments of the regression problem it is assumed that the distribution of target data can be described by a deterministic function of the inputs, together with additive Gaussian noise having constant variance. The use of maximum likelihood to train such models then corresponds to the minimization of a sum-of-squares error function. In many applications a more realistic model would allow the noise variance itself to depend on the input variables. However, the use of maximum likelihood for training such models would give highly biased results. In this paper we show how a Bayesian treatment can allow for an input-dependent variance while overcoming the bias of maximum likelihood.
Bishop, C. M., G. E. Hinton, and I. G. D. Strachan (1997). GTM through time. In Proceedings IEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K., pp. 111116. [PDF] [Postscript]
Abstract
The standard GTM (generative topographic mapping) algorithm assumes that the data on which it is trained consists of independent, identically distributed (i.i.d.) vectors. For time series, however, the i.i.d. assumption is a poor approximation. In this paper we show how the GTM algorithm can be extended to model time series by incorporating it as the emission density in a hidden Markov model. Since GTM has discrete hidden states we are able to find a tractable EM algorithm, based on the forward-backward algorithm, to train the model. We illustrate the performance of GTM through time using flight recorder data from a helicopter.
Jordan, M. I. and C. M. Bishop (1997). Neural networks. In A. B. Tucker (Ed.), The Computer Science and Engineering Handbook, pp. 536556. CRC Press. [PDF]
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1997a). Magnification factors for the GTM algorithm. In Proceedings IEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K., pp. 6469. Institute of Electrical Engineers. [PDF] [Postscript]
Abstract
The Generative Topographic Mapping (GTM) algorithm of Bishop et al. (1997) has been introduced as a principled alternative to the Self-Organizing Map (SOM). As well as avoiding a number of deficiencies in the SOM, the GTM algorithm has the key property that the smoothness properties of the model are decoupled from the reference vectors, and are described by a continuous mapping from a lower-dimensional latent space into the data space. Magnification factors, which are approximated by the difference between code-book vectors in SOMs, can therefore be evaluated for the GTM model as continuous functions of the latent variables using the techniques of differential geometry. They play an important role in data visualization by highlighting the boundaries between data clusters, and are illustrated here for both a toy data set, and a problem involving the identification of crab species from morphological data.
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1997b). Magnification factors for the SOM and GTM algorithms. In Proceedings 1997 Workshop on Self-Organizing Maps, Helsinki University of Technology, Finland., pp. 333338.
Abstract
Magnification factors specify the extent to which the area of a small patch of the latent (or `feature') space of a topographic mapping is magnified on projection to the data space, and are of considerable interest in both neuro-biological and data analysis contexts. Previous attempts to consider magnification factors for the self-organizing map (SOM) algorithm have been hindered because the mapping is only defined at discrete points (given by the reference vectors). In this paper we consider the batch version of SOM, for which a continuous mapping can be defined, as well as the Generative Topographic Mapping (GTM) algorithm of Bishop et al. (1997) which has been introduced as a probabilistic formulation of the SOM. We show how the techniques of differential geometry can be used to determine magnification factors as continuous functions of the latent space coordinates. The results are illustrated here using a problem involving the identification of crab species from morphological data.
Tipping, M. E. and C. M. Bishop (1997a). Mixtures of principal component analysers. In Proceedings IEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K., July., pp. 1318. London: IEE.
Abstract
Principal component analysis (PCA) is a ubiquitous technique for data analysis but one whose effective application is restricted by its global linear character. While global nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data nonlinearity by a mixture of local PCA models. However, existing techniques are limited by the absence of a probabilistic formalism with an appropriate likelihood measure and so require an arbitrary choice of implementation strategy. This paper shows how PCA can be derived from a maximum-likelihood procedure, based on a specialisation of factor analysis. This is then extended to develop a well-defined mixture model of principal component analyzers, and an expectation-maximisation algorithm for estimating all the model parameters is given.
Tipping, M. E. and C. M. Bishop (1997b). Hierarchical models for data visualization. In Proceedings IEE Fifth International Conference on Artificial Neural Networks, Cambridge, U.K., pp. 7075.
Abstract
Visualization has proven to be a powerful and widely-applicable tool for the analysis and interpretation of data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximisation algorithm. We demonstrate the principle of the approach first on a toy data set, and then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines.
Bishop, C. M. and M. E. Tipping (1997). Latent variable models and data visualization. In M. Titterington and J. Kay (Eds.), Statistics and Neural Networks, pp. 147164. Oxford University Press.
Abstract
Visualization is a powerful and widely used technique for data analysis and data mining. For simple data sets a single projection of the data on to a two-dimensional plane, such as that provided by principal component analysis, may prove adequate. In the case of more complex data sets, however, it may be necessary to find multiple plots corresponding to different projection directions and/or different subsets of the data points in order to capture the full complexity of the data. Here we use latent variable models to construct a framework for data visualization which allows simultaneous soft clustering and projection of the data in a probabilistic setting. We first show how standard principal component analysis can be formulated in terms of maximum likelihood under a latent variable model. Next we extend the formalism to include both mixtures and hierarchical mixtures of principal component models, and derive the corresponding visualization algorithms. Finally, we illustrate the hierarchical approach to visualization using data sets obtained from multi-phase flows along oil pipelines, and from satellite image data.
Bishop, C. M. (1997). Bayesian neural networks. Journal of the Brazilian Computer Society 1(4), 6168. Special issue on neural networks.
Abstract
Bayesian techniques have been developed over many years in a range of different fields, but have only recently been applied to the problem of learning in neural networks. As well as providing a consistent framework for statistical pattern recognition, the Bayesian approach offers a number of practical advantages including a solution to the problem of over-fitting. This article provides an introductory overview of the application of Bayesian methods to neural networks. It assumes the reader is familiar with standard feed-forward neural network models and how to train them using conventional techniques.
Qazaz, C. S., C. K. I. Williams, and C. M. Bishop (1997). An upper bound on the Bayesian error bars for generalized linear regression. In S. W. Ellacott, J. C. Mason, and I. J. Anderson (Eds.), Mathematics of Neural Networks: Models, Algorithms and Applications, pp. 295299. Kluwer.
Abstract
In the Bayesian framework, predictions for a regression problem are expressed in terms of a distribution of output values. The mode of this distribution corresponds to the most probable output, while the uncertainty associated with the predictions can conveniently be expressed in terms of error bars. In this paper we consider the evaluation of error bars in the context of the class of generalized linear regression models. We provide insights into the dependence of the error bars on the location of the data points and we derive an upper bound on the true error bars in terms of the contributions from individual data points which are themselves easily evaluated.
Bishop, C. M. and I. T. Nabney (1997). Modelling conditional probability densities for periodic variables. In S. W. Ellacott, J. C. Mason, and I. J. Anderson (Eds.), Mathematics of Neural Networks: Models, Algorithms and Applications, pp. 118122. Kluwer.
Abstract
Most conventional techniques for estimating conditional probability densities are inappropriate for applications involving periodic variables. In this paper we introduce three related techniques for tackling such problems, and test them using synthetic data. We then apply them to the problem of extracting the distribution of wind vector directions from radar scatterometer data.
Bishop, C. M. (1997). Neural networks. In A. Bullock and S. Trombley (Eds.), \em Fontana Dictionary of Modern Thought\/ (Third ed.). Fontana Press.
Bishop, C. M. and C. S. Qazaz (1997). Regression with input-dependent noise: A Bayesian treatment. In Advances in Neural Information Processing Systems, Volume 9, pp. 347353. MIT Press. [PDF]
Abstract
In most treatments of the regression problem it is assumed that the distribution of target data can be described by a deterministic function of the inputs, together with additive Gaussian noise having constant variance. The use of maximum likelihood to train such models then corresponds to the minimization of a sum-of-squares error function. In many applications a more realistic model would allow the noise variance itself to depend on the input variables. However, the use of maximum likelihood to train such models would give highly biased results. In this paper we show how a Bayesian treatment can allow for an input-dependent variance while over- coming the bias of maximum likelihood.
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1997). GTM: a principled alternative to the Self-Organizing Map. In M. C. Mozer, M. I. Jordan, and T. Petche (Eds.), Advances in Neural Information Processing Systems, Volume 9, pp. 354360. MIT Press. [PDF]
Abstract
The Self-Organizing Map (SOM) algorithm has been extensively studied and has been applied with considerable success to a wide variety of problems. However, the algorithm is derived from heuristic ideas and this leads to a number of significant limitations. In this paper, we consider the problem of modelling the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. We introduce a novel form of latent variable model, which we call the GTM algorithm (for Generative Topographic Mapping), which allows general non-linear transformations from latent space to data space, and which is trained using the EM (expectation-maximization) algorithm. Our approach overcomes the limitations of the SOM, while introducing no significant disadvantages. We demonstrate the performance of the GTM algorithm on simulated data from flow diagnostics for a multi-phase oil pipeline.
Barber, D. and C. M. Bishop (1997). Bayesian model comparison by Monte Carlo chaining. In M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems, Volume 9, pp. 333339. MIT Press. [PDF]
Abstract
The techniques of Bayesian inference have been applied with great success to many problems in neural computing including evaluation of regression functions, determination of error bars on predictions, and the treatment of hyper-parameters. However, the problem of model comparison is a much more challenging one for which current techniques have significant limitations. In this paper we show how an extended form of Markov chain Monte Carlo, called chaining, is able to provide effective estimates of the relative probabilities of different models. We present results from the robot arm problem and compare them with the corresponding results obtained using the standard Gaussian approximation framework.
Jordan, M. I. and C. M. Bishop (1996). Neural networks. ACM Computing Surveys 28(1), 7375. [PDF]
Bishop, C. M. and I. T. Nabney (1996). Modelling conditional probability distributions for periodic variables. Neural Computation 8(5), 11231133.
Abstract
Most conventional techniques for estimating conditional probability densities are inappropriate for applications involving periodic variables. In this paper we introduce three related techniques for tackling such problems, and investigate their performance using synthetic data. We then apply these techniques to the problem of extracting the distribution of wind vector directions from radar scatterometer data gathered by a remote-sensing satellite.
Bishop, C. M. (1996a). Neural networks: A pattern recognition perspective. In E. Fiesler and R. Beale (Eds.), Handbook of Neural Computation. Oxford University Press and IOP Publishing.
Abstract
The majority of current applications of neural networks are concerned with problems in pattern recognition. In this article we show how neural networks can be placed on a principled, statistical foundation, and we discuss some of the practical benefits which this brings.
Bishop, C. M. (1996b). Theoretical foundations of neural networks. In P. Borcherds, M. Bubak, and A. Maksymowicz (Eds.), Proceedings of Physics Computing 96, Krakow, pp. 500507. Academic Computer Centre.
Abstract
Neural networks have often been motivated by superficial analogy with biological nervous systems. Recently, however, it has become widely recognised that the effective application of neural networks requires instead a deeper understanding of the theoretical foundations of these models. Insight into neural networks comes from a number of fields including statistical pattern recognition, computational learning theory, statistics, information geometry and statistical mechanics. As an illustration of the importance of understanding the theoretical basis for neural network models, we consider their application to the solution of multi-valued inverse problems. We show how a naive application of the standard least-squares approach can lead to very poor results, and how an appreciation of the underlying statistical goals of the modelling process allows the development of a more general and more powerful formalism which can tackle the problem of multi-modality.
Bishop, C. M., M. Svens\'en, and C. K. I. Williams (1996). EM optimization of latent variable density models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems, Volume 8, pp. 465471. MIT Press. [PDF]
Abstract
There is currently considerable interest in developing general non-linear density models based on latent, or hidden, variables. Such models have the ability to discover the presence of a relatively small number of underlying `causes' which, acting in combination, give rise to the apparent complexity of the observed data set. Unfortunately, to train such models generally requires large computational effort. In this paper we introduce a novel latent variable algorithm which retains the general non-linear capabilities of previous models but which uses a training procedure based on the EM algorithm. We demonstrate the performance of the model on a toy problem and on data from flow diagnostics for a multi-phase oil pipeline.
Bishop, C. M. (1995a). Neural Networks for Pattern Recognition. Oxford University Press. [Information about this book]
Abstract
This book provides the first comprehensive treatment of neural networks from the perspective of statistical pattern recognition. It comprises 504 pages with 160 figures, 300 references, and 129 graded exercises, and is published by Oxford University Press.
Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation 7(1), 108116. [PDF]
Abstract
It is well known that the addition of noise to the input data of a neural network during training can, in some circumstances, lead to significant improvements in generalization performance. Previous work has shown that such training with noise is equivalent to a form of regularization in which an extra term is added to the error function. However, the regularization term, which involves second derivatives of the error function, is not bounded below, and so can lead to difficulties if used directly in a learning algorithm based on error minimization. In this paper we show that, for the purposes of network training, the regularization term can be reduced to a positive definite form which involves only first derivatives of the network mapping. For a sum-of-squares error function, the regularization term belongs to the class of generalized Tikhonov regularizers. Direct minimization of the regularized error function provides a practical alternative to training with noise.
Bishop, C. M., P. S. Haynes, M. E. U. Smith, T. N. Todd, and D. L. Trotman (1995). Real-time control of a tokamak plasma using neural networks. Neural Computation 7, 206217. [PDF]
Abstract
In this paper we present results from the first use of neural networks for real-time control of the high temperature plasma in a tokamak fusion experiment. The tokamak is currently the principal experimental device for research into the magnetic confinement approach to controlled fusion. In an effort to improve the energy confinement properties of the high temperature plasma inside tokamaks, recent experiments have focussed on the use of non-circular cross-sectional plasma shapes. However, the accurate generation of such plasmas represents a demanding problem involving simultaneous control of several parameters on a timescale as short as a few tens of microseconds. Application of neural networks to this problem requires fast hardware, for which we have developed a fully parallel custom implementation of a multilayer perceptron, based on a hybrid of digital and analogue techniques.
Williams, C. K. I., C. Qazaz, C. M. Bishop, and H. Zhu (1995). On the relationship between Bayesian error bars and the input data density. In Proceedings Fourth IEE International Conference on Artificial Neural Networks, Cambridge, UK, pp. 160165. IEE.
Abstract
We investigate the dependence of Bayesian error bars on the distribution of data in input space. For generalized linear regression models we derive an upper bound on the error bars which shows that, in the neighbourhood of the data points, the error bars are substantially reduced from their prior values. For regions of high data density we also show that the contribution to the output variance due to the uncertainty in the weights can exhibit an approximate inverse proportionality to the probability density. Empirical results support these conclusions.
Nabney, I. T., C. M. Bishop, and C. Legleye (1995). Modelling conditional probability distributions for periodic variables. In Proceedings Fourth IEE International Conference on Artificial Neural Networks, Cambridge, UK, pp. 177182. IEE.
Abstract
Most of the common techniques for estimating conditional probability densities are inappropriate for applications involving periodic variables. In this paper we introduce two novel techniques for tackling such problems, and investigate their performance using synthetic data. We then apply these techniques to the problem of extracting the distribution of wind vector directions from radar scatterometer data gathered by a remote-sensing satellite.
Bishop, C. M. (1995). Bayesian methods for neural networks. Technical Report NCRG/95/009, Neural Computing Research Group, Aston University.
Abstract
Bayesian techniques have been developed over many years in a range of different fields, but have only recently been applied to the problem of learning in neural networks. As well as providing a consistent framework for statistical pattern recognition, the Bayesian approach offers a number of practical advantages including a potential solution to the problem of over-fitting. This chapter aims to provide an introductory overview of the application of Bayesian methods to neural networks. It assumes the reader is familiar with standard feed-forward network models and how to train them using conventional techniques.
Bartlett, D., C. Bishop, R. Cahill, A. McLachlan, L. Porte, and A. Rookes (1995). Recent progress in the measurement and analysis of ece on jet. In Proceedings of the 9th International Workshop on ECE and ECRH.
Abstract
Recent changes to the JET ECE diagnostic system have been made to accommodate the changing requirements of JET and to enhance further the measurement performance of the broad-band heterodyne radiometer. The radiometer frequency coverage has been extended, its spatial resolution has been improved and a technique for frequency selective sharing of radiation among its six mixers has been developed. An analysis of the limiting resolution of ECE measurements has been made, and applied to the determination of the optimum IF filter widths for the radiometer. These improvements and the resolution analysis are described, and some illustrative results are shown. A study of the feasibility of using neural networks to reduce the level of systematic uncertainty in the JET ECE data has been completed. The technique and the results are presented.
Bishop, C. M., P. S. Haynes, M. E. U. Smith, T. N. Todd, and D. L. Trotman (1995). Real-time control of a tokamak plasma using a hardware neural network. In J. G. Taylor (Ed.), Neural Networks, Chapter 12, pp. 193216. Alfred Waller.
Abstract
One of the most promising approaches to achieving fusion of the light elements, as a potential large-scale energy source for the next century, is based on the magnetic confinement of an ionized high temperature plasma. Most of the current research in magnetic confinement makes use of toroidal plasma configurations in experiments known as tokamaks. Theoretical results have predicted that the characteristics of a tokamak plasma can be made more favourable to fusion if the cross-section of the plasma is appropriately shaped. However, the accurate generation of such plasmas, and the real-time control of their position and shape, represents a demanding problem involving the simultaneous adjustment of the currents through several control coils on time scales as short as a few tens of microseconds. In this paper we present results from the first use of neural networks for the control of the high temperature plasma in a tokamak fusion experiment. This application requires the use of fast hardware, for which we have developed a fully parallel custom implementation of a multi-layer perceptron, based on a hybrid of digital and analogue techniques. Our results demonstrate that the network is indeed capable of fast plasma control in accordance with the predictions of software simulations.
Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks. In F. Fougelman-Soulie and P. Gallinari (Eds.), Proceedings International Conference on Artificial Neural Networks ICANN'95, Volume 1, pp. 141148. EC2 et Cie. .
Abstract
In this paper we consider four alternative approaches to complexity control in feed-forward networks based respectively on architecture selection, regularization, early stopping, and training with noise. We show that there are close similarities between these approaches and we argue that, for most practical applications, the technique of regularization should be the method of choice.
Bishop, C. M. (1995b). Multiphase flow monitoring in oil pipelines. In A. F. Murray (Ed.), Applications of Neural Networks, Chapter 6, pp. 133155. Kluwer.
Nabney, I. T. and C. M. Bishop (1995). Modelling conditional probability distributions for periodic variables. In F. Fougelman-Soulie and P. Gallinari (Eds.), Proceedings International Conference on Artificial Neural Networks ICANN'95, Volume 2, Paris, pp. 209214. EC2 et Cie.
Abstract
Most of the common techniques for estimating conditional probability densities are inappropriate for applications involving periodic variables. In this paper we introduce two novel techniques for tackling such problems, and investigate their performance using synthetic data.
Bishop, C. M. and C. Legleye (1995). Estimating conditional probability densities for periodic variables. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in Neural Information Processing Systems, Volume 7, Cambridge MA, pp. 641648. MIT Press. [PDF]
Abstract
Most of the common techniques for estimating conditional probability densities are inappropriate for applications involving periodic variables. In this paper we introduce three novel techniques for tackling such problems, and investigate their performance using synthetic data. We then apply these techniques to the problem of extracting the distribution of wind vector directions from radar scatterometer data gathered by a remote-sensing satellite.
Bishop, C. M., P. S. Haynes, M. E. U. Smith, T. N. Todd, D. L. Trotman, and C. G. Windsor (1995). Real-time control of a tokamak plasma using neural networks. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in Neural Information Processing Systems, Volume 7, Cambridge MA, pp. 10071014. MIT Press. [PDF]
Abstract
This paper presents results from the first use of neural networks for the real-time feedback control of high temperature plasmas in a Tokamak fusion experiment. The Tokamak is currently the principal experimental device for research into the magnetic confinement approach to controlled fusion. In the Tokamak, hydrogen plasmas, at temperatures of up to 100 Million K, are confined by strong magnetic fields. Accurate control of the position and shape of the plasma boundary requires real-time feedback control of the magnetic field structure on a time-scale of a few tens of microseconds. Software simulations have demonstrated that a neural network approach can give significantly better performance than the linear technique currently used on most Tokamak experiments. The practical application of the neural network approach requires high-speed hardware, for which a fully parallel implementation of the multi-layer perceptron, using a hybrid of digital and analogue technology, has been developed.
Bishop, C. M. (1994a). Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University. [PDF] [Postscript]
Abstract
Minimization of a sum-of-squares or cross-entropy error function leads to network outputs which approximate the conditional averages of the target data, conditioned on the input vector. For classifications problems, with a suitably chosen target coding scheme, these averages represent the posterior probabilities of class membership, and so can be regarded as optimal. For problems involving the prediction of continuous variables, however, the conditional averages provide only a very limited description of the properties of the target variables. This is particularly true for problems in which the mapping to be learned is multi-valued, as often arises in the solution of inverse problems, since the average of several correct target values is not necessarily itself a correct value. In order to obtain a complete description of the data, for the purposes of predicting the outputs corresponding to new input vectors, we must model the conditional probability distribution of the target data, again conditioned on the input vector. In this paper we introduce a new class of network models obtained by combining a conventional neural network with a mixture density model. The complete system is called a Mixture Density Network, and can in principle represent arbitrary conditional probability distributions in the same way that a conventional neural network can represent arbitrary functions. We demonstrate the effectiveness of Mixture Density Networks using both a toy problem and a problem involving robot inverse kinematics.
Bishop, C. M. (1994b). Novelty detection and neural network validation. IEE Proceedings: Vision, Image and Signal Processing 141(4), 217222. Special issue on applications of neural networks.
Abstract
One of the key factors limiting the use of neural networks in many industrial applications has been the difficulty of demonstrating that a trained network will continue to generate reliable outputs once it is in routine use. An important potential source of errors arises from novel input data, that is input data which differ significantly from the data used to train the network. In this paper we investigate the relationship between the degree of novelty of input data and the corresponding reliability of the outputs from the network. We describe a quantitative procedure for assessing novelty, and we demonstrate its performance using an application involving the monitoring of oil flow in multi-phase pipelines.
Bishop, C. M., P. S. Haynes, M. E. U. Smith, T. N. Todd, and D. L. Trotman (1994). Fast feedback control of a high temperature fusion plasma. Neural Computing and Applications 2(3), 148159.
Abstract
One of the most promising approaches to achieving fusion of the light elements, as a potential large-scale energy source for the next century, is based on the magnetic confinement of an ionized high temperature plasma. Most of the current research in magnetic confinement makes use of toroidal plasma configurations in experiments known as tokamaks. Theoretical results have predicted that the characteristics of a tokamak plasma can be made more favourable to fusion if the cross-section of the plasma is appropriately shaped. However, the accurate generation of such plasmas, and the real-time control of their position and shape, represents a demanding problem involving the simultaneous adjustment of the currents through several control coils on time scales as short as a few tens of microseconds. In this paper we present results from the first use of neural networks for the control of the high temperature plasma in a tokamak fusion experiment. This application requires the use of fast hardware, for which we have developed a fully parallel custom implementation of a multi-layer perceptron, based on a hybrid of digital and analogue techniques. Our results demonstrate that the network is indeed capable of fast plasma control in accordance with the predictions of software simulations.
Bishop, C. M. (1994). Neural networks and their applications. Review of Scientific Instruments 65(6), 18031832.
Abstract
Neural networks provide a range of powerful new techniques for solving problems in pattern recognition, data analysis and control. They have several notable features including high processing speeds and the ability to learn the solution to a problem from a set of examples. The majority of practical applications of neural networks currently make use of two basic network models. We describe these models in detail and explain the various techniques used to train them. Next we discuss a number of key issues which must be addressed when applying neural networks to practical problems, and highlight several potential pitfalls. Finally, we survey the various classes of problem which may be addressed using neural networks, and we illustrate them with a variety of successful applications drawn from a range of fields. It is intended that this review should be accessible to readers with no previous knowledge of neural networks, and yet also provide new insights for those already making practical use of these techniques.
Deliyanakis, N., C. M. Bishop, J. W. Connor, M. Cox, and D. C. Robinson (1994). An investigation of coupled energy and particle transport in tokamak plasmas. Plasma Physics and Controlled Fusion 36(9), 13911406. [PDF]
Abstract
This paper examines some experimental evidence for coupling between particle and energy transport in tokamak plasmas and presents analytical and numerical investigations of this type of transport. This coupling generally leads to discrepancies between the effective thermal diffusivities inferred from analyses of power balance and perturbation measurements. Such discrepancies have been observed experimentally. Comparisons are presented between the results from the numerical solution of a coupled transport model and data from experiments with modulated heating carried out on the DITE machine. The salient features of coupled transport have been assessed and demonstrated to be fully consistent with experimental data: it has been shown that transport matrices with relatively large off-diagonal components can lead to small apparent perturbations of the density, when the energy balance is perturbed, whilst still affecting the thermal transport considerably. Furthermore, perturbation measurements, used in conjunction with predictive transport codes, have emerged as a useful technique for validating transport models.
Bishop, C. M. (1993). Novelty detection and neural network validation. In S. Gielen and B. Kappen (Eds.), Proceedings International Conference on Artificial Neural Networks ICANN'93, pp. 789794.
Abstract
One of the key factors limiting the use of neural networks in many industrial applications has been the difficulty of demonstrating that a trained network will continue to generate reliable outputs once it is in routine use. An important potential source of errors arises from input data which differs significantly from that used to train the network. In this paper we investigate the relation between the degree of \em novelty of input data and the corresponding reliability of the output data. We provide a quantitative procedure for measuring novelty, and we demonstrate its performance using an application involving the monitoring of oil flow in multi-phase pipelines.
Bishop, C. M., C. M. Roach, and M. G. von Hellermann (1993). Automatic analysis of JET charge exchange recombination spectra using neural networks. Plasma Physics and Controlled Fusion 35, 765773. [PDF]
Abstract
The analysis of charge exchange re-combination spectra represents a very challenging problem due to the presence of many overlapping spectral lines. Conventional approaches are based on iterative least-squares optimisation and suffer from the two difficulties of low speed and the need for a good initial approximation to the solution. This latter problem necessitates considerable human supervision of the analysis procedure. In this letter we show how neural network techniques allow charge exchange data to be analysed very rapidly, to give an approximate solution without the need for supervision. The network approach is well suited to the fast inter-shot analysis of large volumes of data, and can readily be implemented in dedicated hardware for real-time applications. The neural network can also be used to provide the initial guess for the standard least-squares algorithm when high accuracy is required.
Bishop, C. M. and G. D. James (1993). Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research A327, 580593.
Abstract
Dual-energy gamma densitometry offers a powerful technique for the non-intrusive analysis of multi-phase flows. By employing multiple beam lines, information on the phase configuration can be obtained. Once the configuration is known, it then becomes possible in principle to determine the phase fractions. In practice, however, the extraction of the phase fractions from the densitometer data is complicated by the wide variety of phase configurations which can arise, and by the considerable difficulties of modelling multi-phase flows. In this paper we show that neural network techniques provide a powerful approach to the analysis of data from dual-energy gamma densitometers, allowing both the phase configuration and the phase fractions to be determined with high accuracy, while avoiding the uncertainties associated with modelling. The technique is well suited to the determination of oil, water and gas fractions in multi-phase oil pipelines. Results from linear and non-linear network models are compared, and a new technique for validating the network output is described.
Bishop, C. M. (1993). Curvature-driven smoothing: a learning algorithm for feedforward networks. IEEE Transactions on Neural Networks 4(5), 882884. [PDF]
Abstract
The performance of feed-forward neural networks in real applications can be often be improved significantly if use is made of a-priori information. For interpolation problems this prior knowledge frequently includes smoothness requirements on the network mapping, and can be imposed by the addition to the error function of suitable regularization terms. The new error function, however, now depends on the derivatives of the network mapping, and so the standard back-propagation algorithm cannot be applied. In this paper, we derive a computationally efficient learning algorithm, for a feed-forward network of arbitrary topology, which can be used to minimize the new error function. Networks having a single hidden layer, for which the learning algorithm simplifies, are treated as a special case.
Bishop, C. M., I. G. D. Strachan, J. O'Rourke, G. P. Maddison, and P. S. Thomas (1993). Reconstruction of tokamak density profiles using feed-forward networks. Neural Computing and Applications 1(1), 416.
Abstract
The tokamak is currently the principal magnetic confinement system for controlled fusion research. In seeking to understand the physics of the high temperature plasma inside the tokamak, it is important to have detailed information on the spatial distribution of electron density. One technique for density measurement uses laser interferometry, which gives line-integral information along chords through the plasma. This requires an inversion procedure to extract spatially local density information. In this paper we make use of feed-forward networks to extract local density profiles from the line-integral data obtained from the multichannel interferometer on the JET (Joint European Torus) tokamak. An important feature of our approach is the use of profile data from a second high resolution diagnostic system, called LIDAR, to train the network. The LIDAR system provides data at high spatial resolution but with a low repetition rate, and therefore has a complementary role to interferometry which operates at a high sampling rate but with much lower spatial resolution. Results show that the neural network is able to extract significantly more detailed profile information than the conventional Abel inversion method currently used on JET.
Bishop, C. M. (1993). Neural network validation: an illustration from the monitoring of multi-phase flows. In Proceedings IEE Conference on Artificial Neural Networks, pp. 4145.
Abstract
One of the key factors limiting the use of neural networks in many industrial applications has been the difficulty of demonstrating that a trained network will continue to generate reliable outputs once it is in routine use. An important potential source of errors arises from novel input data, that is input data which differs significantly from the data used to train the network. In this paper we investigate the relation between the degree of novelty of input data and the corresponding reliability of the outputs from the network. We describe a quantitative procedure for assessing novelty, and we demonstrate its performance using an application involving the monitoring of oil flow in multi-phase pipeline.
Bishop, C. M. (1992). Exact calculation of the Hessian matrix for the multilayer perceptron. Neural Computation 4(4), 494501. [PDF]
Abstract
The elements of the Hessian matrix consist of the second derivatives of the error measure with respect to the weights and thresholds in the network. They are needed in Bayesian estimation of network regularization parameters, for estimation of error bars on the network outputs, for network pruning algorithms, and for fast re-training of the network following a small change in the training data. In this paper we present an extended back-propagation algorithm which allows all elements of the Hessian matrix to be evaluated exactly for a feed-forward network of arbitrary topology. Software implementation of the algorithm is straightforward.
Bishop, C. M. and C. M. Roach (1992). Fast curve fitting using neural networks. Review of Scientific Instruments 63(10), 44504456.
Abstract
Neural networks provide a new tool for the fast solution of repetitive nonlinear curve-fitting problems. In this paper we introduce the concept of a neural network, and we show how it can be used for fitting functional forms to experimental data. The neural network algorithm is typically much faster than conventional iterative approaches. In addition, further substantial improvements in speed can be obtained by using special-purpose hardware implementations of the network, thus making the technique suitable for using in fast real-time applications. The basic concepts are illustrated using a simple example from fusion research, involving the determination of spectral line parameters from measurements of B-IV impurity radiation in the COMPASS-C tokamak.
Allen, L. and C. M. Bishop (1992). Neural network approach to energy confinement scaling in tokamaks. Plasma Physics and Controlled Fusion 34(7), 12911302. [PDF]
Abstract
Empirical studies of the scaling of Tokamak energy confinement times with machine parameters constitute a useful point of contact with physics-based transport theories. They also form the basis for the design of next-step and reactor grade Tokamaks. In most cases a simple power law (or sometimes offset linear) functional form is fitted to the data. Such linear regression techniques have the advantage of computational simplicity, but otherwise have little a-priori justification. Neural networks provide a powerful general-purpose technique for nonlinear regression which exhibits no essential limitations on the functional form which can be fitted. In this paper we apply neural networks to the problem of energy confinement scaling in Tokamaks, and we illustrate the technique using data from the JET (Joint European Torus) Tokamak. The results show that the neural network approach leads to a substantial improvement in the ability to predict the energy confinement time as compared with linear regression. The significance of this result is discussed.
Bishop, C. M. (1992). Neural networks and their diagnostic applications. Review of Scientific Instruments 63(10), 47724774.
Abstract
Neural network techniques offer a wide range of new opportunities for the analysis of data from plasma diagnostics. In particular, the class of neural network known as the multi-layer perceptron provides a general purpose approach to nonlinear data transformation between multidimensional spaces. In this paper, we outline the principles of the multi-layer perceptron, and illustrate its application to plasma diagnostics using two examples. The first of these concerns the extraction of line shape parameters from spectral data, and offers considerable improvements in speed compared with conventional approaches. The second application involves deconvolution of line integral data from a multichannel interferometer allowing the extraction of more detailed density provides than obtained by conventional Abel inversion.
Bishop, C. M., P. Cox, P. S. Haynes, C. M. Roach, M. E. U. Smith, T. N. Todd, and D. L. Trotman (1992). A neural network approach to tokamak equilibrium control. In J. G. Taylor (Ed.), Neural Network Applications, pp. 114128. Springer.
Abstract
We exploit the properties of the multi-layer perceptron to develop a neural network approach to the feedback control of plasma position and shape in a Tokamak experiment. The requirements of large bandwidth and high precision have led us to develop a custom hybrid analogue--digital hardware implementation of the neural network using conventional components. It is planned to demonstrate a complete system on the COMPASS Tokamak at Culham Laboratory.
Bishop, C. M. (1992). Curvature-driven smoothing in back-propagation neural networks. In J. G. Taylor and C. L. T. Mannion (Eds.), Theory and Applications of Neural Networks, pp. 139148. Springer.
Abstract
The standard back-propagation learning algorithm for feed-forward networks aims to minimize the mean square error defined over a set of training data. This form of error measure can lead to the problem of over-fitting in which the network