C M Bishop,
Department of Engineering Science, University of Oxford
Microsoft Research, Cambridge
The solution of classification problems using statistical techniques requires appropriately labelled training data. In the case of multi-channel data, however, the labels may only be available in aggregate form rather than as separate labels for each individual channel. Standard techniques, using a trained model to predict each channel separately, are therefore precluded. In this paper we present a new method of training neural network classifiers from aggregate labels. This technique allows the network to learn what significant events on individual channels result in the given labels. We apply this training method to two synthetic (but, in the second case, realistic) problems and compare the results with those from a classifier trained on the accurate channel labels, which would usually not be available. On previously unseen test data for the two problems and of feature vectors were classified correctly. These represent reductions of only and from classifiers trained on accurate labels for all channels.
The use of neural networks for classification is well documented and the requirements for training are similarly well known. One prerequisite of any training method is correctly labelled training data . When a neural network is used to analyse time-varying data it is usual for the data to be temporally segmented and a label assigned to each segment .
In a multi-channel environment the same segmentation process can be used on the data and classification networks applied to each channel independently. However, the available labelling may only be aggregate, i.e., for each time segment only a single label is given; the channels are not labelled individually. The label indicates the occurrence of a particular event on at least one of the recorded channels, but it cannot be taken as correct when the channels are inspected independently.
An example of this problem occurs in the detection of spikes in the human electroencephalogram (EEG) during the diagnosis of epilepsy. Typically a number of channels of data (commonly 20) are recorded and segmented temporally. A single aggregate label is assigned to each time segment indicating the presence of spikes in at least one of the channels but there is no indication of the channels in which the spikes occurred. As a result, the channels in which there is no spike are wrongly labelled. The task of relabelling each channel independently would require a significant amount of time on the part of a trained EEG technician and this is not a practical option.
In this paper we present a method which allows a neural network to be trained on the available aggregate labelling to identify what characteristic of individual channels gives rise to the observations. The trained network can subsequently be used to classify each channel individually. Our approach builds on that adopted by Keeler et al.  to learn the spatial segmentation of hand-written numerals.
Figure 1 shows a simple example of the labelling problem which we have described. A time sequence of features (A, B, C or D) is shown over five channels. Each time slice has been given an aggregate label according to the presence (label 1) or absence (label 0) of a particular feature in at least one of the channels. By examining the data we can identify the critical feature (in this case, B) which results in an event being signalled. Once this has been established, subsequent data could be classified on each channel independently.
We start by presenting the theoretical background to the training method and then demonstrate its use on two synthetic data sets. In each case, results are compared against a neural network classifier trained on the full labelling of each channel. After a discussion of these results we conclude with some possible areas for future developments of this method.
To learn a solution to aggregate labelled problems we use the following approach. Suppose that the available training data consists of time slices and channels. In this case we have a set of feature vectors for and . We also have an aggregate label provided by an expert for each time slice given by , where . This label indicates the presence of a particular event in at least one of the channels at time slice .
In order to be able to classify the channels independently we need to train one model per channel, , where is a vector of adaptive parameters. The output of model provides an estimate of the probability of our event being observed in channel at a given time slice .
If we assume that the distribution of feature vectors is independent of channel, so we could use the same model for each of the channels, in which case now represents the probability of our event being observed in channel at time slice . It is possible to use a feed-forward neural network, such as a multi-layer perceptron (MLP), as the non-linear model , so that
If we also assume that the channels are independent of one another then the probability that, at a time slice , at least one of the channels contains our event is given by , where
We can now train the network by minimising the negative log-likelihood, :
Having developed the theory behind the training method we shall now turn to some practical examples using synthetic data sets.
Our first synthetic problem consists of data from four two-dimensional, radially symmetric Gaussian distributions ( ) with the sampled and values being the features used. Figure 2 gives a plot of the distributions used.
Independent training, validation and test data sets were constructed and consisted of 100 sampled points on each of 4 ``channels''. For each channel, at time slice , a point is sampled randomly from one of the four Gaussians (with equal priors). The labels for each channel are artificially assigned according to the following rule: the label is 1 if the point is taken from Gaussian A, 0 otherwise. The per-channel and aggregate labelling of the training data set are shown in Figure 3.
MLP classifiers with structures of the form 2--1, for , were trained using the scaled conjugate gradient optimisation method with the error function given in Equation 4. The values are the aggregate labels shown in Figure 3(b). After training, the 2-11-1 network was identified as having the lowest error on the validation data set and is the model used for further testing.
Setting the decision boundary to 0.5 and applying the trained network independently to each of the channels of the test data set resulted in 9 feature vectors () being misclassified. Figure 4 shows the results graphically.
These results can be compared with the classification accuracy of an MLP network trained on the fully labelled data (i.e., the same training data and the same training procedure, except that we now use per-channel labels rather than aggregate labels ). A 2-2-1 network gave the best generalisation performance and left 7 feature vectors () misclassified from the test data set.
Study of the human electroencephalogram (EEG) recorded during the investigation of epilepsy has shown that a large majority of subjects suffering from epilepsy exhibit spikes in their EEG between seizures (inter-ictal spikes) . In most cases when epilepsy is confirmed by analysis of the EEG, it is on the basis of inter-ictal activity . The detection of these inter-ictal spikes is therefore an important step in the diagnosis of epilepsy.
Recordings of the EEG are generally made over multiple (approximately 20) channels and the expert labelling of this data for spikes is a prime example of aggregate labelling -- spikes are identified as occurring within a particular time period, but the channels in which the spikes occur is not recorded. The labelling of individual channels would be too time-consuming and so the ability to train a neural network spike detector from just the aggregate labels would be an important step forward. For this reason we have assembled another synthetic, but realistic, data set, designed to mimic the detection of EEG spikes. A five coefficient auto-regressive (AR) model of human EEG sampled at 256Hz during wakefulness has been used to generate four channels of synthetic background EEG. Spikes of variable height and duration (between 50 and 100ms) have been inserted into this data randomly (with a probability of that a spike will occur in a one second time period). Figure 5 shows a short section of one channel of the signal.
Four-channel training and test data sets were constructed, each 250 seconds long. Since this is artificial data, as with the sampled Gaussian data in the first problem, the actual per-channel labels are known for both data sets.
The data is segmented into one-second time slices and the features used as input to the neural network are the mean slope and mean sharpness of the signal over each time slice. For three consecutive EEG sample values, , and , slope and sharpness are defined as :
The two classes in this problem are almost linearly separable. A 2-4-1 network structure was used for classification with both the aggregate and the fully labelled data. Figure 7 shows the classifications given by the network trained on aggregate labels using a 0.5 decision boundary: 9 feature vectors () were misclassified.
For comparison a 2-4-1 network trained on the fully labelled training set (i.e., labels rather than labels) left 8 feature vectors () misclassified when applied to the test data.
Results from the application of this training method to two training sets have shown that it is possible to train a neural network classifier from aggregate labels with only a very slight reduction in performance. This degradation is insignificant with respect to the cost (either financial or in terms of manpower) of extensive expert relabelling.
In the studies presented in this paper we have used synthetic data to show that the same model can be fitted to the data from individual channels as a result of training with an aggregate label . We used synthetic (but realistic, in the case of the EEG) data in order to have the correct individual labels also available, so that per-channel training could be compared with the method presented in this paper. Testing using real EEG data is currently in progress and we hope to use this method to detect automatically the onset of epileptic seizures in long-term recordings for which the amount of time required for a technician to relabel the available data on a per-channel basis is considered prohibitive.
Further development of the training method is required to support different models for each channel, to allow for spatial correlation between neighbouring channels, and to move beyond two class problems by allowing multiple outputs from the classifier.
Nick McGrogan is supported by an EPSRC studentship. We gratefully acknowledge the help of our clinical collaborators at the National Hospital for Neurology and Neurosurgery, Mr Philip Allen and Dr Sheilagh Smith, with the data collection and analysis.
This document was generated using the LaTeX2HTML translator Version 98.2 beta3 (July 4th, 1998)
Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer
Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 icann99.tex
The translation was initiated by Nick McGrogan on 1999-02-11