N McGrogan
,
C M Bishop
,
L Tarassenko
Department of Engineering Science, University of Oxford
Microsoft Research, Cambridge
Date:
The solution of classification problems using statistical techniques requires
appropriately labelled training data. In the case of multi-channel data,
however, the labels may only be available in aggregate form rather than as
separate labels for each individual channel. Standard techniques, using a
trained model to predict each channel separately, are therefore precluded. In
this paper we present a new method of training neural network classifiers from
aggregate labels. This technique allows the network to learn what significant
events on individual channels result in the given labels. We apply this training
method to two synthetic (but, in the second case, realistic) problems and
compare the results with those from a classifier trained on the accurate channel
labels, which would usually not be available. On previously unseen test data for
the two problems
and
of feature vectors were classified correctly. These represent reductions of only
and
from classifiers trained on accurate labels for all channels.
The use of neural networks for classification is well documented and the requirements for training are similarly well known. One prerequisite of any training method is correctly labelled training data [5]. When a neural network is used to analyse time-varying data it is usual for the data to be temporally segmented and a label assigned to each segment [4].
In a multi-channel environment the same segmentation process can be used on the data and classification networks applied to each channel independently. However, the available labelling may only be aggregate, i.e., for each time segment only a single label is given; the channels are not labelled individually. The label indicates the occurrence of a particular event on at least one of the recorded channels, but it cannot be taken as correct when the channels are inspected independently.
An example of this problem occurs in the detection of spikes in the human electroencephalogram (EEG) during the diagnosis of epilepsy. Typically a number of channels of data (commonly 20) are recorded and segmented temporally. A single aggregate label is assigned to each time segment indicating the presence of spikes in at least one of the channels but there is no indication of the channels in which the spikes occurred. As a result, the channels in which there is no spike are wrongly labelled. The task of relabelling each channel independently would require a significant amount of time on the part of a trained EEG technician and this is not a practical option.
In this paper we present a method which allows a neural network to be trained on the available aggregate labelling to identify what characteristic of individual channels gives rise to the observations. The trained network can subsequently be used to classify each channel individually. Our approach builds on that adopted by Keeler et al. [3] to learn the spatial segmentation of hand-written numerals.
Figure 1 shows a simple example of the labelling problem which we have described. A time sequence of features (A, B, C or D) is shown over five channels. Each time slice has been given an aggregate label according to the presence (label 1) or absence (label 0) of a particular feature in at least one of the channels. By examining the data we can identify the critical feature (in this case, B) which results in an event being signalled. Once this has been established, subsequent data could be classified on each channel independently.
We start by presenting the theoretical background to the training method and then demonstrate its use on two synthetic data sets. In each case, results are compared against a neural network classifier trained on the full labelling of each channel. After a discussion of these results we conclude with some possible areas for future developments of this method.
To learn a solution to aggregate labelled problems we use the following
approach. Suppose that the available training data consists of
time slices and
channels. In this case we have a set of feature vectors
for
and
.
We also have an aggregate label provided by an expert for each time slice given
by
, where
.
This label indicates the presence of a particular event in at least one of the
channels at time slice
.
In order to be able to classify the channels independently we need to train
one model per channel,
,
where
is a vector of adaptive parameters. The output of model
provides an estimate of the probability of our event
being observed in channel
at a given time slice
.
If we assume that the distribution of feature vectors is independent of
channel, so we could use the same model for each of the channels, in
which case
now represents the probability of our event being observed in channel
at time slice
.
It is possible to use a feed-forward neural network, such as a multi-layer
perceptron (MLP), as the non-linear model
,
so that
|
|
|
(1) |
|
|
(2) |
If we also assume that the channels are independent of one another then the
probability that, at a time slice
,
at least one of the channels contains our event is given by
, where
|
|
(3) |
We can now train the network by minimising the negative log-likelihood,
[1]:
|
|
(5) |
Having developed the theory behind the training method we shall now turn to some practical examples using synthetic data sets.
Our first synthetic problem consists of data from four two-dimensional,
radially symmetric Gaussian distributions (
)
with the sampled
and
values being the features used. Figure 2
gives a plot of the distributions used.
![]() |
Independent training, validation and test data sets were constructed and
consisted of 100 sampled points on each of 4 ``channels''. For each
channel, at time slice
,
a point is sampled randomly from one of the four Gaussians (with equal priors).
The labels for each channel are artificially assigned according to the following
rule: the label is 1 if the point is taken from Gaussian A,
0 otherwise. The per-channel and aggregate labelling of the training data
set are shown in Figure 3.
![]() |
MLP classifiers with structures of the form 2-
-1, for
, were trained using the scaled conjugate
gradient optimisation method with the error function given in Equation 4.
The
values are the aggregate labels shown in
Figure 3(b).
After training, the 2-11-1 network was identified as having the lowest error on
the validation data set and is the model used for further testing.
Setting the decision boundary to 0.5 and applying the trained network
independently to each of the channels of the test data set resulted in
9 feature vectors (
)
being misclassified. Figure 4
shows the results graphically.
These results can be compared with the classification accuracy of an MLP
network trained on the fully labelled data (i.e., the same training data and the
same training procedure, except that we now use per-channel labels
rather than aggregate labels
). A 2-2-1 network gave the best generalisation
performance and left 7 feature vectors (
)
misclassified from the test data set.
Study of the human electroencephalogram (EEG) recorded during the investigation of epilepsy has shown that a large majority of subjects suffering from epilepsy exhibit spikes in their EEG between seizures (inter-ictal spikes) [6]. In most cases when epilepsy is confirmed by analysis of the EEG, it is on the basis of inter-ictal activity [2]. The detection of these inter-ictal spikes is therefore an important step in the diagnosis of epilepsy.
Recordings of the EEG are generally made over multiple
(approximately 20) channels and the expert labelling of this data for
spikes is a prime example of aggregate labelling -- spikes are identified as
occurring within a particular time period, but the channels in which the spikes
occur is not recorded. The labelling of individual channels would be too
time-consuming and so the ability to train a neural network spike detector from
just the aggregate labels would be an important step forward. For this reason we
have assembled another synthetic, but realistic, data set, designed to mimic the
detection of EEG spikes. A five coefficient auto-regressive (AR) model of human
EEG sampled at 256Hz during wakefulness has been used to generate four
channels of synthetic background EEG. Spikes of variable height and duration
(between 50 and 100ms) have been inserted into this data randomly (with a
probability of
that a spike will occur in a one second time period). Figure 5
shows a short section of one channel of the signal.
![]() |
Four-channel training and test data sets were constructed, each 250 seconds long. Since this is artificial data, as with the sampled Gaussian data in the first problem, the actual per-channel labels are known for both data sets.
The data is segmented into one-second time slices and the features used as
input to the neural network are the mean slope and mean sharpness of the signal
over each time slice. For three consecutive EEG sample values,
,
and
,
slope and sharpness are defined as [6]:
|
|
|
(6) |
|
|
|
(7) |
|
|
|
(8) |
|
|
|
(9) |
![]() |
The two classes in this problem are almost linearly separable. A 2-4-1
network structure was used for classification with both the aggregate and the
fully labelled data. Figure 7
shows the classifications given by the network trained on aggregate labels using
a 0.5 decision boundary: 9 feature vectors (
) were misclassified.
For comparison a 2-4-1 network trained on the fully labelled training set
(i.e.,
labels rather than
labels) left 8 feature vectors (
)
misclassified when applied to the test data.
Results from the application of this training method to two training sets have shown that it is possible to train a neural network classifier from aggregate labels with only a very slight reduction in performance. This degradation is insignificant with respect to the cost (either financial or in terms of manpower) of extensive expert relabelling.
In the studies presented in this paper we have used synthetic data to show
that the same model
can be fitted to the data
from individual channels as a result of training
with an aggregate label
.
We used synthetic (but realistic, in the case of the EEG) data in order to have
the correct individual labels
also available, so that per-channel training could be compared with the method
presented in this paper. Testing using real EEG data is currently in progress
and we hope to use this method to detect automatically the onset of epileptic
seizures in long-term recordings for which the amount of time required for a
technician to relabel the available data on a per-channel basis is considered
prohibitive.
Further development of the training method is required to support different models for each channel, to allow for spatial correlation between neighbouring channels, and to move beyond two class problems by allowing multiple outputs from the classifier.
Nick McGrogan is supported by an EPSRC studentship. We gratefully acknowledge the help of our clinical collaborators at the National Hospital for Neurology and Neurosurgery, Mr Philip Allen and Dr Sheilagh Smith, with the data collection and analysis.
This document was generated using the LaTeX2HTML translator Version 98.2 beta3 (July 4th, 1998)
Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer
Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, Ross Moore, Mathematics
Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0
icann99.tex
The translation was initiated by Nick McGrogan on 1999-02-11