Another old robust method is the so-called *regression diagnostics*. It
tries to iteratively detect possibly wrong data and reject them through
analysis of globally fitted model. The classical approach works as follows:

- Determine an initial fit to the whole set of data through least squares.
- Compute the residual for each datum.
- Reject all data whose residuals exceed a predetermined threshold; if no data have been removed, then stop.
- Determine a new fit to the remaining data, and goto step 2.

Clearly, the success of this method depends tightly upon the quality of the
initial fit. If the initial fit is very poor, then the computed residuals
based on it are meaningless; so is the diagnostics of them for outlier
rejection. As pointed out by Barnett and Lewis, with least-squares techniques,
*even one or two outliers in a large set can wreak havoc*! This technique
thus does not guarantee for a correct solution. However, experiences have shown
that this technique works well for problems with a moderate percentage of
outliers and more importantly outliers only having *gross errors less than the
size of good data*.

The threshold on residuals can be chosen by experiences using for example
graphical methods (plotting residuals in different scales). Better is to use a
priori statistical noise model of data and a chosen confidence level. Let
be the residual of the *i*data, and be the predicted
variance of the *i*residual based on the characteristics of the data nose
and the fit, the standard test statistics can be used. If
is not acceptable, the corresponding datum should be rejected.

One improvement to the above technique uses *influence measures* to
pinpoint potential outliers. These measures asses the extent to which a
particular datum influences the fit by determining the change in the solution
when that datum is omitted. The refined technique works as follows:

- Determine an initial fit to the whole set of data through least squares.
- Conduct a statistic test whether the measure of fit
*f*(e.g. sum of square residuals) is acceptable; if it is, then stop. - For each datum
*I*, delete it from the data set and determine the new fit, each giving a measure of fit denoted by . Hence determine the change in the measure of fit, , when datum*i*is deleted. - Delete datum
*i*for which is the largest, and goto step 2.

As can be remarked, the regression diagnostics approach depends heavily on a priori knowledge in choosing the thresholds for outlier rejection.

Thu Feb 8 11:42:20 MET 1996