# Pattern Recognition and Machine Learning: Data Sets

This page offers access to the data sets which are described and illustrated in Appendix A of Pattern Recognition and Machine Learning, which are also used in several examples and figures. I would like to thank Markus Svensén for putting this together.

## Handwritten Digits

The MNIST digits data are available from Yann LeCun’s MNIST page, which also contains a detailed description of the data. There's also a Matlab function to read the data into Matlab under Windows.

## Oil Flow

This data set can be retrieved in various formats from the GTM data web-page.

## Old Faithful

There are several Old Faithful data sets in existence. The one used in PRML, which seems to be the most widely adopted, is available here.

## Synthetic Data

Curve Fitting
The curve fitting data contains 10 data, uniformly spaced on [0,1] in x-space and with

y = sin(2πx) + N(0,0.3),

i.e, with Gaussian noise of variance 0.09. The file has 10 rows of 2 columns ([x,y]). This is the actual data that was used to generate the plots in figure 1.4 (and others).

Classification
The classification data contains 200 data, sampled from a 3-component Gaussian mixture in 2D. This data was generated using the gmmsamp function from Netlab. The corresponding Gaussian mixture model had the parameters:

mix.priors = [0.5 0.25 0.25];
mix.centres = [0 -0.1; 1 1; 1 -1];
mix.covars(:,:,1) = [0.625 -0.2165; -0.2165 0.875];
mix.covars(:,:,2) = [0.2241 -0.1368; -0.1368 0.9759];
mix.covars(:,:,3) = [0.2375 0.1516; 0.1516 0.4125];

The first component represent class 1 (blue circles, o, in the left panel of Figure A.7), the other components class 0 (red crosses, ×). The file has 200 rows of 3 columns, the first two columns giving datum position, the last column containing the label (0/1).

Christopher M. Bishop 2008