HomePublicationsPattern Recognition and Machine LearningBooksBiography  
 

Pattern Recognition and Machine Learning: Data Sets

This page offers access to the data sets which are described and illustrated in Appendix A of Pattern Recognition and Machine Learning, which are also used in several examples and figures. I would like to thank Markus Svensén for putting this together.

Handwritten Digits

The MNIST digits data are available from Yann LeCun’s MNIST page, which also contains a detailed description of the data. There's also a Matlab function to read the data into Matlab under Windows.

 
 

Oil Flow

This data set can be retrieved in various formats from the GTM data web-page.

 
 

Old Faithful

There are several Old Faithful data sets in existence. The one used in PRML, which seems to be the most widely adopted, is available here.

 
 

Synthetic Data

Curve Fitting
The curve fitting data contains 10 data, uniformly spaced on [0,1] in x-space and with

y = sin(2πx) + N(0,0.3),

i.e, with Gaussian noise of variance 0.09. The file has 10 rows of 2 columns ([x,y]). This is the actual data that was used to generate the plots in figure 1.4 (and others).

 
 

Classification
The classification data contains 200 data, sampled from a 3-component Gaussian mixture in 2D. This data was generated using the gmmsamp function from Netlab. The corresponding Gaussian mixture model had the parameters:

mix.priors = [0.5 0.25 0.25];
mix.centres = [0 -0.1; 1 1; 1 -1];
mix.covars(:,:,1) = [0.625 -0.2165; -0.2165 0.875];
mix.covars(:,:,2) = [0.2241 -0.1368; -0.1368 0.9759];
mix.covars(:,:,3) = [0.2375 0.1516; 0.1516 0.4125];

The first component represent class 1 (blue circles, o, in the left panel of Figure A.7), the other components class 0 (red crosses, ×). The file has 200 rows of 3 columns, the first two columns giving datum position, the last column containing the label (0/1).

 
  Christopher M. Bishop 2008