MSRA-CFW is a data set of celebrity face images collected from the web. Starting from any face image, we obtain its near-duplicate images and associated surrounding texts. Then we detect the dominant people names by matching with a large list of celebrity names from public websites such as Wikipedia. A classifier is applied to further identify the celebrities appearing in the web images. The final dataset contains 202792 faces of 1583 people.
The dataset includes image URLs for 202792 faces. The labels of the faces are automatically generated by the algorithm in , with high accuracy. To facilitate downloading the images, we provide a number of URLs for the near-duplicates of each face. Besides, the thumbnail images and facial features(LBP) are also provided for visualization and benchmarking purposes. Due to copyright reasons, we donnot provide the original web images.
In the dataset, the files for each person are put into the same folder under this person's name. The files in each folder are categorized into four types:
(1) thumbnails: downsampled images for the faces. These are for visualization purposes only. Please note that each image contains only one face as detected by a face detector .
(2) info.txt: contains the "original web images" (OWI) for the thumbnail images. For each thumbnail, info.txt contains a line of metadata followed by a list of near-duplicate image URLs. The metadata consists of: the number of near-duplicate image URLs, the file name of the corresponding thumbnail and the URL of the OWI.
(3) feature.bin: contains LBP features for the faces. The file starts with two int32 variables indicating the total number of faces and the dimension of LBP features, followed by a byte buffer storing all the features (one face after another).
(4) filelist_LBP.txt: each line of this file contains a file name, corresponding to the order of the features in feature.bin. The following four numbers on each line are the location of the faces in the OWIs, where the thumbnails are down-sampled (left top right bottom), and the last two numbers are the size of the OWIs.
· We would like to thank the Bing team for the support in the dataset creation process, and various interns in the SMS Group of MSRA for the help in the dataset verification.
· Please cite the following paper  if the dataset is used in an academic research.
· If you have any questions or suggestions, please kindly contact us.
To use the datasets, you must read and accept the online agreement. By using the datasets, you agree to obey the terms of its license.
 Xiao Zhang, Lei Zhang, Xin-Jing Wang, Heung-Yeung Shum, "Finding Celebrities in Billions of Webpages", to appear on IEEE Transaction on Multimedia, 2012.
 Timo Ahonen, Abdenour Hadid, and Matti Pietikainen, “Face Recognition with Local Binary Patterns” ECCV 2004.
 Cha Zhang and Paul Viola, “Multiple-instance pruning for learning efficient cascade detectors,” in NIPS, 2007.