MSR Action Recognition Datasets and Codes

 

HON4D Code and MSRActionPairs Dataset

 

MSRGesture3D (28MB): The dataset was captured by a Kinect device. There are 12 dynamic American Sign Language (ASL) gestures, and 10 people. Each person performs each gesture 2-3 times. There are 336 files in total, each corresponding to a depth sequence. The hand portion (above the wrist) has been segmented. The file name has the format sub_depth_m_n where m is the person index. n ranges from 1 to 36. Note that for some (m,n), the file sub_depth_m_n  does not exist. For example, there is no "sub_depth_02_03". The reason is that some of the bad sequences are excluded from the dataset. The mapping from n to gesture type is the following:

{1,2,3}-> "ASL_Z";

{4,5,6} ->"ASL_J";

{7,8,9} ->"ASL_Where";

{10,11,12} ->"ASL_Store";

{13,14,15} ->"ASL_Pig";

{16,17,18} ->"ASL_Past";

{19,20,21}->"ASL_Hungary";

{22.23,24}->"ASL_Green";

{25.26.27}->"ASL_Finish";

{28,29,30}->"ASL_Blue";

{31,32,33}->"ASL_Bathroom";

{34,35,36}->"ASL_Milk";

 

Each file is a MAT file which can be loaded with 64bit MATLAB. Below is a sample  MATLAB code to load a file:

 

x=load('sub_depth_01_01');
width = size(x.depth_part,1);
height = size(x.depth_part,2);
nFrames = size(x.depth_part,3);
for(i=1:width)
  for(j=1:height)
    for(k=1:nFrames)
      depthval = x.depth_part(i,j,k);
    end
  end
end
 

The  following  two papers reported experiment results on this dataset:

Alexey Kurakin, Zhengyou Zhang, Zicheng Liu, A Real-Time System for Dynamic Hand Gesture Recognition with a Depth Sensor, EUSIPCO, 2012.

Jiang Wang, Zicheng Liu, Jan Chorowski, Zhuoyuan Chen, Ying Wu, Robust 3D Action Recognition with Random Occupancy Patterns, ECCV, 2012.

 

MSRDailyActivity3D (MSR Daily Activity 3D dataset): The dataset was captured by using a Kinect device. There are 16 activities: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lie down on sofa, walk, play guitar, stand up, sit down. There are 10 subjects. Each subject performs each activity twice, once in standing position, and once in sitting position. There is a sofa in the scene. Three channels are recorded: depth maps (.bin), skeleton joint positions (.txt), and RGB video (.avi). There are 16*10*2=320 files for each channel.  In total, there are 320*3=960 files. Note that the RGB channel anddepth channel are recorded independently, so they are not strictly synchronized.

The format of the skeleton file is as follows. The first integer is the number of frames. The second integer is the number of joints which is always 20. For each frame, the first integer is the number of rows. This integer is 40 when there is exactly one skeleton being detected in this frame. It is zero when no skeleton is detected. It is 80 when two skeletons are detected (in that case which is rare, we simply use the first skeleton in our experiments). For most of the frames, the number of rows is 40. Each joint corresponds to two rows. The first row is its real world coordinates (x,y,z) and the second row is its screen coordinates plus depth (u, v, depth) where u and v are normalized to be within [0,1]. For each row, the integer at the end is supposed to be the confidence value, but it is not useful.

This diagram shows the correpondence between the 20 points in the skeleton data and the joints (Thanks to Yu Zhong from AIT, BAE Systems for providing this diagram).

Activity  recognition experiment with this dataset is reported in the following paper:

Mining Actionlet Ensemble for Action Recognition with Depth Cameras, Jiang Wang, Zicheng Liu, Ying Wu, Junsong Yuan, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012),  Providence, Rhode Island, June 16-21, 2012.

Part 1 (440 MB)

Part 2 (540 MB)

Part 3 (520 MB)

Part 4 (490 MB)

Part 5 (300 MB)

Part 6 (390 MB)

Part 7 (520 MB)

Part 8 (270 MB)

Sample code to load MSRDailyAct3D Dataset

 

MSR Action3D Dataset: 20 action types, 10 subjects, each subject performs each action 2 or 3 times. There are 567 depth map sequences in total. The resolution is 320x240. The data was recorded with a depth sensor similar to the Kinect device. The dataset is described in the following paper. Click here for a description of the subject splits used in various papers.

Action Recognition Based on A Bag of 3D Points, Wanqing Li, Zhengyou Zhang, Zicheng Liu, IEEE International Workshop on CVPR for Human Communicative Behavior Analysis (in conjunction with CVPR2010), San Francisco, CA, June, 2010.

Better classification results are reported in the following paper:

Mining Actionlet Ensemble for Action Recognition with Depth Cameras, Jiang Wang, Zicheng Liu, Ying Wu, Junsong Yuan, IEEE Conference on Computer Vision and Pattern Recognition (CVPR2012),  Providence, Rhode Island, June 16-21, 2012. Note that there is an error in the paper on the number of samples being used for the experiment. The number 402 in the paper is not correct. The correct number is 557. Out of the original 567 sequences in  MSR Action3D Dataset, 10 sequences are not used in this paper's experiment because the skeletons are either missing or too erroneous. Here is a list of the file names that are used in the experiment: list of file names.

Sample code to load MSR Action3D Dataset

Skeleton Data in screen coordinates (Thanks to Yi Wen Wan, University of North Texas, for data cleaning and conversion). There is a skeleton sequence file for each depth sequence in the Action3D dataset. A skeleton has 20 joint positions (see the image for illustrations of the joint positions). Four real numbers are stored for each joint: u, v, d, c where (u,v) are screen coordinates, d is the depth value, and c is the confidence score. If a depth sequence has n frames, then the number of real numbers stored in the corresponding skeleton file is equal to:  n*20*4. Click here for MATLAB code to visualize the skeleton motions (The code is provided by Antonio Vieira from Federal University of Minas Gerais).

This diagram shows the correpondence between the 20 points in the skeleton data and the joints (Thanks to Yu Zhong from AIT, BAE Systems for providing this diagram).

Skeleton Data in real world coordinates (Thanks to Ferda Ofli, UC Berkeley, for processing the data).

 

MSR Action Dataset I (1.5GB):

The test dataset contains 16 video sequences and has in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes. All of the video sequences are captured with clutter and moving backgrounds. Each video is of low resolution 320 x 240 and frame rate 15 frames per second. Their lengths are between 32 to 76 seconds. To evaluate the performance, we manually label a spatio-temporal bounding box for each action. The ground truth labeling can be found in the groundtruth.txt file. The ground truth format of each labeled action is "X width Y height T length".

If you use this dataset, please cite the following paper:

Discriminative Subvolume Search for Efficient Action Detection, Junsong Yuan, Zicheng Liu, Ying Wu, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, Florida, June 22-24, 2009.

Subvolume Branch-and-Bound Search Code: (Win32 binary)

Given a subvolume of scores (each point in the volume is assigned a score), the program searches for all the subvolumes whose total scores are above a user-specified threshold value. The algorithm is described in the paper below. The input file format and command syntax are described in the readme file contained in the downloaded package.

Discriminative Subvolume Search for Efficient Action Detection, Junsong Yuan, Zicheng Liu, Ying Wu, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, Florida, June 22-24, 2009.

MSR Action Dataset II:

    Part 1 (1.5GB)

    Part 2 (1.3GB)

    Part 3 (1.4GB)

    Part 4 (1.4GB)

    Part 5 (1.2GB)

    Ground truth labels (.txt file)

This is an extended version of the Microsoft Research Action Data Set. It consists of 54 video sequences recorded in a crowded environment. The video resolution is 320x240 and frame rate is 15 frames per second. Each video sequence consists of multiple actions. There are in total 203 action instances. There are three action types: hand waving, handclapping, and boxing. These action types are overlapped with the KTH dataset so that one can perform cross-dataset action detection by using the KTH dataset for training while using this dataset for testing. To make downloading easier, the dataset is split into 5 parts. Please note that the positions in the ground truth file are with respect to 160x120 spatial resolution. Therefore if you work with 320x240 resolution, you will need to scale the spatial positions in the ground truth file. This dataset was used in the following three papers:

Hierarchical Filtered Motion for Action Recognition in Crowded Videos, Y. Tian, L. Cao, Z. Liu, and Z. Zhang, IEEE Transactions on Systems, Man, and Cybernetics--Part C: Applications and Reviews, Vol. 42, No. 3, pp313-323, 2012.

Cross-dataset Action Detection, Liangliang Cao, Zicheng Liu, Thomas Huang, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, CA, June, 2010.

Discriminative Video Pattern Search for Efficient Action Detection, Junsong Yuan, Zicheng Liu, Ying Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence, accepted, 2010.

Negative Data:

There are five video sequences of a person moving around in an office. This data was used in some of the experiments in Yuan et al's CVPR09 paper.

Senior Home Monitoring Dataset:

For information on the Senior Home Monitoring dataset, please go to the UESTC Senior Home Monitoring web page.

Contact: zliu@microsoft.com