IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
Mask-pose Cascaded CNN for 2D Hand Pose Estimation from Single Color Images
Yangang Wang1,
Cong Peng2,
Yebin Liu3
1Southeast University
2Nanjing University of Aeronautics and Astronautics
3Tsinghua University

Network architecture.
The proposed network structure includes two stages: mask prediction stage and pose prediction stage. See the original paper for more details.

We present a cascaded convolutional neural network for 2D hand pose estimation from single in-the-wild RGB images. Inspired by the commonly used silhouette information in the generative pose estimation approaches, we build the cascaded network with two stages, including mask prediction stage as well as pose estimation stage. We find that the two stages network architecture for end-to-end training could benefit with each other for detecting the hand mask and 2D pose. To further improve the hand pose detection accuracy, we contribute a new RGB hand dataset named OneHand10K, which contains 10K RGB images. Each image contains one single hand. We manually obtained the segmented mask and labeled keypoints for guided learning. We hope that this dataset will give a benchmark and encourage more people to perform research on this challenging topic. Experiments on the validation dataset have demonstrated the superior performance of the proposed cascaded convolutional neural network.
Training Code

Our network was originally trained by Caffe. The training codes are listed below:

Training Network      Solver      Deploy Network

Note that we have updated our training platform to Caffe2, the original trained weights are not maintained anymore. You can re-train the network and reproduce the results of this paper by yourself.

Update on 2019-07
We have released a more accurate and faster method named as SRHandNet for real-time 2D hand pose estimation. The code is available and please check the new method .
OneHand10K Dataset

Update on 2020-06
There are many students, including undergraduates, masters and Ph.D. candidates, to request our dataset. Please note our following statements and send your requesting E-mail with a carbon copy to your advisor. Otherwise, your E-mail would be automatically ignored. Thanks for your coorperation.

To request the dataset, please send an email stating
  1. your name, title or position, and institution or affiliation. (NOTE: this dataset is ONLY for research and non-commercial use. For copyright issue, your institution or affiliation is a must and we do not accept the requirement from individual researchers or students. If you are a student, we encourage you to ask your advisor or the faculty from your research institute, college or university to request the dataset. Institution E-mail address is required.)
  2. a statement saying that you accept the following terms of licensing (please copy the licensing text into your email):

    The rights to copy, distribute, and use the OneHand10K dataset (henceforth called "OneHand10K") you are being given access to are under the control of Yangang Wang, director of the Vision and Cognition Lab, Southeast University. You are hereby given permission to copy this data in electronic or hardcopy form for your own scientific use and to distribute it for scientific use to colleagues within your research group. Inclusion of images or video made from this data in a scholarly publication (printed or electronic) is also permitted. In this case, credit must be given to the publication: *Mask-pose Cascaded CNN for 2D Hand Pose Estimation from Single Color Image*. For any other use, including distribution outside your research group, written permission is required from Yangang Wang. Any commercial use is not allowed. Commercial use includes but is not limited to sale of the data, derivatives, replicas, images, or video, inclusion in a product for sale, or inclusion in advertisements (printed or electronic), on commercially-oriented web sites, or in trade shows.



Related links

Yangang Wang, Cong Peng and Yebin Liu. "Mask-pose Cascaded CNN for 2D Hand Pose Estimation from Single Color Images". IEEE Transactions on Circuits and Systems for Video Technology, 29(11), 3258 - 3268, 2019.
Acknowledgments: The authors would like to thank Xiaoquan Lv, Biyao Shao, Junjie Zhu and Yining Xie to help us to build the dataset. This work was supported by the National Natural Science Foundation of China (No. 61806054, 6170320), Natural Science Foundation of Jiangsu Province (No. BK20180355 and BK20170812) and Foundation of Southeast University (No. 3208008410 and 1108007121).