基于深度图像的手势估计比人体姿势估计更加困难, 部分原因在于算法不能很好地识别同一个手势经旋转后的不同外观样式. 提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)推测预旋转角度的手势姿态估计改进方法: 先利用自动算法标注的最佳旋转角度来训练CNN; 在手势识别之前, 用训练好的CNN模型回归计算出应预旋转的角度, 然后再对手部深度图像进行旋转; 最后采用随机决策森林(Random Decision Forest, RDF)方法对手部像素进行分类, 聚类产生出手部关节位置. 实验证明该方法可以减少预测的手部关节位置与准确位置之间的误差, 手势姿态估计的正确率平均上升了约4.69%.
Hand gesture estimation is much more difficult than human pose estimation from depth images, in part because existing algorithms are unable to recognize different appearances of the same hand gesture after rotation. In this paper, an improved approach for hand gesture estimation based on in-plane image rotation is proposed. First, a convolutional neural network (CNN) was trained by datasets with an auto tagged optimum angle of rotation. Then, prior to hand gesture recognition, an in-plane image of the hand depth was processed by the predicted angle of rotation through the trained CNN model. Lastly, depth pixels were classified by random decision forest (RDF), followed by clustering to generate the hand joint position. Experiments show that this method can reduce the error between the predicted position of the hand joint and the exact position, and the accuracy of gesture estimation improves by about 4.69% from the baseline.
[1] ZHOU R, YUAN J S, ZHANG Z Y. Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera [C]// Proceedings of the 19th International Conference on Multimedia. ACM, 2011: 1093–1096. DOI: 10.1145/2072298.2071946.
[2] TOMPSON J, STEIN M, LECUN Y, et al. Real-time continuous pose recovery of human hands using convolutional networks [J]. ACM Transactions on Graphics, 2014, 33(5): Article number 169. DOI: 10.1145/2629500.
[3] SINHA A, CHOI C, RAMANI K. Deephand: Robust hand pose estimation by completing a matrix imputed with deep features [J]. Computer Vision and Pattern Recognition, 2016(1): 4150-4158.
[4] KHAN R, HANBURY A, STTTINGER J, et al. Color based skin classification [J]. Pattern Recognition Letters, 2012, 33(2): 157-163. DOI: 10.1016/j.patrec.2011.09.032.
[5] GE L H, LIANG H, YUAN J S, et al. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 5679-5688. DOI: 10.1109/CVPR.2017.602.
[6] YUAN S X, YE Q, STENGER B, et al. BigHand2.2M benchmark: Hand pose dataset and state of the art analysis [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 2605-2613. DOI: 10.1109/CVPR.2017.279.
[7] SHOTTON J, GIRSHICK R, FITZGIBBON A, et al. Efficient human pose estimation from single depth images [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2821-2840. DOI: 10.1109/TPAMI.2012.241.
[8] QIAN C, SUN X, WEI Y C, et al. Realtime and robust hand tracking from depth [C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2014: 1106-1113. DOI: 10.1109/CVPR.2014.145.
[9] XU C, CHENG L. Efficient hand pose estimation from a single depth image [C]// 2013 IEEE International Conference on Computer Vision. IEEE, 2013: 3456-3462. DOI: 10.1109/ICCV.2013.429.
[10] CAMPBELL L W, BECKER D A, AZARBAYEJANI A, et al. Invariant features for 3-D gesture recognition [C]// Proceedings of the Second International Conference on Automatic Face and Gesture Recognition. IEEE, 1996: 157-162. DOI: 10.1109/AFGR.1996.557258.
[11] JOONGROCK K, SUNJIN Y, DONGCHUL K, et al L. An adaptive local binary pattern for 3D hand tracking [J]. Pattern Recognition, 2017, 61: 139-152. DOI: 10.1016/j.patcog.2016.07.039.
[12] KESKIN C, KIRAÇ F, KARA Y E, et al. Real time hand pose estimation using depth sensors [C]// 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 2011: 1228-1234. DOI: 10.1109/ICCVW.2011.6130391.
[13] LAPTEV D, SAVINOV N, BUHMANN J M, et al. TI-POOLING: Transformation-invariant pooling for feature learning in convolutional neural networks [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 289-297. DOI: 10.1109/CVPR.2016.38.
[14] BOUREAU Y L, PONCE J, LECUN Y. A theoretical analysis of feature pooling in visual recognition [C]// Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010: 111–118.
[15] LEPETIT V, LAGGER P, FUA P. Randomized trees for real-time keypoint recognition [C]// 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE, 2005: 775–781. DOI: 10.1109/CVPR.2005.288.
[16] CHENG G, ZHOU P C, HAN J W. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images [J]. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54(12): 7405-7415. DOI: 10.1109/TGRS.2016.2601622.
[17] MELAX S, KESELMAN L, ORSTEN S. Dynamics based 3D skeletal hand tracking [C]// Proceedings of the 2013 Graphics Interface Conference. ACM, 2013: 63-70. DOI:10.1145/2448196.2448232.
[18] SRIDHAR S, OULASVIRTA A, THEOBALT C. Interactive markerless articulated hand motion tracking using RGB and depth data [C]// 2013 IEEE International Conference on Computer Vision. IEEE, 2013: 2456-2463. DOI: 10.1109/ICCV.2013.305.
[19] OIKONOMIDIS I, KYRIAZIS N, ARGYROS A. Efficient model-based 3D tracking of hand articulations using kinect [C]// Proceedings of the British Machine Vision Conference. BMVC, 2011: 101.1-101.11. DOI: 10.5244/C.25.101.
[20] ROMERO J, KJELLSTROM H, KRAGIC D. Monocular real-time 3D articulated hand pose estimation [C]// 2009 9th IEEE-RAS International Conference on Humanoid Robots. IEEE, 2009: 87-92. DOI: 10.1109/ICHR.2009.5379596.
[21] SUN X, WEI Y C, LIANG S, et al. Cascaded hand pose regression [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 824-832. DOI: 10.1109/CVPR.2015.7298683.
[22] TOMPSON J, JAIN A, LECUN Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation [EB/OL]. (2014-09-17)[2019-03-01]. https://arxiv.org/pdf/1406.2984.pdf.
[23] JHINN W L, GOH K O M, HOE L S, et al. A contactless rotation-invariant palm vein recognition system [J]. Advanced Science Letters, 2018, 24(2): 1143-1148. DOI: 10.1166/asl.2018.10704.
[24] CHENG G, HAN J W, ZHOU P C, et al. Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection [J]. IEEE Transactions on Image Processing, 2019, 28(1): 265-278. DOI: 10.1109/TIP.2018.2867198.
[25] SHOTTON J, SHARP T, KIPMAN A, et al. Realtime human pose recognition in parts from single depth images [J]. Communications of the ACM, 2013, 56(1): 116-124. DOI: 10.1145/2398356.2398381.
[26] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]. (2015-04-10)[2019-03-01]. https://arxiv.org/pdf/1409.1556.pdf.
[27] TANG D H, CHANG H J, TEJANI A, et al. Latent regression forest: structured estimation of 3D articulated hand posture [C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2014: 3786-3793. DOI: 10.1109/CVPR.2014.490.
[28] ŠARI’C M. Libhand: A library for hand articulation [EB/OL]. [2019-03-01]. http://www.libhand.org/.
[29] KINGMA D P, LEI BA J. Adam: A method for stochastic optimization [EB/OL]. (2017-01-30)[2019-03-01]. https://arxiv.org/pdf/1412.6980v9.pdf.
[30] GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks [J]. Journal of Machine Learning Research, 2010, 9: 249-256.