Fusing CNN and Transformer Network for Human Pose Estimation

Jiajia Shi; Fuchun Zhang; Zhenni Ma

doi:10.23977/acss.2024.080520

Fusing CNN and Transformer Network for Human Pose Estimation

Download as PDF

DOI: 10.23977/acss.2024.080520 | Downloads: 22 | Views: 1074

Author(s)

Jiajia Shi ¹, Fuchun Zhang ¹, Zhenni Ma ¹

Affiliation(s)

¹ School of Physics and Electronic Information, Yan'an University, Yan'an, 716000, China

Corresponding Author

Fuchun Zhang

ABSTRACT

Accurate human pose estimation is essential for further human action recognition and behavioral analysis. Existing convolutional networks can extract local feature information but fail to model long-range dependencies, while Transformers excel at capturing global context but lose fine-grained details. To address this, we propose a dual-branch network called the Dual Transformer and CNN Network (DTCNet) that integrates global and local information for human pose estimation. DTCNet is proposed to improve human pose estimation by leveraging both global context and local features. It contains two branches - a Transformer branch that extracts global dependencies and a CNN branch that preserves local details. A fusion module then interacts between these branches, combining their complementary information to enhance representational power. Finally, the heatmap regression decoding unit obtains the pose estimations. Experiments demonstrate that through its dual-branch design, DTCNet effectively balances accuracy and efficiency while addressing limitations of previous methods. It achieves significantly higher average accuracy than the baseline on standard datasets, with 2.9% and 2.1% improvement on MPII and COCO respectively, validating that DTCNet better captures both long-range dependencies and fine-grained aspects needed for accurate pose estimation.

KEYWORDS

Keypoint detection, CNN, Transformer, feature fusion

CITE THIS PAPER

Jiajia Shi, Fuchun Zhang, Zhenni Ma, Fusing CNN and Transformer Network for Human Pose Estimation. Advances in Computer, Signals and Systems (2024) Vol. 8: 174-184. DOI: http://dx.doi.org/10.23977/acss.2024.080520.

REFERENCES

[1] S. Salisu, A. S. A. Mohamed, M. H. Jaafar, A. S. B. Pauzi, H. A. Younis, A Survey on Deep Learning-Based 2D Human Pose Estimation Models,Computers, Materials Continua, 2023.
[2] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting Skeleton-based Action Recognition, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 2959-2968.
[3] Z. Fang, A.M. López, Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation,IEEE Transactions on Intelligent Transportation Systems, 2020.
[4] M. Lu, Y. Hu, X. Lu, Driver action recognition using deformable and dilated faster R-CNN with optimized region proposals,Applied Intelligence, 2020.
[5] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition,Proceedings of the IEEE, 1998.
[6] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks,Commun. ACM, 2017.
[7] S. Liu, W. Deng, Very deep convolutional neural network based image classification using small training sample size, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, 730-734.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions,2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[9] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[10] B. Xiao, H. Wu, Y. Wei, Simple Baselines for Human Pose Estimation and Tracking, V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.) Computer Vision – ECCV 2018, 2018, 472-487.
[11] K. Sun, B. Xiao, D. Liu, J. Wang, Deep High-Resolution Representation Learning for Human Pose Estimation, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 5686-5696.
[12] Y. Cai, Z. Wang, Z. Luo, B. Yin, A. Du, H. Wang, X. Zhou, E. Zhou, X. Zhang, J. Sun, Learning Delicate Local Representations for Multi-Person Pose Estimation, European Conference on Computer Vision, 2020.
[13] A. Vaswani, N.M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is All you Need, Neural Information Processing Systems, 2017.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,ArXiv, 2020.
[15] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H.e. J'egou, Training data-efficient image Transformers & distillation through attention, International Conference on Machine Learning, 2020.
[16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[17] Y. Xu, J. Zhang, Q. Zhang, D. Tao, ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation, ArXiv, 2022.
[18] S. Yang, Z. Quan, M. Nie, W. Yang, TransPose: Towards Explainable Human Pose Estimation by Transformer, ArXiv, 2020.
[19] Y. Li, S. Zhang, Z. Wang, S. Yang, W. Yang, S. Xia, E. Zhou, TokenPose: Learning Keypoint Tokens for Human Pose Estimation,2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[20] K. Ludwig, P. Harzig, R. Lienhart, Detecting Arbitrary Intermediate Keypoints for Human Pose Estimation with Vision Transformers, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2022, 663-671.
[21] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[22] N. Ma, X. Zhang, H. Zheng, J. Sun, ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design, ArXiv, 2018.
[23] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,ArXiv, 2017.
[24] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, J. Sun, Cascaded Pyramid Network for Multi-person Pose Estimation, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 7103-7112.
[25] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D Human Pose Estimation: New Benchmark and State of the Art Analysis, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, 3686-3693.
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: Common Objects in Context, D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.) Computer Vision – ECCV 2014, 740-755.

Subscription

E-Mail Alert

Downloads:	38875
Visits:	733022

Fusing CNN and Transformer Network for Human Pose Estimation

Author(s)

Affiliation(s)

Corresponding Author

ABSTRACT

KEYWORDS

CITE THIS PAPER

REFERENCES

RESOURCES

JOIN US

PUBLICATION SERVICES

CONTACT US