Fusing CNN and Transformer Network for Human Pose Estimation
DOI: 10.23977/acss.2024.080520 | Downloads: 21 | Views: 976
Author(s)
Jiajia Shi 1, Fuchun Zhang 1, Zhenni Ma 1
Affiliation(s)
1 School of Physics and Electronic Information, Yan'an University, Yan'an, 716000, China
Corresponding Author
Fuchun ZhangABSTRACT
Accurate human pose estimation is essential for further human action recognition and behavioral analysis. Existing convolutional networks can extract local feature information but fail to model long-range dependencies, while Transformers excel at capturing global context but lose fine-grained details. To address this, we propose a dual-branch network called the Dual Transformer and CNN Network (DTCNet) that integrates global and local information for human pose estimation. DTCNet is proposed to improve human pose estimation by leveraging both global context and local features. It contains two branches - a Transformer branch that extracts global dependencies and a CNN branch that preserves local details. A fusion module then interacts between these branches, combining their complementary information to enhance representational power. Finally, the heatmap regression decoding unit obtains the pose estimations. Experiments demonstrate that through its dual-branch design, DTCNet effectively balances accuracy and efficiency while addressing limitations of previous methods. It achieves significantly higher average accuracy than the baseline on standard datasets, with 2.9% and 2.1% improvement on MPII and COCO respectively, validating that DTCNet better captures both long-range dependencies and fine-grained aspects needed for accurate pose estimation.
KEYWORDS
Keypoint detection, CNN, Transformer, feature fusionCITE THIS PAPER
Jiajia Shi, Fuchun Zhang, Zhenni Ma, Fusing CNN and Transformer Network for Human Pose Estimation. Advances in Computer, Signals and Systems (2024) Vol. 8: 174-184. DOI: http://dx.doi.org/10.23977/acss.2024.080520.
REFERENCES
[1] S. Salisu, A. S. A. Mohamed, M. H. Jaafar, A. S. B. Pauzi, H. A. Younis, A Survey on Deep Learning-Based 2D Human Pose Estimation Models,Computers, Materials Continua, 2023.
[2] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting Skeleton-based Action Recognition, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 2959-2968.
[3] Z. Fang, A.M. López, Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation,IEEE Transactions on Intelligent Transportation Systems, 2020.
[4] M. Lu, Y. Hu, X. Lu, Driver action recognition using deformable and dilated faster R-CNN with optimized region proposals,Applied Intelligence, 2020.
[5] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition,Proceedings of the IEEE, 1998.
[6] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks,Commun. ACM, 2017.
[7] S. Liu, W. Deng, Very deep convolutional neural network based image classification using small training sample size, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, 730-734.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions,2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[9] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[10] B. Xiao, H. Wu, Y. Wei, Simple Baselines for Human Pose Estimation and Tracking, V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.) Computer Vision – ECCV 2018, 2018, 472-487.
[11] K. Sun, B. Xiao, D. Liu, J. Wang, Deep High-Resolution Representation Learning for Human Pose Estimation, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 5686-5696.
[12] Y. Cai, Z. Wang, Z. Luo, B. Yin, A. Du, H. Wang, X. Zhou, E. Zhou, X. Zhang, J. Sun, Learning Delicate Local Representations for Multi-Person Pose Estimation, European Conference on Computer Vision, 2020.
[13] A. Vaswani, N.M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is All you Need, Neural Information Processing Systems, 2017.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,ArXiv, 2020.
[15] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H.e. J'egou, Training data-efficient image Transformers & distillation through attention, International Conference on Machine Learning, 2020.
[16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[17] Y. Xu, J. Zhang, Q. Zhang, D. Tao, ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation, ArXiv, 2022.
[18] S. Yang, Z. Quan, M. Nie, W. Yang, TransPose: Towards Explainable Human Pose Estimation by Transformer, ArXiv, 2020.
[19] Y. Li, S. Zhang, Z. Wang, S. Yang, W. Yang, S. Xia, E. Zhou, TokenPose: Learning Keypoint Tokens for Human Pose Estimation,2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[20] K. Ludwig, P. Harzig, R. Lienhart, Detecting Arbitrary Intermediate Keypoints for Human Pose Estimation with Vision Transformers, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2022, 663-671.
[21] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[22] N. Ma, X. Zhang, H. Zheng, J. Sun, ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design, ArXiv, 2018.
[23] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,ArXiv, 2017.
[24] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, J. Sun, Cascaded Pyramid Network for Multi-person Pose Estimation, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 7103-7112.
[25] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D Human Pose Estimation: New Benchmark and State of the Art Analysis, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, 3686-3693.
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: Common Objects in Context, D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.) Computer Vision – ECCV 2014, 740-755.
Downloads: | 38554 |
---|---|
Visits: | 697997 |
Sponsors, Associates, and Links
-
Power Systems Computation
-
Internet of Things (IoT) and Engineering Applications
-
Computing, Performance and Communication Systems
-
Journal of Artificial Intelligence Practice
-
Journal of Network Computing and Applications
-
Journal of Web Systems and Applications
-
Journal of Electrotechnology, Electrical Engineering and Management
-
Journal of Wireless Sensors and Sensor Networks
-
Journal of Image Processing Theory and Applications
-
Mobile Computing and Networking
-
Vehicle Power and Propulsion
-
Frontiers in Computer Vision and Pattern Recognition
-
Knowledge Discovery and Data Mining Letters
-
Big Data Analysis and Cloud Computing
-
Electrical Insulation and Dielectrics
-
Crypto and Information Security
-
Journal of Neural Information Processing
-
Collaborative and Social Computing
-
International Journal of Network and Communication Technology
-
File and Storage Technologies
-
Frontiers in Genetic and Evolutionary Computation
-
Optical Network Design and Modeling
-
Journal of Virtual Reality and Artificial Intelligence
-
Natural Language Processing and Speech Recognition
-
Journal of High-Voltage
-
Programming Languages and Operating Systems
-
Visual Communications and Image Processing
-
Journal of Systems Analysis and Integration
-
Knowledge Representation and Automated Reasoning
-
Review of Information Display Techniques
-
Data and Knowledge Engineering
-
Journal of Database Systems
-
Journal of Cluster and Grid Computing
-
Cloud and Service-Oriented Computing
-
Journal of Networking, Architecture and Storage
-
Journal of Software Engineering and Metrics
-
Visualization Techniques
-
Journal of Parallel and Distributed Processing
-
Journal of Modeling, Analysis and Simulation
-
Journal of Privacy, Trust and Security
-
Journal of Cognitive Informatics and Cognitive Computing
-
Lecture Notes on Wireless Networks and Communications
-
International Journal of Computer and Communications Security
-
Journal of Multimedia Techniques
-
Automation and Machine Learning
-
Computational Linguistics Letters
-
Journal of Computer Architecture and Design
-
Journal of Ubiquitous and Future Networks