Multi-scale Self-Attention Convolutional Networks for Skeleton-Based Action Recognition

Yuwen Fang; Zonghui Wang

doi:10.23977/acss.2025.090212

Multi-scale Self-Attention Convolutional Networks for Skeleton-Based Action Recognition

Download as PDF

DOI: 10.23977/acss.2025.090212 | Downloads: 10 | Views: 790

Author(s)

Yuwen Fang ¹, Zonghui Wang ¹

Affiliation(s)

¹ School of Computer and Information Sciences, Chongqing Normal University, Chongqing, China

Corresponding Author

Yuwen Fang

ABSTRACT

Skeleton-based action recognition is one of the core tasks in the field of video understanding and is widely used in scenarios such as human-computer interaction, intelligent monitoring, and sports analysis. Existing graph convolutional networks (GCNs) effectively model the spatial dependency of joints by constructing a skeletal connection graph, but their temporal modeling usually relies on fixed-window temporal convolution, which makes it difficult to capture the global dynamic associations between distant frames, resulting in the loss of key temporal features in complex actions. To this end, this paper proposes a feature extraction framework based on temporal context enhancement. First, the framework uses GCN to explicitly encode the spatial dependency of skeletal joints and extract spatial features containing physical connection priors; secondly, the local temporal dynamics between adjacent frames are captured through a multi-scale temporal convolution module; on this basis, the self-attention mechanism of the temporal dimension is introduced to model the cross-frame association of the feature sequence output by the temporal convolution, and the key dependencies between distant action frames are adaptively captured through dynamic weight allocation, realizing temporal modeling from local to global. Experimental results on the NTU RGB+D dataset show that the proposed method significantly outperforms the existing advanced models in the task of skeletal action recognition, verifying the effectiveness of the temporal self-attention mechanism in modeling complex action dynamics.

KEYWORDS

Skeletal action recognition; graph convolutional network; temporal self-attention mechanism; multi-scale temporal convolution; spatiotemporal modeling

CITE THIS PAPER

Yuwen Fang, Zonghui Wang, Multi-scale Self-Attention Convolutional Networks for Skeleton-Based Action Recognition. Advances in Computer, Signals and Systems (2025) Vol. 9: 99-107. DOI: http://dx.doi.org/10.23977/acss.2025.090212.

REFERENCES

[1] Zhou Y, Yan X, Cheng Z Q, et al. Blockgcn: Redefine topology awareness for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2049-2058.
[2] Myung W, Su N, Xue J H, et al. Degcn: Deformable graph convolutional networks for skeleton-based action recognition[J]. IEEE Transactions on Image Processing, 2024, 33: 2477-2490.
[3] Qin X, Cai R, Yu J, et al. An efficient self-attention network for skeleton-based action recognition[J]. Scientific Reports, 2022, 12(1): 4111.
[4] Wang Q, Shi S, He J, et al. Iip-transformer: Intra-inter-part transformer for skeleton-based action recognition[C]//2023 IEEE International Conference on Big Data (BigData). IEEE, 2023: 936-945.
[5] Shi F, Lee C, Qiu L, et al. Star: Sparse transformer-based action recognition[J]. arXiv preprint arXiv:2107.07089, 2021.
[6] Choi J, Wi H, Kim J, et al. Graph convolutions enrich the self-attention in transformers![J]. Advances in Neural Information Processing Systems, 2024, 37: 52891-52936.
[7] Pang Y, Ke Q, Rahmani H, et al. Igformer: Interaction graph transformer for skeleton-based human interaction recognition[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 605-622.
[8] Shi F, Lee C, Qiu L, et al. Star: Sparse transformer-based action recognition[J]. arXiv preprint arXiv:2107.07089, 2021.
[9] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455, 2018.
[10] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12026–12035, 2019.
[11] Cheng K, Zhang Y F, He X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2020: 180-189
[12] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 143–152, 2020.
[13] C. Plizzari, M. Cannici, M. Matteucci, Skeleton-based action recognition via spatial and temporal transformer networks, Comput. Vis. Image Underst. 208-209 (2021) 103219
[14] Fanfan Ye, Shiliang Pu, Qiaoyong Zhong, Chao Li, Di Xie, and Huiming Tang. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings ofthe 28th ACM International Conference on Multimedia, pages 55–63, 2020.
[15] L. Shi, Y. Zhang, J. Cheng, H. Lu, Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition, in: Revised Selected Papers of the Asian Conf. on Computer Vision (ACCV'20), Part V, Springer, Cham, Switzerland, 2020, pp. 3853.
[16] Z. Chen, S. Li, B. Yang, Q. Li, H. Liu, Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition, in: AAAI Conf. on Arti cial Intelligence (AAAI'21), IAAI'21, EAAI'21, AAAI, RedHook, NY, USA, 2021, pp. 11131122.

Subscription

E-Mail Alert

Downloads:	40927
Visits:	814256

Multi-scale Self-Attention Convolutional Networks for Skeleton-Based Action Recognition

Author(s)

Affiliation(s)

Corresponding Author

ABSTRACT

KEYWORDS

CITE THIS PAPER

REFERENCES

RESOURCES

JOIN US

PUBLICATION SERVICES

CONTACT US