Education, Science, Technology, Innovation and Life
Open Access
Sign In

Applying Transfer Learning for Syllable-Based Speech Recognition in Tibetan Language

Download as PDF

DOI: 10.23977/fcvpr.2023.010101 | Downloads: 19 | Views: 752

Author(s)

Senyan Li 1, Guanyu Li 1, Sirui Li 1

Affiliation(s)

1 Northwest Minzu University, Lanzhou, Gansu, 730000, China

Corresponding Author

Guanyu Li

ABSTRACT

This article mainly explores Tibetan speech recognition and reviews its development history. In recent years, end-to-end methods have been applied to Tibetan speech recognition. However, due to the lack of training data, the performance of the end-to-end method is not ideal. Therefore, this article introduces the transfer learning method, which uses Mandarin as a same-language family language to train a pre-trained model that initializes the Tibetan speech recognition model. On the xbmu-amdo31 Tibetan public dataset, our method achieved an 11.8% relative reduction in phoneme error rate compared to the baseline system. This method not only enhances the performance of speech recognition in low-resource languages but also has the potential to be extended to other same-language family languages. Overall, this article highlights the importance of transfer learning in speech recognition and its potential impact on improving speech recognition systems in low-resource languages.

KEYWORDS

Speech recognition, Tibetan language, low-resource, transfer learning, Amdo dialect

CITE THIS PAPER

Senyan Li, Guanyu Li, Sirui Li, Applying Transfer Learning for Syllable-Based Speech Recognition in Tibetan Language. Frontiers in Computer Vision and Pattern Recognition (2023) Vol. 1: 1-8. DOI: http://dx.doi.org/10.23977/fcvpr.2023.010101.

REFERENCES

[1] Müller M. Dynamic time warping. Information retrieval for music and motion, 2007: 69-84.
[2] Juang B H, Rabiner L R. Hidden Markov models for speech recognition. Technometrics, 1991, 33 (3): 251-272.
[3] Pei C. Research on Tibetan Speech Recognition Technology Based on Standard Lhasa Tibetan [Doctoral dissertation, Tibet University].2009.
[4] Han Q., & Yu H. Research on Isolated Word Speech Recognition of Ando Tibetan based on HMM. Software Guide, 2010, 9 (7), 173-175.
[5] Zhao E., Wang C., Dang H., et al. Research on Isolated Word Speech Recognition Technology for Tibetan. Journal of Northwest Normal University (Natural Science Edition), 2015, 51 (5), 50-54.
[6] Zhang Y. Research on Lhasa Tibetan Speech Recognition Based on Deep Learning [Doctoral dissertation, Northwest Normal University]. Lanzhou, China. 2016.
[7] Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks// International Conference on Machine Learning. JMLR. org, 2014.
[8] Graves A. Sequence transduction with recurrent neural networks. arXiv preprint arXiv: 1211. 3711, 2012.
[9] Chorowski J, Bahdanau D, Cho K, et al. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. Eprint Arxiv, 2014.
[10] Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition//2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016: 4945-4949.
[11] Chan W, Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition//2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016: 4960-4964.
[12] Lu L, Zhang X, Renais S. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016: 5060-5064.
[13] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
[14] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.
[15] Zhang B, Lv H, Guo P, et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 6182-6186.
[16] Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv: 1804. 00015, 2018.
[17] Kingma D P, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412. 6980, 2014.
[18] Watanabe S, Hori T, Kim S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 2017, 11 (8): 1240-1253.

Downloads: 40
Visits: 1573

Sponsors, Associates, and Links


All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.