研究生: |
白立亭 Pai, Li-Ting |
---|---|
論文名稱: |
針對端到端語音辨識中語境偏移之適應性研究 A Study on Contextual Biasing Adaptation in End-to-End Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: |
陳柏琳
Chen, Berlin 陳冠宇 Chen, Kuan-Yu 曾厚強 Tseng, Ho-Chiang |
口試日期: | 2025/01/17 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 47 |
中文關鍵詞: | 語音辨識 、語境偏移 、關鍵詞辨識 、提示微調 |
英文關鍵詞: | Speech Recognition, Contextual Biasing, Keyword Recognition, Prompt-Tuning |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202500280 |
論文種類: | 學術論文 |
相關次數: | 點閱:85 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著後疫情時代的到來,線上會議成為主流,使得對語音轉錄技術的需求日益增加。然而,在這些會議場景中,語音辨識系統面臨專業術語、人名、關鍵詞等辨識不準確的挑戰,影響了轉錄結果的完整性和精確度。這些問題尤其常見於涉及特定行業術語或專業背景的會議,如醫療、法律、金融等領域。在此情境下,準確地轉錄關鍵詞和專有名詞不僅是為了提升會議紀錄的可讀性,也有助於在後續的資訊檢索和分析中更有效地處理和提取重要內容。針對此需求,語音辨識技術逐漸引入語境化偏移及文字提示功能,通過整合特定語境清單和專業術語庫,使系統能更精確地辨識會議中的重要內容,進一步提高會議資料的品質與實用性。本研究聚焦於增強語音辨識模型的上下文敏銳度,旨在透過引入不同類型的語義特徵以及特定的提示訊息來提升模型對領域特定詞彙的辨識能力。研究結果顯示,利用提示訓練,在AISHELL-1 資料集上的詞相對錯誤率可以達到13.8 %的相對詞錯誤率,以及7.5 %的相對實體錯誤率,研究結果表明本研究有效地喚醒模型對於專業術語或重要詞彙的敏感性,降低偏移詞錯誤率,並提升轉錄結果的精確度。透過提供了詞彙的語境線索,幫助模型在專業場景下更準確地辨識並正確轉錄相應內容,從而減少因上下文缺乏而導致的誤差。
With the arrival of the post-pandemic era, online meetings have become the norm, leading to a growing demand for speech transcription technology. However, in these meeting scenarios, speech recognition systems face challenges with accurately recognizing specialized terminology, names, and keywords, which in turn affects the completeness and precision of the transcription results. These issues are especially common in meetings involving industry-specific or specialized knowledge, such as in healthcare, law, and finance. In such contexts, accurately transcribing keywords and proper nouns not only improves the readability of meeting minutes but also facilitates more effective retrieval and extraction of important information in subsequent analysis. To address this need, speech recognition technology has gradually introduced contextual biasing and text prompting functionality. By integrating domain-specific word lists and specialized terminology databases, the system can more accurately recognize important content in meetings and further enhance the quality and utility of meeting data. This study focuses on enhancing the contextual sensitivity of speech recognition models by introducing different types of semantic features and specific prompts to improve the recognition of domain-specific vocabulary. The results show that through prompt-based training on the AISHELL-1 dataset, it is possible to achieve a 13.8% relative word error rate reduction and a 7.5% relative entity error rate reduction. These findings indicate that this approach effectively heightens the model’s sensitivity to specialized terminology or critical vocabulary, reduces errors in biasing words, and improves transcription accuracy. By providing contextual clues for the vocabulary, the model is better able to accurately recognize and correctly transcribe relevant content in professional settings, thereby reducing errors caused by a lack of context.
[1] J. Li, “Recent advances in end-to-end automatic speech recognition,” arXiv preprint arXiv:2111.01690, 2021.
[2] Rohit Prabhavalkar, Takaaki Hori, Tara N Sainath, Ralf Schl ̈uter, and Shinji Watanabe, “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
[3] T. N. Sainath, Y. He, Narayanan, et al., “An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling,” in Proc. Interspeech, 2021.
[4] Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, and Shinji Watanabe, “Contextualized automatic speech recognition with attention-based bias phrase boosted beam search,” arXiv preprint, 2024.
[5] Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, and Shinji Watanabe,“Phoneme-aware encoding for prefix-tree-based contextual ASR,” in Proc. ICASSP, 2024.
[6] Rahul Pandey, Roger Ren, Qi Luo, Jing Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant Strimel, Andreas Stolcke, and Ivan Bulyko, “PROCTER: PROnunciation-aware Contextual adaptER for personalized speech recognition in neural transducers,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
[7] Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine, 39(3):42–62, 2022.
[8] Labied, M., Belangour, A., & Banane, M. Delve deep into End-To-End Automatic Speech Recognition Models. In 2023 International Seminar on Application for Technology of Information and Communication, pp. 164-169.
[9] Alex Graves, Santiago Fern ́andez, and Faustino Gomez. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML, 2006.
[10] S. Kim, M. L. Seltzer, J. Li, and R. Zhao, “Improved training for online end-to-end speech recognition systems,” arXiv preprint arXiv:1711.02212, 2017.
[11] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arxiv preprint arxiv: 1508.01211, 2015.
[12] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved Training of End-to-end Attention Models for Speech Recognition,” in Interspeech, pp. 7–11, Sep.2018.
[13] Takaaki Hori, Shinji Watanabe, and John Hershey,“Joint CTC/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, July 2017, pp. 518–529, Association for Computational Linguistics.
[14] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al., “Streaming end-to-end speech recognition for mobile devices,” in Proc. IEEE ICASSP, 2019, pp. 6381–6385.
[15] Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, and Olivier Siohan. Recurrent neural network transducer for audio-visual speech recognition. In Interspeech, 2019.
[16] J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin, “Understanding and improving layer normalization,” arXiv preprint arXiv1911.07013, 2019.
[17] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
[18] A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv2005.08100, 2020.
[19] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via largescale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518.
[20] an Williams, Anjuli Kannan, Petar S Aleksic, David Rybach, and Tara N Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search.,” in Proc. Interspeech, 2018.
[21] Aditya Gourav, Linda Liu, Ankur Gandhe, Yile Gu, Guitang Lan, Xiangyang Huang, Shashank Kalmane, Gautam Tiwari, Denis Filimonov, Ariya Rastrow, et al., “Personalization strategies for end-to-end speech recognition systems,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7348–7352.
[22] Ding Zhao, Tara N. Sainath, David Rybach, Pat Rondon, Deepti Bhatia, Bo Li, and Ruoming Pang, “Shallow-Fusion End-toEnd Contextual Biasing,” in Proc. Interspeech 2019, 2019, pp. 1418–1422.
[23] Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, Ariya Rastrow, and Siegfried Kunzmann, “Context-aware transformer transducer for speech recognition,” ASRU, 2021.
[24] Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao, “Deep context: end-to-end contextual speech recognition,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 418–425.
[25] Guangzhi Sun, Chao Zhang, and Philip C Woodland,“Tree-constrained pointer generator with graph neural network encodings for contextual speech recognition,” arXiv preprint, 2022.
[26] K. Huang, A. Zhang, B. Zhang, T. Xu, X. Song, and L. Xie,“Spike-triggered contextual biasing for end-to-end mandarin speech recognition,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8.
[27] Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, and Michael L Seltzer, “Deep shallow fusion for rnn-t personalization,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 251–257.
[28] Y. Xu, B. Liu, Q. Huang, X. Song, Z. Wu, S. Kang, and H. Meng,“Cb-conformer: Contextual biasing conformer for biased word recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[29] K. Mysore et al., “Contextual adapters for personalized speech recognition in neural transducers,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022.
[30] S. Tong, P. Harding, and S. Wiesler, “Slot-triggered contextual biasing for personalized speech recognition using neural transducers,” in Proc. ICASSP, 2023, pp. 1–5.
[31] G. Sun, C. Zhang, and P. C. Woodland, “Minimising biasing word errors for contextual ASR with the tree-constrained pointer generator,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 345–354, 2023.
[32] A. Shamsian et al., “Keyword-Guided Adaptation of Automatic Speech Recognition,” arXiv preprint arXiv:2406.02649, 2024.
[33] X. Yang et al., “PromptASR for contextualized ASR with controllable style,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10536–10540, Apr. 2024.
[34] Y. Li et al., “Using Large Language Model for End-to-End Chinese ASR and NER,” arXiv preprint arXiv:2401.11382, 2024.
[35] X. Gong et al., “Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model,” in Proc. Interspeech, pp. 257–261, 2024.
[36] Y. Chu et al., “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023.
[37] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA, 2017.
[38] B. Chen, G. Xu, X. Wang, P. Xie, M. Zhang, and F. Huang, “AISHELL-NER: Named entity recognition from chinese speech,” in Proc. ICASSP. IEEE, 2022.
[40] H. Wang et al., “SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11076–11080, Apr. 2024.
[41] S. Tong et al., “Hierarchical attention-based contextual biasing for personalized speech recognition,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
[42] T. Chen et al., “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning (ICML), pp. 1597–1607, Nov. 2020.
[43] G. Ayache et al., “WhisperNER: Unified Open Named Entity and SpeechRecognition,” arXiv preprint arXiv:2409.08107, 2024.