簡易檢索 / 詳目顯示

研究生: 鄭皓天
Cheng, Hao-Tien
論文名稱: 多口音英語語音辨識
Multi-accent English Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳柏琳
Chen, Berlin
洪志偉
Hung, Jeih-Weih
江振宇
Chiang, Chen-Yu
口試日期: 2024/01/20
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 41
中文關鍵詞: 語音辨識口音多任務學習資料視覺化模型探測轉換器
英文關鍵詞: Speech Recognition, Accent, Multi-task Learning, Data Visualization, Model Probing, Adapter
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202400347
論文種類: 學術論文
相關次數: 點閱:193下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著全球化的趨勢,英語作為國際通用語言的角色日益重要。然而,由於母語背景、地區和文化差異的影響,英語口音的多樣性也相應增加。這使得語音辨識系統在識別各種口音的英語時面臨著挑戰。
    本論文探討針對在有限口音語料的狀況下如何通過增加口音鑑別力來改進Conformer模型對於多口音英語語音的辨識效果。本論文提出了一種方法將口音分類任務加入語音辨識模型中,旨在提高模型對於不同口音的敏感性和鑑別能力。實驗結果顯示,與傳統的語音辨識方法相比,此方法在口音英語語音辨識的詞錯率有下降,並且也將模型編碼器中不同層的口音特徵視覺化來進行分析,探討模型在不同層的特徵所代表的訊息。
    另外,本論也探討了利用大量資料訓練的Whisper模型在英語版、多語言版本以及不同模型大小的設定下對於多口音英語語音辨識任務的效果,也比較了使用LoRA的方式來訓練模型與全面微調方式的差異,為模型的選擇提供了一個更明確的參考。

    With globalization, the role of English as an international lingua franca has become increasingly important. However, the diversity of English accents, influenced by native language backgrounds, regional and cultural differences, poses challenges to speech recognition systems in recognizing various accents. This thesis investigates how to improve the Conformer model for multi-accent English speech recognition under limited accent data by enhancing accent discrimination. A method integrating accent classification tasks into the speech recognition model is proposed to increase the model's sensitivity and discrimination towards different accents. The results demonstrate a decrease in word error rate for accented English speech recognition compared to traditional methods. Furthermore, this study visualizes accent features in different layers of the model encoder for analysis, exploring the information represented by features at various layers. Additionally, the thesis examines the performance of the extensively trained Whisper model in English and multilingual versions, as well as under different model sizes, for multi-accent English speech recognition tasks. It also compares the differences between training the model using LoRA and comprehensive fine-tuning, expecting to provide clearer guidance for model selection.

    第一章 緒論 1 1.1. 研究背景 1 1.2. 研究動機 1 1.3. 研究貢獻 2 第二章 文獻探討 4 2.1 背景描述 4 2.2 多口音語音辨識方法 4 2.2.1 聲學模型和語音處理 4 2.2.2 獨立的口音聲學模型 5 2.2.3 多口音深度神經網路 5 2.2.4 對抗生成訓練 6 2.2.5 口音嵌入 7 2.2.6 殘差轉換器 8 2.2.7 口音分類任務特徵 9 2.2.8 一般編碼器與口音編碼器 11 第三章 方法與步驟 13 3.1 Conformer模型 13 3.1.1 簡介 13 3.1.2 Conformer模型架構 13 3.1.3 預訓練與微調 14 3.1.4 輔助性的多任務學習 15 3.2 Whisper模型 17 3.2.1 Whisper模型簡介 17 3.2.2 LoRA 19 第四章 實驗與結果 21 4.1 資料集 21 4.1.1 LibriSpeech資料集 21 4.1.2 AESRC2020資料集 22 4.2 評估指標 23 4.3 實驗結果 24 4.3.1 Librispeech預訓練與微調 24 4.3.2 加入口音分類的多任務學習 25 4.3.3 口音分類任務損失的權重 26 4.3.4 Conformer模型內部特徵視覺化 26 4.3.5 加入語言模型之分析 31 4.3.6 加入領域對抗訓練(DAT)的輔助損失 32 4.3.7 外域資料測試 33 4.3.8 微調不同Whisper模型至多口音任務上 34 4.3.9 使用LoRA訓練Whisper模型 35 第五章 結論與展望 37 參考文獻 38

    [1] Gulati, Anmol & Qin, James & Chiu, Chung-Cheng & Parmar, Niki & Zhang, Yu & Yu, Jiahui & Han, Wei & Wang, Shibo & Zhang, Zhengdong & Wu, Yonghui & Pang, Ruoming. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. 5036-5040. 10.21437/Interspeech.2020-3015.
    [2] van der Maaten, Laurens & Hinton, Geoffrey. (2008). Viualizing data using t-SNE. Journal of Machine Learning Research. 9. 2579-2605.
    [3] Radford, Alec & Kim, Jong & Xu, Tao & Brockman, Greg & McLeavey, Christine & Sutskever, Ilya. (2022). Robust Speech Recognition via Large-Scale Weak Supervision.
    [4] Hu, Edward & Shen, Yelong & Wallis, Phillip & Allen-Zhu, Zeyuan & Li, Yuanzhi & Wang, Shean & Chen, Weizhu. (2021). LoRA: Low-Rank Adaptation of Large Language Models.
    [5] W.J. Barry, C.E. Hoequist, F.J. Nolan, An approach to the problem of regional accent in automatic speech recognition, Computer Speech & Language, Volume 3, Issue 4, 1989, Pages 355-366, ISSN 0885-2308, https://doi.org/10.1016/0885-2308(89)90003-X
    [6] Humphries, J.J., & Woodland, P.C. (1998). The use of accent-specific pronunciation dictionaries in acoustic model training. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), 1, 317-320 vol.1.
    [7] Vergyri, Dimitra & Lamel, Lori & Gauvain, Jean-Luc. (2010). Automatic speech recognition of multiple accented English data. 1652-1655. 10.21437/Interspeech.2010-477.
    [8] Rasmussen, Carl. (2000). The Infinite Gaussian Mixture Model. Adv Neural Inf Process Syst. 12. 554-560.
    [9] Schmidhuber, Juergen. (2014). Deep Learning in Neural Networks: An Overview. Neural Networks. 61. 10.1016/j.neunet.2014.09.003.
    [10] Huang, Yan & Yu, Dong & Liu, Chaojun & Gong, Yifan. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. 2977-2981. 10.21437/Interspeech.2014-497.
    [11] F. Perez-Cruz, "Kullback-Leibler divergence estimation of continuous distributions," 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 2008, pp. 1666-1670, doi: 10.1109/ISIT.2008.4595271.
    [12] Zhang, Zhilu & Sabuncu, Mert. (2018). Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels.
    [13] Sun, Sining & Yeh, Ching-Feng & Hwang, Mei-Yuh & Ostendorf, Mari & Xie, Lei. (2017). Domain Adversarial Training for Accented Speech Recognition.
    [14] Gani, Yaroslav & Ustinova, Evgeniya & Ajakan, Hana & Germain, Pascal & Larochelle, Hugo & Laviolette, Francois & Marchand, Mario & Lempitsky, Victor. (2015). Domain-Adversarial Training of Neural Networks.
    [15] Peddinti, Vijayaditya & Povey, Daniel & Khudanpur, Sanjeev. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. 3214-3218. 10.21437/Interspeech.2015-647.
    [16] Rao, Wenbi & Zhang, Ji & Wu, Jianwei. (2020). Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings. 1-6. 10.1145/3388818.3389159.
    [17] Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, pages 73–78, Shanghai, China.
    [18] Tomanek Katrin, Zayats Vicky, Padfield Dirk, Vaillancourt Kara, and Biadsy Fadi. 2021. Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6751–6760, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
    [19] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. Learning multiple visual domains with residual adapters. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 506–516.
    [20] Shao, Qijie & Yan, Jinghao & Kang, Jian & Guo, Pengcheng & Shi, Xian & Hu, Pengfei & Xie, Lei. (2022). Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition. 3719-3723. 10.21437/Interspeech.2022-10444.
    [21] Li, R., Xie, Z., Xu, H., Peng, Y., Liu, H., Huang, H., Chng, E.S. (2023) Self-supervised Learning Representation based Accent Recognition with Persistent Accent Memory. Proc. INTERSPEECH 2023, 1968-1972, doi: 10.21437/Interspeech.2023-1702
    [22] L. Ericsson, H. Gouk, C. C. Loy and T. M. Hospedales, "Self-Supervised Representation Learning: Introduction, advances, and challenges," in IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 42-62, May 2022, doi: 10.1109/MSP.2021.3134634.
    [23] Wang, X., Long, Y., Li, Y., Wei, H. (2023) Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition. Proc. INTERSPEECH 2023, 2923-2927, doi: 10.21437/Interspeech.2023-142
    [24] Wesolek, Sarah & Gulgowski, Piotr & Błaszczak, Joanna & Zygis, Marzena. (2023). What influences foreign accent strength? Phonological and grammatical errors in the perception of accentedness.
    [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
    [26] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.
    [27] Park, D.S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, E.D., & Le, Q.V. (2019). SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech.
    [28] Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie. 2019. Neural News Recommendation with Multi-Head Self-Attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6389–6394, Hong Kong, China. Association for Computational Linguistics.
    [29] Ba, Jimmy & Kiros, Jamie & Hinton, Geoffrey. (2016). Layer Normalization.
    [30] Hendrycks, Dan & Lee, Kimin & Mazeika, Mantas. (2019). Using Pre-Training Can Improve Model Robustness and Uncertainty.
    [31] Graves, Alex & Fernández, Santiago & Gomez, Faustino & Schmidhuber, Jürgen. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks.
    [32] Agarap, A.F. (2018). Deep Learning using Rectified Linear Units (ReLU). ArXiv, abs/1803.08375.
    [33] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
    [34] Shi, X., Yu, F., Lu, Y., Liang, Y., Feng, Q., Wang, D., Qian, Y., & Xie, L. (2021). The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6918-6922.
    [35] Yang, Mu & Chandra Shekar, Ram Charan & Kang, Okim & Hansen, John. (2023). What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model.

    下載圖示
    QR CODE