簡易檢索 / 詳目顯示

研究生: 楊子霆
Tzu-Ting, Yang
論文名稱: 提升編碼器語言敏銳度在語碼轉換語音辨識中的有效性之研究
A Study on the Effectiveness of Language Acuity-Enhanced Encoders in Code-Switching Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳柏琳
Chen, Berlin
王新民
Wang, Hsin-Min
洪志偉
Hung, Jeih-weih
江振宇
Chiang, Chen-Yu
口試日期: 2024/07/22
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 52
中文關鍵詞: 自動語音辨識語碼轉換混合專家模型解絞損失中間層損失非尖峰CTC損失
英文關鍵詞: automatic speech recognition, code-switching, mixture of expert, disentangle loss, intermediate loss, non-peaky CTC loss
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202401823
論文種類: 學術論文
相關次數: 點閱:113下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著端到端 (End-to-End, E2E) 神經網路的出現,語音辨識 (Automatic Speech Recognition, ASR) 領域進入了一個革命性的全新時代。E2E ASR 將傳統語音辨識框架中的模組整合為一個單一、統一的神經網路,能夠直接將輸入的語音信號轉錄為相應的文本。這一創新不僅簡化了神經網路的建模過程,還大大減少了各個模組獨立訓練時可能產生的不一致性。在單語辨識效能方面,E2E ASR 模型已經達到了接近人類水準的準確性,這標誌著語音辨識技術演進中的一個重要里程碑。
    根據統計,現今全球超過60%的人口是多語言使用者。在口頭交流中,多語者經常因為學習環境和情緒變化等因素無意識地在不同語言之間切換。這種現象被稱為語碼轉換(Code-Switching, CS),在台灣、新加坡和馬來西亞等高度國際化的國家中特別普遍。在語碼轉換中,模型不僅需要考慮聲學特徵,還需要學會精確識別語言切換的時刻。這一任務的複雜性經常導致端到端語音識別系統(E2E ASR)性能下降。因此,解決語碼轉換問題是語音識別領域中最緊迫的挑戰之一。
    為了解決這一挑戰,我們提出了 D-MoE 架構,這是一種設計用於同時利用語言間共享的底層資訊並有效減少聲音嵌入中語言混淆的編碼器。隨後,我們實施了一項創新技術,透過在編碼器內部建立語言邊界,潛移默化地豐富聲音嵌入中的語言知識,進一步增強了模型對不同語言的敏銳度。

    With the advent of End-to-End (E2E) neural networks, the field of Automatic Speech Recognition (ASR) has embarked on a revolutionary new era. E2E ASR consolidates the modules of traditional speech recognition frameworks into a single, cohesive neural network, capable of directly transcribing input speech signals into corresponding text. This innovation not only streamlines neural network modeling but also significantly reduces inconsistencies that can arise from independently training each module. In terms of monolingual recognition performance, E2E ASR models have achieved near-human levels of accuracy, marking a significant milestone in the evolution of speech recognition technology.
    According to statistics, over 60% of the global population today are multilingual users. In verbal communication, multilingual speakers often switch between different languages unconsciously due to factors such as their learning environment and mood changes. This phenomenon, known as Code-Switching (CS), is especially prevalent in highly internationalized countries like Taiwan, Singapore, and Malaysia. In CS, a model needs to not only account for acoustic features but also learn to pinpoint the exact moments when languages switch. The complexity of this task often results in a decline in the performance of E2E ASR systems. Therefore, addressing CS is one of the most pressing challenges in the field of speech recognition.
    To tackle this challenge, we proposed the so-called D-MoE architecture, an encoder designed to simultaneously leverage shared underlying information between languages while effectively reducing language confusion in acoustic embeddings. Following this, we designed and implemented an innovative technique that establishes language boundaries within the encoder, subtly enriching the language knowledge in the audio embedding and further enhancing the acuity of the model to different languages.

    摘要 I Abstract II Table of Contents IV List of Tables VI List of Figures VII Chapter 1 Introduction 1 1.1 Background 1 1.2 Evolution of Model Architectures 2 1.3 Motivation 3 Chapter 2 Common Method 5 2.1 Data Augmentation 7 2.2 Multi-Task Learning Architecture 10 Chapter 3 Proposed Methodology 12 3.1 Disentangle-Based Mixture-of-Expert (D-MoE) 13 3.1.1 Overall Architecture of D-MoE 15 3.1.2 Language-Aware Encoder (LAE) 16 3.1.3 Integration into the MoE Architecture 18 3.1.4 Disentangling Between Two Languages 20 3.2 Language Acoustic Boundary Injection (LABI) 21 3.2.1 Overall Architecture 22 3.2.2 LID Information Block 23 3.2.3 Language Boundary Alignment Loss 26 3.2.4 Deep Language Posterior Injection 28 Chapter 4 Experiments 29 4.1 Corpus 29 4.2 Experimental Setup 31 4.3 Overall Comparison 33 4.4 Ablation Study 35 4.4.1 Component-wise Evaluation of D-MoE 35 4.4.2 Impact of NPC Loss 37 4.5 Exemplars of Visualization 39 4.5.1 Pre- and Post-Disentanglement 39 4.5.2 Pre- and Post-Integration of NPC Loss 41 Chapter 5 Conclusion and Outlook 42 References 43

    K. H. Davis, R. Biddulph, and S. Balashek, “Automatic Recognition of Spoken Digits,” J. Acoust. Soc. Am., vol. 24, no. 6, pp. 627-642, 1952.
    L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1986.
    L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
    B. H. Juang and L. R. Rabiner, “Hidden Markov Models for Speech Recognition,” Technometrics, vol. 33, no. 3, pp. 251-272, 1991.
    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
    D. Rumelhart, G. Hinton, and R. Williams, “Learning Representations by Back-propagating Errors,” Nature, vol. 323, no. 6088, pp. 533-536, 1986.
    S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to Forget: Continual Prediction with LSTM,” Neural Computation, vol. 12, no. 10, pp. 2451-2471, 2000.
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems, vol. 30, no. 0, pp. 5998-6008, 2017.
    S. Dalmia, Y. Liu, S. Ronanki, and K. Kirchhoff, “Transformer-Transducers for Code-Switched Speech Recognition,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
    H. Vogt, “Language Contacts,” Word, vol. 10, no. 2-3, pp. 365-374, 1954.
    S. Poplack, “Sometimes I’ll Start a Sentence in Spanish y Termino en Espanol: Toward a Typology of Code-Switching,” Linguistics, vol. 18, no. 7–8, pp. 581–618, 1980.
    D. Sankoff and S. Poplack, “A Formal Grammar for Code-Switching,” Research on Language & Social Interaction, vol. 14, no. 1, pp. 3-45, 1981.
    E. Lanza, “Multiple Voices: An Introduction to Bilingualism,” 2008.
    H. M. Belazi, E. J. Rubin, and A. J. Toribio, “Code Switching and X-Bar Theory: The Functional Head Constraint,” Linguistic Inquiry, vol. 25, no. 2, pp. 221-237, 1994.
    S. Mahootian and B. Santorini, “Code-Switching and the Complement/Adjunct Distinction,” Linguistic Inquiry, vol. 27, no. 3, pp. 464-479, 1996.
    E. M. Eppler, “The Syntax of German-English Code-Switching,” 2005.
    L. Qin, M. Ni, Y. Zhang, and W. Che, “CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation For Zero-Shot Cross-Lingual NLP,” in Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI), 2021.
    G. I. Winata and A. Madotto, C.-S. Wu, P. Fung, “Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences,” in Proceedings of the Conference on Computational Natural Language Learning (CoNLL), 2019.
    A. Pratapa, G. Bhat, M. Choudhury, S. Sitaram, S. Dandapat, and K. Bali, “Language Modeling for Code-Mixing: The Role of Linguistic Theory Based Synthetic Data,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
    Y. Li and P. Fung, “Language Modeling with Functional Head Constraint for Code Switching Speech Recognition,” in Proceedings of International Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
    Y. Li and P. Fung, “Code-switch Language Model with Inversion Constraints for Mixed Language Speech Recognition,” in Proceedings of International Conference on Computational Linguistics (COLING), 2012.
    P. E. Dussias and E. H. Courtney, “Que Es Un Good Code-Switch? Testing the Functional Head Constraint Within Noun Phrases,” Journal of Second Language Acquisition and Teaching, vol. 2, no. 0, pp. 1-13, 1994.
    A. Hussein, S. A. Chowdhury, A. Abdelali, N. Dehak, A. Ali, and S. Khudanpur, “Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition,” in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2022.
    G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Learn to Code-Switch: Data Augmentation Using Copy Mechanism on Language Modeling,” arXiv preprint arXiv:1810.10254, 2018.
    C.-T. Chang, S.-P. Chuang, and H.-Y. Lee, “Code-Switching Sentence Generation by Generative Adversarial Networks and Its Application to Data Augmentation,” in Proceedings of the International Speech Communication Association (Interspeech), 2019.
    C. Du, H. Li, Y. Lu, L. Wang, and Y. Qian, “Data Augmentation for End-to-End Code-Switching Speech Recognition,” in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 2021.
    L. Ye, G. Cheng, R. Yang, Z. Yang, S. Tian, P. Zhang, and Y. Yan, “Improving Recognition of Out-of-Vocabulary Words in E2E Code-Switching ASR by Fusing Speech Generation Methods,” in Proceedings of the International Speech Communication Association (Interspeech), 2022.
    J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, Robust and Controllable Text to Speech,” arXiv preprint arXiv:1905.09263, 2019.
    Z. Li, C. Hu, J. Chen, Z. Chen, X. Guo, and R. Zhang, “Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching,” arXiv preprint arXiv:2406.13361, 2024.
    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020.
    G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences,” in Proceedings of the Conference on Computational Natural Language Learning (CoNLL), 2019.
    Y. Peng, Y. Liu, J. Zhang, H. Xu, Y. He, H. Huang, and E. S. Chng, “Internal Language Model Estimation Based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition,” arXiv preprint arXiv:2207.04176, 2022.
    Z. Meng, S. Parthasarathy, E. Sun, Y. Gaur, N. Kanda, L. Lu, X. Chen, R. Zhao, J. Li, and Y. Gong, “Internal Language Model Estimation for Domain-Adaptive End-To-End Speech Recognition,” in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2021.
    H. Seki, S. Watanabe, T. Hori, J. Le Roux, and J. R. Hershey, “An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towards End-To-End Code-Switching Speech Recognition,” arXiv preprint arXiv:1810.13091, 2018.
    K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong, “Towards Code-Switching ASR for End-to-End CTC Models,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
    Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, “On the End-to-End Solution to Mandarin-English Code-Switching Speech Recognition,” in Proceedings of the International Speech Communication Association (Interspeech), 2019.
    C.-Y. Li, N. and T. Vu, “Integrating Knowledge in End-To-End Automatic Speech Recognition for Mandarin-English Code-Switching,” in Proceedings of International Conference on Asian Language Processing (IALP), 2019.
    F. Zhang and P. Yin, “A Study of Pronunciation Problems of English Learners in China,” Asian Social Science, vol. 5, no. 6, pp. 141-146, 2009.
    X. Zhou, E. Yılmaz, Y. Long, Y. Li, and H. Li, “Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition,” arXiv preprint arXiv:2006.10414, 2020.
    Y. Lu, M. Huang, H. Li, J. Guo, and Y. Qian, “Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts,” in Proceedings of International Speech Communication Association (INTERSPEECH), 2020.
    J. Tian, J. Yu, C. Zhang, C. Weng, Y. Zou, and D. Yu, “LAE: Language-Aware Encoder for Monolingual and Multilingual ASR,” arXiv preprint arXiv:2206.02093, 2022.
    T. Song, Q. Xu, M. Ge, L. Wang, H. Shi, Y. Lv, Y. Lin, and J. Dang, “Language-Specific Characteristic Assistance for Code-Switching Speech Recognition,” in Proceedings of International Speech Communication Association (INTERSPEECH), 2022.
    A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in Proceedings of International Conference on Machine Learning (ICML), 2006.
    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive Mixtures of Local Experts,” Neural Computation, vol. 3, no. 1, pp. 79-87, 1991.
    F. Xue, Z. Shi, F. Wei, Y. Lou, Y. Liu, and Y. You, “Go Wider Instead of Deeper,” in Proceedings of The Association for the Advancement of Artificial Intelligence (AAAI), 2022.
    W. Wang, G. Ma, Y. Li, and B. Du, “Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition,” arXiv preprint arXiv:2307.05956, 2023.
    C. Chen, Y. Hu, C.-H. H. Yang, H. Liu, S. M. Siniscalchi, and E. S. Chng, “Generative Error Correction for Code-Switching Speech Recognition Using Large Language Models,” arXiv preprint arXiv:2310.13013, 2023.
    N. Chirkova, D. Rau, H. Déjean, T. Formal, S. Clinchant, and V. Nikoulina, “Retrieval-Augmented Generation in Multilingual Settings,” arXiv preprint arXiv:2407.01463, 2024.
    Y. Xi, W. Ding, K. Yu, and J. Lai, “Semi-Supervised Learning for Code-Switching ASR with Large Language Model Filter,” arXiv preprint arXiv:2407.04219, 2024.
    R. Sanabria and F. Metze, “Hierarchical Multitask Learning with CTC,” in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2018.
    J. Nozaki and T. Komatsu, “Relaxing the Conditional Independence Assumption of CTC-Based ASR By Conditioning on Intermediate Predictions,” in Proceedings of International Speech Communication Association (INTERSPEECH), 2021.
    A. Zeyer, R. Schlüter, and H. Ney, “Why does CTC result in peaky behavior?” arXiv preprint arXiv:2105.14849, 2021.
    Z. Tian, H. Xiang, M. Li, F. Lin, K. Ding, and G. Wan, “Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
    R. Huang, X. Zhang, Z. Ni, L. Sun, M. Hira, J. Hwang, V. Manohar, V. Pratap, M. Wiesner, S. Watanabe, D. Povey, and S. Khudanpur, “Less Peaky and More Accurate CTC Forced Alignment by Label Priors,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
    V. Manohar, D. Povey, and S. Khudanpur, “Semi-Supervised Maximum Mutual Information Training of Deep Neural Network Acoustic Models,” in Proceedings of International Speech Communication Association (INTERSPEECH), 2015.
    H. Liu, H. Xu, L. P. Garcia, A. W. H. Khong, Y. He, and S. Khudanpur, “Reducing Language Confusion for Code-Switching Speech Recognition with Token-Level Language Diarization,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
    H. Liu, L. P. Garcia, X. Zhang, A. W. H. Khong, and S. Khudanpur, “Enhancing Code-Switching Speech Recognition with Interactive Language Biases,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
    D.-C. Lyu, T.-P. Tan, E.-S. Chng, and H. Li, “SEAME: A Mandarin-English Code-Switching Speech Corpus in South-East Asia,” in Proceedings of International Speech Communication Association (INTERSPEECH), 2010.
    X. Shi, Q. Feng, and L. Xie, “The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results,” arXiv preprint arXiv:2007.05916, 2020.
    H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline,” in Proceedings of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), 2017.
    V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
    L. Lonergan, M. Qian, N. Ní Chiaráin, C. Gobl, and A. Ní Chasaide, “Low-Resource Speech Recognition and Dialect Identification of Irish in A Multi-Task Framework,” arXiv preprint arXiv:2405.01293, 2024.
    G. Lee, T.-N. Ho, E.-S. Chng, and H. Li, “A Review of The Mandarin-English Code-Switching Corpus: SEAME,” in Proceedings of International Conference on Asian Language Processing (IALP), 2017.

    下載圖示
    QR CODE