簡易檢索 / 詳目顯示

研究生: 范姜紹瑋
Fan Jiang, Shao-Wei
論文名稱: 多編碼器端到端模型於英語錯誤發音檢測與診斷
Multi-Encoder based End-to-End Model for English Mispronunciation Detection and Diagnosis
指導教授: 陳柏琳
Chen, Berlin
口試委員: 洪志偉
Hung, Jeih-Weih
林伯慎
Lin, Bor-Shen
陳冠宇
Chen, Kuan-Yu
口試日期: 2021/07/30
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 60
中文關鍵詞: 錯誤發音檢測和診斷電腦輔助發音訓練口音嵌入多任務學習端到端模型
英文關鍵詞: mispronunciation detection and diagnosis, computer-assisted pronunciation training, accent embeddings, multi-task learning, End-to-End model
研究方法: 實驗設計法比較研究
DOI URL: http://doi.org/10.6345/NTNU202100982
論文種類: 學術論文
相關次數: 點閱:124下載:15
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著全球化的加速,大多數人需要學習第二語言(Second language, L2),相較之下,語言教師的人數增長卻無法跟上語言學習的需求。因此越來越多研究著重在電腦輔助發音訓練(Computer-assisted pronunciation training, CAPT),嘗試利用電腦輔助學習者做更方便且有效的學習。在 CAPT 中,最重要的模組為以自動語音辨識(Automatic speech recognition, ASR)為核心技術的錯誤發音和診斷(Mispronunciation detection and diagnosis, MD&D)。然而,現有 MD&D 模型仍面臨兩個問題:一、任務不匹配。純語音辨識任務並未充分利用提示文本(Text prompt)於訓練階段。二、口音多樣性。第二語言學習者具有特殊的發音習慣,該習慣的聲學或語言特性會導致模型效能辨識困難。基於上述兩個問題,本研究提出兩個解決方向於端對端 MD&D 模型 (End-to-end MD&D, E2E MD&D)。首先,我們使用不同細粒度(音素與字元)的文本提示進行輸入增強,使 E2E ASR 更適合 MD&D 任務。其次,我們設計兩種不同面向的口音感知模塊,提示模型口音資訊以及消除口音資訊,嘗試減輕口音多樣性於 E2E MD&D 系統的影響。實驗結果表明,在公開二語語料庫 L2-ARCTIC 上,我們提出 MD&D 模型具有明顯的優勢與有效性。

    With the acceleration of globalization, most people need to learn a second language (L2). In contrast, the increase in the number of language teachers cannot keep up with the demand for language learning. Therefore, more and more researches focus on computer-assisted pronunciation training (CAPT), trying to use computers to assist learners do more convenient and effective learning. In CAPT, the most important module is mispronunciation detection and diagnosis (MD&D) with automatic speech recognition (ASR) as the core technology. However, the existing MD&D model still facing two problems. First, the task does not match. The pure ASR task does not make full use of the text prompt in the training phase. Second, diversity of accents. L2 learners have special pronunciation habits, and the acoustic or linguistic characteristics of this habit will make it difficult to identify the effectiveness of the model. Based on the above two problems, this research proposes two solutions to the end-to-end MD&D model (E2E MD&D). First, we use different fine-grained (phoneme and character) text prompts for input augmentation, making E2E ASR more suitable for MD&D tasks. Second, we designed two different accent perception modules, prompting model accent information and eliminating accent information, trying to reduce the impact of accent diversity on the E2E MD&D system. The experiment results shown that our proposed MD&D model has advantages and effectiveness on the public L2 corpus L2-ARCTIC.

    第1章 緒論 1 1.1 研究動機 1 1.2 任務描述 3 1.3 自動語音辨識 4 1.3.1. 深度類神經網路結合隱藏式馬可夫模型的語音辨識 5 1.3.1.1. 特徵擷取 5 1.3.1.2. 聲學模型 6 1.3.1.3. 語言模型 6 1.3.1.4. 語言解碼 7 1.3.2. 端到端模型的語音辨識 7 1.3.2.1. 連結時序分類 8 1.3.2.2. 注意力模型 10 1.3.2.3. CTC-Attention混合模型(Hybrid CTC-Attention model) 11 1.4 電腦輔助發音訓練 13 1.4.1. 錯誤發音類型 13 1.4.2. 基於DNN-HMM發音檢測 14 1.4.3. 基於端到端模型發音檢測 15 1.4.4. 回饋 16 1.4.5. 評估標準 16 1.5 論文研究內容與貢獻 16 1.6 論文架構 17 第2章 文獻回顧及方法探討 18 2.1 發音檢測與診斷方法 18 2.1.1. DNN-HMM發音檢測與診斷 18 2.1.1.1. 發音優劣評估(Goodness of pronunciation, GOP) 19 2.1.1.2. 擴展識別網路(Extended recognition network, ERN) 21 2.1.1.3. 音素聲學模型(Acoustic phonetic model, APM) 22 2.1.2. 端到端發音檢測與診斷 25 2.1.2.1. 基於發音分數的模型 25 2.1.2.2. 基於語音辨識結果的模型 26 第3章 實驗環境設定 31 3.1 實驗語料 31 3.1.1. TIMIT(TI-MIT) 31 3.1.2. L2-ARCTIC 32 3.1.3. 音素統計 33 3.2 實驗方法 33 3.2.1. 傳統DNN-HMM模型設計 34 3.2.2. 基線端到端模型設計 34 3.2.3. 文本提示雙編碼模型 37 3.2.4. 口音感知雙編碼模型 38 3.2.4.1. 特定的口音感知 40 3.2.4.2. 軟式口音感知 41 3.2.5. 混合文本提示與口音感知多編碼模型 42 3.3 發音檢測與診斷評估方式 44 第4章 實驗結果分析與討論 45 4.1 發音優劣評估與基線端到端模型 45 4.2 文本提示雙編碼模型 46 4.3 口音感知雙編碼模型 48 4.4 混合文本提示與口音感知多編碼模型 51 4.5 綜合比較 51 第5章 結論與未來展望 53 參考文獻 55

    [Atal, 1974] B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304-1312, 1974.
    [Bahdanau et al., 2014] D. Bahdanau, K. Cho and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in arXiv, 2014.
    [Chen et al., 2018] L. Chen, J. Tao, S. Ghaffarzadegan and Y. Qian, "End-to-end neural network based automated speech scoring," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2018.
    [Chiu and Chen, 2021] S. H. Chiu and B. Chen, "Innovative BERT-based reranking language models for speech recognition," in Proceedings of the Spoken Language Technology Workshop, 2021.
    [Chorowski et al., 2015] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho and Y. Bengio, "Attention-based models for speech recognition," in arXiv, 2015.
    [Davis and Mermelstein, 1980] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
    [Demenko et al., 2009] G. Demenko, A. Wagner, N. Cylwik and O. Jokisch, "An audiovisual feedback system for acquiring L2 pronunciation and L2 prosody," in Proceedings of the International Workshop on Speech and Language Technology in Education, 2009.
    [Devlin et al., 2018] J. Devlin, M. W. Chang, K. Lee and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in arXiv, 2018.
    [Fant et al., 1973] G. Fant, Speech Sounds and Features. Cambridge, MA, MIT Press, 1973.
    [Feng et al., 2020] Y. Feng, G. Fu, Q. Chen and K Chen, "SED-MDD: Towards sentence dependent end-to-end mispronunciation detection and diagnosis," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2020.
    [Garofolo et al., 1993] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus and D. S. Pallett, “Darpa timit acoustic-phonetic con-tinous speech corpus cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report, no. 93, 1993.
    [Graves et al., 2006] A. Graves, S. Fernández, F. Gomez and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006.
    [Harrison et al., 2008] A. M. Harrison, W. Y. Lau, H. Meng and L. Wang, "Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer," in Proceedings of the International Conference on Speech Communication Association, 2008.
    [Harrison et al., 2009] A. M. Harrison, W. K. Lo, X. Qian and H. Meng, "Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training," in Proceedings of the International Workshop on Speech and Language Technology in Education, 2009.
    [Hinton et al., 2012] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
    [Hu et al., 2015] W. Hu, Y. Qian, F. K. Soong and Y. Wang, "Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers," Speech Communication, vol. 67, pp. 154-166, 2015.
    [Kim et al., 1997] Y. Kim, H. Franco and L. Neumeyer, "Automatic pronunciation scoring of specific phone segments for language instruction," in Proceedings of the European Conference on Speech Communication and Technology, 1997.
    [Kim et al., 1997] Y. Kim, H. Franco and L. Neumeyer, "Automatic pronunciation scoring of specific phone segments for language instruction," in Proceedings of the Fifth European Conference on Speech Communication and Technology, 1997.
    [Laborde et al., 2016] V. Laborde, T. Pellegrini, L. Fontan, J. Mauclair, H. Sahraoui, and J. Farinas, "Pronunciation assessment of Japanese learners of French with GOP scores and phonetic information," in Proceedings of the International Conference on Speech Communication and Technology, 2016.
    [Lee et al., 2012] A. Lee, and J. Glass, "A comparison-based approach to mispronunciation detection," in Proceedings of the Spoken Language Technology Workshop, 2012.
    [Leung et al., 2019] W. K. Leung, X. Liu and H. Meng, "CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2019.
    [Lo et al., 2020] T. H. Lo, S. Y. Weng, H. J. Chang and B. Chen "An effective end-to-end modeling approach for mispronunciation detection," in arXiv, 2020.
    [Mao et al., 2018] S. Mao, Z. Wu, R. Li, X. Li, H. Meng and L. Cai, "Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2018.
    [Meng et al., 2007] H. Meng, Y. Y. Lo, L. Wang and W. Y. Lau, "Deriving salient learners’ mispronunciations from cross-language phonological comparisons," in Proceedings of the Workshop on Automatic Speech Recognition & Understanding, 2007.
    [Peters et al., 2018] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, "Deep contextualized word representations," in arXiv, 2018.
    [Povey et al., 2011] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer and K. Vesely, "The Kaldi speech recognition toolkit," in Proceedings of the IEEE Automatic Speech Recognition and Understanding Worshop, 2011.
    [Qian and Meng, 2016] K. Li, X. Qian and H. Meng, "Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 193-207, 2016.
    [Qian et al., 2010] X. Qian, H. Meng and F. Soong, "Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT)," in Proceedings of the International Symposium on Chinese Spoken Language Processing, 2010.
    [Qian et al., 2012] X. Qian, H. Meng and F. K. Soong, "The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computer-aided pronunciation training," in Proceedings of the International Conference on Speech Communication Association, 2012.
    [Radford et al., 2018] A. Radford, K. Narasimhan, T. Salimans and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
    [Rogerson-Revell, 2021] P. M. Rogerson-Revell, "Computer-Assisted Pronunciation Training (CAPT): Current Issues and Future Directions." RELC Journal, vol. 52, no. 1, pp. 189-205, 2021.
    [Stevens et al., 2000] K. N. Stevens, Acoustic Phonetics. Cambridge, MA, MIT Press, 2000.
    [Truong et al., 2005] K. P. Truong, A. Neri, F. D. Wet, C. Cucchiarini and H. Strik, "Automatic detection of frequent pronunciation errors made by L2-learners," in Proceedings of the European Conference on Speech Communication and Technology, 2005.
    [Truong et al., 2007] K. P. Truong, A. Neri, F. Wet, C. Cucchiarini and H. Strik, "Comparing classifiers for pronunciation error detection," in Proceedings of the International Conference on Speech Communication Association, 2007.
    [Viterbi, 1967] A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260-269, 1967.
    [Wang and Lee, 2012] Y. B. Wang and L. S. Lee, "Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2012.
    [Watanabe et al., 2017] S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE Journal of Selected Topics in Signal Processing, vol.11, no. 8, pp. 1240-1253, 2017.
    [Watanabe et al., 2018] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala and Tsubasa Ochiai, "Espnet: End-to-end speech processing toolkit," in arXiv, 2018.
    [Wei et al., 2009] S. Wei, G. Hu, Y. Hu and R. H. Wang, "A new method for mispronunciation detection using support vector machine based on pronunciation space models," Speech Communication, vol. 51, no. 10, pp. 896-905, 2009.
    [Witt and Young, 2000] S. M. Witt and S. J. Young, "Phone-level pronunciation scoring and assessment for interactive language learning." Speech communication, vol. 30, no. 2-3, pp. 95-108, 2000.
    [Xu et al., 2015] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention." in Proceedings of the International conference on machine learning, 2015.
    [Zhang et al., 2008] F. Zhang, C. Huang, F. K. Soong, M. Chu and R. Wang, "Automatic mispronunciation detection for Mandarin," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2008.
    [Zhao et al., 2018] G. Zhao, S. Sonsaat, A. O. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A Non-native English Speech Corpus,” in Proceedings of the Interspeech, 2018.

    下載圖示
    QR CODE