簡易檢索 / 詳目顯示

研究生: 林孟欣
Lin, Meng-Shin
論文名稱: 第二外語學習者之自動發音評測及錯誤發音偵測研究
Research on Automatic Pronunciation Assessment and Mispronunciation Detection for Second Language Learners
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳柏琳
Chen, BerLin
陳冠宇
Chen, Kuan-Yu
曾厚強
Tseng, Hou-Chiang
口試日期: 2024/07/23
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 55
中文關鍵詞: 電腦輔助發音訓練自動發音評估錯誤發音偵測與診斷
英文關鍵詞: computer-assisted pronunciation training, automatic pronunciation assessment, mispronunciation detection and diagnosis
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202401845
論文種類: 學術論文
相關次數: 點閱:71下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著全球化的趨勢,電腦輔助發音訓練(CAPT)系統越來越受歡迎,應用於 減輕教師工作量、發音評測線上課程和幫助學習者練習語言技能等場景。本 論文提出了一系列創新的 CAPT 建模技術,以應對各種教學和自學應用,展 示了其強大的潛力和實用價值。在自動語音評估 (Automatic Pronunciation Assessment, ASA) 方面,我們針對資料不平衡問題,採用了類平衡損失函數 和重新採樣方法,縮小了訓練集和測試集之間的差距,並在不平衡資料集 speechocean762 上顯示出顯著的性能提升。在錯誤發音偵測與診斷 (Mispronunciation Detection and Diagnosis, MDD) 方面,我們使用了一種新穎 的基於文本提示引導聽寫模型,通過音素依賴閾值有效平衡精度和召回率, 同時引入多視角音頻編碼器提供細粒度發音提示。這些創新方法能夠更精確 地識別並診斷 L2 學習者的發音錯誤,並提供即時反饋。在 L2-ARCTIC 基準 數據集上的綜合實驗結果表明,我們的方法在多個競爭基線中具有實際可行 性。然而未來的研究可以探索更多樣化的語言和發音情境,以進一步提升 CAPT 系統的適用性和實用性。同時,我們也希望未來可以探索 APA 和 MDD 的聯合模型,以充分利用兩者的優勢,提供給學習者在使用系統上得到更好 的回饋。

    With the trend of globalization, computer-assisted pronunciation training (CAPT) systems are becoming increasingly popular, applied in scenarios such as reducing teachers' workload, pronunciation assessment in online courses, and helping learners practice language skills. This thesis proposes a series of innovative CAPT modeling techniques to address various teaching and self-study applications, demonstrating their strong potential and practical value. In the area of Automatic Pronunciation Assessment (APA), we tackled the issue of data imbalance by adopting a balanced loss function and resampling methods, narrowing the gap between training and test sets, and showing significant performance improvements on the imbalanced dataset Speechocean762. In the field of Mispronunciation Detection and Diagnosis (MDD), we employed a novel prompt-guided model, effectively balancing precision and recall through phone-dependent thresholds while introducing a multi-view audio encoder to provide fine-grained articulatory cues. These innovative methods enable more precise identification and diagnosis of pronunciation errors in L2 learners, offering timely feedback. Comprehensive experimental results on the L2-ARCTIC benchmark dataset indicate that our methods are practically feasible compared to multiple competitive baselines.Future research can explore more diverse language and pronunciation scenarios to further enhance the applicability and practicality of CAPT systems. We also hope to explore joint models of APA and MDD in the future, leveraging the advantages of both to provide better feedback for learners using the system. In summary, this study demonstrates the potential of using innovative technologies in CAPT systems, which not only improves the accuracy of pronunciation assessment but also better assists language learners in improving their pronunciation skills. This thesis explores the feasibility of increasing the practical application of CAPT and aims to have a positive impact on the field of language education, promoting the spread and development of language learning.

    第 1 章 緒論 1 1.1 研究背景與動機 1 1.2 研究方向 4 1.2.1 自動發音評測 (Automatic Pronunciation Assessment, APA) 5 1.2.2 錯誤發音偵測與診斷(Mispronunciation Detection and Diagnosis, MDD) 6 1.3 研究內容與貢獻 7 1.3.1 自動發音評測(Automatic Pronunciation Assessment, APA) 7 1.3.2 錯誤發音偵測與診斷(Mispronunciation Detection and Diagnosis, MDD) 8 1.4 論文架構 8 第 2 章 文獻探討 10 2.1 電腦輔助發音訓練(COMPUTER-ASSISTED PRONUNCIATION TRAINING, CPAT) 10 2.2 自動發音評測(AUTOMATIC PRONUNCIATION ASSESSMENT, APA) 12 2.2.1 發音優良程度(Goodness of Pronunciation, GOP) 13 2.3 錯誤發音偵測與診斷(MISPRONUNCIATION DETECTION AND DIAGNOSIS, MDD) 15 2.3.1 基於發音評分的 MDD 任務 16 2.3.2 基於聽寫模型的 MDD 任務 17 2.2.2.2.1 連接時序分類 (CTC) 18 2.4 自監督學習 19 2.5 口腔發音特徵 20 第 3 章 實驗方法與設定 22 3.1 自動發音評估 (APA) 22 3.1.1 APA 實驗語料 22 3.1.2 實驗方法 25 3.1.2.1 實驗基線設定 26 3.1.2.2 評估方法 27 3.2 錯誤發音偵測與診斷 (MISPRONUNCIATION DETECTION AND DIAGNOSIS, MDD) 28 3.2.1 實驗語料 28 3.2.1.1 TIMIT 29 3.2.1.2 L2-ARCTIC 29 3.2.2 實驗方法 32 3.2.2.1 聲學編碼器 33 3.2.2.2 口腔特徵編碼器 33 3.2.2.3 多視角的聲學模型 34 3.2.2.4 文本提示雙編碼器 36 3.2.2.5 聯結網路 36 3.2.2.6 錯誤發音偵測器 37 3.2.2.7 音素預測器 37 3.2.2.8 訓練目標設定 37 3.2.2.9 基於文本提示來指導錯誤發音偵測 38 3.2.3 評估指標 39 3.2.4 基線模型 40 第 4 章 實驗結果 42 4.1 自動發音評估(APA) 42 4.2 錯誤發音檢測和診斷(MDD) 43 第 5 章 結論與未來研究方向 47 參考文獻 49

    P. M. Rogerson-Rvel, “Computer-Assisted Pronunciation Training (CAPT): Current Issues and Future Directions,” RELC Journal, vol. 52, no. 1, pp. 189–205, 2021.
    P. Howson, “The English Effect: The Impact of English, What It’s Worth to The UK And Why It Matters to The World,” British Council, 2013.
    E. Kim, J.-J. Jeon, H. Seo, and H. Kim, “Automatic pronunciation assessment using self-supervised speech representation learning,” in Proceedings of Interspeech, pp. 1411–1415, 2022.
    Y. K. Singla et al., “Speaker-conditioned Hierarchical Modeling for Automated Speech Scoring,” in Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 1681–1691, 2021.
    F. A. Chao, T. H. Lo, T. I. Wu, Y. T. Sung, and Berlin Chen, “3M: An effective multi-view, multigranularity, and multi-aspect modelling approach to English pronunciation assessment,” in Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 575–582, 2022.
    H. Do, Y. Kim, and G. G. Lee, “Hierarchical pronunciation assessment with multi-aspect attention,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5, 2023.
    Y. Gong, Z. Chen, I.-H. Chu, P. Chang, and J. Glass, “Transformer-based multi-aspect multigranularity non-native English speaker pronunciation assessment,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7262–7266, 2022.
    K. Kyriakopoulos, K. Knill, and M. Gales, “Automatic detection of accent and lexical pronunciation errors in spontaneous nonnative English speech,” in Proceedings of Interspeech, pp. 3052–3056, 2020.
    D. Korzekwa, J. Lorenzo-Trueba, T. Drugman, S. Calamaro and B. Kostek, “Weakly-supervised word-level pronunciation error detection in non-native English speech,” in Proceedings of Interspeech, pp. 4408–4412, 2021.
    S. M. Witt and S. J Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communication, vol. 30, pp. 95–108, 2000.
    H. Do, Y. Kim, and G. G. Lee, “Score-balanced loss for multi-aspect pronunciation assessment,” in Proceedings of Interspeech (INTERSPEECH), pp. 4998–5002, 2023.
    L. Peng, K. Fu, B. Lin, D. Ke, and J. Zhang, “A Study on Fine-Tuning wav2vec2. 0 Model for the Task of Mispronunciation Detection and Diagnosis,” in Proceedings of Interspeech, pp. 4448–4452, 2021.
    T. H. Lo, Y. T. Sung, and B. Chen, “Improving end-to-end modeling for mispronunciation detection with effective augmentation mechanisms,” in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1049–1055, 2021.
    B.-C. Yan, H.-W. Wang, Y.-C. Wang, J.-T. Li, C.-H. Lin, and B. Chen, “Preserving phonemic distinctions for ordinal regression: A novel loss function for automatic pronunciation assessment,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, pp. 1–7, 2023.
    H.-C. Pei, H. Fang, X. Luo, and X.-S. Xu, “Gradformer: A framework for multi-aspect multi-franularity pronunciation assessment," IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 32, pp. 554–563, 2024.
    F.-A. Chao, T.-H. Lo, T.-I Wu, Y.-T. Sung, B. Chen, “A hierarchical context-aware modeling approach for multi-aspect multi-granular pronunciation assessment," in Proceedings of Interspeech, 2023.
    K. Li, X. Qian, and H. Meng, “Mispronunciation detection and diagnosis in L2 English speech using multi-distribution deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 193–207, 2017
    K. Li, S. Mao, X. Li, Z. Wu and H. Meng, “Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks,” Speech Communication, vol. 96, pp. 28–36, 2018.
    W. Liu et al., “An ASR-free fluency scoring approach with self-supervised learning,” ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5, 2023.
    Y. Gong, Z. Chen, I.-H. Chu, P. Chang, and J. Glass, “Transformer-based multi-aspect multigranularity non-native English speaker pronunciation assessment,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7262–7266, 2022.
    H. Do, Y. Kim, and G. G. Lee, “Hierarchical pronunciation assessment with multi-aspect attention,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5, 2023.
    Y. Peng, S. Dalmia, I. Lane, and S. Watanabe, “Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding,” in Proceedings of the International Conference on Machine Learning, pp. 17627-17643, 2022.
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Conference on Neural Information Processing Systems, pp. 5998–6008. 2017.
    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proceedings of Interspeech, pp. 5036–5040, 2020.
    J. Sakuma, T. Komatsu, and R. Scheibler, “MLP-based architecture with variable length input for automatic speech recognition,” in arXiv preprint arXiv:2202.08456, 2022.
    J. Zhang, Z. Zhang, Y. Wang, Z. Yan, Q. Song, Y. Huang, K. Li, D. Povey, and Y. Wang, “Speechocean762: An open-source non-native English speech corpus for pronunciation assessment,” in Proceedings of Interspeech, pp. 3710–3714, 2021.
    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the International Conference on Neural Information Processing Systems, pp. 12449–12460, 2020.
    S. Chen, C. Wang, Z. Chen, Y. Wu , S. Liu, Z. Chen, J. Li , N. Kanda , T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian , Y. Qian, J. Wu, M. Zeng, X. Yu, F. Wei , “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, pp. 1505–1518, 2022.
    Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio, Speech and Language Processing, pp. 3451-3460, 2021.
    Leung, W., Kim, X. Liu, and H. Meng. “CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis.” In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 8132–8136, 2019
    Lo, W. K., Zhang, S., & Meng, H. M. “Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system,” in Proceedings of Interspeech, pp. 765–768, 2010.
    Li, K., Qian, X., & Meng, H. “Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks,” in Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 193–207, 2016.
    Kheir, Y. E., Chowdhury, S. A., & Ali, A. “Multi-View Multi-Task Representation Learning for Mispronunciation Detection,” in arXiv preprint arXiv:2306.01845, 2023.
    Chen, Y. W., Yu, Z., & Hirschberg, J. “Multipa: a multi-task speech pronunciation assessment system for a closed and open response scenario,” in arXiv preprint arXiv:2308.12490, 2023.
    Wu, P., Chen, L. W., Cho, C. J., Watanabe, S., Goldstein, L., Black, A. W., & Anumanchipalli, G. K. “Speaker-independent acoustic-to-articulatory speech inversion,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 1–5, 2023.
    Cho, C. J., Wu, P., Mohamed, A., & Anumanchipalli, G. K. “Evidence of vocal tract articulation in self-supervised learning of speech,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 1-5, 2023.
    Siriwardena, Y. M., Sivaraman, G., & Espy-Wilson, C. “Acoustic-to-articulatory speech inversion with multi-task learning,” in arXiv preprint arXiv:2205.13755, 2022.
    Garofolo, John S. “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993.
    G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-arctic: A non-native english speech corpus,” in Proceedings of Interspeech, pp. 2783-2787, 2018.
    Zhang, D. Y., Saha, S., & Campbell, S. “Phonetic RNN-Transducer for Mispronunciation Diagnosis,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing. pp. 1-5, 2023.
    B. A. Wang et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in arXiv preprint arXiv: 2111.09296, 2021.
    Wenxuan Ye, Shaoguang Mao, Frank Soong, Wenshan Wu, Yan Xia, Jonathan Tien, Zhiyong Wu “An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings,” in proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6827-6831, 2022.
    Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, and Qun Liu, “Cca-mdd: A coupled cross-attention based framework for streaming mispronunciation detection and diagnosis,” in arXiv preprint arXiv:2111.08191, 2021.
    Zhang, Chongyang, et al. “GOP-level transmission distortion modeling for mobile streaming video”. Signal Processing: Image Communication 23.2 (2008): 116-126.
    Wang, Yow-Bang, and Lin-Shan Lee. “Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training,” in proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), 2012.
    Kheir, Yassine El, Ahmed Ali, and Shammur Absar Chowdhury. “Automatic Pronunciation Assessment--A Review,” in arXiv preprint arXiv: 2310.13974 (2023).
    Xiaojun Qian, Helen Meng, and Frank Soong. “Capturing l2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (capt),” in proceedings of IEEE In International Symposium on Chinese Spoken Language Processing, pages 84–88, 2010.
    Natalia Kartushina and Ulrich H Frauenfelder. On the effects of l2 perception and of individual differences in l1 production on l2 pronunciation. Frontiers in psychology, 5:1246, 2014.
    Yow-Bang Wang and Lin-shan Lee. Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):564–579, 2015.
    Antoine Raux and Tatsuya Kawahara. in proceedings of Automatic intelligibility assessment and diagnosis of critical pronunciation errors for computer-assisted pronunciation learning, 2002.
    Silvia Dahmen, Martine Grice, and Simon Roessig. Prosodic and segmental aspects of pronunciation training and their effects on l2. Languages, 8(1):74.
    Xue Wang. Segmental versus suprasegmental: Which one is more important to teach? RELC Journal, 53(1):194–202, 2023.
    Kim,Y.,Franco,H.,Neumeyer,L. Automatic pronunciation scoring of specific phone segments for language instruction. In Proceedings EUROSPEECH'97. Rhodes, Greece.
    Bi-Cheng Yan, Meng-Che Wu, Hsiao-Tsung Hung, Berlin Chen, “An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling,” in Proc. Annu. Conf. Int. Speech Commun., pp. 3032–3036, 2020.
    D. Y. Zhang, S. Saha and S. Campbell, “Phonetic RNN-Transducer for Mispronunciation Diagnosis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp. 1-5, 2023.
    Scharenborg, Odette, Vincent Wan, and Roger K. Moore. “Towards capturing fine phonetic variation in speech using articulatory features,” Speech Communication 811-826, 2020.
    Wu, P., Watanabe, S., Goldstein, L., Black, A. W., & Anumanchipalli, G. K., “Deep speech synthesis from articulatory representations,” in arXiv preprint arXiv:2209.06337, 2020.
    Wu, P., Chen, L. W., Cho, C. J., Watanabe, S., Goldstein, L., Black, A. W., & Anumanchipalli, G. K. “Speaker-independent acoustic-to-articulatory speech inversion,” In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing pp. 1-5, 2023.
    Cho, C. J., Wu, P., Mohamed, A., & Anumanchipalli, G. K. Evidence of vocal tract articulation in self-supervised learning of speech. In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1-5, 2023.
    ETS, “Linking TOEFL iBT scores to IELTS scores-A research report,” 2010. G. Zhao, “Foreign accent conversion with neural acoustic modeling,” Doctoral dissertation, 2020.
    Ren, J., Zhang, M., Yu, C., & Liu, Z. Balanced mse for imbalanced visual regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7926-7935, 2022.

    下載圖示
    QR CODE