研究生: |
張修瑞 Chang, Hsiu-Jui |
---|---|
論文名稱: |
端對端語音辨識技術於電腦輔助發音訓練 Computer-assisted Pronunciation Training Leveraging End-to-End Speech Recognition Techniques |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 50 |
中文關鍵詞: | 端對端語音辨識 、連結時序分類 、注意力模型 、聲學模型 、發音檢測 、發音診斷 |
英文關鍵詞: | End-to-end speech recognition, Connectionist temporal Classification, Attention model, Acoustic model, Mispronunciation detection, Mispronunciation diagnosis |
DOI URL: | http://doi.org/10.6345/NTNU201900510 |
論文種類: | 學術論文 |
相關次數: | 點閱:187 下載:52 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
電腦輔助發音系統(Computer-assisted pronunciation training, CAPT),主要任務可分為錯誤發音檢測(Mispronunciation detection)以及錯誤發音診斷(Mispronunciation diagnosis)。而這兩種任務,在過去的研究中,主要依賴於傳統語音辨識系統的強制對齊(Forced alignment)方法,並利用強制對齊產生的音素(Phone)段落與觀測到的全部音素或較混淆的音素計算GOP(Goodness of pronunciation)分數。另一方面,由於端對端語音辨識(End-to-end speech recognition)框架簡化了傳統語音辨識所需要的步驟,因此基於此框架訓練的聲學模型(Acoustic model)在近年的研究上已成為熱門的研究議題,其主要作法分別為連結時序分類(Connectionist temporal classification, CTC)以及注意力模型(Attention model)。然而這樣的架構主要探討語音特徵對應到文字序列的正確性,較少探討音素層級的辨識。因此本論文希望藉由端對端語音辨識器探討發音檢測以及發音診斷的效果,並參考過去學者基於傳統聲學模型的研究,提出三種基於端對端聲學模型的發音檢測方法探討1. 基於語音辨識結果 2. 基於辨識產生的信心分數(Confidence score) 3. 利用信心分數結合N-best 語音辨識結果。另外,基於語音辨識結果以及利用信心分數結合N-best 語音辨識結果可同時完成發音診斷。在實驗中發現直接利用語音辨識結果進行發音檢測與診斷,得到的效果可超越以往兩階段藉由強制對齊計算GOP的發音檢測方法。
One of the primary tasks of a computer-assisted the pronunciation (CAPT) system is mispronunciation detection and diagnosis. Previous research on CAPT mostly relies on a forced-alignment procedure which is usually conducted with the acoustic models adopted from a traditional speech recognition system, in conjunction with a phoneme paragraph, to calculate the goodness of pronunciation (GOP) scores for the phonemes of spoken words with respect to a text prompt. On a separate front, the recently proposed end-to-end speech recognition architecture simplifies many of the training steps originally required for traditional speech recognition. As such, acoustic modeling based on this framework has become popular over the years, for which two predominant instantiations are the connectionist temporal classification (CTC) model and the attention-based model. However, current exploration of such an architecture is far more concerned with the correctness of mapping speech feature vectors to corresponding text sequences than its phone-level discriminating capability for subsequent applications like CAPT. In view of this, this thesis sets out to conduct mispronunciation detection and diagnosis on the strength of end-to-end speech recognition. To this end, we design and develop three mispronunciation detection methods: 1) method simply based speech recognition results; 2) method leveraging a recognition confidence measure; and 3) method combining the recognition confidence measure and N-best recognition results. It is remarkable that mispronunciation diagnosis can be simultaneously achieved through the joint use of the recognition confidence measure and the N-best recognition results. A series of experiments are conducted on a Mandarin mispronunciation detection and diagnosis task, which demonstrates that our method that jointly use the recognition confidence measure and the N-best recognition results obtained from end-to-end speech recognition can yield significantly better performance than a conventional two-stage method.
[1] Lawrence R. Rabiner et al., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, 1989.
[2] Mark Gales and Steve Yang, “The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends® in Signal Processing, 2008.
[3] Geoffrey Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, 2012.
[4] Ossama Abdel-Hamid et al., “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, 2014.
[5] Alex Graves et al., “Speech recognition with deep recurrent neural networks," ICASSP, 2013.
[6] Haşim Sak et al., “Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition,” arXiv, 2014.
[7] Vijayaditya Peddinti et al., “A time delay neural network architecture for efficient modeling of long temporal contexts,” Interspeech,2015.
[8] Alex Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” ICML, 2006.
[9] Dzmitry Bahdanau et al., “Neural machine translation by jointly learning to align and translate,” ICLR, 2015.
[10] Kelvin Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” ICML, 2015.
[11] Jan Chorowski et al., “Attention-Based Models for Speech Recognition,” NIPS, 2015.
[12] Suyoun Kim et al., “Joint CTC-Attention based end-to-end speech recognition using multi-task learning,” ICASSP, 2017.
[13] Shinji Watanabe et al., “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing 11, 2017.
[14] Steven Davis and Paul Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. 28, No. 4, pp. 357-366, 1980.
[15] Ronan Collobert et al, “Wav2Letter: an End-to-End ConvNet-based Speech Recognition System,” arXiv, 2016.
[16] Yajie Miao et al., “EESEN: End-to-End Speech Recognition Using Deep Rnn Modles And WFST-Based Decoding,”ASRU, 2015.
[17] Tomas Mikolov et al., “RNNLM - Recurrent Neural Network Language Modeling Toolkit,” Interspeech, 2010.
[18] Caglar Gulcehre et al., “On Using Monolingual Corpora in Neural Machine Translation,” arXiv, 2015.
[19] Anuroop Sriram et al., “Cold Fusion: Training Seq2Seq Models Together with Language Models,” Interspeech, 2018.
[20] Andrew Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Information Theory, Vol. 13, No. 2, pp. 260-269. 1967.
[21] Nancy F Chen and Haizhou Li, “Computer-assisted pronunciation training: From pronunciation scoring towards spoken language learning,” APSIPA, 2016.
[22] Rong Tong et al., “Subspace Gaussian Mixture Model for Computer-Assisted Language Learning,” ICASSP, 2014.
[23] Yoon Kim et al., “Automatic pronunciation scoring of specific phone segments for language instruction,” in Proc. Eurospeech-1997. ISCA, pp. 645–648, 1997.
[24] Silke Witt and Steve Young, “Language Learning Based On Non-Native Speech Recognition,” European Conference on Speech Communication and Technology, 1997.
[25] Silke Witt and Steve Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communiciation, Vol. 30, No. 2-3, pp. 95–108, 2000.
[26] Horacio Franco et al., “Automatic Pronunciation Scoring for Language Instruction,” ICASSP, 1997.
[27] Wenping Hu et al., “A DNN-based Acoustic Modeling of Tonal Language and Its Application to Mandarin Pronunciation Training,” ICASSP, 2014.
[28] Wenping Hu et al., “Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers,” Speech Communication, Vol. 67, pp. 154–166, 2015.
[29] Wei Li et al., “Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling,” ICASSP,2016.
[30] Alissa M. Harrison et al., “Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training,” SLaTE, 2009.
[31] Ann Lee and James Glass, “Mispronunciation detection without nonnative training data,” Interspeech, 2015.
[32] Ann Lee et al., “Personalized mispronunciation detection and diagnosis based on unsupervised error pattern discovery,” ICASSP, 2016.
[33] Alissa M. Harrison et al., “Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer,” Interspeech, 2008.
[34] Gunnar Fant, “Speech sounds and features,” 1973.
[35] Kenneth N Stevens, “Acoustic phonetics,” Vol. 30. MIT press, 2000.
[36] Hongyan Li et al., “Context-dependent duration modeling with backoff strategy and look-up tables for pronunciation assessment and mispronunciation detection,” ISCA, 2011.
[37] Grazyna Demenko et al., “An audiovisual feedback system for acquiring L2 pronunciation and L2 prosody,” SLaTE, 2009.
[38] Khiet P Truong et al., “Automatic detection of frequent pronunciation errors made by L2-learners,” Eurospeech, 2005.
[39] Helmer Strik et al., “Comparing classifiers for pronunciation error detection,” Interspeech, 2007.
[40] Si Wei et al., “A new method for mispronunciation detection using support vector machine based on pronunciation space models,” Speech Communication, 2009.
[41] Ann Lee and James Glass. “A comparison-based approach to mispronunciation detection,” SLT, 2012.
[42] Vincent Laborde et al., “Pronunciation assessment of Japanese learners of French with GOP scores and phonetic information,” 2016.
[43] Wenping Hu et al., “Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers,” Speech Communication, 2015.
[44] Xiaojun Qian et al., “The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computer-aided pronunciation training,” Interspeech, 2012.
[45] Kun Li et al., “Mispronunciation detection and diagnosis in L2 english speech using multidistribution deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016.
[46] Shaoguang Mao et al., “Applying Multitask Learning to Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech,” ICASSP, 2018.
[47] Wai-Kim Leung et al., “CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis,” ICASSP, 2019.
[48] Michael McAuliffe et al., “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” Interspeech, 2017.
[49] Feng Zhang et al., “Automatic mispronunciation detection for Mandarin,” ICASSP, 2008.
[50] Alex Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” ICML, 2006.
[51] William Chan et al., “Listen, Attend And Spell: A neural network for large vocabulary conversational speech recognition,” ICASSP, 2016.
[52] Takaaki Hori et al., “Advances in Joint CTC-Attention based End-to-End speech recognition with a Deep CNN Encoder and RNN-LM,” Interspeeh, 2017.
[53] Yi Liu et al., “HKUST/MTS: A very large scale Mandarin telephone speech corpus,” Chinese Spoken Language Processing. Springer, 2006.
[54] Daniel Povey et al., “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI,” Interspeech, 2016.
[55] Christopher M Bishop. “Pattern recognition and machine learning,” Springer, 2006.
[56] Fabian Pedregosa et.al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, Vol. 12, pp. 2825–2830, 2011.
[57] Robert A. Wagner and Michael J. Fischer. “The string-to-string correction problem,” Journal of the ACM (JACM) , Vol. 21.1, pp. 168-173, 1974.
[58] Daniel Povey et.al, “The Kaldi Speech Recognition Toolkit,” ASRU, 2011.
[59] Shinji Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” Interspeech, 2018.
[60] Povey, Daniel et.al, “Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks,” Interspeech, 2018.
[61] William M Rand. “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical association, 1971.
[62] Yao-Chi Hsu et al., “Mispronunciation Detection Leveraging Maximum Performance Criterion Training of Acoustic Models and Decision Functions,” Interspeech, 2016.
[63] Yao-Chi Hsu et al., “Evaluation Metric-related Optimization Methods for Mandarin Mispronunciation Detection,” IJCLCLP, Vol. 21.2, 2016.
[64] Wei Li et al., “Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models,” Interspeech, 2017
[65] Chung-Cheng Chiu and Colin Raffel. “Monotonic chunkwise attention,” ICLR, 2018.
[66] Ruchao Fan et al., “An Online Attention-based Model for Speech Recognition,” arXiv, 2018.