研究生: |
賴敏軒 |
---|---|
論文名稱: |
實證探究多種鑑別式語言模型於語音辨識之研究 Empirical Comparisons of Various Discriminative Language Models for Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2011 |
畢業學年度: | 99 |
語文別: | 中文 |
論文頁數: | 68 |
中文關鍵詞: | 語音辨識 、鑑別式語言模型 、邊際 、訓練準則 |
論文種類: | 學術論文 |
相關次數: | 點閱:165 下載:5 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語言模型(Language Model)在自動語音辨識(Automatic Speech Recognition, ASR)系統中扮演相當重要的角色,藉由使用大量的訓練文字來估測其相對應的模型參數,以描述自然語言的規律性。N-連(N-gram)語言模型(特別是雙連詞(Bigram)與三連詞(Trigram))常被用來估測每一個詞出現在已知前N-1個歷史詞之後的條件機率。此外,N-連模型大多是以最大化相似度為訓練目標,對於降低語音辨識錯誤率常會有所侷限,並非能達到最小化辨識錯誤率。近年來為了解決此問題,鑑別式語言模型(Discriminative Language Model, DLM)陸續地被提出,目的為從可能的辨識語句中正確地區別最佳的語句作為辨識之結果,而不是去符合其訓練資料,此概念已經被提出並論證有一定程度的成果。本論文首先實證探討多種以提升語音辨識效能為目標的鑑別式語言模型。接著,我們提出基於邊際(Margin-based)鑑別式語言模型訓練方法,對於被錯誤辨識的語句根據其字錯誤率(Word Error Rate, WER)與參考詞序列(字錯誤率最低)字錯誤率之差為比重,給予不同程度的懲罰。相較於其它現有的鑑別式語言模型,我們所提出的方法使用於大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)時有相當程度的幫助。
Language modeling (LM), at the heart of most automatic speech recognition (ASR) systems, is to render the regularity of a given natural language, while it corresponding model parameters are estimated on the basis of a large amount of training text. The n-gram (especially the bigram and trigram) language models, which determine the probability of a word given the preceding n-1 word history, are most prominently used. The n-gram model, normally trained with the maximum likelihood (ML) criterion, are not always capable of achieving minimum recognition error rates which in fact are closely connected to the final evaluation metric. To address this problem, in the recent past, a range of discriminative language modeling (DLM) methods, aiming at correctly discriminate the recognition hypotheses for the best recognition results rather than just fitting the distribution of training data, have been proposed and demonstrated with varying degrees of success. In this thesis, we first present an empirical investigation of a few leading DLM models designed to boost the speech recognition performance. Then, we propose a novel use of various margin-based DLM training methods that penalize incorrect recognition hypotheses in proportion to their WER (word error rate) distance from the desired hypothesis (or the oracle) that has the minimum WER. Experiments conducted on a large vocabulary continuous speech recognition (LVCSR) task illustrate the performance merits of the methods instantiated from our DLM framework when compared to other existing methods.
[Arisoy et al., 2010] E. Arisoy, M. Saraclar, B. Roark and I. Shafran, “Syntactic and sub-lexical features for Turkish discriminative language models,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5538-5541, 2010.
[Arisoy et al., 2011] E. Arisoy, B. Ramabhadran and H.-K. J. Kuo, “Feature combination approaches for discriminative language models,” in Proc. of 12th Annual Conference of the International Speech Communication Association, 2011.
[Aubert, 2002] X. Aubert, “An overview of decoding techniques for large vocabulary continuous speech recognition,” Computer Speech and Language, Vol. 16, pp. 89-114, 2002.
[Bahl et al., 1983] L. R. Bahl, F. Jelinek and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE Transactions on Patten Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190, 1983.
[Bahl et al., 1986] L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum mutual information estimation,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 49-52 ,1986.
[Bellegarda, 2005] J. R. Bellegarda, “Latent semantic mapping,” IEEE Signal Processing Magazine, Vol. 22. No. 5, pp. 70- 80, 2005.
[Brown et al., 1992] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer. “Class-based n-gram models of natural language,” Computational Linguistics, 1992.
[Chelba and Jelinek, 2000] C. Chelba and F. Jelinek, “Structured language modeling,” Computer Speech and Language, Vol. 14, pp. 283-332, 2000.
[Chen and Goodman, 1999] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proc. of the 34th annual meeting on Association for Computational Linguistics, pp. 310-318, 1996.
[Chen and Liu, 2011] B. Chen and C.-W. Liu, “Discriminative language modeling for speech recognition with relevance information,” IEEE International Conference on Multimedia and Expo, 2011.
[Chen et al., 2004] B. Chen, J.-W. Kuo and W.-H. Tsai, “Lightly supervised and data-driven approaches to mandarin broadcast news transcription,” in Proc. of International Conference on Acoustics, Speech and Signal Processing, pp. 777-780, 2004.
[Clarkson and Robinson, 1997] P. R. Clarkson and A. J. Robinson, “Language model adaptation using mixtures and an exponentially decaying cache,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 799-802 1997.
[Collins and Koo, 2000] M. Collins and T. Koo, “Discriminative reranking for natural language parsing,” in Proc. International Conference on Machine Learning, 2000.
[Collins, 2002] M. Collins, “Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms,” in Proc. of the Conference on Empirical Methods in Natural Language Processing, pp. 1-8, 2002.
[Gao et al., 2005] J. Gao, H. Suzuki and W. Yuan, “An empirical study on language model adaptation,” ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, pp. 209-227, 2005.
[Geng et al., 2007] X. Geng, T.-Y. Liu, T. Qin, H. Li, “Feature selection for ranking,” in Proc. of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 407-414, 2007.
[Gildea and Hofmann, 1999] D. Gildea and T. Hofmann, “Topic-based language models using EM,” in Proc. 6th European Conference on Speech Communication and Technology, Vol. 5, pp. 2167-2170, 1999.
[Jelinek, 1999] F. Jelinek, “Statistical methods for speech recognition,” the MIT Press, 1999.
[Juang and Katagiri, 1992] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992.
[Jiang et al., 2006] H. Jiang, X. Li and C. Liu, “Large margin hidden Markov models for speech recognition,” in Proc. IEEE transactions on audio, speech, and language processing, Vol. 14, No. 5, pp. 1584-1595, 2006.
[Jiang et al., 2007] H. Jiang, X. Li and C. Liu, “Incorporating training errors for large margin HMMs under semi-definite programming framework,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 629-632, 2007.
[Katz, 1987] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” in Proc. of IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 3, pp. 400, 1987.
[Kneser and Ney, 1995] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 181-184, 1995.
[Kobayashi et al., 2008] A. Kobayashi, T. Oku, S. Homma, S. Sato, T. Imai and T, Takagi, “Discriminative rescoring based on minimization of word errors for transcribing broadcast news,” in Proc. of 9th Annual Conference of the International Speech Communication Association, pp. 1574-1577, 2008.
[Kuhn, 1998] R. Kuhn, “Speech recognition and the frequency of recently used words: A modified markov model for natural language,” in Proc. of International Conference on Computational Linguistics, pp. 348-350, 1988.
[Kuo et al., 2002] H.-K. J. Kuo, E. Fosler-Lussier, H. Jian and C.-H. Lee, “Discriminative training of language models for speech recognition,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 325-328, 2002.
[Lafferty et al., 2001] J. Lafferty, A. McCallum and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. International Conference on Machine Learning, 2001.
[Li et al., 2006] J. Li, M. Yuan and C.-H. Lee, “Soft margin estimation of hidden markov model parameters,” in Proc. of International Conference on Spoken Language Processing, pp. 2422-2425, 2006.
[Liu et al., 2010] J.-W. Liu, S.-H. Lin and B. Chen, “Exploiting discriminative language models for reranking speech recognition hypotheses,” ROCLING, 2010. (in Chinese)
[Lo and Chen, 2010] Y.-T. Lo and B. Chen, "A comparative study on margin-based discriminative training of acoustic models," ROCLING, 2010. (in Chinese)
[Magdin and Jiang, 2010] V. Magdin and H. Jiang, “Large margin estimation of n-gram language model for speech recognition via linear programming,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5398-5401, 2010.
[Oba et al., 2007] T. Oba, T. Hori and A. Nakamura, “An approach to efficient generation of high-accuracy and compact error-corrective models for speech recognition,” in Proc. of 8th Annual Conference of the International Speech Communication Association, pp. 1753-1756, 2007.
[Oba et al., 2010a] T. Oba, T. Hori and A. Nakamura, “A comparative study on methods of weighted language model training for reranking LVCSR N-best hypotheses,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5126-5129, 2010.
[Oba et al., 2010b] T. Oba, T. Hori and A. Nakamura, “Round-robin discriminative model for reranking ASR hypotheses,” in Proc. of 11th Annual Conference of the International Speech Communication Association, pp. 2446-2449, 2010.
[Och, 2003] F. J. Och, “Minimum error rate training in statistical machine translation,” in Proc. of the 41st Annual Meeting on Association for Computational Linguistics, pp. 160-167, 2003.
[Penagarikano et al., 2011] M. Penagarikano, A. Varona, L. J. Rodriguez-Fuentes and G. Bordel, “A dynamic approach to the selection of high order n-grams in phonotactic language recognition,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4412-4415, 2011.
[Povey, 2004] D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004.
[Roark et al., 2004] B. Roark, M. Saraclar and M. Collins, “Corrective language modeling for large vocabulary ASR with the perceptron algorithm,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 749-752, 2004.
[Roark et al., 2007] B. Roark, M. Saraclar, M. Collins and M. Johnson, “Discriminative n-gram language modeling,” Computer Speech and Language, vol. 21, no. 2, pp.373-392, 2007.
[Saul and Pereira, 1997] L. Saul and F. Pereira, “Aggregate and mixed-order Markov models for statistical language processing,” in Proc. of the Conference on Empirical Methods in Natural Language Processing, pp.81-89, 1997.
[Sha and Pereira, 2003] F. Sha and F. Pereira, “Shallow parsing with conditional random fields,” in Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134-141, 2003.
[Shen et al., 2004] L. Shen, A. Sarkar and F. J. Och, “Discriminative reranking for machine translation,” in Proc. Human Language Technology conference/North American chapter of the Association for Computational Linguistics annual meeting, pp.177-184, 2004.
[SRI] A. Stolcke, SRI language modeling toolkit, version 1.5.8, http://www.speech.sri.com/projects/srilm/.
[Tam and Schultz, 2005] Y. C. Tam and T. Schultz, “Language model adaptation using variational Bayes inference,” in Proc. of 9th European Conference on Speech Communication and Technology, pp. 5-8, 2005.
[Troncoso et al., 2004] C. Troncoso, T. Kawahara, H. Yamamoto and G. Kikui, “Trigger-based language model construction by combining different corpora,” IEICE Technical Report, 2004.
[Viterbi, 1967] A. J. Viterbi, “Error bounds for convolutional codes and a asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, Vol. 13, No. 2, 1967.
[Wang, 2009] W. Wang, “Combining discriminative re-ranking and co-training for parsing mandarin speech transcripts,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4705-4708, 2009.
[Wang et al., 2005] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: A Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 2, pp. 219-236, 2005.
[Watanabe et al., 2010] S. Watanabe, T. Hori, E. McDermott and A. Nakamura, “A discriminative model for continuous speech recognition based on weighted finite transducers,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4922-4925, 2010.
[Zhou and Meng, 2008] Z. Zhou and H. Meng, “Recasting the discriminative n-gram model as a pseudo-conventional n-gram model for LVCSR,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4933-4936, 2008.
[Zhou et al., 2006] Z. Zhou, J. Gao, F. K. Soong and H. Meng, “A comparative study of discriminative methods for reranking LVCSR N-best hypotheses in domain adaptation and generalization,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 141-144, 2006.
[邱炫盛,2007] 邱炫盛,”利用主題與位置相關語言模型於中文連續語音辨識,”國立臺灣師範大學資訊工程所碩士論文,2007。
[邱炫盛等,2007] 邱炫盛,羅永典,陳韋豪,陳柏琳,“使用位置資訊於中文連續語音辨識,” TAAI,2007。
[劉鳳萍,2009] 劉鳳萍,“使用鑑別式言模型於語音辨識結果重新排序,”國立臺灣師範大學資訊工程所碩士論文,2009。
[陳冠宇,2010] 陳冠宇,“主題模型於語音辨識使用之改進,”國立臺灣師範大學資訊工程所碩士論文,2010。
[劉家妏,2010] 劉家妏,”多種鑑別式語言模型應用於語音辨識之研究,” 國立臺灣師範大學資訊工程所碩士論文,2010。