研究生: |
朱芳輝 Fang-Hui, Chu |
---|---|
論文名稱: |
資料選取方法於鑑別式聲學模型訓練之研究 Training Data Selection for Discriminative Training of Acoustic Models |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2008 |
畢業學年度: | 96 |
語文別: | 中文 |
論文頁數: | 116 |
中文關鍵詞: | 資料選取 、鑑別式訓練 、聲學模型 、語音辨識 |
英文關鍵詞: | Data Selection, Discriminative Training, Acoustic Models, Speech Recognition |
論文種類: | 學術論文 |
相關次數: | 點閱:158 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文旨在研究使用各種訓練資料選取方法來改善以最小化音素錯誤為基礎的鑑別式聲學模型訓練,並應用於中文大詞彙連續語音辨識。首先,我們汲取Boosting演算法中強調被錯誤分類的訓練樣本之精神,修改最小化音素錯誤訓練中每一句訓練語句之統計值權重,以提高易傾向於被辨識錯誤的語句對於聲學模型訓練之貢獻。同時,我們透過多種方式來結合在不同訓練資料選取機制下所訓練出的多個聲學模型,進而降低語音辨識錯誤率。其次,我們亦提出一個基於訓練語句詞圖之期望音素正確率(Expected Phone Accuracy)定義域上的訓練資料選取方法,分別藉由在語句與音素段落兩種不同單位上的訓練資料選取,以提供最小化音素錯誤訓練更具鑑別資訊的訓練樣本。再者,我們嘗試結合本論文所提出的訓練資料選取方法及前人所提出以正規化熵值為基礎之音框層次訓練資料選取方法、以及音框音素正確率函數,冀以提升最小化音素錯誤訓練之成效。最後,本論文以公視新聞語料作為實驗平台,實驗結果初步驗證了本論文所提出方法之可行性。
This thesis aims to investigate various training data selection approaches for improving the minimum phone error (MPE) based discriminative training of acoustic models for Mandarin large vocabulary continuous speech recognition (LVCSR). First, inspired by the concept of the AdaBoost algorithm that lays more emphasis on the training samples misclassified by the already-trained classifier, the accumulated statistics of the training utterances prone to be incorrectly recognized are properly adjusted during the MPE training. Meanwhile, multiple speech recognition systems with their acoustic models respectively trained using various training data selection criteria are combined together at different recognition stages for improving the recognition accuracy. On the other hand, a novel data selection approach conducted on the expected phone accuracy domain of the word lattices of training utterances is explored as well. It is able to select more discriminative training instances, in terms of either utterances or phone arcs, for better model discrimination. Moreover, this approach is further integrated with a previously proposed frame-level data selection approach, namely the normalized entropy based frame-level data selection, and a frame-level phone accuracy function for improving the MPE training. All experiments were performed on the Mandarin broadcast news corpus (MATBN), and the associated results initially demonstrated the feasibility of our proposed training data selection approaches.
[Alimoglu and Alpaydin 1997] F. Alimoglu and E. Alpaydin, “Combining Multiple Representations and Classifiers for Pen-Based Handwritten Digit Recognition,” in Proc. Int. Conf. Document Analysis and Recognition, 1997.
[Alpaydin 2004] E. Alpaydin, “Introduction to Machine Learning,” The MIT Press 2004.
[Arslan and Hansen 1999] L. Arslan and J. Hansen, “Selevtive Training for Hidden Markov Models with Applications to Speech Classification,” IEEE Trans. Speech and Audio Process, Vol. 7, No. 1, pp. 46-54, 1999.
[Atal 1974] B. S. Atal, “Effectiveness of Linear Prediction Characteristics of The Speech Wave for Automatic Speaker Identification and Verification,” Journal of the Acoustical Society of America, Vol. 55, No. 6, pp.1304-1312, 1974.
[Aubert 2002] X. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continue Speech Recognition,” Computer Speech and Language, Vol. 16, pp.89-114, 2002.
[Bahl et al. 1983] L. R. Bahl, F. Jelinek and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, 1983.
[Bahl et al. 1986] L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” in Proc. ICASSP, 1986.
[Barras et al. 1986] C. Barras, E. Geoffrois, Z. B. Wu and M. Liberman, “Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production,” Speech Communication, Vol. 33, pp.5-22, 2001.
[Bauer and Kohavi 1999] E. Bauer and R. Kohavi, “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants,” Machine Learning, Vol. 36, No. 1-2, pp. 105-139, 1999.
[Baum 1972] L. E. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities, Vol. 3, No. 1, pp.1-8, 1972.
[Bishop 1995] C. Bishop, “Neural Networks for Pattern Recognition,” Oxford: Oxford University Press, 1995.
[Breiman 1996] L. Breiman, “Bagging Predictors,” Machine Learning, Vol.24, No. 2, pp. 123-140, 1996.
[Breiman et al. 1998] L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and Regression Trees,” CRC Press, 1998.
[Chen and Lee 2006] I.-F. Chen and L.-S. Lee, “A New Framework for System Combination Based on Integrated Hypothesis Space,” in Proc. ICSLP, 2006.
[Chen et al. 2002] B. Chen, H.-M. Wang and L.-S. Lee, “Discriminating Capabilities of Syllable-Based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese,” IEEE Trans. Speech and Audio Processing, Vol. 10, No. 5, pp.303-314, 2002.
[Chen et al. 2004] B. Chen, J.-W. Kuo and W.-H. Tsai, “Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription,” in Proc. ICASSP, 2004.
[Chen et al. 2005] B. Chen, J.-W. Kuo and W.-H. Tsai, ”Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 1, pp1-18, 2005.
[Cook and Robinson 1996] G. Cook and T. Robinson, “Boosting the Performance of Connectionist Large Vocabulary Speech Recognition,” in Proc. ICSLP, 1996.
[Cook et al. 1997] G. Cook, S. Waterhouse and T. Robinson, “Ensemble Methods for Connectionist Acoustic Modeling,” in Proc. Eurospeech, 1997.
[Davis and Mermelstein 1980] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. 28, No. 4, pp.357-366, 1980.
[Dimitrakakis and Bengio 2004] C. Dimitrakakis and S. Bengio, “Boosting HMMs with An Application to Speech Recognition,” in Proc. ICASSP, 2004.
[Doumpiotis et al. 2004] V. Doumpiotis, S. Tsakalidis and W. Byrne, “Lattice Segmentation and Minimum Bayes Risk Discriminative Training,” in Proc. Eurospeech, 2004.
[Duda et al. 2000] R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification,” Second Edition. New York: John & Wiley, 2000.
[Efron and Tibshirani 1993] B. Efron and R. Tibshirani, “An Introduction to the Boostrap,” Chapman & Hall/CRC, 1993.
[Fiscus 1997] J. Fiscus, “A Post-processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER),” in Proc. ASRU, 1997.
[Fiscus 1997] J. Fiscus, “A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER),” in Proc. ASRU, 1997.
[Foo and Lim 2002] S.-W. Foo and E.-G. Lim, “Speaker Recognition Using Adaptively Boosted Decision Tree Classifier,” in Proc. ICASSP, 2002.
[Freund and Schapire 1996] Y. Freund and R. E. Schapire, “Experiments with A New Boosting Algorithm,” in Proc. ICML, 1996.
[Freund and Schapire 1997] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-line Learning and An Application to Boosting,” Journal of Computer and System Sciences, Vol. 55, pp. 119-139, 1997.
[Gales 1998] M. J. F. Gales, “Maximum Likelihood Linear Transformations for HMM-based Speech Recognition,” Computer Speech and Language, Vol. 12, No. 2, pp. 75-98, 1998.
[Gales 2002] M. J. F. Gales, “Maximum Likelihood Multiple Subspace Projections for Hidden Markov Models,” IEEE Trans. on Speech and Audio Processing, Vol. 10, No. 2, pp. 37-47, 2002.
[Goel and Byrne 2000] V. Goel and W. Byrne, “Minimum Bayes-Risk Automatic Speech Recognition,” Computer Speech and Language, Vol. 14, pp.115-135, 2000.
[Gopinath 1998] R. A. Gopinath, “Maximum Likelihood Modeling with Gaussian Distributions for Classification,” in Proc. ICASSP, 1998.
[Huang et al. 2001] X. Huang, A. Acero and H.-W. Hon, “Spoken Language Processing: A Guide to Theory, Algorithm and System Development,” Upper Saddle River, NJ, USA, Prentice Hall PTR, 2001.
[Jacobs 1997] R. Jacobs, “Bias/Variance Analysis for Mixtures-of-Experts Architectures,” Neural Computation, Vol. 9, pp.369-383, 1997.
[Jiang 2005] H. Jiang, “Confidence Measures for Speech Recognition: A Survey,” Speech Communication, Vol. 45, pp. 455-470, 2005.
[Jiang and Li 2007] H. Jiang and X. Li, “Incorporating Training Errors for Large Margin HMMs under Semi-definite Programming Framework,” in Proc. ICASSP, 2007.
[Jiang et al. 2005] H. Jiang, F. Soong and C.-H. Lee, “A Dynamic In-Search Data Selection Method with Its Applications to Acousitc Modeling and Utterance Verification,” IEEE Trans. Speech and Audio Process, Vol. 13, No. 5, pp.945-955, 2005.
[Jiang et al. 2006] H. Jiang, X. Li and C. Liu, “Large Margin Hidden Markov Models for Speech Recognition,” IEEE Trans. Audio, Speech and Language Processing, Vol. 14, No. 5, pp. 1584-1595, 2006.
[Juang and Katagiri 1992] B.-H. Juang and S. Katagiri, “Discriminative Learning for Minimum Classification Error,” IEEE Trans. Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992.
[Juang et al. 1997] B.-H. Juang, W. Chou and C.-H. Lee, “Minimum Classification Error Rate Methods for Speech Recognition,” IEEE Trans. Speech and Audio Processing, Vol. 5, No. 3, pp.257-265, 1997.
[Kaiser et al. 2002] J. Kaiser, B. Horvat and Z. Kacic, “Overall Risk Criterion Estimation of Hidden Markov Model Parameters,” Speech Communication, Vol. 38, pp.383-398, 2002.
[Katz 1987] S. M. Katz, “Estimation of Probabilities from Sparse Data for Other Language Component of a Speech Recognizer,” IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 35, No. 3, pp. 400-401, 1987.
[Kaynak and Alpaydin 2000] C. Kaynak and E. Alpaydin, “MultiStage Cascading of Multiple Classifiers: One Man’s Noise is Another Man’s Data,” in Proc. ICML, 2000.
[Korkmazsky et al. 2004] F. Korkmazsky, D. Fohr and I. Illina, “Using Linear Interpolation to Improve Histogram Equalization for Speech Recognition,” in Proc. of ICSLP, 2004.
[Kumar 1997] N. Kumar, “Investigation of Silicon-Auditory Models and Generalizaion of Linar Discriminant Analysis for Improved Speech Recognition”, Ph.D. Thesis, John Hopkins University, Baltimore, 1997.
[Kuo and Chen 2005] J.-W. Kuo and B. Chen, "Minimum Word Error Based Discriminative Training of Language Models," in Proc. Eurospeech, 2005.
[Kuo et al. 2006] J.-W. Kuo, S.-H. Liu, H.-M. Wang and B. Chen, “An Empirical Study of Word Error Minimization Approaches for Mandarin Large Vocabulary Speech Recognition,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 11, No.3, pp.201-222, 2006.
[LDC] Linguistic Data Consortium: http://www.ldc.upenn.edu .
[Li and Jiang 2007] X. Li and H. Jiang, “Solving Large-Margin Hidden Markov Model Estimation via Semidefinite Programming,” IEEE Trans. Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2383-2392, 2007.
[Li and O’Shaughnessy 2007] H.-Z. Li and D. O’Shaughnessy, “Frame Margin Probability Discriminative Training Algorithm for Noisy Speech Recognition,” in Proc. Eurospeech, 2007.
[Li et al. 2005] X. Li, H. Jiang and C. Liu, “Large Margin HMMs for Speech Recognition,” in Proc. ICASSP, 2005.
[Li et al. 2006] J. Li, M. Yuan and C.-H. Lee, “Soft Margin Estimation of Hidden Markov Model Parameters” in Proc. ICSLP, 2006.
[Li et al. 2007a] J. Li, M. Yuan and C.-H. Lee, “Approximate Test Risk Bound Minimization Through Soft Margin Estimation,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 15, No. 8, pp.2393-2404, 2007.
[Li et al. 2007b] J. Li, Z.-J. Yan, C.-H. Lee and R.-H. Wang, “A Study On Soft Margin Estimation for LVCSR,” in Proc. ASRU, 2007.
[Lin et al. 2007] S.-H. Lin, Y.-M. Yeh and B. Chen, "A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition," International Journal of Computational Linguistics and Chinese Language Processing, Vol. 12, No. 2, pp. 217-238, 2007.
[Liu et al. 2007a] S.-H. Liu, F.-H. Chu, S.-H. Lin and B. Chen, “Investigation Data Selection for Minimum Phone Error Training of Acoustic Models,” in Proc. ICME, 2007.
[Liu et al. 2007b] S.-H. Liu, F.-H. Chu, S.-H. Lin, H.-S. Lee and B. Chen, “Training Data Selection for Improving Discriminative Training of Acoustic Models,” in Proc. ASRU, 2007.
[Mangu et al. 2000] L. Mangu, E. Brill and A. Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks,” Computer Speech and Language, Vol. 14, pp.373-400, 2000.
[Meyer 2002] C. Meyer, “Utterance-Level Boosting of HMM Speech Recognizers,” in Proc. ICASSP, 2002.
[Meyer and Schramm 2006] C. Meyer and H. Schramm, “Boosting HMM Acoustic Models in Large Vocabulary Speech Recognition,” Speech Communication, Vol. 48, No. 5, pp.532-548, 2006.
[Moreno et al. 2001] P. Moreno, B. Logan and B. Raj, “A Boosting Approach for Confidence Scoring,” in Proc. Eurospeech, 2001
[Ney et al. 1994] H. Ney, U. Essen and R. Kneser, “On Structuring Probabilistic Dependences in Stochastic Language Modeling,” Computer Speech and Language, Vol. 8, pp. 1-38, 1994.
[NIST] National Institute of Standards and Technology. http://www.nist.gov/ .
[Normandin 1991] Y. Normandin, “Hidden Markov Models, Maximum Mutual Information Estimation and the Speech Recognition Problem,” Ph.D Dissertation, McGill University, Montreal, 1991.
[Ortmanns et al 1997] S. Ortmanns, H. Ney and X. Aubert, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 11, pp.11-72, 1997.
[Povey 2004] D. Povey, “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, University of Cambridge, 2004.
[Povey and Woodland 2002] D. Povey and P. C. Woodland, “Minimum Phone Error and I-smoothing for Improved Discriminative Training,” in Proc. ICASSP, 2002.
[PTS] Public Television Service Foundation. http://www.pts.org.tw .
[Rosenfeld 1996] R. Rosenfeld, “A Maximum Entropy Approach to Adaptive Statistical Language Modeling,” Computer Speech and Language, Vol. 10, No. 2, pp. 187-228, 1996.
[Sankar 2005] A. Sankar, “Bayesian Model Combination (BAYCOM) for Improved Recognition,” in Proc. ICASSP, 2005.
[Saon et al. 2000] G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” in Proc. ICASSP, 2000.
[Schapire 1990] R. E. Schapire, “The Strength of Weak learnability,” Machine Learning, Vol. 5, pp. 197-227, 1990.
[Schapire 2002] R. E. Schapire, “The Boosting Approach to Machine Learning: An Overview,” in Proc. MSRI Workshop on Nonlinear Estimation and Classification, 2002.
[Schapire et al. 1998] R. E. Schapire, Y. Freund, P. Bartlett and W.-S. Lee, “Boosting the Margin: A New Explanation of The Effectiveness of Voting Methods,” The Annals of Statistics, Vol. 26, No. 5, pp. 1651-1686, 1998.
[Schluter and Ney 2001] R. Schluter and H. Ney, “Model-based MCE Bound to The True Bayes’ Error,” IEEE Signal Process. Lett., Vol. 8, No. 5, pp. 131-133, 2001.
[Scholkopf and Smola 2002] B. Scholkopf and A. Smola, “Learning with Kernels: Support Vector Machine, Regularization, Optimization, and Beyond,” Cambridge, MA: MIT Press, 2002.
[Schwenk 1999] H. Schwenk, “Using Boosting to Improve a Hybrid HMM/Neural Network Speech Recognition,” in Proc. ICASSP, 1999.
[SLG] Spoken Language Group at Chinese Information Processing Laboratory, Institute of Information Science, Academia Sinica. http://sovideo.iis.sinica.edu.tw/SLG/index.htm .
[Smola et al. 2000] A. J. Smola, P. Bartlett, B. Scholkopf and D. Schuurmans, “Advances in Large Margin Classifiers,” The MIT Press, 2000.
[SRILM 2007] A. Stolcke, “SRI language Modeling Toolkit,” version 1.5.3, http://www.speech.sri.com/projects/srilm/ .
[Valiant 1984] L. Valiant, “A Theory of the Learnable,” Communication of the ACM, Vol. 27, No. 11, pp. 1134-1142, 1984.
[Vapnik 2000] V. Vapnik, “The Nature of Statistical Learning Theory,” Second Edition, Springer, New York, 2000.
[Viikki and Laurila 1998] O. Viikki and K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 133-147, 1998.
[Viterbi 1967] A. J. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Information Theory, Vol. 13, No. 2, 1967.
[Wang et al. 2005] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No.2, pp.219-236, 2005.
[Wessel et al. 2001] F. Wessel, R. Schluter and H. Ney, “Explicit Word Error Minimization Using Word Hypothesis Posterior Probabilities,” in Proc. ICASSP, 2001.
[Wessel et al. 2001] F. Wessel, R. Schluter, K. Macherey and H. Ney, “Explicit Word Error Minimization Using Word Hypothesis Posterior Probability,” in Proc. ICASSP, 2001.
[Wolpert 1992] D. Wolpert, “Stacked Generalization,” Neural Networks, Vol 5, pp.241-259, 1992.
[Wolpert and Macready 1997] D. Wolpert and W. Macready, “No Free Lunch Theorems for Optimization,” IEEE Trans. Evolutionary Computation, Vol. 1, No. 1, pp. 67-82, 1997.
[Young et al. 2006] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. C. Woodland, “The HTK Book,” Version 3.4, 2006. http://htk.eng.cam.uk/
[Zhang and Rudnicky 2003a] R. Zhang and A. Rudnicky, “Improving the Performance of An LVCSR System Through Ensembles of Acoustic Models,” in Proc. ICASSP, 2003.
[Zhang and Rudnicky 2003b] R. Zhang and A. Rudnicky, “Comparative Study of Boosting and Non-Boosting Training for Constructing Ensembles of Acoustic Models,” in Proc. Eurospeech, 2003.
[Zhang and Rudnicky 2004a] R. Zhang and A. Rudnicky, “A Frame Level Boosting Training Scheme for Acoustic Modeling,” in Proc. ICSLP, 2004.
[Zhang and Rudnicky 2004b] R. Zhang and A. Rudnicky, “Optimizing Boosting with Discriminative Criteria,” in Proc. ICSLP, 2004.
[Zheng and Stolcke 2005] J. Zheng and A. Stolcke, “Improved Discriminative Training Using Phone Lattices,” in Proc. Eurospeech, 2005.
[Zitouni 2008] I. Zitouni, “Constrained Minimization and Discriminative Training for Natural Language Call Routing,” IEEE Trans. Audio, Speech and Language Processing, Vol.16, No. 1, pp.208-215, 2008.
[Zitouni et al. 2002] I. Zitouni, H.-K. Kuo and C.-H. Lee, “Combination of Boosting and Discriminative Training for Natural Language Call Steering System,” in Proc. ICASSP, 2002.
[Zweig and Padmanabhan 2000] G. Zweig and M. Padmanabhan, “Boosting Gaussian Mixtures in An LVCSR System,” in Proc. ICASSP, 2000.
[郭人瑋 2005] 郭人瑋, “最小化音素錯誤鑑別式聲學模型學習於中文大詞彙連續語音辨識之初步研究,” 國立台灣師範大學資訊工程研究所碩士論文, 2005.
[陳羿帆 2006] 陳羿帆, “鑑別式解碼應用於多重系統結合之中文大詞彙語音辨識” 國立台灣大學電信工程研究所碩士論文, 2006.
[陳燦輝 2006] 陳燦輝, “信心度評估於中文大詞彙連續語音辨識之研究,” 國立台灣師範大學資訊工程研究所碩士論文, 2006.
[劉士弘 2007] 劉士弘, “改善鑑別式聲學模型訓練於中文連續語音辨識之研究,” 國立台灣師範大學資訊工程研究所碩士論文, 2007.