研究生: |
李鴻欣 Hung-Shin Lee |
---|---|
論文名稱: |
基於分類錯誤之線性鑑別式特徵轉換應用於大詞彙連續語音辨識 Classification Error-based Linear Discriminative Feature Transformation for Large Vocabulary Continuous Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 中文 |
論文頁數: | 107 |
中文關鍵詞: | 語音辨識 、鑑別分析 、特徵擷取 、特徵轉換 |
英文關鍵詞: | speech recognition, discriminant analysis, feature extraction, feature transformation |
論文種類: | 學術論文 |
相關次數: | 點閱:228 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
線性鑑別分析(linear discriminant analysis, LDA)的目標在於尋找一個線性轉換,能將原始資料投射到較低維度的特徵空間,同時又能保留類別間的幾何分離度(geometric separability)。然而,LDA並不能總是保證在分類過程中產生較高的分類正確率。其中一個可能的原因在於LDA的目標函式並非直接與分類錯誤率連接,因此它也就未必適合在某特定分類器控制下的分類規則,自動語音辨識(automatic speech recognition, ASR)就是一個很好的例子。在本篇論文中,我們藉著探索每一對容易混淆之音素類別間的經驗分類錯誤率(empirical classification error rate)與馬氏距離(Mahalanobis distance)的關係,擴展了傳統的LDA,並且將原來的類別間散佈矩陣(between-class scatter),從每一對類別間的歐式距離(Euclidean distance)估算,修改為它們的成對經驗分類正確率。這個新方法不僅保留了原本LDA就具有的輕省可解性,同時無須預設資料是為何種機率分佈。
另一方面,我們更進一步提出一種嶄新的線性鑑別式特徵擷取方法,稱之為普遍化相似度比率鑑別分析(generalized likelihood ratio discriminant analysis, GLRDA),其旨在利用相似度比率檢驗(likelihood ratio test)的概念尋求一個較低維度的特徵空間。GLRDA不僅考慮了全體資料的異方差性(heteroscedasticity),即所有類別之共變異矩陣可被彈性地視為相異;並且在分類上,能藉由最小化類別間最混淆之情況(由虛無假設(null hypothesis)所描述)的發生機率,而求得有助於分類效果提升的較低維度特徵子空間。同時,我們也證明了LDA與異方差性線性鑑別分析(heteroscedastic linear discriminant analysis, HLDA)可被視為GLRDA的兩種特例。再者,為了增進語音特徵的強健性,GLRDA更可進一步地與辨識器所提供的經驗混淆資訊結合。
實驗結果顯示,在中文大詞彙連續語音辨識系統中,我們提出的方法都比LDA或其它現有的改進方法,如HLDA等,有較佳的表現。
The goal of linear discriminant analysis (LDA) is to seek a linear transformation that projects an original data set into a lower-dimensional feature subspace while simultaneously retaining geometrical class separability. However, LDA cannot always guarantee better classification accuracy. One of the possible reasons lies in that its criterion is not directly associated with the classification error rate, so that it does not necessarily accommodate itself to the allocation rule governed by a given classifier, such as that employed in automatic speech recognition (ASR). In this thesis, we extend the classical LDA by leveraging the relationship between the empirical phone classification error rate and the Mahalanobis distance for each respective phone class pair. To this end, we modify the original between-class scatter from a measure of the Euclidean distance to the pairwise empirical classification accuracy for each class pair, while preserving the lightweight solvability and taking no distributional assumption, just as what LDA does.
Furthermore, we also present a novel discriminative linear feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension – heteroscedastic linear discriminant analysis (HLDA) are just two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance.
Experimental results demonstrate that our approaches yields moderate improvements over LDA and other existing methods, such as HLDA, on the Chinese large vocabulary continuous speech recognition (LVCSR) task.
[1] H.-S. Chiu, et al., "Position information for language modeling in speech recognition," in Proc. ISCSLP, 2008, pp. 101-104.
[2] J. Li, et al., "Soft margin estimation of hidden markov model parameters," in Proc. Interspeech, 2006, pp. 2422-2425.
[3] M. Gilbert, et al., "Intelligent virtual agents for contact center automation," IEEE Signal Processing Magazine, vol. 22, pp. 32-41, 2005.
[4] M. Gilbert and J. Feng, "Speech and language processing over the web," IEEE Signal Processing Magazine, vol. 25, pp. 18-28, 2008.
[5] N. Morgan, et al., "Pushing the envelope - aside," IEEE Signal Processing Magazine, vol. 22, pp. 81-88, 2005.
[6] H. Hermansky, "Should recognizers have ears?," Speech Communication, vol. 25, pp. 3-27, 1998.
[7] H. Hermansky, "Exploring temporal domain for robustness in speech recognition," in Proc. ICA, 1995, pp. 61-64.
[8] H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach: Springer, 1994.
[9] M. J. Hunt and C. Lefdbvre, "A comparison of several acoustic representations for speech recognition with degraded and undegraded speech," in Proc. ICASSP, 1989, pp. 262-265.
[10] S. Makino, et al., "Recognition of consonant based on the Perceotron model," in Proc. ICASSP, 1983, pp. 738-741.
[11] S. Furui, "Speaker-independent isolated word recognizer using dynamic features of speech spectrum," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 34, pp. 52-59, 1986.
[12] J. S. Bowers and C. J. Davis, "Is speech perception modular or interactive?," Trends in Cognitive Sciences, vol. 8, pp. 3-5, 2004.
[13] A. G. Samuel, "Knowing a word affects the fundamental perception of the sounds within it," Psychological Science, vol. 12, pp. 348-351, 2001.
[14] J. Obleser and F. Eisner, "Pre-lexical abstraction of speech in the auditory cortex," Trends in Cognitive Sciences, vol. 13, pp. 14-19, 2009.
[15] D. Norris, et al., "Merging information in speech recognition: Feedback is never necessary," Behavioral and Brain Sciences, vol. 23, pp. 299-370, 2000.
[16] M. K. Tanenhaus, et al., "No compelling evidence against feedback in spoken word recognition," Behavioral and Brain Sciences, vol. 23, pp. 348-349, 2000.
[17] D. B. Pisoni and R. E. Remez, The Handbook of Speech Perception. Oxford: Blackwell, 2005.
[18] R. Chengalvarayan and L. Deng, "HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features," IEEE Trans. Speech and Audio Processing, vol. 12, pp. 19-26, 1997.
[19] X.-B. Li, et al., "Dimensionality reduction using MCE-optimized LDA transformation," in Proc. ICASSP, 2004, pp. 137-140.
[20] D. Povey, et al., "fMPE: discriminatively trained features for speech recogntion," in Proc. ICASSP, 2005, pp. 961-964.
[21] B. Schölkopf and A. J. Smola, Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, Massachusetts: The MIT Press, 2002.
[22] E. Alpaydin, Introduction to Machine Learning. Cambridge, MA: The MIT Press, 2004.
[23] A. R. Webb, Statistical Pattern Recognition, 2nd ed.: John Wiley and Sons, 2002.
[24] M. Gales, "Semi-tied covariance matrices for hidden Markov models," IEEE Trans. Speech and Audio Processing, vol. 7, pp. 272-281, 1999.
[25] X. Wang and K. K. Paliwal, "Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition," Pattern Recognition, vol. 36, pp. 2429-2439, 2003.
[26] N. Kumar and A. G. Andreou, "Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition," Speech Communication, vol. 26, pp. 283-297, 1998.
[27] G. Saon, et al., "Maximum likelihood discriminant feature spaces," in Proc. ICASSP, 2000, pp. 1129-1132.
[28] K. Demuynck, et al., "Optimal feature sub-space selection based on discriminant analysis " in Proc. Eurospeech, 1999, pp. 1311-1314.
[29] X. Cui, et al., "Stereo-based stochastic mapping with discriminative training for noise robust speech recognition," in Proc. ICASSP, 2009, pp. 2933-2936.
[30] P. F. Brown, "The acoustic-modelling problem in automatic speech recognition," Ph.D. dissertation, Carnegie Mellon University, 1987.
[31] R. Haeb-Umbach and H. Ney, "Linear discriminant analysis for improved large vocabulary continuous speech recognition," in Proc. ICASSP, 1992, pp. 13-16.
[32] H. Hermansky, "Stochastic techniques in deriving perceptual knowledge," in Proc. SAPA, 2004.
[33] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 2nd ed.: Prentice Hall, 2008.
[34] S. Young, et al., The HTK Book (for HTK Version 3.4): Cambridge University Engineering Department, 2006.
[35] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition: Prentice Hall, 1993.
[36] W. Chou and B.-H. Juang, Pattern Recognition in Speech and Language Processing: CRC Press, 2003.
[37] X. Liu, "Discriminative complexity control and linear projections for large vocabulary speech recognition," Ph.D. dissertation, University of Cambridge, 2005.
[38] M. N. Stuttle, "A Gaussian mixture model spectral representation for speech recognition," Ph.D. dissertation, University of Cambridge, 2003.
[39] J. W. Picone, "Signal modeling techniques in speech recognition," in Proc. the IEEE, 1993, pp. 1214-1247.
[40] S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 357-366, 1980.
[41] R. Haeb-Umbach, et al., "Improvements in connected digit recognition using linear discriminant analysis and mixture densities," in Proc. ICASSP, 1994, pp. 239-242.
[42] B. D. Ripley, Pattern Recognition and Neural Networks. New York: Cambridge University Press, 1996.
[43] M. J. Hunt, "A statistical approach to metrics for word and syllable recognition," presented at the the 98th Meeting of the Acoustical Society of America, 1979.
[44] G. R. Doddington, "Phonetically sensitive discriminants for improved speech recognition," in Proc. ICASSP, 1989, pp. 556-559.
[45] M. J. Hunt, et al., "An investigation of PLP and IMELDA acoustic representations and of their potential for combination," in Proc. ICASSP, 1991, pp. 881-884.
[46] L. Wood, et al., "Improved vocabulary-independent sub-word HMM modelling," in Proc. ICASSP, 1991, pp. 181-184.
[47] G. Yu, et al., "Discriminant analysis and supervised vector quantization for continuous speech recognition," in Proc. ICASSP, 1990, pp. 685-688.
[48] C. M. Ayer, et al., "A discriminately derived linear transform for improved speech recognition," in Proc. Eurospeech, 1993, pp. 583-586.
[49] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. New York: Academic Press, 1990.
[50] R. A. Fisher, "The use of multiple measurements in taxonomic problems," Annals of Eugenics, vol. 7, pp. 179-188, 1936.
[51] R. A. Fisher, "The statistical utilization of multiple measurements," Annals of Eugenics, vol. 8, pp. 376-386, 1938.
[52] C. R. Rao, "The utilization of multiple measurements in problems of biological classification," Journal of the Royal Statistical Society, Series B, vol. 10, pp. 159-203, 1948.
[53] R. O. Duda, et al., Pattern Classification. New York: John & Wiley, 2000.
[54] G. A. F. Seber, Multivariate Observations. New York: John Wiley & Sons, 1984.
[55] S. S. Wilks, Mathematical Statistics. New York: John Wiley & Sons, 1962.
[56] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 5th ed. New Jersey: Prentice Hall, 2002.
[57] R. A. Gopinth, "Maximum likelihood modeling with Gaussian distributions for classification," in Proc. ICASSP, 1998, pp. 661-664.
[58] N. A. Campbell and W. R. Atchley, "The geometry of canonical variate analysis," Systematic Zoology, vol. 30, pp. 268-280, 1981.
[59] W. J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective. New York: Oxford University Press, 1988.
[60] D. J. Hand, Construction and Assessment of Classification Rules. New York: John Wiley & Sons, 1997.
[61] N. A. Campbell, "Canonical Variate Analysis - A General Model Formulation," Australian Journal ofStatistics, vol. 26, pp. 86-96, 1984.
[62] N. Kumar, "Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition," Ph.D. dissertation, Johns Hopkins University, 1997.
[63] M. Sakai, et al., "Generalization of linear discriminant analysis used in segmental unit input hmm for speech recognition," in Proc. ICASSP, 2007, pp. 333-336.
[64] M. Sakai, et al., "Linear discriminant analysis using a generalized mean of class covariances and its application to speech recognition," IEICE Trans. Information and Systems, vol. E91-D, pp. 478-487, 2008.
[65] C. R. Rao, Linear Statistical Inference and Its Applications, 2nd ed. New York: John Wiley & Sons, 2002.
[66] T. W. Anderson, An Introduction to Multivariate Statistical Methods, 2nd ed. New York: John Wiley & Sons, 1984.
[67] S. Geisser, "Discrimination, Allocatory, and Separatory Linear Aspects," in Classification and Clustering, J. V. Ryzin, Ed., ed, 1977, pp. 301-330.
[68] Y. Li, et al., "Weighted pairwise scatter to improve linear discriminant analysis," in Proc. ICSLP, 2000, pp. 608-611.
[69] Y. Liang, et al., "Uncorrelated linear discriminant analysis based on weighted pairwise Fisher criterion," Pattern Recognition, vol. 40, pp. 3606-3615, 2007.
[70] M. Loog and R. Haeb-Umbach, "Multi-class linear dimension reduction by generalized Fisher criteria," in Proc. ICSLP, 2000, pp. 1069-1072.
[71] H.-S. Lee and B. Chen, "Linear discriminant feature extraction using weighted classification confusion information," in Proc.Interspeech, 2008, pp. 2254-2257.
[72] H.-S. Lee and B. Chen, "Improved linear discriminant analysis considering empirical pairwise classification error rates," in Proc. ISCSLP, 2008, pp. 149-152.
[73] H.-S. Lee and B. Chen, "Empirical error rate minimization based linear discriminant analysis," in Proc. ICASSP, 2009.
[74] E. K. Tang, et al., "Linear dimensionality reduction using relevance weighted LDA," Pattern Recognition, vol. 38, pp. 485-493, 2005.
[75] Y. Liu and P. Fung, "Acoustic and phonetic confusions in accented speech recognition," in Proc. Interspeech, 2005, pp. 3033-3036.
[76] J. M. Górriz, et al., "Generalized LRT-based voice activity detector," IEEE Signal Processing Letters, vol. 13, pp. 636-639, 2006.
[77] N. A. Campbell, "Canonical variate analysis with unequal covariance matrices - generalizations of the usual solution," Mathematical Geology, vol. 16, pp. 109-124, 1984.
[78] J. D. Foley, et al., Computer Graphics: Principles and Practice in C, 2nd ed.: Addison-Wesley, 1995.
[79] H.-M. Wang, et al., "MATBN: A mandarin Chinese broadcast news corpus," International Journal of Computational Linguistics and Chinese Language Processing, vol. 10, pp. 219-235, 2005.
[80] C. Barras, et al., "Transcriber : Development and use of a tool for assisting speech corpora production," Speech Communication, vol. 33, pp. 5-22, 2001.
[81] A. Stolcke, SRI Language Modeling Toolkit (Version 1.5.2): http://www.speech.sri.com/projects/srilm/.
[82] X. Aubert, "An overview of decoding techniques for large vocabulary continuous speech recognition," Computer Speech and Language, vol. 16, pp. 89-114, 2002.
[83] 劉士弘, "改善鑑別式聲學模型訓練於中文連續語音辨識之研究," 碩士論文: 國立台灣師範大學, 2007.
[84] B. Chen, et al., "Lightly supervised and data-driven approaches to mandarin broadcast news transcription," in Proc. ICASSP, 2004, pp. 777-780.
[85] 張志豪, "強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究," 碩士論文: 國立台灣師範大學, 2005.
[86] S. Ortmanns, et al., "A word graph algorithm for large vocabulary continuous speech recognition," Computer Speech and Language, vol. 11, pp. 43-72, 1997.
[87] L. R. Bahl, et al., "A maximum likelihood approach to continuous speech recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-5, pp. 179-190, 1983.
[88] L. E. Baum, "An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov Processes," Inequalities, vol. 3, pp. 1-8, 1972.
[89] D. Povey, "Discriminative training for large vocabulary speech recognition," Ph.D. Dissertation, University of Cambridge, 2004.
[90] D. Povey and P. C. Woodland, "Minimum phone error and I-smoothing for improved discriminative training," in Proc. ICASSP, 2002, pp. 105-108.