簡易檢索 / 詳目顯示

研究生: 朱紋儀
Wen-Yi Chu
論文名稱: 調變頻譜特徵正規化於強健語音辨識 之研究
Exploring Modulation Spectrum Normalization for Robust Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 69
中文關鍵詞: 自動語音辨識語音強健性非負矩陣分解機率式潛藏語意分析
英文關鍵詞: speech recognition, robustness method, nonnegative matrix factorization, probabilistic latent semantic analysis
論文種類: 學術論文
相關次數: 點閱:220下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在自動語音辨識技術的發展上,語音強健性一直都是一門重要的研究議題。在眾多的強健性技術中,針對語音特徵參數進行強化與補償為其中之一大主要派別。其中,近年來已有為數不少的新方法,藉由更新語音特徵時間序列及其調變頻譜來提昇語音特徵的強健性。綜觀這些技術,絕大多皆是藉由正規化時間序列或調變頻譜之統計特性,以降低語句間不匹配的程度,進而提昇語音辨識系統之強健性。然而本論文嘗試以一個嶄新的觀點切入,以對調變頻譜進行分解與成分分析為目標,提出兩種調變頻譜正規化法。首先,本論文嘗試藉由非負矩陣分解(Nonnegative Matrix Factorization, NMF)擷取調變頻譜中重要的基底向量,並且藉此更新調變頻譜以求取更具強健性的語音特徵。其次,本論文進一步賦予調變頻譜機率的意義,採用機率式潛藏語意分析(Probabilistic Latent Semantic Analysis, PLSA)之概念,對調變頻譜施以機率式成分分析、進而擷取出較重要的成分以求得更具強健性的語音特徵。本論文之所有實驗皆於國際通用的Aurora-2連續數字資料庫進行。相較於使用梅爾倒頻譜特徵之基礎實驗,本論文的方法皆能顯著低降低詞錯誤率。此外,本論文也嘗試將所提方法跟一些知名的特徵強健技術做結合;實驗顯示,相對於單一方法而言,結合法皆可進一步提昇辨識精確率,代表所提之新方法與許多特徵強健技術有良好的加成性。

    The environmental mismatch caused by additive noise and/or channel distortion often degrades the performance of a speech recognition system seriously. Therefore, various robustness methods have been proposed, and one prevalent school of thought aims to refine the modulation spectra of speech feature sequences. In this thesis, we proposed two novel methods to normalize the modulation spectra of speech feature sequences. First, we leverage nonnegative matrix factorization (NMF) to extract a common set of basis spectral vectors that discover the intrinsic temporal structure inherent in the modulation spectra of clean training speech features. The new modulation spectra of the speech features, constructed by mapping the original modulation spectra into the space spanned by these basis vectors, are demonstrated with good noise-robust capabilities. Second, to the render modulation spectra of speech feature sequences with a probabilistic perspective, we employ probabilistic latent semantic analysis (PLSA) with a latent set of topic distributions to explore the relationship between each modulation frequency and the magnitude modulation spectrum as a whole. All experiments were carried out on the Aurora-2 database and task. Experimental results show that the updated features via NMF and PLSA maintain high recognition accuracy for matched mismatched noisy conditions, which is quite competitive when compared to those obtained by other existing methods.

    第一章 序論 1 1.1 研究背景 1 1.2 強健性語音技術 2 1.3 研究內容與貢獻 5 1.4 論文章節安排 6 第二章 文獻回顧 7 2.1 語音特徵參數擷取 7 2.2 強健性語音特徵技術 11 2.2.1 調變頻譜特徵受雜訊干擾之影響情形 11 2.2.2調變頻譜域之語音特徵參數轉換法 14 2.2.3 時間序列域之語音特徵參數轉換法 16 2.2.3.1 資料相關線性語音特徵空間轉換 17 2.2.3.2 語音特徵參數正規化 18 第三章 實驗語料庫與相關基礎實驗結果 25 3.1 實驗語料庫 25 3.2 實驗設定 26 3.3 辨識效能評估方式 27 3.4 基礎實驗結果 28 第四章 調變頻譜分解之研究 30 4.1以非負矩陣分解為基礎之調變頻譜正規化法 30 4.2以非負矩陣分解為基礎之調變頻譜正規化法之實驗結果 35 4.2.1 NMF法作用於原始MFCC之結果 35 4.2.2 NMF法結合其他強健性特徵演算法之結果 36 4.2.3 NMF法於不同特徵參數之結果 39 4.2.4 NMF法降低調變頻譜強度失真的效能 41 第五章 調變頻譜分解之研究之延伸 43 5.1以機率式潛藏語意分析為基礎之調變頻譜正規化法 43 5.2以機率式潛藏語意分析為基礎之調變頻譜正規化法之實驗結果 49 5.2.1 PLSA法作用於原始MFCC之結果 49 5.2.2 PLSA法結合其他強健性特徵演算法之結果 51 5.2.3 PLSA法與其他調變頻譜更新法的效能比較 51 5.2.4 PLSA法降低調變頻譜強度失真的效能 52 5.3使用不同資料分析技術於調變頻譜分解之研究 54 5.3.1 以主成分分析為基礎之調變頻譜正規化法 54 5.3.2 以獨立成分分析為基礎之調變頻譜正規化法 55 5.4使用不同資料分析技術於調變頻譜分解之研究之實驗結果 58 5.4.1 PCA法與ICA法作用於原始MFCC之結果 58 5.4.2 PCA法與ICA法結合其他強健性特徵演算法之結果 59 第六章 結論與未來展望 61 6.1結論 61 6.2 未來展望 62 第七章 參考文獻 63

    Acero, A. (1990), “Acoustical and environmental robustness for automatic speech recognition,” Ph.D. Dissertation, Carnegie Mellon University.
    Beyerlein, P., X. Aubert, R. Haeb-Umbach, M. Harris, D. Klakow, A. Wendemuth, S. Molau, H. Ney, Michael Pitz and A. Sixtus (2002), “Large vocabulary continuous dpeech recognition of broadcast news - The Philips/RWTH spproach,” Speech Communication, vol. 37: pp. 109-131.
    Schuller, B., F. Weninger, M. W¨ollmer, Y. Sun, G. Rigoll (2010)., “Non-negative matrix factorization as noise-robust feature extractor for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
    Boll, S. F. (1979), “Supperssion of Acoutstic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Process., vol. 27(2): pp. 113-120.
    Comon, P. (1994), “Independent component analysis – A new concept?” Signal Process., vol. 36, pp. 287-314.
    Cooke, M., P. Green, L. Josifovski, A. Vizinho (2001), “Robust automatic speech recognition with missing and uncertain acoustic data,” Speech Communication, vol. 34: pp. 267-285.
    Chen, C. P. and J. Bilmes (2007), “MVA processing of speech features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15(1): pp. 257-270.
    Chen, B. (2009), “Word topic models for spoken document retrieval and transcription.” ACM Transaction on Asian Language Information Processing, Vol. 8, No. 1, pp. 2:1-2:27.
    Davis, S. B. and P. Mermelstein (1980). “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28(4): pp. 357-366.
    Dharanipragada, S. and M. Padmanabhan (2000), “A nonlinear unsupervised adaptation technique for speech recognition,” Interspeech2000: 6th International Conference on Spoken Language Processing (ICSLP), Beijing, China.
    Driesen, J. and Van Hamme H. (2011), “Modeling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA”, Neurocomputing.
    Droppo, J. (2008), Tutorial of European Signal Processing Conference (EUSIPCO), 2008.
    Duda, R. O. and P. E. Hart (1973), “Pattern classification and scene analysis,” New York, John Wiley and Sons.
    Ephraim , Y. and H. L. Van Trees. (1995) “A signal subspace approach for speech enhancement,” IEEE Transactions on Speech and Audio Processing, 3(4):251-266, July.
    Furui, S. (1981), “Cepstral analysis techniques for automatic speaker verification,” IEEE Transaction on Acoustic, Speech and Signal Processing, vol. 29(2): pp. 254-272.
    Gales, M. J. F. (1998), “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12(2): pp. 75-98.
    Gales, M. J. F. (2002), “Maximum likelihood multiple subspace projections for Hidden Markov Models,” IEEE Transactions on Speech and Audio Processing , vol. 10(2): pp. 37-47.
    Gales, M. J. F. and S. J. Young (1995), “Robust speech recognition in additive and convolutional noise using parallel model combination.,” Computer Speech and Language, vol. 9: pp. 289-307.
    Gauvain, J.-L. and C.-H. Lee (1994), “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transaction on Speech and Audio Processing, vol. 2(2): pp. 291-297.
    Greenberg, S (1997), “On the origins of speech intelligibility in the real world,” Proceedings of ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, April.
    Hermansky, H. (1991), “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87: pp. 1738-1752.
    Hermansky, H. (1995), “Exploring temporal domain for robustness in speech recognition,” Proc. of 15th International Congress on Acoustics, vol. II.: pp. 61-64, June 1995.
    Hermansky, H. (1997), “Should recognizers have ears?, ” Invited Tutorial Paper, Proceedings of ESCA-NATO Tutorial and Research Workshop on Robust speech recognition for unknown communication channels, pp.1-10, Pont-a-Mousson, France, April.
    Hermansky, H. and N. Morgan. (1994), “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2(4): pp. 578-589.
    Hilger, F. and H. Ney (2001), “Quantile based histogram equalization for noise robust speech recognition,” Interspeech'2001 - 7th European Conference on Speech Communication and Technology (Eurospeech), Aalborg, Denmark.
    Hilger, F. and H. Ney (2006), “Quantile based histogram equalization for noise robust large vocabulary speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14(3): pp. 845-854.
    Hirsch, H. G. and D. Pearce (2002), “The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions,” in Proc. ISCA ITRW ASR2000, Paris, France.
    Hofmann, T. (1999), “Probabilistic latent semantic analysis.” in Proc. Uncertainty in Arterial Intelligence, UAI.
    Huang, X., A. Acero, H.-W. Hon (2001), “Spoken language processing: A guide to theory, algorithm and system development,” Upper Saddle River, NJ, USA, Prentice Hall PTR.
    Hung, J.-W. and Tsai W.-Y. (2008), “Constructing modulation frequency domain based features for robust speech recognition,” IEEE Trans. Acoust., Speech, Lang. Process.
    Huang, S.-Y., W.-H. Tu, J.-W. Hung (2009), “A study of sub-band modulation spectrum compensation for robust speech recognition,” ROCLING XXI: Conference on Computational Linguistics and Speech Processing (ROCLING 2009), Taichung, Taiwan.
    Huo, Q., C. Chany, C. H. Lee (1995), “Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition,” IEEE Transaction on Speech and Audio Processing, vol. 3(4): pp. 334-345.
    Hyvarinen, A., (1999), “Gaussian Moments for Noisy Independent Component Analysis”, IEEE Signal Processing Letters, vol. 6, no. 6.
    Koehler, J., N. Morgan, H. Hermansky, H.G. Hirsch, G. Tong (1994), "Integrating RASTA-PLP into speech recognition,." IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '94), Albuquerque, New Mexico, pp. 421-424, 1994.
    Koo, B., J. D. Gibson, S. D. Gray (1989), “Filtering of colored noise for speech enhancement and coding,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '89), Glasgow, Scotland.
    Kumar, N. (1997), “Investigation of silicon-auditory models and generalization of linear discriminant analysis for improved speech recognition,” Ph.D. Dissertation, John Hopkins University.
    Lee, D. D. and H. S. Seung (1999), “Learning the parts of objects by non-negative matrix factorization,” Nature, 401:788–791.
    Lee, D. D. and H. S. Seung (2000), “Algorithms for Non-negative Matrix Factorization,” Advances in Neural Information Processing Systems 13.
    Lin, S. H., Y. M. Yeh, B. Chen (2006a), “Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition,” Interspeech'2006 - 9th International Conference on Spoken Language Processing (ICSLP), Pittsburgh, Pennsylvania.
    Lin, S. H., Y. M. Yeh, B. Chen (2006b), “An improved histogram equalization approach for robust speech recognition,” ROCLING XVIII: Conference on Computational Linguistics and Speech Processing (ROCLING 2006), Hsinchu, Taiwan.
    Leggeter, C. J. and P. C. Woodland (1995), “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, vol. 9: pp. 171-185.
    Molau, S. (2003), “Normalization in the acoustic feature space for improvedspeech recognition,” Ph.D. Dissertation, RWTH Aachen University.
    Molau, S., F. Hilger, H. Ney (2003), “Feature space normalization in adverse acoustic conditions,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong.
    Molau, S., M. Pitz, H. Ney (2001), “Histogram based normalization in the acoustic feature space,” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '01), Trento, Italy.
    Mika, S. (1999), “Fisher discriminant analysis with kernels,” IEEE International Workshop on Neural Networks for Signal Processing (NNSP 1999), Madison, Wisconsin.
    Raj, B. (2000), “Reconstruction of incomplete spectrograms for robust speech recognition. ECE Department, Pittsburgh,” Ph. D. Dissertation, Carnegie Mellon University.
    Saon, G., M. Padmanabhan, R. Gopinath, S. Chen (2000), “Maximum likelihood discriminant feature spaces,” IEEE International Conference on Acoustics, Speech, Signal processing (ICASSP '00), Istanbul, Turkey.
    Segura, J. C., C. Benitez, et al. (2004), “Cepstral domain segmental nonlinear feature transformations for robust speech recognition,” IEEE Signal Processing Letters, vol. 11(5): pp. 517-520.
    Sun, L.-C., C.-W. Hsu, L.-S. Lee (2007), “Modulation Spectrum Equalization for robust Speech Recognition,” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '07).
    Torre, A., A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, A. J. Rubio (2005), “Histogram equalization of speech representation for robust speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 13(3): pp. 355-366.
    Torre, A., J. C. Segura, C. Benitez, A. M. Peinado, A. J. Rubio (2002), “Non-linear transformations of the feature space for robust Speech recognition,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '02), Orlando, Florida.
    Varga, A. P. and R. K. Moore (1990), “Hidden Markov model decomposition of speech and noise,” In Porc. International Conference on Acoustics, Speech and Signal Processing, pages 845-848, Albuquerque, NM, U.S.A., April.
    Vikki, A. and K. Laurila (1998), “Segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol. 25: pp. 133-147.
    Wu, J., Q. Huo, D. Zhu (2005), “An environment compensated maximum likelihood training approach based on stochastic vector mapping,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '05), Philadelphia, Pennsylvania.
    Xiao, X., E. S. Chng, H. Li (2008), “Normalization of the speech modulation spectra for robust speech recognition,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 16, no. 8.
    Young, S., G. Evermann, et al. (2006), “The HTK Book (for HTK Verson 3.4),” Cambridge University.

    下載圖示
    QR CODE