國立臺灣師範大學博碩士論文全文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	汪逸婷
論文名稱：	運用調變頻譜分解技術於強健語音特徵擷取之研究 Leveraging Modulation Spectrum Factorization Techniques for Robust Speech Recognition
指導教授：	陳柏琳
學位類別：	碩士 Master
系所名稱：	資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2014
畢業學年度：	102
語文別：	中文
論文頁數：	86
中文關鍵詞：	調變頻譜、強健性、自動語音辨識、非負矩陣分解法、稀疏性、壓縮感知法
英文關鍵詞：	modulation spectrum, robustness, automatic speech recognition, nonnegative matrix factorization, sparsity, compressive sensing
論文種類：	學術論文
相關次數：	點閱：460 下載：4
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，語音特徵調變頻譜的研究，由於其簡單又能針對語音特徵提供整體變化分析的特性，在強健性自動語音辨識的領域獲得了廣大的迴響；本論文著重於二個部分：其一為非負矩陣分解法之延伸，非負矩陣分解法由於能有效擷取調變頻譜中關鍵且不受雜訊影響的資訊，而得到許多關注，本論文將延續這個領域的研究，提出對語音進行分群處理的分群式非負矩陣分解法，以及加上稀疏性之條件的稀疏化非負矩陣分解法。其二為壓縮感知法之延伸，壓縮感知法為一種用較相關之資訊以較精簡的方式來還原訊號，本論文提出一個展新的想法，將壓縮感知法應用在語音特徵調變頻譜。分群式非負矩陣分解法為運用分群處理的技術將不同特性的語句分開處理，使非負矩陣分解法能夠更精準地擷取語音中的重要資訊，而不受語句之間的變異性干擾；稀疏化非負矩陣分解法為探索非負矩陣分解法中稀疏性帶來的影響，以期取得較集中且不重覆的基底調變頻譜。本論文所有的實驗皆使用常見的Aurora-2語料庫進行驗證，並進一步在大詞彙語料庫Aurora-4進行驗證。實驗的結果說明了：本論文所提出的兩種延伸方法，確實能在改進語音辨識的強健性上發揮其效力，並得到比其他調變頻譜應用技術更佳的辨識正確率。

Modulation spectrum processing of acoustic features has received considerable attention in the area of robust automatic speech recognition (ASR) because of its relative simplicity and good empirical performance. This thesis focus on two concept: one is nonnegative matrix factorization (NMF). An emerging school of thought is to conduct NMF on the modulation spectrum domain so as to distill intrinsic and noise-invariant temporal structure characteristics of acoustic features for better robustness. Our work try to extend the NMF by cluster the training data called cluster-based NMF and consider the sparsity of NMF called sparsed NMF. The other is compressive sensing. We proposed a novel concept to use compressive sensing on modulation spectrum. Cluster-based NMF is to investigate an alternative cluster-based NMF processing, in which speech utterances belonging to different clusters will have their own set of cluster-specific basis vectors. As such, the speech utterances can retain more compressive sensing in the NMF processed modulation spectra. Sparsed NMF is try to explore the notion of sparsity for NMF so as to ensure the derived basis vectors have sparser and more localized representations of the modulation spectra. All experiments were conducted with the widely-used Aurora-2 database and task. Furthermore, we used to LVCSR task Aurora-4. Empirical evidence reveals that our methods can offer substantial improvements and achieve performance competitive to or better than several widely-used robustness methods.

一、    緒論    1
(一)    研究背景    1
(二)    強健性語音技術    2
(三)    研究內容與貢獻    4
(四)    論文章節安排    5
二、    文獻回顧    6
(一)    語音特徵參數擷取    6
(二)    強健性語音特徵技術    9
1.    倒頻譜平均消去法(Cepstral Mean Substraction, CMS)    9
2.    倒頻譜平均與變異數正規化法(Cepstral Mean and Variance Normalization, CMVN)    9
3.    統計圖等化法(Histogram Equalization, HEQ)    10
(三)    調變頻譜正規化技術於強健性語音辨識之研究    11
1.    調變頻譜統計圖等化法(Spectral Histogram Equalization, SHE)    13
2.    調變頻譜平均正規化法(Spectral Mean Normalization, SMN)    13
3.    調變頻譜平均與變異數正規化法(Spectral Mean and Variance Normalization, SMVN)    14
三、    資料壓縮法    15
(一)    非負矩陣分解法    16
(二)    非負矩陣分解法之稀疏性    19
(三)    壓縮感知法    21
四、    實驗語料庫與相關基礎實驗結果    23
(一)    實驗語料庫    23
(二)    實驗設定    26
(三)    辨識效能評估方式    26
(四)    基礎實驗結果    27
五、    調變頻譜域之非負矩陣分解法之相關研究    30
(一)    以非負矩陣分解為基礎的的調變頻譜正規化法    30
(二)    以分群式非負矩陣分解為基礎的調變頻譜正規化法    35
(三)    以稀疏性非負矩陣分解法為基礎之調變頻譜正規化法    43
(四)    壓縮感知法    49
(五)    實驗結果之比較    51
六、    結論與未來展望    60
參考文獻    62
附錄    i
附錄一：非負矩陣分解法之公式推導    i
附錄二：稀疏化非負矩陣分解法之詳細演算法，與其詳細說明    v

                                

[1]J. Benesty, M. Sondhi and Y. Huang, “Springer Handbook of Speech Processing,” 2008.
[2]J. Tabrikian, G. S. Fostick, and H. Messer, “Detection of Environmental Mismatch in a Shallow Water Waveguide,” IEEE Transactions on Signal Processing, Vol. 47, No. 8, pp. 2181–2190, 1999.
[3]J. P. M. Schalkwijk and T. Kailath, “A Coding Scheme for Additive Noise Channels with Feedback-Part I: No Bandwidth Constraint,” IEEE Transactions on Information Theory, Vol. IT-12, No.12, pp. 183–189, 1966.
[4]V. Stouten, H. V. Hamme and P. Wambacq, “Joint Removal of Additive and Convolutional Noise with Model-Based Feature Enhancement, “in Proceeding of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. I, pp. 949–952, 2004.
[5]H. G. Hirsch and C. Ehrlicher, “Noise Estimation Techniques for Robust Speech Recognition,” in Proceeding of IEEE International Conference Acoustics, Speech, Signal Processing, Vol. 1, pp. 153–156, 1995.
[6]Y. Lv and C.-X. Zhai, “Positional Language Models for Information Retrieval, “ in Proceedings of the ACM SIGIR conference on Research and development in information retrieval (SIGIR), pp. 299–306, 2009.
[7]J. Mark and F. Gales, “Acoustic Modelling for Speech Recognition: Hidden Markov Models and Beyond?” in Proceedings of Automatic Speech Recognition & Understanding, pp. 44, 2009.
[8]J. Droppo, “Tutorial of International Conference on Spoken Language Processing, “in Proceedings of International Speech Communication Association (INTERSPEECH), 2008.
[9]S.F Boll, “Supperssion of Acouststic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech , and Signal Processing, Vol. 27, No. 2, pp. 113–120, 1979.
[10]P. Lockwood and J. Boudy, “Experiments with a Nonlinear Spectral Subtractor(NSS), Hidden Markov Models and The Projection, for Roubst Speech Recognition in Car, “ Speech Communication Vol. 11, No. 2-3, pp. 215–228, 1992.
[11]S. Fruri, “Cepstral Analysis Techniques for Automatic Speaker Verification,” IEEE Transaction on Acoustic, Speech and Signal Processing, Vol. 29, pp. 254–272, 1981.
[12]V. Olli and K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 113–147, 1998.
[13]S. Yoshizawa, N. Hayasaka, N. Wada and Y. Miyanaga, “Cepstral Gain Normalization for Noise Robust Speech Recognition, “in Proceedings of International Conference on Acoustics, Speech and Signal Processing(ICASSP), Vol. 1, pp. I-209–I-212, 2004.
[14]F. Hilger and H. Ney, “Quantile Based Histogram Equalization for Noise Robust Large Vocabulary Speech Recognition, “IEEE Transaction On Audio, Speech and Language Processing, Vol. 1, pp. I-209–I-212, 2006.
[15]A. Torre, A. M. Peinado, J. C. Segura, J. L. Perez- Cordoba, M. C. Benitez and A. J. Rubio, “Histogram Equalization of Speech Representation for Robust Speech Recognition, “ IEEE Transaction Speech Audio Processing, Vol. 13, No. 3, pp. 355–366, 2005.
[16]S.-H. Lin, H.-B. Chen, Y.-M. Yeh and B. Chen, ”Improved Histogram Equalzaiton (HEQ) for Robust Speech Recogntion,” in Proceedings of IEEE International Conference on Multimedia and Expo(ICME), pp. 2234–2237, 2007.
[17]A. P. Varga and R. K. Moore, “Hidden Markov Model Decomposition of Speech and Noise,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , pp. 845-848, 1990.
[18]M. J. F. Gales, “Model-Based Techniques for Noise Robust Speech Recognition,” Ph. D. thesis, University of Cambridge, UK, 1995.
[19]C. J. Leggetter and P. C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,“ Computer Speech and Language, Vol. 9, pp. 171–185, 1995.
[20]J.-L. Gauvain and C.-H. Lee, “Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains, “ IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 2, pp. 291–298, 1994.
[21]M. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust Automatic Speech Recognition With Missing and Unreliable Acoustic Data,” Speech Communication, Vol. 34, No.3, pp. 267–285, 2001.
[22]M. P. Cooke, A. Morris, and P. D. Green, “Missing Data Techniques For Robust Speech Recognition,” in Proceeding of International Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 863–866, 1997.
[23]B. Raj, “Reconstruction of Incomplete Spectrograms for Robust Speech Recognition,” Ph. D. dissertation, ECE Department, Carnegie Mellon University, Pittsburgh, 2000.
[24]H. Hermansky, “Perceptual Linear Predictive (PLP) Analysis of Speech, “ Journal of the Acoustical Society of America, Vol. 87, No 4, pp. 1738–1752, 1991.
[25]S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllaic Word Recognition in Comtinuously Spoken Sentences,“ IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28, No. 4, pp. 357–366, 1980.
[26]R. Drullman, J. M. Festen, and R. Plomp, “Effect of Temporal Envelope Smearing on Speech Reception,“ The Journal of the Acoustical Society of America, Vol. 95, No. 2, pp. 1053–1064, 1994.
[27]R. Drullman, J. M. Festen, and R. Plomp, “Effect of Reducing Slow Temporal Modulations on Speech Reception,“ The Journal of the Acoustical Society of America, Vol. 95, pp. 2670–2680, 1994.
[28]H. Hermansky, “Should Recognizers Have Ears?,“ Speech Communication, Vol. 25, No.1–3, pp. 3–27, 1998.
[29]N. F. Viemeister, “Temporal Modulation Transfer Functions Based Upon Modulation Thresholds,” Journal of the Acoustical Society of America, Vol. 66, pp. 1364–1380, 1979.
[30]B. Kollmeier, and R. Koch, “Speech Enhancement Based on Physiological and Psychoacoustical Models of Modulation Perception,” Journal of the Acoustical Society of America, Vol. 95, pp. 1593–1602, 1994.
[31]S. Greenberg, “On the Origins of Speech Intelligibility in The Real World, “in Proceedings of European Speech Communication Association (ESCA)–NATO Tutorial and Research Workshop on Robust Speech Rocognition for Unknown Communication Channels, pp. 23–32, 1997.
[32]S. van Vuuren and H. Hermansky, “On the Importance of Components of the Modulation Spectrum for Speaker Verification,” in Proceedings of the International Conference on Spoken Language Processing(ICSLP), Sydney, Australia, Vol. 7, pp. 3205–3208, 1998.
[33]Y. Wada, K. Yoshida, T. Suzuki, H. Mizuiri, K. Konishi, K. Ukon, K.Tanabe, Y. Sakata and M. Fukushima, “Synergistic Effects of Docetaxel And S-1 by Modulating The Expression Of Metabolic Enzymes Of 5-fluorouracil in Human Gastric Cancer Cell Lines, “ International Journal of Cancer, Vol. 119, pp. 783–791, 2006.
[34]L.-C. Sun, , C.-W. Hsu, and L.-S. Lee, “Modulation Spectrum Equalization for Robust Speech Recognition,“ in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding(ASRU), pp. 81–86, 2007.
[35]X. Xiao, E.- S. Chng and H. Li, “Normalizing the Speech Modulation Spectrum for Robust Speech Recognition,” in Proceedings of International Conference on Acoustics , Speech and Signal Processing (ICASSP), pp.1021–1024, 2007.
[36]S.-Y. Huang, W.-H. Tu and J.-W. Hung, “A Study of Sub-band Modulation Spectrum Compensation for Robust Speech Recognition,“ in Proceedings of ROCLING Conference on Computational Linguistics and Speech Processing, pp. 39–52, 2009.
[37]B. Chen, W.-H. Chen, S.-H. Lin, and W.-Y. Chu, “Robust Speech Recognition Using Spatial–Temporal Feature Distribution Characteristics,” Pattern Recognition Letters, Vol. 32, No. 7, pp. 919–926, 2011.
[38]J.-W. Hung, W.-H. Tu and C.-C. Lai, “Improved Modulation Spectrum Enhancement Methods for Robust Speech Recognition,” Signal Processing, Vol. 92, No. 11, pp. 2791–2814, 2012.
[39]J.-W. Hung, H.-T. Fan and Y.-C. Lian, “Modulation Spectrum Exponential Weighting for Robust Speech Recognition,” in Proceedings of International Conference on ITS Telecommunications, pp. 812–816, 2012.
[40]S. Ghwanmeh, R. Al-Shalabi and G. Kanaan, “Efficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language,” Journal of Computer Science, Vol. 2, No. 12, pp. 885–888, 2006.
[41]J. Bradbury, “Linear Predictive Coding,” 2000.
[42]D. D. Lee and H. S. Seung. “Learning the Parts of Objects by Non-negative Matrix Factorization. “ Nature, Vol.401, pp. 788–791, 1999.
[43]A. Hyvarinen, J. Karhunen and E. Oja. “Independent Component Analysis,“ Wiley Interscience, Vol. 13, No. 4–5, pp. 411–430, 2001.
[44]R. O. Duda, and P. E. Hatr, “Pattern Classification and Scene Analysis,” Wiley, 1 edition, 1973.
[45]N. Kumar, Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition, Ph.D. dissertation, Johns Hopkins University, Baltimore, MD, 1997.
[46]M. J. Gales, “Maximum Likelihood Multiple Subspace Projections for Hidden Markov Models, “ IEEE Transaction Speech Audio Processing, Vol. 10, No. 2, pp. 37–47, 2002.
[47]G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen, “Maximum Likelihood Discriminant Feature Spaces, “ in Proceedings of International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pp. 129–132, 2000.
[48]D. D. Lee and H. S. Seung, “Algorithms for Nonnegative Matrix Factorization,” in Advances in Neural Information Processing Systems ,Vol. 13, pp. 556–562 2001.
[49]W.-Y. Chu, J.-W. Hung and B. Chen, “Modulation Spectrum Factorization for Robust Speech Recognition,” in Proceedings of APSIPA Annual Summit and Conference (APSIPA ASC), pp. 18–21, 2011.
[50]K. Kimura and T. Yoshida, “Topic Graph based Transfer Learning via Generalized KL Divergence Based NMF,” in Proceedings of IEEE International Conference on Granular Computing, pp. 330–335, 2011.
[51]D. Cai, X. He, X. Wang, H. Bao and J. Han. “Locality Preserving Nonnegative Matrix Factorization,” in Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pp. 1010–1015, 2009.
[52]L. Zhang, Z. Chen, M. Zheng and X. He. “Robust Non-negative Matrix Factorization. “ Frontiers of Electrical and Electronic Engineering, Vol. 6, No. 2, pp. 192–200, 2011.
[53]H.-T. Fan, Y.-C. Tsai and J.-W. Hung, “Enhancing the Sub-band Modulation Spectra of Speech Features via Nonnegative Matrix Factorization for Robust Speech Recognition,“ in Proceedings of IEEE International Conference on System Science and Engineering (ICSSE), pp. 179–182, 2012.
[54]P. O. Hoyer, “Non-negative Matrix Factorization with Sparseness Constraints,” Journal of Machine Learning Research, Vol. 5, pp. 1457–1469, 2004.
[55]M. Mørup, K. H. Madsen and L. K. Hansen. “Approximate L_0 Constrained Non-negative Matrix and Tensor Factorization, “in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), special session, pp. 1328–1331, 2008.
[56]R. Peharz, M. Stark and F. Pernkopf, “Sparse Nonnegative Matrix Factorization Using l0 Constraints, “in Proceedings of IEEE International Workshop on Machine Learning for Signal Processing(MLSP), pp. 83¬–88, 2010.
[57]W.-S. Zheng, S.-Z. Li, J.-H. Lai, and S. Liao. “On Constrained Sparse Matrix Factorization, “in Proceedings of IEEE International Conference Computer Vision, pp. 1–8, 2007.
[58]T. Cai, G. Xu and J. Zhang, “On Recovery of Sparse Signals Via L1 Minimization, “ IEEE Transactions on Information Theory, Vol. 55, No. 7, pp. 3388–3397, 2009.
[59]J. Emmanuel, K. Justin and T. Terence, “Stable Signal Recovery from Incomplete and Inaccurate Measurements,“ Communications on Pure and Applied Mathematics, Vol. 59, No. 8, pp. 1207–1223, 2006.
[60]D. L. Donoho, “Compressed Sensing, “ IEEE Transactions on Information Theory, Vol. 52, No. 4, pp. 1289–1306 , 2006.
[61]E. Candès, J. Romberg and T. Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information,” IEEE Transactions Information Theory, Vol. 52, No. 2, pp. 489–509, 2006.
[62]H. G. Hirsch and D. Pearce, “The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems Under Noisy Conditions, “ in Proceeding of International Symposium on Computer Architecture Tutorial and Research Workshop Automatic Speech Recognition, 2000.
[63]S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, The HTK Book (for version 3.4), Cambridge University Engineering Department, 2009.

簡易檢索 / 詳目顯示

相關論文