簡易檢索 / 詳目顯示

研究生: 林士翔
ShihHsiang.Lin
論文名稱: 數據擬合與分群方法於強健語音特徵擷取之研究
Exploring the Use of Data Fitting and Clustering Techniques for Robust Speech Recognition
指導教授: 葉耀明
Yeh, Yao-Ming
陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 107
中文關鍵詞: 語音辨識語音強健技術統計圖等化法數據擬合遺失特徵理論
英文關鍵詞: Speech Recognition, Robustness, Histogram Equalization, Data-Fitting, Missing Feature Theory
論文種類: 學術論文
相關次數: 點閱:273下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音長久以來一直是人類最自然且最容易使用的溝通媒介。無庸至疑地,語音也勢必會扮演著未來人類與各種智慧型電子設備間最主要的人機互動媒介,因此自動語音辨識(Automatic Speech Recognition, ASR)技術將會是扮演其中最關鍵且重要的角色。目前大部份的自動語音辨識系統在語音訊號不受干擾的理想乾淨實驗室環境下,可獲得非常不錯的辨識效果;但若應用至現實環境中,語音辨識率卻往往會因為環境中複雜因素的影響,造成訓練環境與測試環境存在的不匹配(Mismatch)的問題存在,使得系統辨識效能大幅度地降低。因此,語音強健(Robustness)技術就顯得格外重要與受到重視。
    目前有關語音強健方法的研究若以其處理對象而言,大致上可從二種不同層面討論:從語音特徵值本身為出發,或是從統計分布出發,此二類研究各有其優缺點。本論文嘗試結合上述二種層面的優點,並且利用數據擬合(Data-fitting)技術來增進語音辨識系統的辨識效能。吾人首先提出了群集式為基礎之多項式擬合統計圖法(Cluster-based Polynomial-fit Histogram Equalization, CPHEQ),利用統計圖等化法(Histogram Equalization)的概念與雙聲源訓練語料(Stereo Training Speech Data)的使用求得多項式轉換函數。再者,吾人將此方法做一些假設及延伸,進而衍生出二種不同方法,其一是以多項式擬合統計圖等化法(Polynomial-fit Histogram Equalization, PHEQ)來改良傳統統計圖等化法需要耗費較多記憶體空間與處理器運算時間的缺點;另一個則是配合遺失特徵理論(Missing Feature Theorem)的選擇性群集式為基礎之多項式擬合統計圖等化法(Selective Cluster-based Polynomial-fit Histogram Equalization, SCPHEQ)來進行語音特徵參數的重建。語音辨識實驗是以Aurora-2語料庫為研究題材;實驗結果顯示,在乾淨語料訓練模式下,吾人所提出的方法相較於基礎實驗結果能顯著地降低詞錯誤率,並且其成效也較其它傳統語音強健方法來的好。

    Speech is the primary and the most convenient means of communication between individuals. It is also expected that automatic speech recognition (ASR) will play a more active role and will serve as the major human-machine interface for the interaction between people and different kinds of intelligent electronic devices in the near future. Most of the current state-of-the-art ASR systems can achieve quite high recognition performance levels in controlled laboratory environments. However, as the systems are moved out of the laboratory environments and deployed into real-world applications, the performance of the systems often degrade dramatically due to the reason that varying environmental effects will lead to a mismatch between the acoustic conditions of the training and test speech data. Therefore, robustness techniques have received great importance and attention in recent years.
    Robustness techniques in general fall into two aspects according to whether the methods’ orientation is either from feature domain or from their corresponding probability distributions. Methods of each have their own superiority and limitations. In this thesis, several attempts were made to integrate these two distinguishing information to improve the current speech robustness methods by using a novel data-fitting scheme. Firstly, cluster-based polynomial-fit histogram equalization (CPHEQ), based on histogram equalization and polynomial regression, was proposed to directly characterize the relationship between the speech feature vectors and their corresponding probability distributions by utilizing stereo speech training data. Moreover, we extended the idea of CPHEQ with some elaborate assumptions, and two different methods were derived as well, namely, polynomial-fit histogram equalization (PHEQ) and selective cluster-based polynomial-fit histogram equalization (SCPHEQ). PHEQ uses polynomial regression to efficiently approximate the inverse of the cumulative density functions of speech feature vectors for HEQ. It can avoid the need of high computation cost and large disk storage consumption caused by traditional HEQ methods. SCPHEQ is based on the missing feature theory and use polynomial regression to reconstruct unreliable feature components. All experiments were carried out on the Aurora-2 database and task. Experimental results shown that for clean-condition training, our method achieved a considerable word error rate reduction over the baseline system and also significantly outperformed the other robustness methods.

    第一章 序論 1 1.1 研究背景 1 1.2 統計式語音辨識 2 1.3 語音強健技術 4 1.4 研究內容與貢獻 6 1.5 論文章節安排 8 第二章 文獻回顧 9 2.1 語音特徵參數擷取 9 2.2 雜訊干擾影響情形 16 2.3 強健性語音特徵技術 19 2.3.1 語音特徵參數轉換法(Feature Transformation) 19 2.3.1.1 資料相關線性語音特徵空間轉換 19 2.3.1.2 語音特徵參數正規化 20 2.3.2 語音特徵參數補償法(Feature Compensation) 26 2.3.3 語音特徵參數重建法(Feature Reconstruction) 36 2.3.3.1遺失特徵重建法作用在前端語音特徵擷取上 37 2.3.3.2遺失特徵重建法作用在後端語音解碼上 40 第三章 實驗語料庫與相關基礎實驗結果 43 3.1 實驗語料庫 43 3.2 實驗設定 43 3.3 辨識效能評估方式 45 3.4 基礎實驗結果 45 第四章 特徵參數補償法之相關改進 53 4.1 群集式為基礎之多項式擬合統計圖等化法 53 4.2 群集式為基礎之多項式擬合統計圖等化法相關實驗結果 59 4.3 群集式為基礎之多項式擬合統計圖等化法結合不同語音特徵參數相關實驗結果 62 第五章 群集式為基礎之多項式擬合統計圖等化法之延伸 65 5.1 多項式擬合統計圖等化法 65 5.1.1 多項式擬合統計圖等化法(PHEQ)相關實驗結果 69 5.2 群集式為基礎之選擇性多項式擬合統計圖等化法 72 5.2.1 群集式為基礎之選擇性多項式擬合統計圖等化法相關實驗結果 74 第六章 結論與未來展望 77 6.1 結論 77 6.2 未來展望 78 參考文獻 81 作者相關學術著作 91

    Abramowitz, M., and I. A. Stegun (1972), “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables,” Dover.
    Acero, A. (1990), “Acoustical and Environmental Robustness for Automatic Speech Recognition,” Ph. D. Dissertation, Carnegie Mellon University.
    Alpaydin, E. (2004), “Introduction to Machine Learning,” The MIT Press.
    Barker, J., M. P. Cooke, et al. (2001), “Robust ASR based on Clean Speech Models: An Evaluation of Missing Data Techniques for Connected Digit Recognition in Noise,” Interspeech'2001 - 7th European Conference on Speech Communication and Technology (Eurospeech), Alaborg, Denmark.
    Beyerlein, P., X. Aubert, et al. (2002), “Large Vocabulary Continuous Speech Recognition of Broadcast News - The Philips/RWTH Approach,” Speech Communication, vol. 37: pp. 109-131.
    Boll, S. F. (1979), “Supperssion of Acoutstic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27(2): pp. 113-120.
    Chen, C. P. and J. Bilmes (2007), “MVA Processing of Speech Features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15(1): pp. 257-270.
    Chen, C. P., J. Bilmes, et al. (2002), “Low-Resource Noise-Robust Feature Post-Processing on Aurora 2.0,” Interspeech'2002 - 7th International Conference on Spoken Language Processing (ICSLP), Denver, Colorado.
    Cooke, M., P. Green, et al. (2001), “Robust Automatic Speech Recognition with Missing and Uncertain Acoustic Data,” Speech Communication, vol. 34: pp. 267-285.
    Davis, S. B. and P. Mermelstein (1980). “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28(4): pp. 357-366.
    Deng, L., A. Acero, et al. (2000), “Large Vocabulary Speech Recognition under Adverse Acoustic Environments,” Interspeech'2000 - 6th International Conference on Spoken Language Processing (ICSLP), Beijing, China.
    Dharanipragada, S. and M. Padmanabhan (2000), “A Nonlinear Unsupervised Adaptation Technique for Speech Recognition,” Interspeech'2000 - 6th International Conference on Spoken Language Processing (ICSLP), Beijing, China.
    Droppo, J. and A. Acero (2005), “Maximum Mutual Information SPLICE Transform for Seen and Unseen Conditions,” Interspeech'2005 - 9th European Conference on Speech Communication and Technology (Eurospeech), Lisbon, Portugal.
    Droppo, J., A. Acero, et al. (2001), “Evaluation of the SPLICE Algorithm on the Aurora2 Database,” Interspeech'2001 - 7th European Conference on Speech Communication and Technology (Eurospeech), Aalborg, Denmark.
    Droppo, J., L. Deng, et al. (2002), “Evaluation of SPLICE on the Aurora 2 and 3 Tasks,” Interspeech'2002 - 7th International Conference on Spoken Language Processing (ICSLP), Denver, Colorado.
    Droppo, J., M. Mahajan, et al. (2005), “How to Train a Discriminative Front End with Stochastic Gradient and Maximum Mututal Information,” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU'05), San Juan, Puerto Rico.
    Duda, R. O. and P. E. Hart (1973), “Pattern Classification and Scene Analysis,” New York, John Wiley and Sons.
    EL-Maliki, M. and A. Drygajlo (1999), “Missing Features Detection and Handling for Robust Speaker Verification,” Interspeech'1999 - 6th European Conference on Speech Communication and Technology (Eurospeech), Budapest, Hungary.
    Fiscus, J. (1997), “A Post-Processing System To Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER),” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’97), Santa Barbara, California.
    Furui, S. (1981), “Cepstral Analysis Techniques for Automatic Speaker Verification,” IEEE Transaction on Acoustic, Speech and Signal Processing, vol. 29(2): pp. 254-272.
    Gales, M. J. F. (1998), “Maximum Likelihood Linear Transformations for HMM-based Speech Recognition,” Computer Speech and Language, vol. 12(2): pp. 75-98.
    Gales, M. J. F. (2002), “Maximum Likelihood Multiple Subspace Projections for Hidden Markov Models,” IEEE Transactions on Speech and Audio Processing , vol. 10(2): pp. 37-47.
    Gales, M. J. F. and S. J. Young (1995), “Robust Speech Recognition in Additive and Convolutional Noise Using Parallel Model Combination.,” Computer Speech and Language, vol. 9: pp. 289-307.
    Gales, M. J. F. and S. J. Young (1996), “Robust Continuous Speech Recognition using Parallel Model Combination,” IEEE Transaction on Speech and Audio Processing, vol. 4(5): pp. 352-359.
    Gauvain, J.-L. and C.-H. Lee (1994), “Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Transaction on Speech and Audio Processing, vol. 2(2): pp. 291-297.
    Gong, Y. (1995), “Speech Recognition in Noisy Environments: A Survey,” Speech Communication, vol. 16(3): pp. 261-291.
    Hain, T., P. C. Woodland, et al. (2005), “Automatic Transcription of Conversational Telephone Speech,” IEEE Transactions on Speech and Audio Processing, vol. 13(6): pp. 1173-1185.
    Hamme, H. V. (2004), “Robust Speech Reocgnition Using Cepstral Domain Missing Data Techniques and Noisy Mask,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '04), Quebec, Canada.
    Hermansky, H. (1991), “Perceptual Linear Predictive (PLP) Analysis of Speech,” Journal of the Acoustical Society of America, vol. 87: pp. 1738-1752.
    Hermansky, H. and N. Morgan. (1994), “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2(4): pp. 578-589.
    Hilger, F. and H. Ney (2001), “Quantile Based Histogram Equalization for Noise Robust Speech Recognition,” Interspeech'2001 - 7th European Conference on Speech Communication and Technology (Eurospeech), Aalborg, Denmark.
    Hilger, F. and H. Ney (2006), “Quantile Based Histogram Equalization for Noise Robust Large Vocabulary Speech Recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14(3): pp. 845-854.
    Hirsch, H. G. and D. Pearce (2002), “The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions,” in Proc. ISCA ITRW ASR2000, Paris, France.
    Hsu, C. W. and L. S. Lee (2004), “Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '04), Quebec, Canada.
    Hsu, C. W. and L. S. Lee (2006), “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition,” Interspeech'2006 - 9th International Conference on Spoken Language Processing (ICSLP), Pittsburgh, Pennsylvania.
    Huang, X., A. Acero, et al. (2001), “Spoken Language Processing: A Guide to Theory, Algorithm and System Development,” Upper Saddle River, NJ, USA, Prentice Hall PTR.
    Hung, J. W., J. L. Shen, et al. (2002), “New Approaches for Domain Transformation and Parameter Combination for Improved Accuracy in Parallel Model Combination (PMC) Technologies,” IEEE Transactions on Speech and Audio Processing, vol. 9(8): pp. 842-855
    Huo, Q., C. Chany, et al. (1995), “Bayesian Adaptive Learning of the Parameters of Hidden Markov Model for Speech Recognition,” IEEE Transaction on Speech and Audio Processing, vol. 3(4): pp. 334-345.
    Huo, Q. and D. Zhu (2006), “A Maximum Likelihood Training Approach to Irrelevant Variability Compensation Based on Piecewise Linear Transformations,” Interspeech'2006 - 9th International Conference on Spoken Language Processing (ICSLP), Pittsburgh, Pennsylvania.
    Josifovski, L., M. Cooke, et al. (1999), “State Based Imputation of Missing Data for Robust Speech Recognition and Speech Enhancement,” Interspeech'1999 - 6th European Conference on Speech Communication and Technology (Eurospeech), Budapest, Hungary.
    Juang, B. H. and S. Frui (2000), “Automatic Recognition and Understanding of Spoken Language— A First Step Toward Natural Human-Machine Communication,” Proceedings of the IEEE, vol. 88(8): pp. 1142-1165.
    Junqua, J. C., J. P. Haton, et al. (1996), “Robustness in Automatic Speech Recognition,” Norwell, MA:Kluwer.
    Koo, J. D. Gibson, et al. (1989), “Filtering of Colored Noise for Speech Enhancement and Coding,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '89), Glasgow, Scotland.
    Kumar, N. (1997), “Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition,” Ph. D. Dissertation, John Hopkins University.
    Lee, L. S. and B. Chen (2005), “Spoken Document Understanding and Organization,” IEEE Signal Processing Magazine (IEEE SPM), vol. 22(5): pp. 42-60.
    Leggeter, C. J. and P. C. Woodland (1995), “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech and Language, vol. 9: pp. 171-185.
    Lin, S. H., S.-H. Liu, et al. (2007a), “Improved Histogram Equalization (HEQ) for Robust Speech Recognition,” IEEE International Conference on Multimedia & Expo (ICME 2007), Beijing, China.
    Lin, S. H., Y. M. Yeh, et al. (2006a), “Exploiting Polynomial-Fit Histogram Equalization and Temporal Average for Robust Speech Recognition,” Interspeech'2006 - 9th International Conference on Spoken Language Processing (ICSLP), Pittsburgh, Pennsylvania.
    Lin, S. H., Y. M. Yeh, et al. (2006b), “An Improved Histogram Equalization Approach for Robust Speech Recognition,” ROCLING XVIII: Conference on Computational Linguistics and Speech Processing (ROCLING 2006), Hsinchu, Taiwan.
    Lin, S. H., Y. M. Yeh, et al. (2007b), “Cluster-based Polynomial-Fit Histogram Equalization (CPHEQ) for Robust Speech Recognition,” Interspeech'2007 - 10th European Conference on Speech Communication and Technology (Eurospeech), Antwerp, Belgium.
    Mika, S. (1999), “Fisher Discriminant Analysis With Kernels,” IEEE International Workshop on Neural Networks for Signal Processing (NNSP 1999), Madison, Wisconsin.
    Molau, S. (2003), “Normalization in the Acoustic Feature Space for Improved Speech Recognition,” Ph. D. Dissertation, RWTH Aachen University.
    Molau, S., F. Hilger, et al. (2003), “Feature Space Normalization in Adverse Acoustic Conditions,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong.
    Molau, S., M. Pitz, et al. (2001), “Histogram Based Normalization in the Acoustic Feature Space,” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '01), Trento, Italy.
    Montgomery, D. C., E. A. Peck, et al. (2006), “Introduction to Linear Regression Analysis,” Wiley-Interscience.
    Neumeyer, L. and M. Weintraub (1994), “Probabilistic Optimum Filtering for Robust Speech Recognition,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '94), Albuquerque, New Mexico.
    Palomaki, K. J., G. J. Brown, et al. (2004), “A Binaural Processor for Missing Data Speech Recognition in the Presence of Noise and Small-Room Reverberation,” Speech Communication, vol. 43(4): pp. 361-378.
    Pujol, P., D. Macho, et al. (2006), “On Real-Time Mean-and-Variance Normalization of Speech Recognition Features,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '06), Toulouse, France.
    Raj, B. (2000), “Reconstruction of Incomplete Spectrograms for Robust Speech Recognition. ECE Department. Pittsburgh,” Ph. D. Dissertation, Carnegie Mellon University.
    Raj, B., M. L. Seltzer, et al. (2004), “Reconstruction of Missing Features for Robust Speech Recognition,” Speech Communication, vol. 43(4): pp. 275-296.
    Raj, B. and R. M. Stern (2005), “Missing-feature Approaches in Speech Recognition,” Signal Processing Magazine, vol. 22(5): pp. 101-116.
    Saon, G., M. Padmanabhan, et al. (2000), “Maximum Likelihood Discriminant Feature Spaces,” IEEE International Conference on Acoustics, Speech, Signal processing (ICASSP '00), Istanbul, Turkey.
    Segura, J. C., C. Benitez, et al. (2004), “Cepstral Domain Segmental Nonlinear Feature Transformations for Robust Speech Recognition,” IEEE Signal Processing Letters, vol. 11(5): pp. 517-520.
    Suk, Y. H., S. H. Choi, et al. (1999), “Cepstrum Third-Order Normalisation Method for Noisy Speech Recognition,” Electronics Letters, vol. 35(7): pp. 527-528.
    Torre, A., A. M. Peinado, et al. (2005), “Histogram Equalization of Speech Representation for Robust Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 13(3): pp. 355-366.
    Torre, A., J. C. Segura, et al. (2002), “Non-Linear Transformations of the Feature Space for Robust Speech Recognition,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '02), Orlando, Florida.
    Vikki, A. and K. Laurila (1998), “Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, vol. 25: pp. 133-147.
    Vizinho, A., P. Green, et al. (1999), “Missing Data Theory, Spectral Subtraction and Signal-to-Noise estimation for Robust ASR,” Interspeech'1999 - 6th European Conference on Speech Communication and Technology (Eurospeech), Budapest, Hungary.
    Wan, C. Y., Y. Chen, et al. (2007), “Three-Stage Error Concealment for Distributed Speech Recognition (DSR) with Histogram-based Quantization (HQ) under Noisy Environment,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '07), Honolulu, Hawai'i.
    Wan, C. Y. and L. S. Lee (2005), “Histogram-based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition,” Interspeech'2005 - 9th European Conference on Speech Communication and Technology (Eurospeech), Lisbon, Portugal.
    Wan, C. Y. and L. S. Lee (2006), “Joint Uncertainty Decoding (JUD) with Histogram-Based Quantization (HQ) for Robust and/or Distributed Speech Recognition,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '06), Toulouse, France.
    Wu, J. and Q. Huo (2006), “An Environment-Compensated Minimum Classification Error Training Approach Based on Stochastic Vector Mapping,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14(6): pp. 2147-2155.
    Wu, J., Q. Huo, et al. (2005), “An Environment Compensated Maximum Likelihood Training Approach based on Stochastic Vector Mapping,” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP '05), Philadelphia, Pennsylvania.
    Xiao, X., H. Li, et al. (2006), “Vector Autoregressive Model for Missing Feature Reconstruction,” The Fifth International Symposium on Chinese Spoken Language Processing ( ISCSLP 2006), Singapore.
    Young, S., G. Evermann, et al. (2006), “The HTK Book (for HTK Verson 3.4),” Cambridge University.

    QR CODE