研究生: |
劉成韋 Liu Cheng-Wei |
---|---|
論文名稱: |
強健性語音辨識上關於特徵正規化與其它改良技術的研究 A Study on Feature Normalization and Other Improved Techniques for Robust Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 中文 |
論文頁數: | 115 |
中文關鍵詞: | 特徵抽取 、特徵正規化 、統計圖等化法 、頻譜熵 、語音辨識 、強健性 |
英文關鍵詞: | feature extraction, feature normalization, histogram equalization, spectral entropy, speech recognition, aurora 2.0, robust |
論文種類: | 學術論文 |
相關次數: | 點閱:180 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人類在幾千年的演化過程中,生活上的智慧不斷的累積傳承,因此過去文明變遷和人類演化的步伐是一致的。而如今科技進化的速度,卻早已大大的超越了人類演化的速度,並且日常生活中可以使用的多媒體影音資訊也越來越多,例如廣播電視節目、語音信件、演講錄影和數位典藏等,基於這個因素,可以隨時隨地的存取上述多媒體資訊的手持式行動裝置,也越來越受到重視。很明顯地,在上述的絕大部份多媒體中,語音可以說是最具語意的主要內涵之一。除此之外,語音自古以來一直都是人類最自然也最直接的溝通方式,若能利用語音來做為人類和科技產品之間的溝通橋樑,除了具備友善且有效的優點之外,更能省去繁雜的操作手續。現今市面上所見的科技產品,普遍的來說體積已越來越小,因此觸控的方式已漸漸地不再便利。此外傳統的人機介面如滑鼠和鍵盤,並非在所有的環境下都能適當的被使用,例如在行動的汽車環境下就顯得不夠方便。所以若能利用語音來做為人機介面,將會大大的提升便利性,使得科技和生活能夠更緊密的融合。然而語音辨識通常會遭受到一些複雜的因素干擾,諸如背景噪音,通道效應,以及語者和語言上的差異等諸多因素,使得辨識系統始終無法發揮最佳的效用,而辨識率往往也差強人意。
而本篇論文的主旨,在於針對目前許多語音強健技術進行研究比較並加以改良,最後整合出一套新的技術。而本論文主要的研究方法,是以查表式統計圖等化法為主,並和其它相關的技術結合來提升語音的強健性,最後將查表式統計圖等化法加以改良為改良式統計圖等化法,也就是將參考分佈依據音框的種類,分為靜音和語音。甚至根據中文特性,再將語音細分為聲母和韻母。而吾人所提出的改良式統計圖等化法,辨識率比傳統的查表示統計圖等化法相對提升了4.04% ; 對於原始辨識率也相對提升了至少5.75%。此外吾人也嘗試對語音訊號所擷取出的頻譜熵特徵與線性鑑別分析的技術結合,再與傳統的語音特徵參數合併來作為新的語音特徵參數,而辨識率也相對提升了近1.00%。若將新的特徵參數和本論文另一個研究主題(THEQ)作結合,更可以達到加成性的效果,平均相對辨識率提升至5.19%。
In the course of evolution for thousands of years, human beings have continuously acquired as well as accumulated their knowledge from their daily life. Therefore, the civilization and evolution of human beings were almost on a par with each other in the past several thousand years. However, the quick development of technology nowadays has surmounted the evolution of human beings further. For example, huge quantities of multimedia information, such as broadcast radio and television programs, voice mails, digital archives and so on, are continuously growing and filling our computers, networks and lives. Therefore, accessing multimedia information at anytime, anywhere by small handheld mobile devices is now becoming more and more emphasized. It is well known that speech is the primary and the most convenient means of communication between people, and it will play a more active role and serve as the major human-machine interface for the interaction between people and different kinds of smart devices in the near future. Hence, it would be much more comfortable if we could use speech as the human-machine interface, and automatically transcribe, retrieve and summarize multimedia using the speech information inherent in it. However, speech recognition is usually interfered with some complicated factors, such as the background and channel noises, speaker and linguistic variations, etc., which make the current state-of-the-art recognition systems still far from perfect.
With these observations in mind, in this thesis, several attempts were made to improve the current speech robustness techniques, as well as to find a way to integrate them together. The experiments were carried out on the Aurora 2.0 database and the Mandarin broadcast news speech collected in Taiwan. Considering the phonetic characteristics of the Chinese language, a modified histogram equalization (MHEQ) approach was first proposed. Separated reference histograms for the silence and speech segments (MHEQ-2), or more precisely, the silence, INITIAL and FINAL segments (MHEQ-3) in Chinese, were established. The proposed approach can yield above 5.75% and 4.04% relative improvements over the baseline system and the conventional table-based histogram equalization (THEQ) approach, respectively, in the clean environments. Furthermore, the spectral entropy features obtained after Linear Discriminant Analysis (LDA) were used to augment the Mel-frequency cepsctral features, and considerable improvements were initially indicated. Finally, fusion of the above proposed approaches was also investigated with very promising results demonstrated.
Anshu Agarwal and Yan Ming Cheng, “Two-Stage Mel-Warped Wiener Filter for Robust Speech Recognition”, USA, ASRU, 1999
S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification”, IEEE Trans. Acoust. Speech Signal Process. 1981
R. Gomez, A. Lee, K. Shikano et al., “Robust Speech Recognition with Spectral Subtraction in low SNR”, ICSLP 2004
Chang-wen Hsu, Lin-shan Lee, “Higher Order Cepstral Moment Normalization for Robust Speech Recognition”, ISASSP 2004.
Florian Hilger & Hermann Ney and Olivier Siohan & Frank K. Soong, “Combining Neighboring Filter Channels to Improve Quantile-Based Histogram Equalization”, in Proc. IEEE International Conference, Hong Kong, China, Apr. 2003
Florian Hilger & Hermann Ney, “Evaluation of Quantile Based Histogram Equalization with Filter Combination on the Aurora 3 and 4 Databases”, GENEVA, EUROSPEECH 2003.
J.R. Hassall, and k. Zaveri, “Acoustic Noise Measurement“, 5thed., Bruel & Kjaer, Naerum, Denmark, June 1988, Chapter 3.
Anil Khare, Student Member, IEEE, Toshinori Yoshikawa, Member, IEEE, “Moment of Cepstrum and its Aplications”, IEEE TRANSCATIONS on SIGNAL PROCESSING. VOL. 40 NO. 11, NOVEMBER 1992
D.Y. Kim, S.Umesh, M.J.F. Gales, T.Hain and P.C. Woodland, “Using VTLN for Broadcast News Transcription”, Cambridge University Engineering Department, 2004
Filipp Korkmazsky, Dominique Fohr, Irina Illina, “Using Linear Interpolation to Improve Histogram Equalization for Speech Recognition”, France, ICSLP 2004
Harold Gene Longbotham, Alan Conrad Bovik, “Theory of Order Statistic Filters and Their Relationship to Linear FIR Filters”, IEEE TRANSACTIONS on ACOUSTICS, SPEECH, and SIGNAL PROCESSING, VOL. 37. NO. 2 , 1989
H. Lord, W.S. Gatley, and H.A. Evensen, “Noise Control for Engineers”, McGraw Hill, 1980, Chapter 2.
Li Lee and Richard Rose, “A Frequency Warping Approach to Speaker Normalization“, Member IEEE, 1998
D. Macho et al., “Evaluation of a Noise Robust DSR Front-End on AURORA Databases”, ICSLP 2002
Dusan Macho and Yan Ming Cheng, “SNR-Dependent Waveform Processing for Improving the Robustness of ASR Front-End”, Human Interface Lab, Motorola Labs, ICASSP, 2001
H. Misra, S. Ikbal, S. Sivadas, and H. Bourlard, “Multi-Resolution Spectral Entropy Feature for Robust ASR”, ICASSP 2005.
H. Misra, S. Ikbal, H. Bourlard, and H.Hermansky, “Spectral Entropy Based Feature for Robust ASR”, ICASSP 2004.
Sirko Molau, “Normalization in the Acoustic Feature Space for Improved Speech Recognition”, February, 2003.
Sirko Molau, Daniel Keysers, and Hermann Ney, “ Matching Training and Test Data Distributions for Robust Speech Recognition”, Speech Communication 41, 579-601, ELSEVIER 2003.
Antonio M. Peinado, Carmen Benitez, “Histogram Equalization of Speech Representation for Robust Speech Recognition”, IEEE Transactions on Speech and Audio Processing, November 2003.
Michael Pitz and Hermann Ney, ”Vocal Tract Normalization as Linear Transformation of MFCC”, Aachen, Germany, EUROSPEECH 2003
J.C. Segura, M.C. Benitez, A. de la Torre, A.J. Rubio, “Feature Extraction Combining Spectral Noise Reduction and Cepstral Histogram Equalization for Robust ASR”, Granada, SPAIN, ICSLP, 2002
Yong Ho Suk, Seung Ho Choi, Hwang Soo Lee, ”Cepstrum Third-Order Normalization Method for Noisy Speech Recognition”, IEEE LETTERS, 1st April 1999 Vol. 35 No. 7
Shang-Nien Tsai, “Improved Robustness if Time-Frequency Principle Components (TFPC) by Synergy of Methods in Different Domains”, ICSLP 2004.
Shang-Nien Tsai and Lin-Shan Lee, “A New Feature Extraction Front-End for Robust Speech Recognition Using Progressive Histogram Equalization and Multi-Eigenvector Temporal Filtering”, ICSLP 2004.
L.F. Uebel and P.C. Woodland,”An Investigation into VTLN”, Cambridge University Engineering Department,2000
O. Viikki, K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition”, Speech Communication, Vol. 25, pp. 133-147, August 1998.
M. Westphal, “The Use of Cepstral Means in Conversational Speech Recognition”, in Proc. Eurospeech 1997, Berlin.
Zhenyu Xiong, Thomas Fang Zheng, and Wenhu Wu, “Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments”, Beijing, China, ICASSP, 2004
Chen Yang, Frank K. Soong and Tan Lee, “Static and Dynamic Spectral Features: Their Noise Robustness and Optimal Weights for ASR“, ICASSP 2005
Puming Zhan and Martin Westphal, “Speaker Normalization based on Frequency Warping“, Interactive Systems Laboratories
Weizhong Zhu and Douglas O’Shaughnessy, “Log-Energy Dynamic Range Normalization for Robust Speech Recognition“, ICASSP 2005