研究生: |
楊明璋 Yang, Ming-Jhang |
---|---|
論文名稱: |
探索基於生成對抗網路之新穎強健性技術
於語音辨識的應用 Exploring Generative Adversarial Network Based Robustness Techniques for Automatic Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 46 |
中文關鍵詞: | 自動語音辨識 、強健式語音辨識 、生成對抗網路 、深度學習技術 、特徵強健性技術 、調變頻譜 |
DOI URL: | http://doi.org/10.6345/NTNU201900632 |
論文種類: | 學術論文 |
相關次數: | 點閱:152 下載:10 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年深度學習技術在許多領域有重大突破,在各種實際應用中也大放異彩,於自動語音辨識的應用中也一樣有優秀表現。雖然主流語音辨識系統在某些指標性任務上已經可達到和人類聽覺相當的辨識效果,然而它們卻不像人類一樣對於環境干擾具有強健性,也就是說儘管語音辨識系統有了大幅度的改進,「噪聲」仍舊一定程度的干擾語音辨識之準確度。諸如:背景人聲,火車,公車站牌,汽車噪音,餐館背景雜音…以上皆為常見的環境噪聲干擾。所以強健性技術的研究在當今語音辨識系統發展中扮演著重要角色。有鑑於此,本論文著手研究在語音特徵向量序列之調變頻譜上基於生成對抗網路之有效的增益方法。並在Aurora4語料庫上進行一系列實驗顯示本研究使用的方法可以增進語音辨識的效果。
Nowadays deep learning technologies have achieved record-breaking results in a wide array of realistic applications, such as automatic speech recognition (ASR). Even though mainstream ASR systems evaluated on a few benchmark tasks have already reached human-like performance, they, in reality, are not robust to environmental distortions in the manner that humans are. In view of this, this thesis sets out to develop effective enhancement methods, stemming from the so-called generative adversarial networks (GAN), for use in the modulation domain of speech feature vector sequences. A series of experiments conducted on the Aurora-4 database and task seem to demonstrate the utility of our proposed methods.
[1] 汪逸婷, “運用調變頻譜分解技術於強健語音特徵擷取之研究,” 國立臺灣師範大學 碩士論文, 2014.
[2] 朱紋儀, “調變頻譜正規化用於強健式語音辨識之研究,” 國立臺灣師範大學 碩士論文, 2011.
[3] 張庭豪, “調變頻譜分解之改良於強健性語音辨識,” 國立臺灣師範大學 碩士論文, 2015.
[4] 顏必成, “探索調變頻譜之低維度特徵結構用於強健性語音辨識,” 國立臺灣師範大學 碩士論文, 2017.
[5] Bi Cheng Yan, Chin Hong Shih, Shih Hung Liu, Berlin Chen, "Exploring Low-Dimensional Structures of Modulation Spectra for Robust Speech Recognition," in INTERSPEECH, 2017.
[6] Pierre Baldi, "Autoencoders, Unsupervised Learning, and Deep Architectures," in JMLR: Workshop and Conference Proceedings, 2012.
[7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, "Generative Adversarial Networks," in NIPS, 2014.
[8] Santiago Pascual, Antonio Bonafonte, Joan Serrà, "SEGAN: Speech Enhancement Generative Adversarial Network," in INTERSPEECH, 2017.
[9] Santiago Pascual, Antonio Bonafonte, Joan Serrà, Jose A. Gonzalez, "Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks," in arXiv, 2018.
[10] Ke Wang, Junbo Zhang, Sining Sun, Yujun Wang, Fei Xiang, Lei Xie, "Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition," in INTERSPEECH, 2018.
[11] Chris Donahue, Bo Li, Rohit Prabhavalkar, "Exploring Speech Enhancement With Generative Adversarial Networks," in ICASSP, 2018.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition," in CVPR, 2016.
[13] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, James Glass, "Highway Long Short-Term Memory RNNS For Distant Speech Recognition," in ICASSP, 2016.
[14] Cewu Lu , Jiaping Shi , Jiaya Jia, "Online robust dictionary learning," in CVPR, 2013.
[15] 顏必成 石敬弘 劉士弘 陳柏琳, “使用字典學習法於強健性語音辨識 The Use of Dictionary Learning Approach for Robustness Speech Recognition,” 於 ROCLING, ACLCLP, 2016.
[16] M. Aharon ,M. Elad , A. Bruckstein, "K-SVD: an algorithm for designing over complete dictionaries for sparse representation," IEEE Transactions on Signal Processing, p. 4311–4322, 2006.
[17] Daniel D. Lee and H. Sebastian Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, no. 401, pp. 788-791, 1999.
[18] Dong Yu, Li Deng, Jasha Droppo, Jian Wu, Yifan Gong, Alex Acero, "A Minimum-Mean-Square-Error Noise Reduction Algorithm On Mel- Frequency Cepstrum for Robust Speech Recognition," in ICASSP, 2008.
[19] P.C.Loizou, Speech Enhancement: theory and Practice, Boca Raton, FL, USA: CRC Press, 2013.
[20] Anuroop Sriram, Heewoo Jun, Yashesh Gaur, Sanjeev Satheesh, "Robust Speech Recognition Using Generative Adversarial Networks," in ICASSP, 2018.
[21] Yanmin Qian, Tian Tan, Hu Hu, Qi Liu, "Noise Robust Speech Recognition On Aurora4 By Humans And Machines," in ICASSP, 2018.
[22] Yanmin Qian, "Multi-Task Joint-Learning Of Deep Neural Networks For Robust Speech Recognition," in ASRU, 2015.
[23] Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, "A time delay neural network architecture for efficient modeling of long temporal contexts," in INTERSPEECH, 2015.
[24] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohamad ,Sanjeev Khudanpur, "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks," in INTERSPEECH, 2018.
[25] Pranay Dighe, Gil Luyet, Afsaneh Asaei, Herve Bourlard, "Exploiting Low-Dimensional Structures To Enhance DNN Based Acoustic Modeling In Speech Recognition," in ICASSP, 2016.
[26] Gil Luyet, Pranay Dighe, Afsaneh Asaei, Hervé Bourlard, "Low Rank Representation of Nearest Neighbor Posterior Probabilistic to Enhance DNN Based Acoustic Modeling," in INTERSPEECH, 2016.
[27] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, "Image-to-image translation with conditional adversarial networks," in CVPR, 2017.
[28] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[29] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, Yi Ma, "Robust recovery of subspace structures by low-rank representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 171–184, 2013.
[30] Guangcan Liu, Zhouchen Lin, Yong Yu, "Robust subspace segmentation by low-rank representation," in ICML, 2010.
[31] Emmanuel J. Candès, Xiaodong Li, Yi Ma, John Wright, "Robust principal component analysis," Journal of the ACM, pp. 3-11, 2011.
[32] N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series, NewYork: WILEY, 1949.
[33] Jahn Heymann, Lukas Drude, Aleksej Chinaev, Reinhold Haeb-Umbach, "BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge," in ASRU, 2015.
[34] T Menne, R Schlüter, H Ney, "Speaker adapted beamforming for multi-channel automatic speech recognition," in SLT, 2018.
[35] Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, Shinji Watanabe, "Building state of the art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline," in INTERSPEECH, 2018.
[36] Tobias Menne, Ralf Schluter, Hermann Ney, "INVESTIGATION INTO JOINT OPTIMIZATION OF SINGLE CHANNEL SPEECH ENHANCEMENT AND ACOUSTIC MODELING FOR ROBUST ASR," in ICASSP, 2019.
[37] Jacob Benesty, M Mohan Sondhi, and Yiteng Huang, Springer Handbook of Speech Processing, Berlin Heidelberg: Springer-Verlag, 2008.
[38] Jae S. Lim, Alan V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proceedings of the IEEE, no. 67,no12, p. 1586–1604.
[39] Yong Xu, Jun Du, Li-Rong Dai, Chin-Hui Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal processing letters, vol. 1, no. 21, p. 65–68, 2014.
[40] Kun Han, Yanzhang He, Deblin Bagchi, Eric Fosler-Lussier, DeLiang Wang, "Deep neural network based spectral feature mapping for robust speech recognition," in in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[41] Joanna Rownicka , Steve Renals , Peter Bell, "Simplifying Very Deep Convolutional Neural Network Architectures For Robust Speech Recognition," in ASRU, 2017.
[42] Masakiyo Fujimoto, Hisashi Kawai, "Comparative Evaluations Of Various Factored Deep Convolutional Rnn Architectures For Noise Robust Speech Recognition," in ICASSP, 2018.
[43] Yanmin Qian ; Philip C Woodland, "Very Deep Convolutional Neural Networks For Robust Speech Recognition," in SLT, 2016.
[44] Haoyi Zhang, "Multi-Task Auto-encoder For Noise-Robust Speech Recognition," in ICASSP, 2018.
[45] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, Stephen Paul Smolley, "Least Squares Generative Adversarial Networks," in IEEE International Conference on Computer Vision (ICCV), 2017.
[46] NF Viemeister, "Temporal modulation transfer functions based upon modulation thresholds," The Journal of the Acoustical Society of America, no. 66, pp. 1364-1380, 1979.
[47] Colin Vaz, Dimitrios Dimitriadis, Samuel Thomas, Shrikanth Narayanan, "CNMF-Based Acoustic Features For Noise-Robust ASR," in ICASSP, 2016.
[48] D.P. Wipf , Bhaskar D. Rao, "Sparse Bayesian learning for basis selection," IEEE Transactions on Signal Processing, p. 2153–2164, 2004.
[49] Mehrdad Yaghoobi, Laurent Daudet, and Mike E. Davies, "Parametric dictionary design for sparse coding," IEEE Transactions on Signal Processing, p. pp. 4800–4810, 2009.
[50] Y. Zhang, "A Better Autoencoder for Image: Convolutional Autoencoder," in ANU Bio-inspired Computing conference, 2018.
[51] David Pearce, J Picone, "Aurora working group: Dsr front end lvcsr evaluation au/384/02," Inst. for Signal & Information Processing, Mississippi State Univ, 2002.
[52] Povey, Daniel, Ghoshal, Arnab, Boulianne, Gilles, Burget, Lukas, Glembek, Ondrej, Goel, Nagendra, Hannemann, Mirko, Motlicek, Petr, Qian, Yanmin, Schwarz, Petr, Silovsky, Jan, Stemmer, Georg and Vesely, Karel, "The Kaldi Speech Recognition Toolkit," in ASRU, 2011.