簡易檢索 / 詳目顯示

研究生: 何冠勳
Ho, Kuan-Hsun
論文名稱: 語音增益之研究 — 適應性與可解釋性
Improving Compatibility and Interpretability in Speech Enhancement
指導教授: 陳柏琳
Chen, Berlin
口試委員: 江振宇
Chiang, Chen-Yu
洪志偉
Hung, Jeih-Weih
陳柏琳
Chen, Berlin
口試日期: 2024/01/20
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 62
中文關鍵詞: 語音增益兼容性強健性語音辨識處理偽影可解釋性Sinc卷積關鍵頻帶
英文關鍵詞: Speech Enhancement, Compatibility, Noise-robust Speech Recognition, Processing Artifacts, Interpretability, Sinc-convolution, Crucial Bands
DOI URL: http://doi.org/10.6345/NTNU202400233
論文種類: 學術論文
相關次數: 點閱:93下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文深入探討語音增益(SE)領域,這是一個通過減少噪音和失真來精煉語音信號的關鍵過程。借助深度神經網絡(DNNs),本研究解決了兩個基本挑戰:1)探索SE和自動語音辨識(ASR)系統之間的兼容性,以及2)增強基於DNN的SE模型的可解釋性。
    動機來源於SE模型可能在運作中引入的偽影(Artifacts),可能危及ASR性能,因此需要重新評估學習目標。為應對這一問題,提出了一種新穎的噪聲和偽影感知損失函數(NAaLoss),它在保持SE質量的同時,顯著提高了ASR性能。
    另外,在基於DNN的SE方法中,我們探索了一種新穎的設計,即基於Sinc的卷積(Sinc-conv),以在解釋性和時域方法的學習自由之間取得平衡。基於此,我們設計了重塑的Sinc卷積(rSinc-conv),不僅提升了SE的最新技術水平,還揭示了神經網絡在SE期間優先考慮的特定頻率組合。
    這項研究做出了實質性的貢獻,包括定義1)SE中的處理偽影,展示NAaLoss的有效性,通過視覺化偽影獲取洞見,並填補SE和ASR目標之間的差距。2)為SE量身定制的rSinc-conv的開發在訓練效率、濾波器多樣性和可解釋性方面提供了優勢。3)解析神經網絡的優先關注,對不同形狀濾波器的探索以及對各種SE模型的評估進一步促進了我們對SE網絡的理解和改進。總的來說,這項研究旨在為SE領域的討論做出貢獻,並為在現實情境中實現更強大和高效的SE鋪平技術道路。

    This work delves into the domain of Speech Enhancement (SE), a critical process for refining speech signals by reducing noise and distortions. Leveraging deep neural networks (DNNs), this study addresses two fundamental challenges: 1) exploring the compatibility between SE and Automatic Speech Recognition (ASR) systems, and 2) enhancing the interpretability of DNN-based SE models.
    The motivation stems from the potential introduction of artifacts by SE models that can compromise ASR performance, necessitating a re-evaluation of the learning objectives. To tackle this, a novel Noise- and Artifact-aware loss function (NAaLoss) is proposed, significantly improving ASR performance while preserving SE quality.
    Within DNN-based SE methods, a novel approach, Sinc-based convolution (Sinc-conv), is explored to strike a balance between the interpretability of spectral approaches and the learning freedom of time-domain methods. Standing upon that, we devise the reformed Sinc-conv (rSinc-conv), which not only enhances the state-of-the-art in SE but also sheds light on the specific frequency components prioritized by neural networks during SE.
    This research makes substantial contributions, including defining processing artifacts in SE, demonstrating the effectiveness of NAaLoss, visualizing artifacts for insights, and bridging the gap between SE and ASR objectives. The development of rSinc-conv tailored for SE offers advantages in training efficiency, filter diversity, and interpretability. Insights into neural network attention, exploration of different shaped filters, and evaluation of various SE models further advance the understanding and improvement of SE networks. Overall, this work aims to contribute to the discourse in SE and pave the way for more robust and efficient SE techniques with broader applications in real-world scenarios.

    Abstract ii 摘要 iii Table of Contents iv List of Tables vi List of Figures vii List of Acronyms viii Chapter 1. Introduction 1 1.1. Background 1 1.2. Motivation 2 1.2.1. Why Compatibility? 2 1.2.2. Where is Interpretability? 4 1.3. Contribution 5 Chapter 2. Related Work 7 2.1. Between SE and ASR 7 2.2. Parametric Filters in Neural Speech Processing 9 Chapter 3. Methodology 12 3.1. NAaLoss 12 3.1.1. Problem Formulation 12 3.1.2. Proposed Solution lying in Objective Function 13 3.2. rSinc-conv 15 3.2.1. Reformulation 15 3.2.2. Cutoff Frequencies Initialization 18 3.2.3. Attribute 19 Chapter 4. Experimental Setting 21 4.1. Dataset 21 4.2. Baseline SE Models 21 4.2.1. Basic Network 21 4.2.2. Simple Network — Conv-TasNet 22 4.2.3. Advanced Network — MANNER 23 4.3. Experiments on Compatibility 24 4.4. Experiments on Interpretability 26 4.5. Evaluation Metrics 27 Chapter 5. Result and Discussion 29 5.1. Compatibility 29 5.1.1. Baselines 29 5.1.2. On Basic SE 30 5.1.3. On Advanced SE 32 5.1.4. Discussion 34 5.1.5. Auxiliary Knowledge in Perception 39 5.2. Interpretability 41 5.2.1. On Simple SE 42 5.2.2. On Advanced SE 42 5.2.3. Discussion 44 5.2.4. Different Frequency Response 46 5.2.5. Invasion in Depth 48 Chapter 6. Conclusion and Outlook 51 References 54 Appendix 60 A. Sub-loss Weightings in NAaLoss Experiments 60 B. Obtaining a Triangular Filter 60 C. Deriving BIF 61

    [1] K. Tan, X. Zhang, and D. Wang, "Real-Time Speech Enhancement for Mobile Communication Based on Dual-Channel Complex Spectral Mapping," in ICASSP, 2021.
    [2] C. K. A. Reddy, N. Shankar, G. S. Bhat, R. Charan, and I. M. S. Panahi, "An individualized super-gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device," IEEE signal processing letters, vol. 24, no. 11, p. 1601–1605, 2017.
    [3] H. Lee, M. Gwak, K. Lee, M. Kim, J. Konan, and O Bhargave, "Speech Enhancement for Virtual Meetings on Cellular Networks," preprint arXiv:2302.00868, 2023.
    [4] L. Sun, J. Du, X. Zhang, T. Gao, X. Fang, and C. -H. Lee, "Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization," in ICASSP, 2020.
    [5] Y. Hu, C. Chen, H. Zou, X. Zhong, and E. S. Chng, "Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation," in ICASSP, 2023.
    [6] C. Guo, L. Hui, W. -Q. Zhang and J. Liu, "A speech enhancement algorithm using computational auditory scene analysis with spectral subtraction," in ISSPIT, 2016.
    [7] M. Dua, Akanksha, and S. Dua, "Noise robust automatic speech recognition: review and analysis," International Journal of Speech Technology, vol. 26, p. 475–519, 2023.
    [8] S. Pascual, A. Bonafonte, and J. Serr‘a, "SEGAN: Speech enhancement generative adversarial network," arXiv preprint arXiv:1703.09452, 2017.
    [9] D. Rethage, J. Pons, and X. Serra, "A wavenet for speech denoising," in ICASSP, 2018.
    [10] A. Defossez, G. Synnaeve, and Y. Adi, "Real time speech enhancement in the waveform domain," in INTERSPEECH, 2020.
    [11] S. W. Fu et al., "Metricgan+: An improved version of metricgan for speech enhancement," in INTERSPEECH, 2021.
    [12] K. Wang, B. He, and W. P. Zhu, "Tstnn: Two-stage transformer based neural network for speech enhancement in the time domain," in ICASSP, 2021.
    [13] J. Chen, Z. Wang, D. Tuo, Z. Wu, S. Kang, and H. Meng, "Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement," in ICASSP, 2022.
    [14] H. J. Park, B. H. Kang, W. Shin, J. S. Kim, and S. W. Han, "Manner: Multiview attention network for noise erasure," in ICASSP, 2022.
    [15] T. Menne, R. Schl ̈uter, and H. Ney, "Investigation into joint optimization of single channel speech enhancement and acoustic modeling for robust asr," in ICASSP, 2019.
    [16] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, "The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines," in INTERSPEECH, 2018.
    [17] M. Fujimoto and H. Kawai, "One-pass single-channel noisy speech recognition using a combination of noisy and enhanced features," in INTERSPEECH, 2019.
    [18] K. Iwamoto, T. Ochiai, D. Marc, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, "How bad are artifacts?: Analyzing the impact of speech enhancement errors on asr," in ICASSP, 2022.
    [19] R. Sawata, Y. Kashiwagi, and S. Takahashi, "Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models," in ICASSP, 2022.
    [20] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal Processing Letters, vol. 21, no. 1, p. 65–68, 2014.
    [21] S. Braun and I. Tashev, "A consolidated view of loss functions for supervised deep learning-based speech enhancement," in TSP, 2021.
    [22] M. Strauss, M. Torcoli, and B. Edler, "Improved Normalizing Flow-Based Speech Enhancement Using an all-Pole Gammatone Filterbank for Conditional Input Representation," in SLT, 2022.
    [23] D. Ditter and T. Gerkmann, "A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet," in ICASSP, 2020.
    [24] R. Razani, H. Chung, Y. Attabi, and Benoˆıt Champagne, "A reduced complexity MFCC-based deep neural network approach for speech enhancement," in ISSPIT, 2017.
    [25] Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, p. 1256–1266, 2019.
    [26] M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet," in SLT, 2018.
    [27] J. Fainberg, O. Klejch, E. Loweimi, P. Bell, and S. Renals, "Acoustic Model Adaptation from Raw Waveforms with Sincnet," in ASRU, 2019.
    [28] B. C. Yan and B. Chen, "End-to-end mispronunciation detection and diagnosis from raw waveforms," in EUSIPCO, 2021.
    [29] S.-J. Chen, A. S. Subramanian, H. Xu, and S. Watanabe, "Building state-of-the-art distant speech recognition using the chime-4 challenge with a setup of speech enhancement baseline," in INTERSPEECH, 2018.
    [30] K. Tan and D. Wang, "Improving robustness of deep learning based monaural speech enhancement against processing artifacts," in ICASSP, 2020.
    [31] I. J. Goodfellow et al., "Generative Adversarial Networks," in NeurIPS, 2014.
    [32] E. Loweimi, P. Bell, and S. Renals, "On Learning Interpretable CNNs with Parametric Modulated Kernel-based Filters," in INTERSPEECH, 2019.
    [33] P.-G. No ́e, T. Parcollet, and M. Morchid1, "CGCNN: Complex Gabor Convolutional Neural Network on Raw Speech," in ICASSP, 2020.
    [34] Wikipedia, "Moore–Penrose inverse," [Online]. Available: https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse. [Accessed 06 Jan. 2024].
    [35] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, "Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech," in SSW, 2016.
    [36] J. Thiemann, N. Ito, and E. Vincent, "The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings," Journal of the Acoustical Society of America, vol. 133, no. 5, p. 3591–3591, 2013.
    [37] M. Ravanelli et al., "SpeechBrain: A General-Purpose Speech Toolkit," arXiv preprint arXiv:2106.04624, 2021.
    [38] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," in ICLR, 2015.
    [39] D. Povey et al., "The Kaldi speech recognition toolkit," in ASRU, 2011.
    [40] Wikipedia, "Perceptual Evaluation of Speech Quality," Wikipedia.com, [Online]. Available: https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality. [Accessed 5 Jan. 2024].
    [41] C. H. Taal, R. C. Hendriks, R. Heusdens and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," in ICASSP, 2010.
    [42] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, "SDR – Half-baked or Well Done?," in ICASSP, 2019.
    [43] R. Lyon, "A computational model of binaural localization and separation," in ICASSP, 1983.
    [44] O. Cappe, "Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor," IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 345-349, 1994.
    [45] H. C. Chen et al., "3.2. Acoustic Aspects of Consonants," EdUHK., [Online]. Available: https://corpus.eduhk. hk/english_pronunciation/index.php/ 3-2-acoustic-aspects-of-consonants/. [Accessed 10 May 2023].
    [46] J. M. Hillenbrand, L. A. Getty, M. J. Clark, and K. Wheeler, "Acoustic characteristics of American English vowels," Journal of the Acoustical Society of America, vol. 97, no. 5, pp. 3099-3111, 1995.
    [47] T.-A. Hsieh, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao, "Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech" Enhancement," in INTERSPEECH, 2021.
    [48] R. Chao, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao, "Perceptual Contrast Stretching on Target Feature for Speech Enhancement," in INTERSPEECH, 2022.
    [49] I. Olkin and F. Pukelsheim, "The distance between two random vectors with given dispersion matrices," Linear Algebra and its Applications, vol. 48, p. 257–263, 1982.
    [50] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "Wav2vec 2.0: a framework for self-supervised learning of speech representations," in NeurIPS, 2020.
    [51] S. Rahman, M. M. Rahman, M. Abdullah-Al-Wadud, G. D. AlQuaderi, and M. Shoyaib, "An adaptive gamma correction for image enhancement," EURASIP Journal on Image and Video Processing, vol. 2016, no. 1, pp. 1-13, 2016.
    [52] N. Saleem, M. I. Khattak, M. Al-Hasan, and A. B. Qazi, "On Learning Spectral Masking for Single Channel Speech Enhancement Using Feedforward and Recurrent Neural Networks," IEEE Access, vol. 8, pp. 160581-160595, 2020.
    [53] A. H. Moore, P. P. Parada, and P. A. Naylor, "Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures," Computer Speech & Language, Vol. 46, pp. 574-584, 2017.
    [54] Y. Luo, Z. Chen, and T. Yoshioka, "Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation," in ICASSP, 2020.
    [55] L. Sifre and S. Mallat, "Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination," in CVPR, 2013.
    [56] Y. Zeng et al., "TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement," in ICASSP, 2023.

    下載圖示
    QR CODE