簡易檢索 / 詳目顯示

研究生: 李佩穎
Lee, Pei-Ying
論文名稱: 新穎語者自動分段標記技術之研究
A Study on Novel Speaker Diarization Techniques
指導教授: 陳柏琳
Berlin Chen
口試委員: 陳柏琳
Berlin Chen
陳冠宇
Chen, Kuan-Yu
曾厚強
Tseng, Hou-Chiang
口試日期: 2024/07/23
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 46
中文關鍵詞: 語者自動分段標記端對端語者自動分段標記模型多頭注意力機制輔助損失函數
英文關鍵詞: speaker diarization, end-to-end neural diarization, multi-head attention, auxiliary loss
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202401826
論文種類: 學術論文
相關次數: 點閱:51下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語者自動分段標記(Speaker Diarization)在廣播節目、會議、線上媒體等多個領域中具有豐富的應用潛力,並且可以與自動語音辨識(ASR)或語音情緒辨識(SER)結合,從對話內容中提取有意義的資訊。然而,自動語音辨識在語者數量超過兩人時,其錯誤率顯著提升,這種情況被稱為雞尾酒會問題。
    為了解決未知語者數量的問題以及提升整體性能,衍生出端到端編碼器-解碼器吸引子(EEND-EDA)模型,並有許多研究針對此問題進行了深入探討。儘管有些研究結合了語者自動分段標記與自動語音辨識(ASR)或大型語言模型(LLM)以增加實用性,但這些方法並未針對編碼器的隱藏狀態進行改進。因此,本研究著重於改進語音特徵訊號的處理,以提升模型效能。
    為此,我們首先將模型框架從Transformer更改為Branchformer,強化模型對語者辨識的效能。其次,為了引導注意力機制使其更專注於語音活動,我們增加了一個輔助損失函數(Auxiliary Loss Function)。最後,嘗試將Log-Mel特徵進行更改,以提升模型的泛化能力。我們探討了在固定語者數量和未知語者數量情況下,進行語者自動分段標記是否能幫助模型提升效能,並為模型提供了新的選擇。

    Speaker diarization has rich application potential in broadcasting programs, meetings, online media, and other fields. It can be combined with Automatic Speech Recognition (ASR) or Speech Emotion Recognition (SER) to extract meaningful information from conversational content. However, the error rate of ASR significantly increases when the number of speakers exceeds two, a phenomenon known as the cocktail party problem.
    To address the issue of unknown speaker numbers and enhance overall performance, the End-to-End Encoder-Decoder Attractor (EEND-EDA) model was developed, and numerous studies have delved into this problem. Although some studies have combined speaker diarization with ASR or large language models (LLMs) to increase practicality, these methods have not focused on improving the encoder's hidden states. Therefore, this study emphasizes enhancing the processing of speech signals to improve model performance.
    This study aims to address the aforementioned issues. First, the model framework By replacing with Transformer to Branchformer to strengthen the model's speaker recognition performance. Second, to guide the attention mechanism to focus more on voice activity, we add an Auxiliary Loss Function. Finally, we attempt to modify the log-Mel features to improve the model's generalization ability. We investigate whether speaker diarization under scenarios with both a fixed number of speakers and an unknown number of speakers can enhance model performance and offer new possibilities for the model.

    第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 論文架構 4 第二章 文獻探討 5 2.1 基於聚類演算法(Clustering-based algorithms) 5 2.2 端對端模型(End-to-End Neural models, EEND) 6 2.2.1 End-to-End Neural Speaker Diarization with Permutation-Free Objectives 6 2.2.2 End-to-End Neural Speaker Diarization with Self-Attention 9 2.2.3 Encoder-Decoder Based Attractors for End-to-End Neural Diarization 11 2.3 混合系統(hybrid systems) 13 2.4 評估標準 13 2.4.1 分段標記錯誤率(Diarization Error Rate, DER) 14 2.4.2 杰卡德錯誤率(Jaccard Error Rate, JER) 15 第三章 研究方法 16 3.1 Sinc-Extractor 17 3.2 Branchformer模型 19 3.2.1 架構介紹 20 3.3 Speaker-wise VAD Loss 輔助損失 23 第四章 實驗與結果 26 4.1 實驗語料 26 4.1.1 模擬資料集 26 4.1.2 真實資料集 28 4.2 實驗設定 29 4.3 資料集訓練方式 29 4.3.1 固定語者人數 30 4.3.2 未知語者人數 30 4.4 評估結果 30 4.4.1 固定語者人數 30 4.4.2 未知語者 33 4.4.3 Speaker wise VAD loss用於特定資料集 35 4.4.4 SincExtractor用於特定資料集 36 第五章 結論與未來研究方向 37 參考文獻 38

    [1] Zhihao Du, Shiliang Zhang, Siqi Zheng, and Zhijie Yan, “Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis,” in EMNLP 2022, Nov. 18, 2022.
    [2] Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. “A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language”, 72, 101317, 2022.
    [3] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein, “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” in SIGGRAPH 2018., Apr. 10, 2018.
    [4] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Maël Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska, Iain McCowan, Wilfried Post, Dennis Reidsma, and Pierre Wellner, “The AMI meeting corpus: A pre-announcement”, in MLMI, volume 3869 of Lecture Notes in Computer Science, 2005, pp. 28–39.
    [5] Gish, H., Siu, M., and Rohlicek, R., “Segregation of speakers for speech recognition and speaker identification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 1991, pp. 873–876.
    [6] Siu, M.-H., George, Y, and Gish, H., “An unsupervised, sequential learning algorithm for segmentation for speech waveforms with multiple speakers,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 1992, pp. 189–192.
    [7] Rohlicek, J.R., Ayuso, D., Bates, M., Bobrow, R., Boulanger, A., Gish, H., Jeanrenaud, P., Meteer, M., and Siu, M, “Gisting conversational speech”, in IEEE International Conference on Acoustics, Speech and Signal Processing, 1992, pp. 113–116.
    [8] Jain, U., Siegler, M.A., Doh, S.-J., Gouvea, E., Huerta, J., Moreno, P.J., Raj, B., and Stern, R.M., “Recognition of continuous broadcast news with multiple unknown speakers and environments,” in ARPA Spoken Language Technology Workshop, 1996, pp. 61–66.
    [9] Padmanabhan, M., Bahl, L.R., Nahamoo, D., and Picheny, M.A., “Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 1996, pp. 701–704.
    [10] Gauvain, J.L., Lamel, L., and Adda, G., “Partitioning and transcription of broadcast news data,” in International Conference on Spoken Language Processing, 1998, pp. 1335–1338.
    [11] Liu, D., and Kubala, F., “Fast speaker change detection for broadcast news transcription and indexing,” in International Conference on Spoken Language Processing, 1999, pp 1031–1034.
    [12] Chen S.S., Gopalakrishnan P.S, “Speaker, Environment and Channel Change Detection and Clustering Via the Bayesian Information Criterion: Tech. Rep.”, IBM T. J. Watson Research Center, 1998, pp. 127-132
    [13] Ajmera, J., and Wooters, C., “A robust speaker clustering algorithm,” in IEEE Workshop on Automatic Speech Recognition and Understanding., 2003, pp. 411–416.
    [14] Tranter S.E., and Reynolds D.A. “Speaker diarisation for broadcast news Odyssey”, 2004, pp. 337-344.
    [15] Reynolds, D.A., and Torres-Carrasquillo, P., “Approaches and applications of audio diarization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2005, pp. 953–956.
    [16] Zhu, X., Barras, C., Meignier, S., and Gauvain, J.-L., “Combining speaker identification and BIC for speaker diarization,” in Annual Conference of the International Speech Communication Association, 2005, pp. 2441–2444.
    [17] Meignier S., Moraru D., Fredouille C., Bonastre J.-F., and Besacier L. “Step-by-step and integrated approaches in broadcast news speaker diarization Comput,” Speech Lang., 20 (2–3) , 2006, pp. 303-330.
    [18] Rosenberg, A.E., Gorin, A., Liu, Z., and Parthasarathy, P., “Unsupervised speaker segmentation of telephone conversations.” in International Conference on Spoken Language Processing, 2002, pp. 565–568.
    [19] Liu, D., and Kubala, F., “A cross-channel modeling approach for automatic segmentation of conversational telephone speech.” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2003, pp. 333–338.
    [20] Tranter, S.E., Yu, K., Evermann, G., and Woodland, P.C., “Generating and evaluating for automatic speech recognition of conversational telephone speech.” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2004, pp. 753–756.
    [21] Kenny P., Reynolds D., and Castaldo F.,” Diarization of telephone conversations using factor analysis”, IEEE J. Sel. Top. Sign. Proces., 4 (6) ,2010, pp. 1059-1070
    [22] Ajmera, J., Lathoud, G., and McCowan, L., “Clustering and segmenting speakers and their locations in meetings.” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2004, pp. 605–608.
    [23] Jin, Q., Laskowski, K., Schultz, T., and Waibel, A., “Speaker segmentation and clustering in meetings.” in International Conference on Spoken Language Processing, 2004, pp. 597–600.
    [24] Anguera, X., Wooters, C., Hernando, J., “Purity algorithms for speaker diarization of meetings data,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2006, Vol. I. pp. 1025–1028.
    [25] Leeuwen, D.A.V., Konecny, M., “Progress in the AMIDA speaker diarization system for meeting data.” in International Evaluation Workshops CLEAR 2007 and RT 2007, 2007 pp. 475–483.
    [26] Vijayasenan D., Valente F., Bourlard H, “ An information theoretic approach to speaker diarization of meeting data,” in IEEE Trans. Audio Speech Lang. Process., 2009, 17 (7), pp. 1382-1393.
    [27] Anguera X., Wooters C., Hernando J., “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. Audio Speech Lang. Process., 2007, 15 (7), pp. 2011-2023.
    [28] Vijayasenan D., Valente F., Bourlard H., “ An information theoretic approach to speaker diarization of meeting data,” IEEE Trans. Audio Speech Lang. Process., 2009 17 (7), pp. 1382-1393.
    [29] Valente, F., Motlicek, P., Vijayasenan, D., “Variational Bayesian speaker diarization of meeting recordings.” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2010, pp. 4954–4957.
    [30] Kenny P., Reynolds D., Castaldo F., “Diarization of telephone conversations using factor analysis,” IEEE J. Sel. Top. Sign. Proces., 2010, 4 (6), pp. 1059-1070.
    [31] Dehak N., Kenny P., Dehak R., Dumouchel P., Ouellet P., “Front-End Factor Analysis for Speaker Verification.,” in IEEE 2011., Vol. 19. No. 4.
    [32] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., Vair, C., “Stream-based speaker segmentation using speaker factors and eigenvoices.,” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2008, pp. 4133–4136.
    [33] Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J., “Exploiting intra-conversation variability for speaker diarization.,” in Annual Conference of the International Speech Communication Association. , 2011.
    [34] Shum, S., Dehak, N., Glass, J., “On the use of spectral and iterative methods for speaker diarization.,” in Annual Conference of the International Speech Communication Association. , 2012.pp. 482–485.
    [35] Shum S.H., Dehak N., Dehak R., Glass J.R., “Unsupervised Methods for Speaker Diarization: an Integrated and Iterative Approach.” Vol. 21. No. 10, IEEE 2013, pp. 2015-2028.
    [36] Senoussaoui M., Kenny P., Stafylakis T., Dumouchel P., “A study of the cosine distance-based mean shift for telephone speech diarization,” IEEE/ACM Trans. Audio Speech Lang. Process., Mon, 22 ,2013, pp. 217-227.
    [37] Sell G., Garcia-Romero D., “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in IEEE Spoken Language Technology Workshop, IEEE 2014, pp. 413-417.
    [38] Variani, E., Lei, X., McDermott, E., Moreno, I.L., G-Dominguez, J., “Deep neural networks for small footprint text-dependent speaker verification.” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2014, pp. 4052–4056.
    [39] Heigold, G., Moreno, I., Bengio, S., Shazeer, N., “End-to-end text-dependent speaker verification.” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2016, pp. 5115–5119.
    [40] Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L., “Speaker diarization with LSTM0.” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2018, pp. 5239–5243.
    [41] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S., “ X-vectors: Robust DNN embeddings for speaker recognition.” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2018, pp. 5329–5333.
    [42] Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C., “Fully supervised speaker diarization.” in IEEE International Conference on Acoustics, Speech and Signal Processing. , 2019, pp. 6301–6305.
    [43] Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S., “End-to-end neural speaker diarization with permutation-free objectives.” in Annual Conference of the International Speech Communication Association. , 2019, pp. 4300–4304.
    [44] Fujita Y., Kanda N., Horiguchi S., Xue Y., Nagamatsu K., Watanabe S., “End-to-end neural speaker diarization with self-attention,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2019, pp. 296-303.
    [45] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. “Xvectors: Robust DNN embeddings for speaker recognition.” in ICASSP, 2018, pp. 5329–5333.
    [46] Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, and Hui Bu. 2022b. “Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge.” in ICASSP 2022.
    [47] Weiqing Wang, Xiaoyi Qin, and Ming Li, “Crosschannel attention-based target speaker voice activity detection: Experimental results for m2met challenge.” in ICASSP, 2022. [48] Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, and Helen Meng, “The cuhk-tencent speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge.” in ICASSP, 2022, pages 9161–9165.
    [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need”, Jun, 12, 2017.
    [50] “On the Opportunities and Risks of Foundation Models”, Stanford University, Jul, 12, 2022.
    [51] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision."in arXiv preprint arXiv:2212.04356 ,2022.
    [52] 鄭宇森。「語者自動分段標記之改進方法」。碩士論文,國立臺灣師範大學資訊工程學系,2021。https://hdl.handle.net/11296/hb9fh9。
    [53] Horiguchi, Shota, et al. “Encoder-decoder based attractors for end-to-end neural diarization. ” IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022): 1493-1507.
    [54] Maiti, Soumi, et al. “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers.” 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023.
    [55] Lee, Younglo, et al. “Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. ” ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
    [56] Cornell, Samuele, et al. “One Model to Rule Them All? Towards End-to-End Joint Speaker Diarization and Speech Recognition. ” ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
    [57] Samarakoon, Lahiru, et al. “Transformer Attractors for Robust and Efficient End-To-End Neural Diarization. ” 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023.
    [58] Chen, Zhengyang, et al. “Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer. ” IEEE/ACM Transactions on Audio, Speech, and Language Processing 32 (2024): 1636-1649.
    [59] Paturi, Rohit, Sundararajan Srinivasan, and Xiang Li. “Lexical speaker error correction: Leveraging language models for speaker diarization error correction. ” arXiv preprint arXiv:2306.09313 (2023).
    [60] Park, Tae Jin, et al. “Enhancing speaker diarization with large language models: A contextual beam search approach. ” ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
    [61] Wang, Quan, et al. “Diarizationlm: Speaker diarization post-processing with large language models." arXiv preprint arXiv:2401.03506 (2024).
    [62] Sak, Haşim, Andrew Senior, and Françoise Beaufays. “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.” arXiv preprint arXiv:1402.1128 (2014).
    [63] Dong, Linhao, Shuang Xu, and Bo Xu. “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition.” 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018.
    [64] Ravanelli, Mirco, and Yoshua Bengio. “Speaker recognition from raw waveform with sincnet.” 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018.
    [65] Peng, Yifan, et al. “Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding.” International Conference on Machine Learning. PMLR, 2022.
    [66] Wu, Chuhan, et al. “Fastformer: Additive attention can be all you need.” arXiv preprint arXiv:2108.09084 (2021).
    [67] Fujita, Yusuke, et al. “End-to-end neural speaker diarization with self-attention.” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019.
    [68] Jeoung, Ye-Rin, et al. “Improving transformer-based end-to-end speaker diarization by assigning auxiliary losses to attention heads.” ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
    [69] https://open.spotify.com/episode/7Gi9abAsrsShwFYauyUkuv?si=vwoaII7yTg-ujPnhEApe6Q
    [70] Gulati, Anmol, et al. “Conformer: Convolution-augmented transformer for speech recognition.” arXiv preprint arXiv:2005.08100 (2020).
    [71] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
    [72] https://speechbrain.readthedocs.io/en/latest/API/speechbrain.lobes.models.transformer.Branchformer.html
    [73] Greenberg, Craig. “Rich Transcription Evaluation.” (2009).
    [74] Huang, Zhiheng, Wei Xu, and Kai Yu. “Bidirectional LSTM-CRF models for sequence tagging.” arXiv preprint arXiv:1508.01991 (2015).
    [75] Ding, Shaojin, et al. “Personal VAD: Speaker-conditioned voice activity detection.” arXiv preprint arXiv:1908.04284 (2019).
    [76] Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., & Watanabe, S. (2019, December). End-to-end neural speaker diarization with self-attention. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 296-303). IEEE. (2019).

    下載圖示
    QR CODE