簡易檢索 / 詳目顯示

研究生: 楊宥芩
Yang, You-Chin
論文名稱: 基於對比式訓練之輕量化開放詞彙的關鍵詞辨識
Small-footprint Open-vocabulary Keyword Spotting Using Contrastive Learning
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳柏琳
Chen, Berlin
王新民
Wang, Xin Min
洪志偉
Hung, Jeih-weih
江振宇
Chiang, Chen-Yu
口試日期: 2024/07/22
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 41
中文關鍵詞: 關鍵詞辨識零樣本對比學習開放詞彙自定義
英文關鍵詞: keyword spotting, user-defined, zero-shot, contrastive learning, open-vocabulary
研究方法: 比較研究觀察研究
DOI URL: http://doi.org/10.6345/NTNU202401799
論文種類: 學術論文
相關次數: 點閱:84下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著智慧裝置的普及,關鍵詞辨識技術變得越來越重要,其目標是在連續語音中識別是否存在特定的關鍵詞,這項任務極具挑戰性,因為它不僅需要準確地檢測關鍵詞,還需要有效地排除其他關鍵詞。隨著深度神經網絡的快速發展,採用深度神經網絡的關鍵詞辨識在精準度上取得了顯著進步。傳統基於深度神經網絡的關鍵詞辨識系統需要大量目標關鍵詞的語音作為訓練資料,因此只能識別固定的關鍵詞,且在訓練完成後難以替換關鍵詞。若需要替換關鍵詞,則必須重新收集目標關鍵詞的語料並重新訓練模型。本文聚焦於實作一個開放詞彙的關鍵詞辨識系統。該系統通過自注意力機制,利用語音特徵與文本嵌入向量生成有效的聯合嵌入,並藉由辨別器對聯合嵌入計算信心分數。系統依據這些信心分數來決定是否啟動系統。同時,透過對比式學習來處理在設定多個關鍵詞時,錯誤關鍵詞的信心分數過高而產生的誤報問題。在預訓練音頻編碼器時,我們除了使用包含5000類關鍵詞的語料進行分類任務訓練的預訓練音頻編碼器外,還採用了更加節省參數的音頻編碼器架構,能夠減少100K的參數,並通過500類關鍵詞進行分類任務的預訓練。本研究在識別10個未在訓練階段出現的新關鍵詞上,達到了94.08%的準確率,相較於基準方法提升了12%。

    As smart devices become more widespread, keyword spotting technology is becoming increasingly important. The goal of this technology is to identify the presence of specific keywords in continuous speech. This task is highly challenging as it not only requires accurate detection of the keywords but also the effective exclusion of other non-target keywords. With the rapid development of deep neural networks, keyword spotting using deep neural networks has achieved significant improvements in accuracy. Traditional keyword spotting systems based on deep neural networks require a large amount of speech data containing the target keywords for training. As a result, they can only recognize fixed keywords and it is difficult to replace these keywords once training is completed. If a keyword needs to be replaced, new speech data for the target keyword must be collected, and the model must be retrained. This paper focuses on implementing an open-vocabulary keyword spotting system. This system utilizes a self-attention mechanism to generate effective joint embeddings by leveraging speech features and text embedding vectors, and calculates confidence scores for these joint embeddings using a discriminator. The system decides whether to activate based on these confidence scores. Furthermore, contrastive learning is employed to address the false alarm issue caused by high confidence scores of incorrect keywords when multiple keywords are set. During the pre-training of the audio encoder, in addition to using a pre-trained audio encoder trained on a classification task with a dataset containing 5000 categories of keywords, we also adopted a more parameter-efficient audio encoder architecture. This architecture reduces the parameters by 100K and is pre-trained on a classification task with 500 categories of keywords. In this study, our approach achieved an accuracy of 94.08% in recognizing 10 new keywords that did not appear during the training phase, which is a 12% improvement over the baseline methods.

    第一章 緒論 1 1.1 研究動機 1 1.2 研究貢獻 2 1.3 章節概述 2 第二章 文獻探討 3 2.1 固定詞彙的關鍵詞辨識 3 2.2 開放詞彙的關鍵詞辨識 6 2.2.1 少樣本開放詞彙的關鍵詞辨識 6 2.2.2 零樣本開放詞彙的關鍵詞辨識 7 第三章 研究方法 10 3.1 編碼器 11 3.1.1 音訊編碼器 11 3.1.2 文字編碼器 12 3.2 模式處理器 13 3.2.1 模式提取器 13 3.2.2 模式辨識器 13 3.3 對比式語音文字預訓練 14 3.3.1 對比式語音文字訓練的模式辨識器 15 3.3.2 使用對比學習處理易混淆的負樣本 16 3.4 訓練標準 16 第四章 關鍵詞語料介紹 18 4.1 LibriPhrase 18 4.2 Google Speech Command 19 4.3 Multilingual Spoken Words Corpus 20 4.4 Qualcomm Keyword Speech Dataset 20 第五章 實驗設計與結果 21 5.1 實驗流程 21 5.1.1 聲學特徵抽取 21 5.1.2 預訓練特徵提取器 22 5.1.3 數據增強 22 5.1.4 易混淆關鍵詞 23 5.1.5 資料集與評估指標 24 5.1.6 訓練流程與參數設定 24 5.2 效果評估方式 25 5.2.1 曲線下面積 25 5.2.2 準確率 25 5.3 結果探討 26 5.3.1 實驗一:多資料集上模型效能的綜合比較包括曲線下面積與準確率 27 5.3.2 實驗二:多資料集與易混淆關鍵詞訓練策略對模型效能的影響 28 5.3.3 實驗三:定義新的關鍵詞的模型準確率比較 29 5.3.4 實驗四:對比式學習對關鍵詞信心分數效果的比較分析 30 5.3.5 實驗五:關鍵詞辨識模型在背景雜訊和背景語者干擾的準確率 31 5.3.6 實驗六:不同的音訊編碼器訓練方法對模型效能的比較 32 第六章 結論與未來展望 33 6.1 結論 33 6.2 未來展望 33 參考文獻 35

    [1] M. Hoy, “Alexa, Siri, Cortana, and more: An introduction to voice assistants, “Medical Reference Services Quarterly, vol. 37, pp. 81–88, 01 2018.
    [2] A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Aleksic, “Keyword spotting for Google Assistant using contextual speech recognition,” in Proceedings of ASRU 2017 – IEEE Automatic Speech Recognition and Understanding Workshop, December 16-20, Okinawa, Japan, 2017, pp. 272–278.
    [3] O. Vinyals and S. Wegmann, “Chasing the metric: Smoothing learning algorithms for keyword detection,” in Proceedings of ICASSP 2014 – 39th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4-9, Florence, Italy, 2014, pp. 3301–3305.
    [4] López-Espejo, I., Tan, Z.H., Hansen, J.H. and Jensen, J., 2021. Deep spoken keyword spotting: An overview. IEEE Access, 10, pp.4169-4199.
    [5] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of ICASSP 2014 – 39th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4-9, Florence, Italy, 2014, pp. 4087–4091.
    [6] T. N. Sainath and C. Parada, “Convolutional neural networks for smallfootprint keyword spotting,” in Proceedings of INTERSPEECH 2015– 16th Annual Conference of the International Speech Communication Association, September 6-10, Dresden, Germany, 2015, pp. 1478–1482.
    [7] X. Wang, S. Sun, and L. Xie, “Virtual adversarial training for DS-CNN based small-footprint keyword spotting,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 607–612.
    [8] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural network with shared weight self-attention for small-footprint keyword spotting,” in Proceedings of INTERSPEECH 2019 – 20th Annual Conference of the International Speech Communication Association, September 15-19, Graz, Austria, 2019, pp. 2190–2194.
    [9] S. Fernández, A. Graves, and J. Schmidhuber, “An application of recurrent neural networks to discriminative keyword spotting,” in Proceedings of ICANN 2007 – 17th International Conference on Artificial Neural Networks, September 9-13, Porto, Portugal, 2007, pp. 220–229.
    [10] E. A. Ibrahim, J. Huisken, H. Fatemi, and J. P. de Gyvez, “Keyword spotting using time-domain features in a temporal convolutional network,” in Proceedings of DSD 2019 – 22nd Euromicro Conference on Digital System Design, August 28-30, Kallithea, Greece, 2019, pp. 313–319.
    [11] Y. Wang and Y. Long, “Keyword spotting based on CTC and RNN for Mandarin Chinese speech,” in Proceedings of ISCSLP 2018 – 11th International Symposium on Chinese Spoken Language Processing, November 26-29, Taipei, Taiwan, 2018, pp. 374–378.
    [12] R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE International Conference on Acoustics, Speech and Signal Processing, April 15-20, Calgary, Canada, 2018, pp. 5484–5488.
    [13] M. Zeng and N. Xiao, “Effective combination of DenseNet and BiLSTM for keyword spotting,” IEEE Access, vol. 7, pp. 10 767–10 775, 2019.
    [14] X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, “Small-footprint keyword spotting with graph convolutional network,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 539–546.
    [15] Z.-H. Tan, A. kr. Sarkar, and N. Dehak, “rVAD: An unsupervised segment-based robust voice activity detection method,” Computer Speech & Language, vol. 59, pp. 1–21, 2020.
    [16] Y. Yuan, Z. Lv, S. Huang, and L. Xie, “Verifying deep keyword spotting detection with acoustic word embeddings,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 613–620.
    [17] S. Myer and V. S. Tomar, “Efficient keyword spotting using time delay neural networks,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, 2018, pp. 1264–1268.
    [18] H. Wu, Y. Jia, Y. Nie, and M. Li, “Domain aware training for far-field small-footprint keyword spotting,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of the International Speech Communication Association, October 25-29, Shanghai, China, 2020, pp. 2562–2566.
    [19] R. Kumar, V. Yeruva, and S. Ganapathy, “On convolutional LSTM modeling for joint wake-word detection and text dependent speaker verification,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, 2018, pp. 1121–1125.
    [20] X. Wang, S. Sun, and L. Xie, “Virtual adversarial training for DS-CNN based small-footprint keyword spotting,” in Proceedings of ASRU 2019–IEEE Automatic Speech Recognition and Understanding Workshop,December 14-18, Singapore, Singapore, 2019, pp. 607–612.
    [21] Y. Yuan, Z. Lv, S. Huang, and L. Xie, “Verifying deep keyword spotting detection with acoustic word embeddings,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 613–620.
    [22] Y. Tan, K. Zheng, and L. Lei, “An in-vehicle keyword spotting systemwith multi-source fusion for vehicle applications,” in Proceedings of WCNC 2019 – IEEE Wireless Communications and Networking Conference, April 15-18, Marrakesh, Morocco, 2019.
    [23] Rusci, M. and Tuytelaars, T., 2023. Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems. arXiv preprint arXiv:2306.02161
    [24] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017.
    [25] M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi,“Few-Shot Keyword Spotting in Any Language,” in Proc. Interspeech, 2021, pp. 4214–4218.
    [26] Y. Chen, T. Ko, and J. Wang, “A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples,” in Proc. Interspeech, 2021, pp. 4224–4228.
    [27] A. Parnami and M. Lee, “Few-shot keyword spotting with prototypical networks,” in 2022 7th International Conference on Machine Learning Technologies (ICMLT). New York, NY,USA: Association for Computing Machinery, 2022, p. 277–283.[Online].
    [28] J. Jung, Y. Kim, J. Park, Y. Lim, B.-Y. Kim, Y. Jang, and J. S.Chung, “Metric learning for user-defined keyword spotting,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
    [29] B. Kim, S. Yang, I. Chung, and S. Chang, “Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting,” in Proc. Interspeech, 2022, pp. 4621–4625.
    [30] H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proc. Interspeech 2022, 2022, pp. 1871–1875.
    [31] Lee, Y.H. and Cho, N., 2023. Phonmatchnet: phoneme-guided zero-shot keyword spotting for user-defined keywords. Interspeech, 2023, pp. 3964-3968
    [32] Nishu, K., Cho, M., Dixon, P. and Naik, D., 2024, April. Flexible keyword spotting based on homogeneous audio-text embedding. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5050-5054). IEEE.
    [33] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2014.
    [34] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH, 2015.
    [35] X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, “Smallfootprint keyword spotting with graph convolutional network,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2019.
    [36] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing, 2015.
    [37] L. Lugosch, S. Myer, and V. S. Tomar, “Donut: Ctc-based queryby-example keyword spotting,” in NeurIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language, 2018.
    [38] J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-byexample keyword spotting system using multi-head attention and softtriple loss,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2021.
    [39] H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proc. Interspeech 2022, 2022, pp. 1871–1875.
    [40] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in Proc. ICASSP 2020, 2020, pp. 7474–7478.
    [41] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “What does bert with vision look at?” in ACL (short), 2020.
    [42] Roman Vygon and N. Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” in International Conference on Speech and Computer, 2021.
    [43] Kiran R., V. Kurmi, Vinay Namboodiri, and C. V. Jawahar, “Generalized keyword spotting using asr embeddings,” in Interspeech, 2022.
    [44] Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, SooWhan Chung, and Hong-Goo Kang, “Learning audio-text agreement for open-vocabulary keyword spotting,” in Interspeech, 2022.
    [45] Kumari Nishu, Minsik Cho, and Devang Naik, “Matching latent encoding for audio-text based keyword spotting,” arXiv preprint arXiv:2306.05245, 2023.
    [46] Jaemin Jung, You kyong. Kim, Jihwan Park, Youshin Lim, Byeongchang Kim, Youngjoon Jang, and Joon Son Chung, “Metric learning for user-defined keyword spotting,” ArXiv, vol. abs/2211.00439, 2022.
    [47] Kyubyong Park and Jongseok Kim, “g2pe,” https: //github.com/Kyubyong/g2p, 2019.
    [48] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.” in INTERSPEECH, 2017.
    [49] V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.
    [50] M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez, M. Sabini, P. Mattson, D. Kanter et al., “Multilingual spoken words corpus,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
    [51] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in Proc. ICASSP 2020, 2020, pp. 7474–7478
    [52] Shin, H.-K., Han, H., Kim, D., Chung, S.-W., Kang, H.-G., “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proceedings of the Annual Conference of the International Speech Communication Association
    [53] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “What does bert with vision look at?” in ACL (short), 2020.
    [54] K. R. Prajwal, L. Momeni, T. Afouras, and A. Zisserman, “Visual keyword spotting with attention,” in 32nd British Machine Vision Conference 2021, BMVC 2021, Online, 2021, p. 380.
    [55] L. Momeni, T. Afouras, T. Stafylakis, S. Albanie, and A. Zisserman, “Seeing wake words: Audio-visual keyword spotting,” in 31st British Machine Vision Conference 2020, BMVC 2020, Online, 2020.
    [56] J. Lee, S.-W. Chung, S. Kim, H.-G. Kang, and K. Sohn, “Looking into your speech: Learning cross-modal affinity for audio-visual speech separation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1336–1345.
    [57] H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 397–406.
    [58] Elizalde, B., Deshmukh, S., Al Ismail, M., & Wang, H. (2023, June). Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
    [59] I. Lopez-Espejo, Z.-H. Tan, J. H. Hansen, and J. Jensen, “Deep ´spoken keyword spotting: An overview,” IEEE Access, vol. 10,pp. 4169–4199, 2021.
    [60] Oord, A. V. D., Li, Y., & Vinyals, O.” Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018.
    [61] Mazumder, M., Chitlangia, S., Banbury, C., Kang, Y., Ciro, J. M., Achorn, K., ... & Reddi, V. J. (2021, August). Multilingual spoken words corpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
    [62] Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang, “Query-by-example on-device keyword spotting,” in Proc. ASRU. IEEE, 2019, pp. 532–538.
    [63] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
    [64] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,”2015, arXiv:1510.08484v1.
    [65] Ng, D., Chen, Y., Tian, B., Fu, Q., & Chng, E. S. (2022, May). Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3603-3607). IEEE.
    [66] Prajit Ramachandran, Barret Zoph, and Quoc V Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.

    下載圖示
    QR CODE