研究生: |
邱世弦 Chiu, Shih-Hsuan |
---|---|
論文名稱: |
使用跨語句上下文語言模型和圖神經網路於會話語音辨識重新排序之研究 A Study on Hypothesis Reranking with Cross-utterance Contextualized Language Models and Graph Neural Networks for Conversational Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: |
王新民
Wang, Hsin-Min 洪志偉 Hung, Jeih-Weih 王家慶 Wang, Jia-Ching 陳柏琳 Chen, Berlin |
口試日期: | 2021/08/30 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 41 |
中文關鍵詞: | 自動語音辨識 、語言模型 、對話語音 、跨句資訊 、N-best列表 、重新排序 、上下文語言模型 、圖神經網路 |
英文關鍵詞: | automatic speech recognition, language modeling, conversational speech, cross-utterance, N-best hypothesis reranking, BERT, GCN |
研究方法: | 實驗設計法 、 現象分析 、 量化研究 、 科學研究 、 資訊工程 |
DOI URL: | http://doi.org/10.6345/NTNU202101350 |
論文種類: | 學術論文 |
相關次數: | 點閱:180 下載:27 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語言模型在一個語音辨識系統中扮演著極為重要的角色,來量化一段已辨識 候選句(詞序列)在自然語言中的語意與語法之接受度。近年來,基於神經網路架 構的語言模型明顯優於傳統的 n 連語言模型,主要因為前者具有捕捉更長距離的 上下文的卓越能力。然而,有鑒於神經語言模型的高計算複雜度,它們通常應用 於第二階段的 N 最佳候選句重新排序來對每個候選句重新打分。這種替代且輕 量級的方法,能夠使用更精緻的神經語言模型以整合任務有關的線索或調適機制 來更佳的重排候選句,已引起了廣大的興趣並成為語音辨識領域中一個重要的研 究方向。
另一方面,使用語音辨識系統來有效的辨識出對話語音,對於邁向智能對話 AI 起關鍵重要的作用。相關的應用包含虛擬助理、智能音箱、互動式語音應答... 等等,都無所不在於我們的日常生活中。而在這些真實世界的應用中,通常(或理 想上)會以多輪語音與使用者作互動,這些對話語音存在一些常見的語言現象, 例如主題連貫性和單詞重複性,但這些現象與解決辦法仍然有待探索。基於上述 的種種觀察,我們首先利用上下文語言模型(例如: BERT),將 N 最佳候選重排任 務重新定義為一種預測問題。再者,為了更進一步增強我們的模型以處理對話語 音,我們探索了一系列的主題和歷史調適的技術,大致可分為三部分: (1)一種將 跨語句資訊融入到模型中的有效方法; (2)一種利用無監督式主題建模來擷取與 任務有關的全局資訊的有效方法; (3)一種利用圖神經網路(例如: GCN)來提取詞 彙之間全局結構依賴性的新穎方法。我們在國際標竿 AMI 會議語料庫進行了一 系列的實驗來評估所提出的方法。實驗結果顯示了在降低單詞錯誤率方面,與當 前一些最先進與主流的方法相比,提出方法有其有效性與可行性。
Language models (LMs) play a significant role in an automatic speech recognition (ASR) system to provide a likelihood for any word sequence hypothesis. Over recent years, neural network (NN)-based LMs have been shown to consistently outperform the classical n-gram LMs due mainly to their superior abilities of modeling longer contextual dependency. Nevertheless, because of their high computational complexity, neural LMs usually apply to score the hypotheses produced by ASR systems at the second-pass N-best hypothesis reranking stage. This alternative and lightweight approach, which reranks N-best hypotheses with more sophisticated neural LMs, has attracted considerable interest and served as an important research direction in ASR.
Meanwhile, the effective recognition of conversational speech with ASR acts as a crucial role towards conversational AI. Possible applications ranging from virtual assistants, smart speakers to interactive voice responses (IVR) and among others, have become ubiquitous in our daily lives. These real-world applications typically interact with users in multiple turns of speech utterances that exist global conversational-level phenomena such as topical coherence and word recurrence, which however remain to be underexplored. In view of the above, we frame ASR N-best reranking with contextualized language models (such as BERT) as a prediction problem. To further enhance our models to handle conversational speech, we explore a set of topic/history modeling techniques that broadly can be three-fold: 1) an effective way to incorporate cross-utterance information clues into the model; 2) an efficient way to leverage task- specific global information with unsupervised topic modeling; and 3) a novel approach to distilling global structural dependencies among words by a graph neural network (such as GCN). We carry out a series of empirical experiments with the proposed methods on the AMI benchmark meeting corpus. Experimental results demonstrate the effectiveness and feasibility of our methods in comparison to some current top-of-the- line methods in terms of word error rate (WER) reduction.
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov. 2012.
[2] D. Yu and L. Deng, Automatic speech recognition: A deep learning approach, Springer-Verlag London, 2015.
[3] A. Becerra, J. I. de la Rosa, and E. González, “Speech recognition in a dialog system: From conventional to deep processing,” Multimedia Tools and Applications, vol. 77, no. 12, pp. 15875–15911, Jul. 2018.
[4] C. You, N. Chen, and Y. Zou, “Contextualized attention-based knowledge transfer for spoken conversational question answering,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2021.
[5] G. Saon, T. Sercu, S. Rennie, and H.-K. J. Kuo, “The IBM 2016 English conversational telephone speech recognition system,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2016, pp. 7–11.
[6] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall, “English conversational telephone speech recognition by humans and machines”, in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2017.
[7] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Achieving human parity in conversational speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2016.
[8] B. Chen and J.-W. Liu, "Discriminative language modeling for speech recognition with relevance information," in Proceedings of the IEEE International Conference on Multimedia & Expo (ICME), Barcelona, Spain, 2011.
[9] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181-184, 1995.
[10] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–394, 1999.
[11] J. T. Goodman, “A bit of progress in language modeling,” Computer Speech & Language, vol. 15, no. 4, pp. 403–434, 2001.
[12] T. Mikolov, M. Karafiát, L. Burget, Jan Černocký, and S. Khudanpur, “Recurrent neural network based language model,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2010.
[13] T. Mikolov, S. Kombrink, L. Burget, Jan Černocký, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531, 2011.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[15] M. Sundermeyer, R. Schluter, and H. Ney, “LSTM neural networks for language modeling,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2012.
[16] E. Arisoy, A. Sethy, B. Ramabhadran, and S. Chen, “Bidirectional recurrent neural network language models for automatic speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5421–5425, 2015.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the International Conference on Neural Information Processing Systems (NIPS), pp. 6000–6010, 2017.
[18] K. Irie, A. Zeyer, R. Schlu ̈ter, and H. Ney, “Language modeling with deep transformers,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2019.
[19] C. Wang, M. Li, and A. J. Smola, “Language models with transformers,” arXiv preprint arXiv:1904.09408, 2019.
[20] Ankur Gandhe and Ariya Rastrow, “Audio-attention discriminative language model for ASR rescoring,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[21] Hongzhao Huang and Fuchun Peng, “An empirical study of efficient ASR rescoring with transformers,” arXiv:1910.11450, 2019.
[22] L. Liu, Y. Gu, A. Gourav, A. Gandhe, S. Kalmane, D. Filimonov, A. Rastrow, and I. Bulyko, “Domain-aware neural language models for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[23] A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Metallinou, A. Venkatesh, and A. Rastrow, “Contextual language model adaptation for conversational agents,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Hyderabad, India. ISCA, 2018, pp. 3333–3337.
[24] S. R. Chetupalli and S. Ganapathy, “Context dependent RNNLM for automatic transcription of conversations, ” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2020.
[25] G. Sun, C. Zhang, P. C. Woodland, “Cross-utterance language models with acoustic error sampling,” arXiv:2009.01008, 2020.
[26] K. Deng, G. Cheng, H. Miao, P. Zhang, Y. Yan, “History utterance embedding transformer LM for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[27] U. Khandelwal, H. He, P. Qi, and D. Jurafsky, “Sharp nearby, fuzzy far away: How neural language models use context,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
[28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
[29] Thomas N. Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” in Proceedings of International Conference on Learning Representations (ICLR), 2017.
[30] M. Sundermeyer, H. Ney, and R. Schlüter “From feedforward to recurrent LSTM neural networks for language modeling,” IEEE Transactions on Audio, Speech, and Language Processing (TASLP), vol. 23, no. 3, pp. 517–529, 2015.
[31] T. Mikolov and G. Zweig, “Context dependent recurrent neural network language model,” in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 2012.
[32] X. Chen, T. Tan, X. Liu, P. Lanchantin, M. Wan, M. Gales, and P. C. Woodland, “Recurrent neural network language model adaptation for multi-genre broadcast speech recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2015.
[33] M. Ma, M. Nirschl, F. Biadsy, and S. Kumar, “Approaches for neural-network language model adaptation,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2017.
[34] M. W. Y. Lam, X. Chen, S. Hu, J. Yu, X. Liu, and H. Meng, “Gaussian process LSTM recurrent neural network language models for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7235–7239, 2019.
[35] K. Irie, S. Kumar, M. Nirschl, and H. Liao, “RADMM: Recurrent adaptive mixture model with applications to domain robust language modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
[36] K. Li, Z. Liu, T. He, H. Huang, F. Peng, D. Povey, S. Khudanpur, “An empirical study of transformer-based neural language model adaptation” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[37] A. Ogawa, M. Delcroix, S. Karita, and T. Nakatani, “Rescoring N-best speech recognition list based on one-on-one hypothesis comparison using encoder-classifier model,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6099–6103, 2018.
[38] A. Ogawa, M. Delcroix, S. Karita and T. Nakatani, “Improved deep duel model for rescoring N-best speech recognition list using backward LSTMLM and ensemble encoders,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2019.
[39] Susan E. Brennan and Herbert H. Clark, “Conceptual pacts and lexical choice in conversation,” Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(6):1482–1493, 1996.
[40] E. A. Schegloff, “Sequencing in conversational openings,” American Anthropologist, 70:1075–1095, 1968.
[41] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolke, “The Microsoft 2017 conversational speech recognition system,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5934–5938.
[42] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Training language models for long-span cross-sentence evaluation,” in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Singapore, 2019.
[43] W. Xiong, L. Wu, J. Zhang, and A. Stolcke, “Session-level language modeling for conversational speech,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
[44] S. Parthasarathy, W. Gale, X. Chen, G. Polovets, and S. Chang, “Long-span language modelling for speech recognition,” arXiv:1911.04571, 2019.
[45] K. Li, D. Povey, and S. Khudanpur, “Neural language modeling with implicit cache pointers,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[46] K. Li, H. Xu, Y. Wang, D. Povey, and S. Khudanpur, “Recurrent neural network language model adaptation for conversational speech recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Hyderabad, India, 2018.
[47] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov “Transformer-XL: Attentive language models beyond a fixed-Length context,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
[48] G. Sun, C. Zhang, and P. C. Woodland, “Transformer language models with LSTM-based cross-utterance information representation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[49] A. Shenoy, S. Bodapati, M. Sunkara, S. Ronanki, and K. Kirchhoff, “Adapting long context NLM for ASR rescoring in conversational agents,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2021
[50] A. Shenoy, S. Bodapati, and K. Kirchhoff, “ASR adaptation for E-commerce chatbots using cross-utterance context and multi-task language modeling,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
[51] S. Kim and F. Metze, “Acoustic-to-word models with conversational context information,” in Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
[52] S. Kim, S. Dalmia, and F. Metze, “Gated embeddings in end-to-end speech recognition for conversational-context fusion,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
[53] S. Kim, S. Dalmia, and F. Metze, “Cross-attention end-to-end ASR for two-party conversations,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2020.
[54] S.-H. Chiu and B. Chen, “Innovative BERT-based reranking language models for speech recognition,” in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 2021.
[55] S.-H. Chiu, T.-H. Lo, and B. Chen, “Cross-sentence neural language models for conversational speech recognition,” in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), 2021.
[56] T. Hoffmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, pp. 177–196, 2001.
[57] N.Peinelt, D. Nguyen, and M. Liakata, “tBERT: Topic models and BERT joining forces for semantic similarity detection,” in Proceedings of the Annual Conference of the International Speech Communication Association (ACL), pp. 7047–7055, 2020.
[58] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using BERT for speech recognition,” in Proceedings of The Asian Conference on Machine Learning (ACML), pp. 1081–1093, 2019.
[59] J. Salazar, D. Liang, T. Q Nguyen, and K. Kirchhoff, “Masked language model scoring,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2699–2712, 2020.
[60] A. Jain, A. Rouhe, S.-A. Grönroos, and M. Kurimo, “Finnish ASR with deep transformer models,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2020.
[61] S.-H. Chiu, T.-H. Lo, F.-A. Chao, and B. Chen, “Cross-utterance reranking models with BERT and graph convolutional networks for conversational speech recognition,” in Proceedings of the IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021.
[62] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
[63] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE Transactions on Knowledge and Data Engineering (TKDE), 2020.
[64] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, 1:57-81, 2021.
[65] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
[66] Z. Lu, P. Du, and J.-Y. Nie, “VGCN-BERT: Augmenting BERT with graph embedding for text classification,” In Advances in Information Retrieval - 42nd European Conference on IR Research (ECIR), 2020.
[67] G. Bouma, “Normalized (Pointwise) mutual information in collocation extraction,” in Proceedings of the Biennial GSCL Conference, 2009.
[68] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281-297, 1967.
[69] J. Carletta, S. Ashby, and S. Bourban et al., “The AMI meeting corpus: A pre-announcement,” in Proceedings of the International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39, 2005.
[70] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011.
[71] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohamadi, and S. Khudanpur, “Semi-orthogonal low-rank Matrix factorization for deep neural networks,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 2018.
[72] S.-H. Liu, K.-Y. Chen, and B. Chen, “Enhanced language modeling with proximity and sentence relatedness information for extractive broadcast news summarization," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 3, Article 46: 1-19, 2020.
[73] R. D. Martinez, S. Novotney, I. Bulyko, A. Rastrow, A. Stolcke, and A. Gandhe, “Attention-based contextual language model adaptation for speech recognition,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
[74] K. Xu, L. Wu, Z. Wang, Y. Feng, M. Witbrock, and V. Sheinin, “Graph2Seq: Graph to sequence learning with attention-based neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
[75] Y. Tachioka and S. Watanabe, “A discriminative method for recurrent neural network language models,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5386–5389, 2015.
[76] T. Hori, C. Hori, S. Watanabe, and J. R. Hershey, “Minimum word error training of long short-term memory recurrent neural network language models for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5990– 5994, 2016.