研究生: |
陳映文 Chen, Ying-Wen |
---|---|
論文名稱: |
語言模型調適使用語者用詞特徵於會議語音辨識之研究 Language Model Adaptation Leveraging Speaker-Aware Word-Usage Characteristics for Meeting Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 中文 |
論文頁數: | 74 |
中文關鍵詞: | 會議語音辨識 、語言模型 、語者調適 、遞迴式類神經網路 |
英文關鍵詞: | speech recognition, language modeling, speaker adaptation, recurrent neural networks |
DOI URL: | http://doi.org/10.6345/THE.NTNU.DCSIE.009.2018.B02 |
論文種類: | 學術論文 |
相關次數: | 點閱:132 下載:46 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在會議中,如何翔實地記錄交談內容是一項很重要的工作;藉由閱讀會議記錄,可以讓未參與的人員了解會議內容。同時,也因為語音被轉寫為文字,可以使會議內容的檢索更為精準。然而,人工會議紀錄往往費力耗時;因此,使用自動語音辨識技術完成會議交談內容的轉寫,能夠節省許多時間與人力的投入。但是會議語料庫和其它一般常見的語料如新聞報導之間存在很大差異;會議語料庫通常包含不常見的單詞、短句、混合語言使用和個人口語習慣。
有鑑於此,本論文試圖解決會議語音辨識時語者間用語特性不同所造成的問題。多個語者的存在可能代表有多種的語言模式;更進一步地說,人們在講話時並沒有嚴格遵循語法,而且通常會有說話延遲、停頓或個人慣用語以及其它獨特的說話方式。但是,過去會議語音辨識中的語言模型大都不會針對不同的語者進行調整,而是假設不同的語者間擁有相同的語言模式,於是將包含多個語者的文字轉寫合成一個訓練集,藉此訓練單一的語言模型。為突破此假設,本研究希望根據不同語者為語言模型的訓練和預測提供額外的信息,即是語言模型的語者調適。本文考慮兩種測試階段的情境──「已知語者」和「未知語者」,並提出了對應此兩種情境的語者特徵擷取方法,以及探討如何利用語者特徵來輔助語言模型的訓練。
在中文和英文會議語音辨識任務上的一系列語言模型的語者調適實驗顯示,我們所提出的語言模型無論是在已知語者還是未知語者情境下都有良好的表現,並且比現有的先進技術方法有較佳的效能。
In a meeting environment, how to faithfully produce the meeting minutes is considered an important task. By reading the minutes of the meeting, the non-participating personnel can understand the content of the meeting. Meanwhile, due to that the spoken content of the meeting has been transcribed into text, searching of relevant meetings in a database thus becomes more accurate. However, manually transcribing the content of a meeting is often labor-intensive and time-consuming; using automatic speech recognition (ASR) technologies to transcribe the content will be a good surrogate for this purpose. Also worth mentioning is that there are great distinctions between those speech corpora that are frequently-dealt with, such as news datasets, and meeting corpora. A meeting corpus usually contains uncommon words, short sentences, code-mixing phenomena and diverse personal characteristics of speaking.
In view of the above, this thesis sets out to alleviate the problems caused by the multiple-speaker situation occurring frequently in a meeting for improved ASR. There are a wide variety of ways to utter in a multiple-speaker situation. That is to say, people do not strictly follow the grammar when speaking and usually have a tendency to stutter while speaking, or often use personal idioms and some unique ways of speaking. Nevertheless, the existing language models employed in ASR of meeting recordings rarely account for these facts but instead assume that all speakers participating in a meeting share the same speaking style or word-usage behavior. In turn, a single language is built with all the manual transcripts of utterances compiled from multiple speakers that were taken holistically as the training set. To relax such an assumption, we endeavor to augment additional information cues into the training phase and the prediction phase of language modeling to accommodate the variety of speaker-related characteristics, i.e., conducting speaker adaptation for language modeling. To this end, two disparate scenarios, i.e., "known speakers" and "unknown speakers," for the prediction phase are taken into consideration for developing methods to extract speaker-related information cues to aid in the training of language models.
A series of experiments carried out on automatic transcription of Mandarin and English meeting recordings show that the proposed language models along with different mechanisms for speaker adaption achieve good performance gains in relation to some state-of-the-art methods compared in the thesis.
[1] A. Mansikkaniemi and M. Kurimo, “Unsupervised topic adaptation for morph-based speech recognition,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 2693–2697, 2013.
[2] A. Mnih and Y. Teh, “A fast and simple algorithm for training neural probabilistic language models,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 1751–1758, 2012.
[3] A. Stolcke, SRI Language Modeling Toolkit (http://www.speech.sri.com/projects/srilm/), 2000.
[4] B. Chen and K.-Y. Chen, “Leveraging relevance cues for language modeling in speech recognition,” Information Processing & Management, Vol. 49, No 4, pp. 807–816, 2013.
[5] B. Roark, M. Saraclar, M. Collins and M. Johnson, “Discriminative N-gram language modeling,” Computer Speech and Language, Vol. 21, No. 2, pp. 373– 392, 2007.
[6] Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.
[7] Brown, Peter F., et al. "A statistical approach to machine translation." Computational linguistics 16.2 (1990): 79-85.
[8] C. Chelba and F. Jelinek, “Exploiting syntactic structure for language modeling,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 225–231, 1998.
[9] C. Chelba, “A structured language model,” in Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL), pp. 498–450, 1997.
[10] C. Chelba, D. Bikel, M. Shugrina, P. Nguyen and S. Kumar, “Large scale language modeling in automatic speech recognition,” Technical report, Google, 2012.
[11] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423 and 623–656, 1948.
[12] C. Manning and H. Schutze, “Foundations of statistical natural language processing,” Cambridge, MA: MIT Press, 1999.
[13] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval,” in Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 334–342, 2001.
[14] D. Gildea and T. Hofmann, “Topic-based language models using EM,” in Proceedings of the European Conference on Speech Communication and Technology (Eurospeech), pp. 2167–2170, 1999.
[15] D. Guthrie, B. Allison, W. Liu, L. Guthrie and Y. Wilks, “A closer look at skip-gram modelling,” in Proceedings of the international Conference on Language Resources and Evaluation (LREC), pp. 1222–1225, 2006.
[16] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, Jan, pp. 993–1022, 2003.
[17] D. Povey and P. C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 105–108, 2002.
[18] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." arXiv preprint arXiv:1612.08083 (2016).
[19] E. Arisoy, T. N. Sainath, B. Kingsbury and B. Ramabhadran, “Deep neural network language models,” in Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pp. 20–28, 2012.
[20] F. J. Och, “Minimum error rate training in statistical machine translation,” in Proceedings of Annual Meeting on Association for Computational Linguistics (ACL), pp. 160–167, 2003.62
[21] F. Jelinek, “Up from trigrams! The struggle for improved language models,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 1037–1040, 1991.
[22] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Cornell Aeronautical Laboratory, Psychological Review, Vol. 65, No. 6, pp. 386–408, 1958.
[23] G. Tur and A. Stolcke, “Unsupervised language model adaptation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.173–176, 2007.
[24] Gokhan Tur and Andreas Stolcke. "Unsupervised language model adaptation for meeting recognition." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Vol. 4. IEEE, 2007
[25] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.
[26] H. Le, A. Allauzen and F. Yvon, “Measuring the influence of long range dependencies with neural network language models,” in Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 1–10, 2012.60
[27] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: a Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 1, pp. 219–235, 2005.
[28] Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal processing magazine 29.6 (2012): 82-97.
[29] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, Vol. 40, No. 3–4, pp. 237–264, 1953. 58
[30] Irie, Kazuki, et al. "RADMM: recurrent adaptive mixture model with applications to domain robust language modeling." Education 758.17.1: 77.
[31] J. Goodman, “A bit of progress in language modeling (extended version),” Machine Learning and Applied Statistics Group, Technique Report, Microsoft, 2001.
[32] J. Nie, R. Li, D. Luo and X. Wu, “Refine bigram PLSA model by assigning latent topics unevenly,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 141–146, 2007.
[33] J. R. Bellegarda, “A latent semantic analysis framework for large–span language modeling,” in Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pp.1451–1454, 1997.
[34] J. R. Bellegarda, “A multispan language modeling framework for large vocabulary speech recognition,” IEEE Transactions on Acoustic, Speech and Signal Processing, Vol. 6, No. 5, pp. 456–467, 1998.
[35] J. R. Bellegarda, “Statistical language model adaptation: review and perspectives,” Speech Communication, Vol. 42, No. 1, pp. 93–108, 2004.
[36] J-T. Chien and C-H. Chueh, “Latent Dirichlet language model for speech recognition”, in Proceedings of IEEE Workshop on Spoken Language Technology (SLT), pp. 201-204, 2008.
[37] K.-F. Lee, “Automatic Speech Recognition: The Development of the SPHINX Recognition System,” Boston: Kluwer Academic Publishers, 1989.
[38] K.-Y. Chen and B. Chen, “Relevance language modeling for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5568–5571, 2011.
[39] Kuhn, Roland, and Renato De Mori. "A cache-based natural language model for speech recognition." IEEE transactions on pattern analysis and machine intelligence 12.6 (1990): 570-583.
[40] Liu, Xunying, et al. "Efficient lattice rescoring using recurrent neural network language models." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[41] Liu, Y., & Liu, F. (2008, March). Unsupervised language model adaptation via topic modeling based on named entity hypotheses. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 4921-4924). IEEE.
[42] M. A. Haidar and D. O’Shaughnessy, “Comparison of a bigram PLSA and a novel context-based PLSA language model for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 8440–8444, 2013.
[43] M. Bahrani and H. Sameti, “A new bigram PLSA language model for speech recognition,” EURASIP Journal on Advances in Signal Processing, Vol. 2010, July, pp. 1–8, 2010.
[44] M. Kozielski, D. Rybach, S. Hahn, R. Schlüter and H. Ney, “Open vocabulary handwriting recognition using combined word-level and character-level language models,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 8257–8261, 2013.
[45] Ma, Min, et al. "MODELING NON-LINGUISTIC CONTEXTUAL SIGNALS IN LSTM LANGUAGE MODELS VIA DOMAIN ADAPTATION." Acoustics, Speech and Signal Processing (ICASSP), 2018
[46] Mustafa, Mumtaz Begum, et al. "Exploring the influence of general and specific factors on the recognition accuracy of an ASR system for dysarthric speaker." Expert Systems with Applications 42.8 (2015): 3924-3932.
[47] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,“Class-based N-gram models of natural language,” Computational Linguistics, Vol. 18, No. 4, pp. 467–479, 1992.
[48] P. F. Brown, V. J. Della Pietra, S. A. Della Pietra and R. L. Mercer, “The mathematics of statistical machine translation : Parameter estimation,” Computational Linguistics, Vol. 19, No. 2, pp. 263–311, 1993.
[49] P.-N. Tan, M. Steinbach and V. Kumar, “Introduction to Data Mining,” Addison-Wesley, pp. 500, 2005.
[50] R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval: the Concepts and Technology behind Search,” Addison-Wesley Professional, 2011.
[51] R. Kneser and H. Ney, “Improved backing-off for N-gram language modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 181–184, 1995.
[52] R. Lau, R. Rosenfeld and S. Roukos, “Trigger-based language models: a maximum entropy approach,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 45–48, 1993.
[53] R. Rosenfeld, “Two decades of statistical language modeling: where do we go from here,” IEEE, Vol. 88, No. 8, pp. 1270–1278, 2000.
[54] Rosenfeld, Ronald. "Two decades of statistical language modeling: Where do we go from here?." Proceedings of the IEEE 88.8 (2000): 1270-1278.
[55] S. F. Chen, and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL), pp. 310–318, 1996.64
[56] S. Kullback and R. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, Vol. 22, No.1, pp. 79–86, 1951.63
[57] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 3, pp. 400–401, 1987.
[58] S.Watanabe, T. Iwata, T. Hori, A. Sako and Y. Ariki, “Topic tracking language model for speech recognition,” Journal of Computer Speech & Language, vol. 25, No. 2, pp. 440–461, 2011.
[59] S.-Y. Kong and L.-S. Lee, “Improved spoken document summarization using probabilistic latent semantic analysis (PLSA),” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 941–944, 2006.
[60] Sundermeyer, Martin, Ralf Schlüter, and Hermann Ney. "LSTM neural networks for language modeling." Thirteenth Annual Conference of the International Speech Communication Association. 2012.
[61] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceeding of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 50–57, 1999. 59
[62] T. Mikolov, M. Karafiát, L. Burget, J. Černocký and S. Khudanpur, “Recurrent neural network based language model,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 1045–1048, 2010.
[63] T. Mikolov, S. Kombrink, A. Deoras, L. Burget and J. Černocký, “RNNLM – Recurrent neural network language modeling toolkit,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.
[64] T. Mikolov, S. Kombrink, L. Burget, J. Černocký and S. Khudanpur,“Extensions of recurrent neural network language model,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5528–5531, 2011.
[65] T. Oba, T. Hori and A. Nakamura, “A comparative study on methods of weighted language model training for reranking LVCSR N-best hypotheses,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5126–5129, 2010.
[66] T.-H. Wen, A Heidel, H.-Yi. Lee, Y Tsao and L.-S. Lee, “Recurrent neural network based language model personalization by social network crowdsourcing”, in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 2703–2707, 2013.
[67] V. Lavrenko and W. Croft, “Relevance-based language models,” in Proceeding of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 120–127, 2001. 61
[68] W.-Y. Ma and K.-J. Chen, “Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff,” in Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 168–171, (http://ckipsvr.iis.sinica.edu.tw/).
[69] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee and R. Rosenfeld, “The SPHINX-II speech recognition system: An overview,” Computer, Speech, and Language, Vol. 7, No. 2, pp. 137–148, 1993.
[70] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, Vol. 5, No. 2, pp. 157–166, 1994.
[71] Y. Lv and C. Zhai, “Positional language models for information retrieval,” in Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 299–306, 2009.
[72] Y.-W. Chen, B.-H. Hao, K.-Y. Chen and B. Chen, “Incorporating proximity information for relevance language modeling in speech recognition,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 2683–2687, 2013.
[73] Z. Chen, K. F. Lee and M. J. Li, “Discriminative training on language model,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 493–496, 2000.
[74] 李俊毅,“語音評分”國立清華大學資訊工程所碩士論文,2002。
[75] 邱炫盛,“利用主題與位置相關語言模型於中文連續語音辨識,”國立臺灣師範大學資訊工程所碩士論文,2007。
[76] 郝柏翰,“運用鄰近與概念資訊於語言模型調適之研究,” 國立臺灣師範大學資訊工程所碩士論文,2014。
[77] 陳冠宇,“主題模型於語音辨識使用之改進,”國立臺灣師範大學資訊工程所碩士論文,2010。
[78] 陳思澄,“使用詞向量與概念資訊於中文大詞彙連續語音辨識之語言模型調適,” 國立臺灣師範大學資訊工程所碩士論文,2015。
[79] 黃邦烜,“遞迴式類神經網路語言模型使用額外資訊於語音辨識之研究,” 國立臺灣師範大學資訊工程所碩士論文,2012。
[80] 楊明翰,“改善類神經網路聲學模型經由結合多任務學習與整體學習於會議語音辨識之研究”國立臺灣師範大學資訊工程研究所碩士論文,2016。
[81] 劉家妏,“多種鑑別式語言模型應用於語音辨識之研究,” 國立臺灣師範大學資訊工程所碩士論文,2010。
[82] 賴敏軒,“實證探究多種鑑別式語言模型於語音辨識之研究,” 國立臺灣師範大學資訊工程所碩士論文,2011。