研究生: |
李宗勳 Lee, Tsung-Hsun |
---|---|
論文名稱: |
語者確認使用不同語句嵌入函數之比較研究 A Comparative Study of Utterance-Embedding Generation Functions for Speaker Verification |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: | 曾厚強 劉士弘 陳柏琳 |
口試日期: | 2021/08/24 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 43 |
中文關鍵詞: | 語者確認 、語音辨識 、小樣本學習 |
英文關鍵詞: | Speaker verification, Speech recognition, Few-shot learning |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202101267 |
論文種類: | 學術論文 |
相關次數: | 點閱:78 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語者語句的嵌入函數利用了神經網路將語句映射到一個空間,在該空間中,距離反映出語者之間的相似度,這種度量學習最早被提出應用在人臉辨識。最近幾年被拿來應用在應用在語者確認,這也推動近幾年語者確認任務的發展。但還是有明顯的正確率差異在語者確認的訓練集辨識和未知語者。在未知語者的狀況下,很評估適合使用小樣本學習。在實際環境中,語者確認系統需要識別短語句的語者,但在訓練時的語者話語都是相對較長的。然而近年的語者確認模型在短語句的語者確認中表現不佳。在這裡我們使用了原型網路損失、三元組損失和最先進的小樣本學習來優化嵌入語者模型。資料集使用了VoxCeleb1和VoxCeleb2,前者資料集的語者數量有1,221,後者資料集的語者數量有5,994。實驗的結果顯示,嵌入語者模型在我們提出的損失函數有較好的表現。
The speaker’s embedding model uses neural networks to map utterances to a space. The distance shows the similarity between each speaker. This metric learning was first proposed to be applied to face recognition. In recent years, it has been used in the application of speaker verification, which has also promoted the development of speaker verification tasks in recent years. However, there is still a significant difference correctness between the seen speakers and unseen speakers in training set. In the case of unseen speakers, it is very good to use few-shot learning. In the really environment, the speaker verification system needs to recognize the speaker of short utterances. But during training the speaker’s utterances are relatively long. In recent years, the speaker verification model does not perform well in short utterances. Here we use prototype network loss, triplet loss and state-of-the-art few-shot learning to optimize the speaker’s embedding model. The dataset we use VoxCeleb1 and VoxCeleb2. The number of speakers in the former dataset is 1,221 and the number of speakers in the latter dataset is 5,994. The results of the experiment show that speaker’s embedding model performs well in our proposed loss function.
[1] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han, “In defence of metric learning for speaker recognition,” in eess.AS , 2020.
[2] Assaf Hurwitz Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, Petar Aleksic, “Keyword Spotting for Google Assistant Using Contextual Speech Recognition,” in Automatic Speech Recognition And Understanding, 2017.
[3] 王小川 “說話人辨認與相關技術之研究” in行政院國家科學委員會專題研究計畫, 2004.
[4] Arsha Nagrani, Joon Son Chung, Andrew Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
[5] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, Karel Vesely, “The Kaldi Speech Recognition Toolkit,” in Automatic Speech Recognition And Understanding, 2011.
[6] Jixuan Wang, Kuan-Chieh Wang, Marc T. Law, Frank Rudzicz, Michael Brudno, “Centroid-based Deep Metric Learning for Speaker Recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
[7] Seong Min Kye, Youngmoon Jung, Hae Beom Lee, Sung Ju Hwang, Hoirin Kim, “Meta-Learning for short utterance speaker recognition with imbalance length pairs,” in INTERSPEECH, 2020.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” in Conference on Computer Vision and Pattern Recognition, 2016.
[9] Mirco Ravanelli, Yoshua Bengio, “Speaker recognition from raw waveform with sincnet,” in IEEE Spoken Language Technology Workshop, 2018.
[10] Koji Okabe, Takafumi Koshinaka, Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding,” in INTERSPEECH, 2018.
[11] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, Sanjeev Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
[12] Jesus Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Fred Richardson, Suwon Shon, Francois Grondin, Reda Dehak, Leibny Paola Garcıa-Perera, Daniel Povey, Pedro A. Torres-Carrasquillo, Sanjeev Khudanpur, Najim Dehak, “State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18,” in INTERSPEECH, 2019.
[13] David Snyder, Jesus Villalba, Nanxin Chen, Daniel Povey, Gregory Sell, Najim Dehak, Sanjeev Khudanpur, “The JHU speaker recognition system for the voices 2019 challenge,” in INTERSPEECH, 2019, pp. 2468–2472.
[14] Feng Wang, Weiyang Liu, Haijun Liu, Jian Cheng,“Additive margin softmax for face verification,” in IEEE Signal Processing Letters, 2018.
[15] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, Wei Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Conference on Computer Vision and Pattern Recognition, 2018.
[16] Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Conference on Computer Vision and Pattern Recognition, 2019.
[17] Florian Schroff, Dmitry Kalenichenko, James Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Conference on Computer Vision and Pattern Recognition, 2015.
[18] Jake Snell, Kevin Swersky, Richard S. Zemel, “Prototypical networks for few-shot learning,” in Neural Information Processing Systems, 2017.
[19] Li Wan, Quan Wang, Alan Papir, Ignacio Lopez Moreno, “Generalized end- to-end loss for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
[20] Arsha Nagrani, Joon Son Chung, Weidi Xie, Andrew Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” in Computer Speech and Language, vol. 60, p. 101027, 2020.
[21] Joon Son Chung, Arsha Nagrani, Andrew Zisserman, “VoxCeleb2: Deep speaker recognition,” in INTERSPEECH, pp. 1086–1090, 2018.
[22] Seong Min Kye, Hae Beom Lee, Hoirin Kim, Sung Ju Hwang, “Meta-learned confidence for few-shot learning,” in arXiv preprint arXiv:2002.12017, 2020.
[23] D Yu, L Deng, “Automatic Speech Recognition,” in Springer, 2012.
[24] Christian Szegedy, Alexander Toshev, Dumitru Erhan, “Deep neural networks for object detection,” in Neural Information Processing Systems, 2013.
[25] Georg Heigold, Ignacio Moreno, Samy Bengio, Noam Shazeer, “End-to-end text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5115–5119.
[26] Jake Snell, Kevin Swersky, Richard S. Zemel, “Prototypical networks for few-shot learning,” in Neural Information Processing Systems, 2017.
[27] Simon J.D. Prince, James H. Elder, “Probabilistic Linear Discriminant Analysis for Inferences About Identity,” in IEEE 11th International Conference on Computer Vision, Rio de Janeiro, 2007, pp. 1–8.