簡易檢索 / 詳目顯示

研究生: 王馨偉
Wang, Hsin-Wei
論文名稱: 適用於改善語音辨識的新穎調適方法與後處理模型
Novel Adaptation and Post-processing Approaches for Improving Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
口試委員: 洪志偉
HUNG, JEIH-WEIH
陳冠宇
Chen, Kuan-Yu
曾厚強
Tseng, Hou-Chiang
陳柏琳
Chen, Berlin
口試日期: 2023/07/21
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 50
中文關鍵詞: 語音辨識後處理N個最佳假設重新排序非自回歸單詞共現圖自動語音辨識
英文關鍵詞: post-processing of speech recognition, N-best hypotheses reranking, non-autoregressive, word co-occurrence graphs, automatic speech recognition
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202301492
論文種類: 學術論文
相關次數: 點閱:93下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 得益於神經模型架構和訓練算法的協同突破,自動語音識別(ASR)最近取得了巨大的成功並達到了人類的水平。然而,ASR 在許多現實用例中的性能仍遠未達到完美。
    人們對設計和開發可行的後處理模組以通過細修 ASR 輸出句子來提高識別性能的研究興趣激增,這些模組大致分為兩類。 第一類方法是 ASR 前 N 個最佳假設重新排序。ASR 前 N 個最佳假設重新排序旨在從給定的 N 個假設列表中找到單詞錯誤率最低的假設。另一類方法的靈感來自中文拼寫校正 (CSC) 或英文拼寫校正 (ESC)等,旨在檢測和校正 ASR 輸出句子的文本級錯誤。在本文中,我們嘗試將上述兩種方法整合到ASR糾錯(AEC)模組中,並探討不同類型的特徵對AEC的影響。我們提出的方法名為REDECORATE,適用於校正從現成語音服務獲得的純文本轉錄。
    在大多數情況下,目標域的相關純文本數據相對更容易獲得。因此,使用從此類數據中收集的知識可以更有效地將通用域 ASR 模型導向目標域。鑑於此,我們提出了另一種基於領域自適應數據構建的單詞共現圖的新穎的糾錯方法。 給定的神經 ASR 模型可以通過即插即用的方式輕鬆訪問有關語音話語語義上下文的知識,而無需引入額外的參數。該方法名為GRACE,可以隨插即用適用於客製化訓練的ASR模型的模型調適或是直接校正ASR轉錄結果。
    在 AISHELL-1 基準數據集上進行的一系列實驗表明,所提出的方法可以在強大的 ASR 基線上顯著降低字符錯誤率 (CER)。

    Automatic speech recognition (ASR) has achieved remarkable success and reached human parity, because of synergistic breakthroughs in neural network architectures and training algorithms. However, the performance of ASR in many real-world applications is still insufficiently high.
    An increasing number of researchers have attempted to design and develop feasible post-processing modules for improving ASR performance by refining ASR output sentences. These modules can be divided into two categories: those based on the reranking of ASR N-best hypotheses, and those focusing on spelling correction. The aim of reranking hypotheses is to find the oracle hypothesis with the lowest word error rate from a given list of N-best hypotheses. Moreover, the aim of spelling correction is to detect and correct errors in ASR output transcriptions. In this study, we attempted to integrate the reranking of N-best ASR hypotheses with correction methods into an ASR error correction (AEC) module and examined the effects of various types of features on AEC.
    In most cases, obtaining text data for a target domain is easier than is obtaining speech data for the domain, and the knowledge acquired from text data can be used to efficiently bias a general-domain ASR model to the target domain. Therefore, we propose a novel error correction method that leverages word co-occurrence graphs constructed using domain-adaptive data. Knowledge regarding the semantic context of a speech utterance can be readily accessed by a given neural ASR model in a plug-and-play manner without the need to introduce additional parameters. This knowledge is accessed through word co-occurrence graphs, allowing the ASR model to tap into the rich contextual relationships between words and enhance its understanding of the spoken language.
    A series of experiments conducted on the AISHELL-1 benchmark dataset indicated that the proposed method can achieve a remarkably lower character error rate compared with those achieved by baseline ASR approaches.

    Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 3 1.3 Problem Scenarios 5 1.4 Organization of the Thesis 6 Chapter 2 Related Work 7 2.1 Shallow Fusion 7 2.2 Parameter-Efficient Fine-Tuning for ASR 9 2.3 Top-N ASR Hypothesis Reranking 11 2.4 ASR Error Correction 13 2.5 Summary of This Chapter 16 Chapter 3 Proposed Methodology 17 3.1 Two Scenarios to Be Resolved 17 3.2 REDECORATE 17 3.2.1 ASR N-best Hypothesis Reranking Module 18 3.2.2 Detection Module 20 3.2.3 Phonetic Encoder 21 3.2.4 Semantic Encoder 23 3.2.5 Correction Module 23 3.3 GRACE 24 3.3.1 Knowledge Graph Construction Module 26 3.3.2 Knowledge Graph Distillation Module 27 3.3.3 ASR Transcription Refinement Module 29 Chapter 4 Experiments 31 4.1 Experimental Setting 31 4.1.1 Datasets 31 4.1.2 Experimental Setup for Case A 32 4.1.3 Experimental Setup for Case B 33 4.2 Baseline 34 4.3 Experimental Results for Case A 35 4.3.1 Reranking Results 35 4.3.2 Experiments on Different ASR Baselines 36 4.3.3 Ablation Study 37 4.4 Experimental Results for Case B 39 4.4.1 Improved Accuracy Achieved Using the CTC Output 39 4.4.2 Overall Performance of GRACE 40 4.4.3 Effectiveness of the Co-Occurrence Graph 42 4.4.4 Enhanced Accuracy Achieved with the Mask-CTC Output 43 Chapter 5 Conclusion and Outlook 45 5.1 Conclusion 45 5.2 Outlook 45 參考文獻 47

    [1] L. Dong et al., "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [2] Y. Zhang et al., "Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition," in Proceedings of NeurIPS Workshop on Self-Supervised Learning (SAS), 2020.
    [3] C. -C. Chiu et al., "State-of-the-Art Speech Recognition with Sequence-to-Sequence Models," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [4] V. Panayotov et al., "Librispeech: An ASR Corpus Based on Public Domain Audio Books," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
    [5] L. F. D’Haro et al., "Automatic Correction of ASR Outputs by Using Machine Translation," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2016.
    [6] H. Wang et al., "ASR Error Correction with Augmented Transformer for Entity Retrieval," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020.
    [7] F. Zhang et al., "ASR Error Correction with Dual-Channel Self-Supervised Learning," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
    [8] J. Yu et al., "Chinese Spelling Error Detection and Correction Based On Language Model, Pronunciation, And Shape," in Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2014.
    [9] L. Huang et al., "PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Check," in Proceedings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021.
    [10] H.-D. Xu et al., "Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking," in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021.
    [11] K. Hu et al., "Massively Multilingual Shallow Fusion with Large Language Models," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
    [12] A. Kannan et al., "An Analysis of Incorporating an External Language Model Into A Sequence-To-Sequence Model," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
    [13] J. Chorowski et al., "Towards Better Decoding and Language Model Integration in Sequence To Sequence Models," in arXiv preprint arXiv:1612.02695, 2016.
    [14] J. D. Fox et al., "Improving Contextual Recognition of Rare Words with An Alternate Spelling Prediction Model," in arXiv preprint arXiv:2209.01250, 2022.
    [15] Z. Yao et al., "Wenet: Production Oriented Streaming and Non-Streaming End-To-End Speech Recognition Toolkit," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021.
    [16] M. Zeineldeen et al., "Investigating ¨ Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021.
    [17] J. Pylkkonen et al., "Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021.
    [18] G. Pundak et al., " Deep Context: End-To-End Contextual Speech Recognition," in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2018.
    [19] M. Jain et al., "Contextual RNN-T for Open Domain ASR," in arXiv preprint arXiv:2006.03411, 2020.
    [20] T. Munkhdalai et al., "Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
    [21] A. Tripathi et al., "Monotonic Recurrent Neural Network Transducer And Decoding Strategies," in Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
    [22] Y. Xu et al., "CB-Conformer: Contextual Biasing Conformer for Biased Word Recognition," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
    [23] A. Gulati et al., "Conformer: Convolution-Augmented Transformer for Speech Recognition," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020.
    [24] R. Kneser et al., "Improved Backing-Off for M-Gram Language Modeling," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1995.
    [25] T. Mikolov et al., "Recurrent Neural Network-Based Language Model," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2010.
    [26] M. Sundermeyer et al., "LSTM Neural Networks for Language Modeling," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2012.
    [27] J. Devlin et al., "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
    [28] S. -H. Chiu et al., "Innovative Bert-Based Reranking Language Models for Speech Recognition," in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2021.
    [29] Y. Leng et al., "Fastcorrect: Fast Error Correction with Edit Alignment For Automatic Speech Recognition," in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021.
    [30] A. Garg et al., "Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020.
    [31] A. Vaswani et al., "Attention Is All You Need," in Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017.
    [32] T.N. Kipf et al., "Semi-Supervised Classification with Graph Convolutional Networks," in Proceedings of 5th International Conference on Learning Representations (ICLR), 2017.
    [33] S. -H. Wu et al., "Chinese Spelling Check Evaluation at SIGHAN Bake-Off 2013,” in Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, 2013.
    [34] M. Ghazvininejad et al., "Mask-Predict: Parallel decoding of conditional masked language models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
    [35] H. Bu et al., "AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline," in Proceedings of the Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017.
    [36] B. Chen et al., "AISHELL-NER: Named Entity Recognition from Chinese Speech," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
    [37] D. S. Park et al., "Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019.
    [38] Y. Leng et al., "Fastcorrect 2: Fast Error Correction on Multiple Candidates For Automatic Speech Recognition," in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021.
    [39] S. Watanabe et al., "Espnet: End-To-End Speech Processing Toolkit," in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018.
    [40] J. Gu et al., "Levenshtein Transformer," in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019.
    [41] J. Mallinson et al., "FELIX: Flexible Text Editing Through Tagging And Insertion," in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
    [42] W. Kraaij et al., "The AMI Meeting Corpus," in Proceedings of International Conference on Methods and Techniques in Behavioral Research, 2005.
    [43] A. Rousseau et al., "TED-LIUM: An Automatic Speech Recognition Dedicated Corpus," in Proceedings of the International Conference on Language Resources and Evaluation (LREC), 2012.
    [44] X. Cheng et al., "SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.

    下載圖示
    QR CODE