簡易檢索 / 詳目顯示

研究生: 楊平
Yang, Ping
論文名稱: 開放領域中文問答系統之建置與評估
Development and Evaluation of Chinese Open-Domain Question Answering System
指導教授: 曾元顯
Tseng, Yuen-Hsien
口試委員: 吳怡瑾
Wu, I-Chin
李龍豪
Lee, Lung-Hao
口試日期: 2021/07/09
學位類別: 碩士
Master
系所名稱: 圖書資訊學研究所
Graduate Institute of Library and Information Studies
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 90
中文關鍵詞: 中文開放領域問答系統問答系統使用者測試機器閱讀理解深度學習人工智慧
英文關鍵詞: Chinese Open-Domain Question Answering System, User Testing of Question Answering System, Machine Reading Comprehension, Deep Learning, Artificial Intelligence
DOI URL: http://doi.org/10.6345/NTNU202100914
論文種類: 學術論文
相關次數: 點閱:173下載:21
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來隨著人工智慧技術日新月異,答案抽取式機器閱讀理解模型在 SQuAD 等資料集上已可超出人類的表現。而基於機器閱讀理解模型,加入了文章庫以及文件檢索器的問答系統架構,亦取得良好的成績。然而這樣子的資料集測試成效於實際應用上,可以達到什麼樣的效果是本研究好奇的問題。
    本研究主要進行了兩個任務,第一個為開發並比較不同的問答系統實作方式,以資料集自動化測試的方式評估何種實作方式的成效最好。第二個為將自動化測試表現最好的問答系統,交由受試者進行測試,並對實驗結果進行分析。
    最終得到的結果有四個。第一,本研究以中文維基百科做為文章庫;以Elasticsearch作為文件檢索器;以Bert-Base Chinese作為預訓練模型,並以DRCD資料集進行訓練的Sentence Pair Classification模型作為文件重排序器;以MacBERT-large作為預訓練模型,並以DRCD加上CMRC 2018資料集進行訓練的答案抽取式機器閱讀理解模型,作為文件閱讀器。此問答系統架構可以在Top 10取得本研究實驗的所有系統當中最好的成效,以DRCD Test set加上CMRC 2018 Dev set進行測試,得到的分數為F1 = 71.355,EM = 55.17。
    第二,本研究招募33位受試者,總計對系統進行了289道題目的測試,最終的成果為,在Top 10的時候有70.24%的問題能被系統回答,此分數介於自動化測試的F1與EM之間,代表自動化測試與使用者測試所得到的結果是相似的。
    第三,針對29.76%無法得到答案的問題進行分析,得到的結論是,大部分無法回答的原因是因為無法從文件庫中檢索正確的文章。
    第四,Top 1可回答的問題佔所有問題中的26.3%,而Top 2 ~ 10的佔比為43.94%。代表許多問題並非系統無法得出解答,而是排序位置不正確,若能建立更好的答案排序機制,將能大幅提升問答系統的實用性。

    With rapid development, artificial intelligence has surpassed human performance in span-extraction machine reading comprehension datasets such as SQuAD. And based on this achievement, question answering system architecture with documents collection, document retriever, and document reader, have also been achieved a good result. However, will the system gets similar results in the real world? That is a question our research curious about.
    Our research has two tasks. The first one is to develop and compare different QA system implementations using datasets. Second, we ask users to test the QA system and analysis the results.
    Finally, we got four results. First, with the QA system architecture of Chinese Wikipedia as the documents collection; Elasticsearch as the document retriever; Sentence Pair Classification model trained on DRCD dataset with Bert-Base Chinese pre-training model, as the document re-ranker; Span-extraction machine reading comprehension model trained on DRCD and CMRC 2018 dataset, with MacBERT-large pre-trained model, as the document reader. This architecture has achieved the best Top 10 result among all the systems tested in our research. The score is F1 = 71.355, EM = 55.17 in Top 10, tested on DRCD Test set and CMRC 2018 Dev set.
    Second, this study recruited 33 users and those users tested the system with 289 questions. The result is, 70.24% of the questions can be answered by the system in the Top 10. This score is between the F1 and EM scores of datasets testing, it means that the results of datasets testing and user testing are similar.
    Third, we analyzed 29.76% of the unanswered questions and find that most of the reasons were because the correct document could not be retrieved from the documents collection.
    Fourth, Top 1 answerable questions take 26.3% of all questions, while Top 2 ~ 10 take 43.94%. This means lots of questions are answerable if the sorting is correct. The practicality of the QA system will be greatly improved if a better answer sorting mechanism can be performed.

    第一章 緒論 1 第一節 研究背景與動機 1 第二節 研究目的與問題 3 第二章 文獻探討 4 第一節 問答系統類型 4 第二節 開放領域問答系統 6 第三節 資料集 8 第四節 評估方式 12 第三章 研究方法 14 第一節 研究範圍與限制 14 第二節 研究實施與步驟 16 第三節 問答系統開發與資料集自動化評測 16 第四節 使用者評測 26 第四章 實驗結果與分析 37 第一節 文件閱讀器實驗結果 37 第二節 文件檢索器與問答系統整體成效 40 第三節 文件重排序器實驗結果 45 第四節 使用者評測結果 49 第五章 結論與後續研究 61 第一節 結論 61 第二節 後續研究 62 參考文獻 64 附錄一 使用者測驗平台網頁上之測驗說明 67 附錄二 問答系統自動化測試整體成效數據 72 附錄三 使用者評測刪除問題列表以及刪除理由 75 附錄四 清除不符合規定題目後的使用者評測題目列表 79

    Abdi, A., Idris, N., & Ahmad, Z. (2018). QAPD: An ontology-based question answering system in the physics domain. Soft Computing, 22(1), 213–230. https://doi.org/10.1007/s00500-016-2328-2
    Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to Answer Open-Domain Questions. ArXiv:1704.00051 [Cs]. http://arxiv.org/abs/1704.00051
    Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). Revisiting Pre-Trained Models for Chinese Natural Language Processing. Findings of the Association for Computational Linguistics: EMNLP 2020, 657–668. https://doi.org/10.18653/v1/2020.findings-emnlp.58
    Cui, Y., Liu, T., Che, W., Xiao, L., Chen, Z., Ma, W., Wang, S., & Hu, G. (2019). A Span-Extraction Dataset for Chinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5882–5888. https://doi.org/10.18653/v1/D19-1600
    Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
    Ferrucci, D. A. (2012). Introduction to “This is Watson.” IBM Journal of Research and Development, 56(3.4), 1:1-1:15. https://doi.org/10.1147/JRD.2012.2184356
    Green, B. F., Wolf, A. K., Chomsky, C., & Laughery, K. (1961). Baseball: An automatic question-answerer. Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference on - IRE-AIEE-ACM ’61 (Western), 219. https://doi.org/10.1145/1460690.1460714
    Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ArXiv:1705.03551 [Cs]. http://arxiv.org/abs/1705.03551
    Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. ArXiv:2004.04906 [Cs]. http://arxiv.org/abs/2004.04906
    Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. ArXiv:1806.03822 [Cs]. http://arxiv.org/abs/1806.03822
    Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. ArXiv:1606.05250 [Cs]. http://arxiv.org/abs/1606.05250
    Saha, A., Aralikatte, R., Khapra, M. M., & Sankaranarayanan, K. (2018). DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. ArXiv:1804.07927 [Cs]. http://arxiv.org/abs/1804.07927
    Shao, C. C., Liu, T., Lai, Y., Tseng, Y., & Tsai, S. (2018). DRCD: a Chinese Machine Reading Comprehension Dataset. ArXiv:1806.00920 [Cs.CL]. https://arxiv.org/abs/1806.00920
    Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., & Suleman, K. (2017). NewsQA: A Machine Comprehension Dataset. ArXiv:1611.09830 [Cs]. http://arxiv.org/abs/1611.09830
    Voorhees, E. M. (1999). The TREC-8 Question Answering Track Report. Proceedings of the Eighth Text REtrieval Conference (TREC-8), 83–106.
    Wang, S., Yu, M., Guo, X., Wang, Z., Klinger, T., Zhang, W., Chang, S., Tesauro, G., Zhou, B., & Jiang, J. (2017). R$^3$: Reinforced Reader-Ranker for Open-Domain Question Answering. ArXiv:1709.00023 [Cs]. http://arxiv.org/abs/1709.00023
    Wang, S., Yu, M., Jiang, J., Zhang, W., Guo, X., Chang, S., Wang, Z., Klinger, T., Tesauro, G., & Campbell, M. (2018). Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering. ArXiv:1711.05116 [Cs]. http://arxiv.org/abs/1711.05116
    Woods, W. A., Kaplan, R. M., & Nesh-Webber, B. (1972). The Lunar Science Natural Language Information System: Final Report.
    Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M., & Lin, J. (2019). End-to-End Open-Domain Question Answering with BERTserini. Proceedings of the 2019 Conference of the North, 72–77. https://doi.org/10.18653/v1/N19-4013

    下載圖示
    QR CODE