研究生: |
楊平 Yang, Ping |
---|---|
論文名稱: |
開放領域中文問答系統之建置與評估 Development and Evaluation of Chinese Open-Domain Question Answering System |
指導教授: |
曾元顯
Tseng, Yuen-Hsien |
口試委員: |
吳怡瑾
Wu, I-Chin 李龍豪 Lee, Lung-Hao |
口試日期: | 2021/07/09 |
學位類別: |
碩士 Master |
系所名稱: |
圖書資訊學研究所 Graduate Institute of Library and Information Studies |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 90 |
中文關鍵詞: | 中文開放領域問答系統 、問答系統使用者測試 、機器閱讀理解 、深度學習 、人工智慧 |
英文關鍵詞: | Chinese Open-Domain Question Answering System, User Testing of Question Answering System, Machine Reading Comprehension, Deep Learning, Artificial Intelligence |
DOI URL: | http://doi.org/10.6345/NTNU202100914 |
論文種類: | 學術論文 |
相關次數: | 點閱:150 下載:21 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來隨著人工智慧技術日新月異,答案抽取式機器閱讀理解模型在 SQuAD 等資料集上已可超出人類的表現。而基於機器閱讀理解模型,加入了文章庫以及文件檢索器的問答系統架構,亦取得良好的成績。然而這樣子的資料集測試成效於實際應用上,可以達到什麼樣的效果是本研究好奇的問題。
本研究主要進行了兩個任務,第一個為開發並比較不同的問答系統實作方式,以資料集自動化測試的方式評估何種實作方式的成效最好。第二個為將自動化測試表現最好的問答系統,交由受試者進行測試,並對實驗結果進行分析。
最終得到的結果有四個。第一,本研究以中文維基百科做為文章庫;以Elasticsearch作為文件檢索器;以Bert-Base Chinese作為預訓練模型,並以DRCD資料集進行訓練的Sentence Pair Classification模型作為文件重排序器;以MacBERT-large作為預訓練模型,並以DRCD加上CMRC 2018資料集進行訓練的答案抽取式機器閱讀理解模型,作為文件閱讀器。此問答系統架構可以在Top 10取得本研究實驗的所有系統當中最好的成效,以DRCD Test set加上CMRC 2018 Dev set進行測試,得到的分數為F1 = 71.355,EM = 55.17。
第二,本研究招募33位受試者,總計對系統進行了289道題目的測試,最終的成果為,在Top 10的時候有70.24%的問題能被系統回答,此分數介於自動化測試的F1與EM之間,代表自動化測試與使用者測試所得到的結果是相似的。
第三,針對29.76%無法得到答案的問題進行分析,得到的結論是,大部分無法回答的原因是因為無法從文件庫中檢索正確的文章。
第四,Top 1可回答的問題佔所有問題中的26.3%,而Top 2 ~ 10的佔比為43.94%。代表許多問題並非系統無法得出解答,而是排序位置不正確,若能建立更好的答案排序機制,將能大幅提升問答系統的實用性。
With rapid development, artificial intelligence has surpassed human performance in span-extraction machine reading comprehension datasets such as SQuAD. And based on this achievement, question answering system architecture with documents collection, document retriever, and document reader, have also been achieved a good result. However, will the system gets similar results in the real world? That is a question our research curious about.
Our research has two tasks. The first one is to develop and compare different QA system implementations using datasets. Second, we ask users to test the QA system and analysis the results.
Finally, we got four results. First, with the QA system architecture of Chinese Wikipedia as the documents collection; Elasticsearch as the document retriever; Sentence Pair Classification model trained on DRCD dataset with Bert-Base Chinese pre-training model, as the document re-ranker; Span-extraction machine reading comprehension model trained on DRCD and CMRC 2018 dataset, with MacBERT-large pre-trained model, as the document reader. This architecture has achieved the best Top 10 result among all the systems tested in our research. The score is F1 = 71.355, EM = 55.17 in Top 10, tested on DRCD Test set and CMRC 2018 Dev set.
Second, this study recruited 33 users and those users tested the system with 289 questions. The result is, 70.24% of the questions can be answered by the system in the Top 10. This score is between the F1 and EM scores of datasets testing, it means that the results of datasets testing and user testing are similar.
Third, we analyzed 29.76% of the unanswered questions and find that most of the reasons were because the correct document could not be retrieved from the documents collection.
Fourth, Top 1 answerable questions take 26.3% of all questions, while Top 2 ~ 10 take 43.94%. This means lots of questions are answerable if the sorting is correct. The practicality of the QA system will be greatly improved if a better answer sorting mechanism can be performed.
Abdi, A., Idris, N., & Ahmad, Z. (2018). QAPD: An ontology-based question answering system in the physics domain. Soft Computing, 22(1), 213–230. https://doi.org/10.1007/s00500-016-2328-2
Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to Answer Open-Domain Questions. ArXiv:1704.00051 [Cs]. http://arxiv.org/abs/1704.00051
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). Revisiting Pre-Trained Models for Chinese Natural Language Processing. Findings of the Association for Computational Linguistics: EMNLP 2020, 657–668. https://doi.org/10.18653/v1/2020.findings-emnlp.58
Cui, Y., Liu, T., Che, W., Xiao, L., Chen, Z., Ma, W., Wang, S., & Hu, G. (2019). A Span-Extraction Dataset for Chinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5882–5888. https://doi.org/10.18653/v1/D19-1600
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
Ferrucci, D. A. (2012). Introduction to “This is Watson.” IBM Journal of Research and Development, 56(3.4), 1:1-1:15. https://doi.org/10.1147/JRD.2012.2184356
Green, B. F., Wolf, A. K., Chomsky, C., & Laughery, K. (1961). Baseball: An automatic question-answerer. Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference on - IRE-AIEE-ACM ’61 (Western), 219. https://doi.org/10.1145/1460690.1460714
Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ArXiv:1705.03551 [Cs]. http://arxiv.org/abs/1705.03551
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. ArXiv:2004.04906 [Cs]. http://arxiv.org/abs/2004.04906
Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. ArXiv:1806.03822 [Cs]. http://arxiv.org/abs/1806.03822
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. ArXiv:1606.05250 [Cs]. http://arxiv.org/abs/1606.05250
Saha, A., Aralikatte, R., Khapra, M. M., & Sankaranarayanan, K. (2018). DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. ArXiv:1804.07927 [Cs]. http://arxiv.org/abs/1804.07927
Shao, C. C., Liu, T., Lai, Y., Tseng, Y., & Tsai, S. (2018). DRCD: a Chinese Machine Reading Comprehension Dataset. ArXiv:1806.00920 [Cs.CL]. https://arxiv.org/abs/1806.00920
Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., & Suleman, K. (2017). NewsQA: A Machine Comprehension Dataset. ArXiv:1611.09830 [Cs]. http://arxiv.org/abs/1611.09830
Voorhees, E. M. (1999). The TREC-8 Question Answering Track Report. Proceedings of the Eighth Text REtrieval Conference (TREC-8), 83–106.
Wang, S., Yu, M., Guo, X., Wang, Z., Klinger, T., Zhang, W., Chang, S., Tesauro, G., Zhou, B., & Jiang, J. (2017). R$^3$: Reinforced Reader-Ranker for Open-Domain Question Answering. ArXiv:1709.00023 [Cs]. http://arxiv.org/abs/1709.00023
Wang, S., Yu, M., Jiang, J., Zhang, W., Guo, X., Chang, S., Wang, Z., Klinger, T., Tesauro, G., & Campbell, M. (2018). Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering. ArXiv:1711.05116 [Cs]. http://arxiv.org/abs/1711.05116
Woods, W. A., Kaplan, R. M., & Nesh-Webber, B. (1972). The Lunar Science Natural Language Information System: Final Report.
Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M., & Lin, J. (2019). End-to-End Open-Domain Question Answering with BERTserini. Proceedings of the 2019 Conference of the North, 72–77. https://doi.org/10.18653/v1/N19-4013