研究生: |
尚宗承 Shang, Tsung-Cheng |
---|---|
論文名稱: |
應用摘要系統與資訊距離方法於生醫問答系統之研究 Applying Summarization System and Information Distance Method to Biomedical Question Answering System |
指導教授: |
侯文娟
Hou, Wen-Juan |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 98 |
中文關鍵詞: | 資訊距離 、摘要 、答案驗證 、機器閱讀問答系統評估 、跨語言評估會議 、字詞擴充 |
英文關鍵詞: | Information distance, Summarization, Answer validation, QA4MRE, CLEF, Query expansion |
論文種類: | 學術論文 |
相關次數: | 點閱:107 下載:7 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文以阿茲海默症為主題,探討生醫相關之問答系統。目的在於將摘要系統特性以及資訊距離方法運用在問答系統的研究上,希望藉由機器學習的能力以及現有的相關文獻與背景知識庫的支援,找出此類問題的正確答案。
測試資料共包含四個與阿茲海默症相關的測試資料集,每個測試集包含一篇測試文章、10個與該文章相關的測試問題,每個問題都有五個選項,問題題型皆為單選題。另外使用到背景知識庫,資料來源包含從Pubmed Central得到關於阿茲海默症的醫學文獻資料庫(Medical Literature Analysis and Retrieval System Online, Medline)的文章,以及美國麻薩諸塞州的阿茲海默症研究中心(Massachusetts Alzheimer’s Disease Research Center)所提供關於阿茲海默症的生物文章及摘要。
在研究過程中根據不同的架構方法進行不同的研究,研究方法一為利用蔡秉翰於2013年所提出的生醫相關問答系統為基礎,結合摘要系統,對測試文章或背景知識庫做摘要,希望能夠藉由摘要系統的特性,將文章中重要的資訊擷取出來。而在研究方法二中的概念是認為問題與正確答案之間的資訊距離應小於問題與其他候選答案之間的資訊距離,因此將資訊距離方法針對QA4MRE的資料特性加以改良,並加入TFIDF計算方法及擴充詞語的技術。
最後,分別對這兩種研究方法進行實驗。在研究方法一的實驗中發現,因為背景知識庫中的文獻與對應測試集的問題主題關聯性較低,代表文章中之資訊大多為不重要的資訊,所以若對背景知識庫做摘要,可以有效的將重要之資訊擷取出來。而在研究方法二的實驗中發現,對資訊距離方法而言,採取增加Question Focus數量的方式能夠有效的使準確率提升。
經由實驗,本研究在探討將摘要系統與資訊距離方法應用於生醫問答系統的過程中發現,對背景知識庫中的文獻做摘要以及應用資訊距離的權重計算方法皆可以得到不錯的結果。
The study takes Alzheimer’s disease as a subject to implement a biomedical question answering system. The purpose in the thesis is to employ both the properties of a summarization system and an information distance method to the question answering system. The machine learning techniques are also applied, attempting to find out a correct answer from the related literature and background knowledge.
The test data is composed of four sets of test documents. Each set includes one document, ten questions and five answer options per question. For each question, there is only one correct answer from the multiple choices. The study also utilizes the background collections from the articles of Medical Literature Analysis and Retrieval System Online, called Medline, and Massachusetts Alzheimer’s Disease Research Center.
In the thesis, several different approaches are adopted towards developing an effective question answering system. The first approach is related to methods used in the study of Hou and Tsai in 2014.In this study, the previous approach is extended using the summarization technique to obtain the important information. The second approach is related to the concept of the information distance. The thesis proposes that the information distance between the question and the corresponding correct answer must be smaller than the distances between the question and the other incorrect answers. Furthermore, the concept of the information distance is adapted to fit the characteristics of QA4MRE. Besides, two other techniques, TFIDF computation and the query expansion, are also used in the second approach.
Finally, from the experiment of the first approach, it shows that the relevance between the literatures in background knowledge and the question in the test set is not high enough. We observe that, if we make a summary of literatures in background knowledge that may include too many noises among, we can effectively capture the important information needed. From the experiment by the second method, we observe that, if we increase the number of “Question Focus,” we can effectively improve the accuracy of the system.
In summary, both summarization and information distance methods are applied to the biomedical question answering system in the study. The experiments show that summarizing the literatures in background knowledge and applying the information distance method can yield good results.
Ask Jeeves. Available from http://www.ask.com
Bhaskar, Pinaki, Pakray, Partha, Banerjee, Somnath, Banerjee, Samadrita, Bandyopadhyay, Sivaji and Gelbukh, Alexander (2012). Question Answering System for QA4MRE@CLEF 2012. CLEF 2012 Workshop on Question Answering For Machine Reading Evaluation (QA4MRE). CLEF 2012 Labs and Workshop- Working Notes Papers.
Bhattacharya, Sanmitra and Toldo, Luca (2012). Question Answering for Alzheimer Disease Using Information Retrieval. CLEF 2012 Evaluation Labs and Workshop - Working Notes Papers.
Cao, Ling, Qiu, Xipeng and Huang, Xuanjing (2011). Deep Question Answering for Single Document with Lexical Chains. Main Task of Question Answering for Machine Reading Evaluation at CLEF 2011.
CLEF2013. Available from http://www.clef2013.org/
Erkan, Günes, and Radev, Dragomir R.(2011).LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research, 22,pp. 457-479.
GDep parser. Available from http://people.ict.usc.edu/~sagae/parser/gdep/index.html
Google search engine. http://www.google.com
Hou, Wen-Juan and Tsai, Bing-Han (2014). An Answer Validation Concept Based Approach for Question Answering in Biomedical Domain. Modern Advances in Applied Intelligent Systems: 27th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2014, Kaohsiung, Taiwan, June 3-6, 2014, Proceedings, Part I. Moonis Ali et al. (Eds.), IEA/AIE 2014, Part I, LNAI 8481, pp. 148-159, Springer International Publishing Switzerland, July, 2014.
LA-PDFText. Available from http://code.google.com/p/lapdftext/
Li, Fangtao, Zhang, Xian and Zhu, Xiaoyan (2008). Answer Validation by Information Distance Calculation. Coling 2008:Proceedings of the 2ndworkshop on Information Retrieval for Question Answering, pp. 42-49.
Li, Ming and Vitanyi, Paul (2008). An Introduction to Kolmogorov Complexity and Its Applications. Third Edition, Springer Verlag.
Manning, Christopher D., Raghavan, Prabhakar and Schütze, Hinrich (2008). Introduction to Information Retrieval. Cambridge University Press.
MEAD. Available fromhttp://www.summarization.com/mead/
Morante, Roser, Krallinger, Martin, Valencia, Alfonso and Daelemans, Walter. Machine Reading of Biomedical Texts about Alzheimer’s Disease. QA4MRE Pilot Task – Machine Reading of Biomedical Texts about Alzheimer’s Disease at CLEF 2012.
Pakray, Partha, Bhaskar, Pinaki, Banerjee, Somnath, Pal, BidhanChandra, Bandyopadhyay, Sivaji and Gelbukh, Alexander (2011). A Hybrid Question Answering System based on Information Retrieval and Answer Validation. Main Task of Question Answering for Machine Reading Evaluation at CLEF 2011.
Porter, M.F. (1980). An Algorithm for Suffix Stripping. Program, 14(3), pp.130-137.
Porter Stemmer. Available from http://tartarus.org/martin/PorterStemmer/
QA4MRE. Available from http://nlp.uned.es/clef-qa/
Qiu, Yonggang and Frei, H.P. (1993). Concept Based Query Expansion. Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval, pp.160-169.
Ramakrishnan, C., Patnia, A., Hovy, E. and Burns G. (2012). Layout-Aware Text Extraction from Full-text PDF of Scientific Articles. Source Code for Biology and Medicine, 7(1),pp. 7.
Robertson, Stephen and Zaragoza, Hugo (2009). The Probabilistic Relevance Framework:BM25 and Beyond. Foundations and Trends in Information Retrieval, 3 (4), pp. 333–389.
Stopword List. Available from http://www.lextek.com/manuals/onix/stopwords1.html
Strohman, T., Metzler, D., Turtle, H. and Croft, W.B. (2005). Indri: a Language-Model Based Search Engine for Complex Queries. Proceedings of the International Conference on Intelligent Analysis.
Wren, Jonathan D. (2011).Question Answering Systems in Biology and Medicine—the Time is Now, Bioinformatics, 27 (14), pp.2025-2026.
Yahoo search engine. http://tw.yahoo.com
Zhou, Guangyou, Cai, Li, Zhao, Jun and Liu, Kang (2011). Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp.653-662.
蔡秉翰 (2013),以答案驗證方法為基礎之生醫相關問答系統,國立台灣師範大學資訊工程所碩士論文,2013年。