研究生: |
陳孝源 |
---|---|
論文名稱: |
人類基因與疾病關係之規則擷取 Rule Extraction in Genes and Diseases |
指導教授: | 侯文娟 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 39 |
中文關鍵詞: | 規則擷取 、規則學習 、疾病與基因關係 、生物醫學文獻探勘 |
英文關鍵詞: | rule extraction, rule learning, gene-disease relationship, biomedical text mining |
論文種類: | 學術論文 |
相關次數: | 點閱:197 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在諸多記載著有關人類遺傳疾病的生物資訊文獻中,研究人員想嘗試著利用各種方法計算出人類遺傳疾病與基因的關聯度,並從中找尋出一些規則或相關性,進而了解兩者之間的關係。若方法適用的話,就可以運用在往後的文獻資料上,在大量產出的文獻上利用這一些規則(rules)及運算方法,如此即可找出疾病與基因兩者之間的關係,這樣既可以幫助閱讀的人,又能省下時間,研究人員們期望藉此方法可以增進生物醫學的發展速度,早日找出解決這些疾病的辦法。
本論文中所使用方法簡述如下:我們使用的資料包含醫學文獻資料庫(Medical Literature Analysis and Retrieval System Online, MEDLINE),首先從MEDLINE擷取需要使用的資訊:包含TI以及AB,TI為標題,而AB為內文。其次利用線上孟德爾遺傳學(Online Mendelian Inheritance in Man, OMIM)提供的morbid標準答案來找出遺傳疾病與基因有關係的正確句子出來。然後用Memory-Based Shallow Parser (MBSP)來剖析這些正確句子以及隨機挑選出的不正確句子以得到詞性(part of speech)的資訊,接著使用ILP framework的ALEPH系統來學習規則。在ILP framework中包含了三個元素,分別是hypothesis H、background knowledge B以及examples E,如果知道了B和E就可以得出H。而在找出來的這些規則中,我們提出一些計算方式實驗取得較好的規則出來,最後評量時就是利用這些規則找出相關聯的疾病與基因,最後再以準確度及回收率做為評估的準則。實驗結果顯示最好的F-score為66.9%,此時的準確度為70.6%,此時回收率為63.5%。
In many biomedical literatures about human genetic diseases, researchers try to use different methods to find some rules or relations between human genetic diseases and genes. If the methods are good to use, then people can use these rules to find relations in more biomedical literatures faster and easier. The researchers expect these methods can improve the speed of development of the biomedical domain and then it is possible to find out a way to cure these diseases.
We used the data provided by Medical Literature Analysis and Retrieval System Online (MEDLINE). First we retrieved the required information from MDELINE, including TI and AB, where TI means title and AB means abstracts. Second, we used the morbid data which was provided from Online Mendelian Inheritance in Man (OMIM) to find the correct sentences about human genetic diseases and genes, and also picked the wrong sentences randomly. Third, we used Memory-Based Shallow Parser (MBSP) to parse these sentences to get the part-of-speech and other information. At last, we used the ALEPH system by utilizing the above information to learn rules. ALEPH is an ILP framework. An ILP framework contains three elements, hypothesis H, background knowledge B and examples E. If we have B and E, then we can inference H which corresponds to rules in our experiment. We proposed some methods of calculation to get better rules, and then we used these rules to find the sentences which are related to human genetic diseases and genes. We used precision, recall and F-score to be our experiment’s measure metrics. The experiment’s results showed that the best F-score is 66.9% where the precision is 70.6% and the recall is 63.5%.
ALEPH. Available from http://www.cs.ox.ac.uk/activities/machlearn/Aleph/ aleph.html.
T. K. Attwood, P. Bradley, D. R. Flower, A. Gaulton, N. Maudling, A. L. Mitchell, G. Moulton, N. Nordle, K. Paine, P. Taylor, A. Uddin and C. Zygouri, “Prints and its automatic supplement, preprints,” Nucleic Acids Research, vol. 31, no. 1, 2003, pp. 400-402.
BIOSIS database. Available from http://thomsonreuters.com/products_services/ science/science_products / a-z/biosis/.
J. Y. Chen, C. Shen, and A. Y. Sivachenko, “Mining Alzheimer disease relevant proteins from integrated protein interactome data,” Pacific Symposium on Biocomputing, vol. 11, 2006, pp. 367-378.
Walter Daelemans, Sabine Buchholz and Jorn Veenstra, “Memory-based shallow parsing,” Proceedings of the EACL'99 workshop on Computational Natural Language Learning (CoNLL-99), pp. 53-60.
EMBASE database. Available from http://www.embase.com/.
Katrin Fundel, Robert Kuffner and Ralf Zimmer, “RelEx─Relation extraction using dependency parse trees”, Bioinformatics, Vol. 23, no. 3, 2007, pp. 365-371.
fnTBL. Available from http://nlp.cs.jhu.edu/~rflorian/fntbl/.
Genia Tagger. Available from http://www-tsujii.is.u-tokyo.ac.jp/GENIA/tagger/.
Y. Hu, L. M. Hines, H. Weng, D.Zuo, M. Rivera, A. Richardson, and J. Labaser, “Analysis of genomic and proteomic data using advanced literature,” Journal of Proteome Research, vol. 2, 2003, pp. 405-412.
HUGO Gene Nomenclature Committee database. Available from http://www.genenames.org/.
Jee-Hyub Kim, Alex Mitchell, Teresa K. Attwood, and Melanie Hilario, “Learning to extract relations for protein annotation”, Bioinformatics, Vol. 23, ISMB/ECCB 2007, pp. i256-i263.
S. Muggleton and L. D. Readt, “Inductive logic programming theory and methods,” Journal of Logic Programming, vol. 9, 1994, pp. 629-679.
MEDLINE Fact Sheet. Available from http://www.nlm.nih.gov/pubs/factsheets/ medline.html.
Memory-Based Shallow Parser Available from http://www.clips.ua.ac.be/pages/ MBSP#server.
MedPost. Available from ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/ medpost.tar.gz.
Nagi,G. and Florian,R. (2001) Transformation-based learning in the fast lane. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Language Technologies 2001 NAACL ’01. pp. 40-47
OMIM database. Available from http://www.ncbi.nlm.nih.gov/omim/.
Smith, L. et al. (2004) Medpost: a part-of-speech tagger for biomedical text. Bioinformatics, 20, 2320-2324.
Ashwin Srinivasan, “The Aleph manual,” Technical Report, Computing Laboratory, Oxford University, 2000. Available from http://www.cs.ox.ac.uk/activities/ machlearn/ Aleph/aleph.html.
Stanford Lexicalized Parser. Available from http://nlp.stanford.edu/software/ lex-parser.shtml.
陳立哲,“生物資訊文獻中人類遺傳疾病與基因關聯度之研究”,國立台灣師範大學資訊工程所碩士論文,2011年。