研究生: |
陳立哲 Li-Che Chen |
---|---|
論文名稱: |
生物資訊文獻中人類遺傳疾病與基因關聯度之研究 The Study of Gene-Disease Associations from the Bioinformatics Literature |
指導教授: |
侯文娟
Hou, Wen-Juan |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2011 |
畢業學年度: | 99 |
語文別: | 中文 |
論文頁數: | 43 |
中文關鍵詞: | 人類遺傳疾病 、基因 、醫學文獻資料庫 、線上人類孟德爾遺傳學 |
英文關鍵詞: | humanity genetic disease, gene, Medline, OMIM |
論文種類: | 學術論文 |
相關次數: | 點閱:269 下載:32 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文之研究,是在探討文獻中人類遺傳疾病與基因的關聯度,希望從中得到一些人類遺傳疾病與基因這兩者之間的關係,其目的在於希望在往後的生物資訊文獻上,可以快速的得知文獻上出現的人類遺傳疾病是否與文獻上出現的基因相關聯。
本論文所使用的相關資料包含了醫學文獻資料庫(Medical Literature Analysis and Retrieval System Online, Medline),從中擷取出所需要使用的資訊,包括PMID、TI以及AB,其中PMID為該篇的ID number,TI為標題,而AB即為內文。接著,利用Geniatagger來標記AB上出現的基因。再來,利用線上人類孟德爾遺傳學(Online Mendelian Inheritance in Man, OMIM)的網站,下載人類遺傳疾病與相關基因的資料,再利用這兩者去標記AB上出現的疾病與基因。
針對此研究,提出了兩類運算的方法,其中第二類方法會再加以變化,衍生出新的運算方法。第一類的方法分為五種,第一種方法是運用密度的計算公式,第二種是運用重力公式,此公式有四種變化。第二類的方法就是自然語言常用的Dice,在此,以此公式為基本架構,再加以調整延伸公式,和一般的比例公式以及一般比例公式延伸變化。
II
最後求出的結果,前兩者的準確率最高是在一成左右,屬於偏低的準確率,其原因是,他們只有運用到位置與TFIDFT(Term Frequency Inverse Document Frequency(Term))的變數去計算他們的值,忽略了一些疾病與基因的特性,所以分數才會如此的不顯著。再來,運用以Dice為主要架構的變化公式,這方法考慮到Gene Ontology,對此實驗來說,考慮的要素正好符合實驗的精神,所以計算出的分數,才會越高而越接近實驗的正確配對,當過一個門檻值之後,準確率就會達到100%。
In this study, we explore the relationships between humanity genetic diseases and genes from documents and hope our approach can help realize the relation between humanity genetic diseases and genes. The purpose of this thesis is to make people to find the relation from bioinformatics documents more efficiently if some genetic disease is related with the gene in documents.
This study uses information that includes a part from Medical Literature Analysis and Retrieval System Online, called Medline which comprises PMID, TI and AB. PMID is the ID number and TI is the topic. In addition, AB is the content. Next, we use Geniatagger to tag the gene which appears in AB. Then, we reference to the website named “Online Mendelian Inheritance in Man, OMIM” and download the information about the gene related with humanity genetic diseases. Therefore, we are able to tag the genes and diseases which appear in AB.
We propose two different operational analysis methods in the research. The first type is divided into five different kinds: The first kind is to use formula of the density to calculate. The second kind is to use formula of the gravity, and it has four different variations. The second type of operational analysis is Dice. We also take this analysis as a foundation to extend the formula, and the change of general ratio formula and extension of general ratio formula.
The result of operational analysis about the first kind shows the highest accuracy approximates ten percent. The rate of accuracy is somewhat low. The reason is that they only use the position and Term Frequency Inverse Document Frequency (Term) variable, and ignore the features of some diseases and genes. That’s the reason why fraction has no significant relationship. Next, we let the formula use dice as the main foundation, and consider the importance of Gene Ontology. It matches the experimental spirit of the research. As a result, the fraction which gets from calculating becomes much higher and is more close to the correct
IV
pairs. After the fraction which exceeds the threshold, the accuracy will achieve a hundred percent.
[1] M. Batet, D. Sanchez, A. Valls and K. Gibert, “Exploiting taxonomical knowledge to compute semantic similarity: An evaluation in the biomedical domain,”, 2010, pp. 274–283.
[2] J. Y. Chen, C. Shen, and A. Y. Sivachenko, “Mining Alzheimer disease relevant proteins from integrated protein interactome data,” Pacific Symposium on Biocomputing, vol. 11, 2006, pp. 367–378.
[3] D. Hristovski, B. Peterlin, J. A. Mitchell, and S. M. Humphrey, “Using literature-based discovery to identify disease candidate genes,” International Journal of Medical Informatics, vol. 74, 2005, pp. 289–298.
[4] Y. Hu, L. M. Hines, H. Weng, D. Zuo, M. Rivera, A. Richardson, and J. LaBaer, “Analysis of genomic and proteomic data using advanced literature,” Journal of Proteome Research, vol. 2, 2003, pp. 405–412.
[5] C. Leacock and M. Chodorow, ”WordNet: An electronic lexical database. In: Combining local context and WordNet similarity for word sense identification,”,1998, pp. 265–283.
[6] C. Perez-Iratxeta, P. Bork, M. Andrade, A. Nat, “Association of genes to genetically inherited diseases using data mining,” Genet. 2002, pp.316-319.
43
[7] R. Rada, H. Mili, E. Bichnell and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Transactions on Systems, Man and Cybernetics,1989,pp. 17–30 .
[8] A. Schlicker, T. Lengauer, and M. Albrecht, “Improving disease gene prioritization using the semantic similarity of Gene Ontology terms,” Bioinformatics, vol. 26, ECCB 2010, pp. i561–i567.
[9] Z. Wu and M. Palmer, “Verb semantics and lexical selection,”, In: Proceedings of the 32nd annual Meeting of the Association for Computational Linguistics, New Mexico, USA,1994,pp.133–138. Association for Computational Linguistics .
[10] Englishstopword form http://www.ranks.nl/resources/stopwords.html
[11] Genia Tagger. Available from http://www-tsujii.is.s.u-tokyo.ac.jp/ GENIA/tagger/
[12] Geniatagger-3.0.1. Available from http://www-tsujii.is.s.u-tokyo.ac.jp/ ~genia/postagger/geniatagger-3.0.1.tar.gz
[13] Harvard Breast SPORE http://www.dfhcc.harvard.edu/spores/breast/
[14] LocusLink http://www.ncbi.nlm.nih.gov/projects/LocusLink/
[15] Medline Fact Sheet. Available from http://www.nlm.nih.gov/pubs/ factsheets/medline.html
[16] MeSH http://www.ncbi.nlm.nih.gov/mesh
[17] TREC 2004 Genome TRACK. Available from http://ir.ohsu.edu/ genomics/2004protocol.html