簡易檢索 / 詳目顯示

研究生: 郭博元
論文名稱: 結合統計與規則探討生醫文件疾病與基因之關係
A Hybrid Method for Discovering Disease-Gene Associations from Biomedical Texts
指導教授: 侯文娟
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 37
中文關鍵詞: 規則學習統計方法疾病與基因關係生物醫學文獻探勘
英文關鍵詞: Rule learning, Statistical method, Gene-disease relationship, Biomedical text mining
論文種類: 學術論文
相關次數: 點閱:144下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究嘗試在生醫文獻中探討基因以及疾病的關聯度,所使用的資料為孟德爾遺傳學(Online Mendelian Inheritance in Man, OMIM)網站中提供的morbid中所包含的Mendelian Inheritance in Man (MIM)文獻。在本論文中,首先從生醫文獻找出含有人類遺傳疾病與基因之句子,視為正確的句子;以及不包含疾病與基因的句子,視為錯誤的句子。然後透過Memory-Based Shallow Parser (MBSP)標記句子以取得我們需要的資訊,模擬ALEPH系統進行規則的學習,並利用這些規則在本實驗的生醫文獻中,抓取單一句子以及相鄰句子配對到的基因與疾病,再使用統計方法中驗證值減期望值所得到的Z-Score值來判斷該配對是否可以列為有效配對,接著結合一些限制條件、Rule數之多寡等因素進行其他實驗,最後以Precision、Recall以及F-Score值當作評估的標準。

    The study focuses on automatically extracting the relationships between human genetic diseases and genes from the biomedical literatures. The experimental data is retrieved from Mendelian Inheritance in Man (MIM) literatures of morbid in Online Mendelian Inheritance in Man (OMIM) database. To collect the corpus used in the research, the first step is to find the sentences that include both the related human genetic diseases and genes mentioned from the morbid file, and they are regarded as the correct sentences. In the second step, the sentences that neither have the related human genetic diseases nor the genes mentioned from the morbid file are randomly selected, and they are regarded as the incorrect sentences. Next, Memory-Based Shallow Parser (MBSP) is utilized to analyze these sentences to get some information in order to find rules in the following step. Then, some learning rules are obtained by simulating ALEPH system in the study. These generated rules are applied to catch the pairs of human genetic diseases and genes within one sentence or multi-sentences. The thesis also proposes a statistical approach, called Z-score method, to determine whether the pairs are valid or not. Finally, the experiments are made with considering some constraints and different numbers of rules. Furthermore, the evaluation metrics in the experiments are precision, recall rates, and F-scores.

    附表目錄 VI 附圖目錄 VII 第一章 緒論 1 第一節 研究動機 1 第二節 研究目的 2 第三節 論文架構 2 第二章 相關研究探討 3 第三章 方法與步驟 6 第一節 緒論 6 第二節 實驗資料與工具 6 第三節 研究架構與方法 13 第四節 研究方法描述 14 第四章 實驗與結果 21 第一節 實驗資料 21 第二節 評估測量標準 21 第三節 實驗結果 23 第四節 分析與討論 30 第五章 結論與未來發展 34 參考文獻 35

    Adamic, Lada A., Wilkinson, Dennis, Huberman, Bernardo A. and Adar, Eytan (2002). “A Literature Based Method for Identifying Gene-Disease Connections,” Proceedings of IEEE Computer Society Bioinformatics Conference 2002, 1: 109-117, 2002.

    Al-Mubaid, Hisham and Singh, Rajit K. (2005). “A New Text Mining Approach for Finding Protein-to-Disease Associations,” American Journal of Biochemistry and Biotechnology, 1(3): 145-152, 2005.

    ALEPH. Available from http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html

    Cheung, Warren A., Ouellette, B.F. Francis and Wasserman, Wyeth W. (2012). “Inferring Novel Gene-Disease Associations Using Medical Subject Heading Over-Representation Profiles,” Genome Medicine, 4: 75, 2012.

    GENIA Corpus. Available from http://www.nactem.ac.uk/genia/

    Maglott, D., Ostell, J., Pruitt, K.D. and Tatusova, T. (2007). “Entrez Gene: Gene-centered Information at NCBI,” Nucleic Acids Research, 35 (Database issue): D26-31, 2007.

    MBT (Memory-Based Tagger-Generator and Tagger). Available from http://ilk.uvt.nl/
    mbt/

    Memory-Based Shallow Parser. Available from www.clips.ua.ac.be/pages/
    MBSP#server

    MeSH (Medical Subject Headings). Available from https://www.nlm.nih.gov/mesh/
    meshhome.html

    Mitchel, J.A., Aronson, A.R., Mork, J.G., Folk, L.C., Humphrey, S.M. and Ward, J.M. (2003). “Gene Indexing: Characterization and Analysis of NLM's GeneRIFs,” Proceedings of AMIA Annual Symposium, 460-464, 2003.

    Muggleton, Stephen, and de Raedt, Luc (1994). “Inductive Logic Programming: Theory and methods,” The Journal of Logic Programming, 19-20: 629-679, 1994.

    NLM (Natural Library of Medicine). Available from https://www.nlm.nih.gov/

    OMIM (Online Mendelian Inheritance in Man). Available from http://www.ncbi.nlm.
    nih.gov/omim

    P-value. Available from http://en.wikipedia.org/wiki/P-value

    Pruitt, K.D. and Maglott, D.R. (2001). “RefSeq and LocusLink: NCBI Gene-centered Resources,” Nucleic Acid Research, 29(1): 137-40, 2001.

    Srinivasan, Ashwin (2000). “The Aleph Manual,” Technical Report, Computing Laboratory, Oxford University, 2000. Available from http://www.cs.ox.ac.uk/ activities/machlearn/Aleph/aleph.html

    TiMBL (Tilburg Memory-Based Learner). Available from http://ilk.uvt.nl/timbl/

    Wain, H.M., Lush, M., Ducluzeau, F. and Povey, S. (2002). “Genew: The Human Nomenclature Database,” Nucleic Acids Research, 30(1): 169-71, 2002.

    陳孝源,“人類基因與疾病關係之規則擷取”,國立台灣師範大學資訊工程所碩士論文,2012年。

    劉宇錚,“利用相鄰句子資訊探討人類疾病與基因之關係”,國立台灣師範大學資訊工程所碩士論文,2013年。

    蔡育霖,“以機率模型為基礎之生醫文件指代消解方法”,國立台灣師範大學資訊工程所碩士論文,2013年。

    下載圖示
    QR CODE