研究生: |
林祺傑 Lin, Ci-Jie |
---|---|
論文名稱: |
新聞面向事實自動擷取與整合之研究 Aspect Retrieval and Integration for News Fact |
指導教授: |
柯佳伶
Koh, Jia-Ling |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 74 |
中文關鍵詞: | 事實句擷取 、新聞事實擷取 、資訊整合 |
英文關鍵詞: | fact sentence extraction, news fact extraction, information integration |
DOI URL: | https://doi.org/10.6345/NTNU202203989 |
論文種類: | 學術論文 |
相關次數: | 點閱:102 下載:16 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
網路資訊流通快速,新聞媒體已經從傳統報章雜誌,改以網路平台傳播新聞資訊,但對同一新聞事件,不同媒體報導內容會有部分相似或相異情況,使用者需耗費時間和精力去統整新聞事實資訊。因此,本論文提出自動擷取新聞事實資訊方法,透過擷取報導內文中的主題關鍵詞,挑選出候選主題相關事實句,並以分類方式,判斷出主題相關事實句。在擷取新聞事實方面,基於主題事實句,使用自然語言分析結果,設計擷取面向詞、關聯詞、描述詞的事實三元詞組方法。而在資訊整合方面,同時考慮三元詞組間相似面向和相似描述語意,使用階層式分群對不同面向事實資訊進行分群,並以漸進式合併方法對相似面向或描述語意的事實三元詞組進行合併。實驗結果顯示事實句擷取、詞組擷取與合併都達到良好效果。因此本論文提供的方法能有效自動整合相關報導中的不同面向資訊,讓使用者對某一新聞事件能有效率獲得各方面事實資訊的瞭解。
Internet speeds up the flow of information. News media has replaced traditional newspaper and magazines to spread information online in recent years. However, users have to take much time and effort to get exact fact information from the news documents because the news documents collected from different news media have similar content but may also provide additional facts specifically. For solving this problem, we propose a method to automatically extract and integrate fact information of news documents. The candidates of fact sentences are picked out by extracting the keywords of topics from news contents. Then, various features of the candidate sentences are used to perform classification to identify the fact sentences. In order to provide fact information, the triples consisting of facet term, relation term, and description term, are extracted by using a natural language tool on the topic sentences. Then the similarity of the facet terms between two triples is used to cluster the extracted triples by agglomerative hierarchical clustering. For each cluster of triples, we use the incremental method to combine each pair of triples which have similar facet or description terms in order to provide integrated fact information. The result of performance evaluation shows that the methods of fact sentences extraction, triple extraction and combination all get good performance. The proposed approach can effectively integrate facet information from different news documents, which provides users a comprehensive understanding of news documents.
[1] Emir Muñoz , Aidan Hogan , and Alessandra, “Mileo Using Linked Data to Mine RDF from Wikipedia’s Tables,” in Proceedings of the 7th ACM international conference on Web search and data mining, 2014, pages 533-542.
[2] Guoliang Li, Dong Deng, and Jianhua Feng, “Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction”, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, 2011, Pages 529-540.
[3] Itamar Kastner and Christof Monz, “Automatic Single-Document Key Fact Extraction from Newswire Articles”, in proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 2009, pages 415-423.
[4] Lu´ıs Marujo, Wang Ling, Isabel Trancoso, Chris Dyer, Alan W. Black, Anatole Gershman1, David Martins de Matos, Joao P. Neto , and Jaime Carbonell, “Automatic Keyword Extraction on Twitter”, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015, pages 637–643.
[5] Olena Medelyan, Vye Perrone, and Ian H. Witten, “Subject Metadata Support Powered by Maui”, in Proceedings of the 10th annual joint conference on Digital libraries, 2010, pages 407-408.
[6] Rada Mihalcea and Paul Tarau, “TextRank: Bringing Order into Texts”, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004, pages 8–15.
[7] Radityo Eko Prasojo, Mouna Kacimi, and Werner Nutt, “Entity and Aspect Extraction for Organizing News Comments”, in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pages 233-242.
[8] Robert L. Thorndike, “Who belongs in the family? “, Psychometrika, Volume 18, Issue 4, pages 267-276, 1953.
[9] Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin, “Knowledge Base Completion via Search-Based Question Answering”, in Proceedings of the 23rd international conference on World wide web, 2014, pages 515-526.
[10] Sandeep Panem, Manish Gupta, and Vasudeva Varma, “Structured Information Extraction from Natural Disaster Events on Twitter”, In Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, 2014, Pages 1-8.
[11] Sergio Oramas, Mohamed Sordo, and Luis Espinosa-Anke. “A Rule-Based Approach to Extracting Relations from Music Tidbits”, In Proceedings of the 24th International Conference on World Wide Web, 2015, Pages 661-666.
[12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space”, arXiv:1301.3781v3, 2013.
[13] Wanxiang Che, Zhenghua Li, and Ting Liu, “LTP: A Chinese Language Technology Platform”, in Proceedings of the Coling 2010:Demonstrations. 2010.08, pages 13-16, Beijing, China.
[14] Weize Kong and James Allan, “Extending Faceted Search to the General Web”, in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 2014, pages 839-848.
[15] Zhicheng Dou, Sha Hu, Yulong Luo, Ruihua Song, and Ji-Rong Wen, “Finding Dimensions for Queries”, in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pages 1311-1320.
[16] 王伟,赵东岩,赵伟,”中文新闻关键事件的主题句识别[J]”, 北京大学学报:自然科学版, 2011, 47(5):789-796