研究生: |
張益豪 Chang, Yi-Hao |
---|---|
論文名稱: |
結合監督式及非監督式方法進行新聞文章意見持有者辨識之研究 Combining the Supervised and Unsupervised Approaches to Identifying Opinion Holders in News |
指導教授: |
侯文娟
Hou, Wen-Juan |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 70 |
中文關鍵詞: | 意見探勘 、意見持有者辨識 、支援向量機 、監督式學習 、非監督式方法 |
英文關鍵詞: | opinion mining, opinion holder identification, support vector machine, supervised learning, unsupervised learning approach |
DOI URL: | https://doi.org/10.6345/NTNU202203822 |
論文種類: | 學術論文 |
相關次數: | 點閱:322 下載:29 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
意見探勘幫助我們自動地從大量的可靠來源文本,擷取人們感興趣且可利用的主觀性資訊。意見句可分為四個部分,包括意見主題、意見持有者、意見主張及意見情感,本研究目的在於辨識意見持有者。本研究提出一個結合監督式及非監督式學習的方法,辨識意見句中的文章作者或持有者代表詞,本研究的主要流程任務分成兩個部分:文章作者意見辨識、意見持有者辨識。
意見持有者辨識目的是從意見句中擷取出表達此意見的人物名或組織名,以監督式學習方法為基礎,從包含主觀性意見句的文檔中,人工標記意見持有者的代表詞答案,再經由自然語言處理方法進行預處理步驟(包含斷詞、詞性標記及具名實體辨識等),之後將兩個主要任務通過各自建立的數個支援向量機模型,對意見表達句進行文章作者辨識與意見持有者的識別。在文章作者意見辨識中使用包含詞彙相關資訊、詞性相關資訊、標點符號相關資訊、具名實體相關資訊、句法相關資訊、意見詞資訊等特徵值;在意見持有者的識別中則使用包含詞性相關資訊、詞彙相關資訊、具名實體相關資訊、文句組成相關資訊、標點符號相關資訊等特徵值。最後合併兩部分的辨識結果,產生系統提報的意見持有者。
對於一個意見句中含有多個意見持有者候選詞之問題,我們利用公式計算出代表意見持有者的詞彙,並借助本研究制定的規則,修正持有者代表詞完整度不足的問題;此外,對於意見持有者涉及指代消解問題的情況,本研究使用Hobbs Algorithm句法剖析的方式解決此問題。本研究的系統辨識方法,實驗結果表明在英語新聞語料中,文章作者辨識可以達到F-1值91.58%的效能,及意見持有者辨識可以達到F-1值71.83%的效能,在此基礎上進行了交叉驗證和刪減特徵值分析重要程度的工作,並且能夠得到良好的辨識效果。
Opinion mining helps us automatically extract useful subjective information from a large number of reliable texts. Opinion sentences can be decomposed into four parts, including opinion topic, opinion holder, opinion claim and opinion sentiment. Our goal aims to identify the holders of opinion. This study proposes a combination of supervised and unsupervised learning approaches to extract the article author and holders. The main flow of our research work is divided into two phases: identifying article author and holders of the opinion sentence among the labeled corpus.
The purpose of opinion holder identification is to capture the expression of the person or organizations from the subjectivity opinion sentences. The approach is based on the supervised learning method using several manual annotated corpus provided in the online news articles. The preprocessing steps via natural language processing techniques, such as segmentation, part-of-speech tagging and named entity recognition, etc. Our feature analysis is based on both machine learning (i.e., support vector machine, SVM) and unsupervised pattern recognition techniques. Different SVM models are evaluated via cross-validation experiments. The proposed features consist of the lexical feature, part-of-speech feature, punctuation mark feature, named entity feature, syntactic feature, position feature, phrase composition feature and opinion word feature.
The study also addresses the problem of multiple opinion holder candidates being realized in a single sentence. The proposed approach includes some unsupervised extracting methods to detect the opinion holders without labeled training data. Some manual rules are employed to revise the incomplete holder representations. Furthermore, the Hobbs algorithm is applied to resolve the anaphora resolution problem. Our approach is tested on an annotated news corpus with 10-fold cross- validation and with feature deletion analysis, obtaining 91.58% and 71.83% of F-1 scores for the task of extracting author’s opinion and the task of opinion holder identification, respectively. Finally, the experimental results show the exhilaratingly good performance.
[1] Kim, S. M., & Hovy, E. (2004, August). "Determining the sentiment of opinions." In Proceedings of the 20th international conference on Computational Linguistics (p. 1367).
[2] Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). "New avenues in opinion mining and sentiment analysis." IEEE Intelligent Systems, 28(2), 15-21.
[3] Choi, Y., Cardie, C., Riloff, E., & Patwardhan, S. (2005, October). "Identifying sources of opinions with conditional random fields and extraction patterns." In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 355-362).
[4] Lafferty, J., McCallum, A., & Pereira, F. C. (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data." In Proceedings of the eighteenth international conference on machine learning, ICML (Vol. 1, pp. 282-289).
[5] Ku, L. W., Liang, Y. T., & Chen, H. H. (2006, March). "Opinion Extraction, Summarization and Tracking in News and Blog Corpora." In AAAI spring symposium: Computational approaches to analyzing weblogs (Vol. 100107).
[6] Cortes, C., & Vapnik, V. (1995). "Support-vector networks." Machine learning, 20(3), 273-297.
[7] Das, D., & Bandyopadhyay, S. (2011, July). "Emotions on Bengali blog texts: role of holder and topic." In Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on (pp. 587-592). IEEE.
[8] Elarnaoty, M., AbdelRahman, S., & Fahmy, A. (2012). "A machine learning approach for opinion holder extraction in Arabic language." arXiv preprint arXiv:1206.1011.
[9] Kim, S. M., & Hovy, E. (2006, July). "Extracting opinions, opinion holders, and topics expressed in online news media text." In Proceedings of the Workshop on Sentiment and Subjectivity in Text (pp. 1-8).
[10] Kim, Y., Jung, Y., & Myaeng, S. H. (2007, November). "Identifying opinion holders in opinion text from online newspapers." In grc (p. 699). IEEE.
[11] Gangemi, A., Presutti, V., & Reforgiato Recupero, D. (2014). "Frame-based detection of opinion holders and topics: a model and a tool." Computational Intelligence Magazine, IEEE, 9(1), 20-30.
[12] Wiegand, M. (2013). "Predicate acquisition for opinion holder extraction." HiER 2013, 41.
[13] Wiegand, M., & Klakow, D. (2012, April). "Generalization methods for in-domain and cross-domain opinion holder extraction." In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 325-335).
[14] Kim, S. M., & Hovy, E. (2005, July). "Identifying opinion holders for question answering in opinion texts." In Proceedings of AAAI-05 Workshop on Question Answering in Restricted Domains (pp. 1367-1373).
[15] Chang, C. C., & Lin, C. J. (2011). "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[16] Markov Random Field, from https://en.wikipedia.org/wiki/Markov_random_field
[17] Kudo, T., "CRF++: Yet Another CRF toolkit." https://taku910.github.io/crfpp/, 2003
[18] MPQA opinion finder, from http://mpqa.cs.pitt.edu/opinionfinder/
[19] Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June). "The Stanford CoreNLP natural language processing toolkit." In ACL (System Demonstrations) (pp. 55-60).
[20] 圖3.2.10 意見表達的涵蓋範圍, from
http://www.aclclp.org.tw/rocling/2010/O10-2006.pdf/
[21] MPQA Subjectivity Lexicon, from
http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
[22] 李佳穎,2009,"意見持有者辨識及其意見立場分析",國立台灣大學資訊工程所碩士論文。
[23] 台灣主流觀點(Taiwan News), from http://taiwannews.com.tw/etn/index_en.php/
[24] 行政院雙語詞彙對照表, from http://www.ey.gov.tw/bilingual.aspx?n=878A02401BC1B95E
[25] 外交部雙語詞彙對照表, from http://www.mofa.gov.tw/Bilingual.aspx?n=00464E5D5C7BF1E0&sms=B2C9ACBE62E87999
[26] 外文姓名拼音對照表, from http://www.boca.gov.tw/ct.asp?xItem=5609&ctNode=677&mp=1#r22
[27] 台灣地名列表, from http://tbroc2.eyp.com.tw/eyp/front/bin/ptdetail.phtml?Part=0012-002-001&Category=350001934
[28] 台灣公營事業列表, from https://zh.wikipedia.org/wiki/%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E5%85%AC%E7%87%9F%E4%BA%8B%E6%A5%AD