簡易檢索 / 詳目顯示

研究生: 徐毓雯
Hsu,Yu-wen
論文名稱: 產品評論特徵自動擷取之研究
Automatic Feature Terms Extraction for Product Opinions
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 60
中文關鍵詞: 產品評論特徵自動擷取字詞重要性評估函式意見探勘
英文關鍵詞: feature terms of products, automatic extraction, importance measure function of terms, opinion mining
論文種類: 學術論文
相關次數: 點閱:358下載:47
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今大多數意見探勘研究中,對於產品特徵字詞的挑選大多由人工給定或是依據詞頻的高低來決定,對不同種類的產品則需要重新給定產品特徵字詞,因此我們希望透過自動擷取產品特徵字詞,降低在產品特徵挑選所花費的人力成本。本論文運用不同的字詞重要性評估方式,探討如何有效地自動從論壇文章中擷取出產品特徵字詞。我們以名詞為候選特徵字詞,分別對論壇文件庫及相機介紹文件庫,統計每個字詞在文件庫中各廠牌討論文的出現頻率,反應出一般常見特徵;運用不同廠牌產品特徵字詞出現的機率差異程度,反應出廠牌特有特徵;並運用廠牌與特徵字詞出現的相關程度,反應出廠牌關聯特徵。此外我們亦考慮跨文件庫的字詞出現機率差異程度,反應出論壇及相機文中常用的產品特徵字詞,再透過常見字詞列表進行一般口語字詞的過濾篩選。我們提出產品特徵字詞重要性評估函式,結合各種分析方法所得的重要性評估值作為產品特徵字詞擷取的依據。實驗結果顯示以所提出的字詞重要性評估函式篩選字詞,可有效地自動擷取出產品特徵字詞。

    In the recent researches on opinion mining, the feature terms of products are usually manual assigned or determined according to the term frequencies. Consequently, it would take lots of costs when we choose different products. For this reason, the goal of this thesis is to study how to extract feature terms of products from documents in a forum automatically and effectively. We select forum and expert commentaries as the corpora. Within a corpus, the nouns appearing in the documents are selected as the candidate feature terms. The term frequency is counted for each candidate term for the documents discussing a certain brand, which shows the popularity of a feature term. The divergence of probability between different brands is calculated for each candidate term, which shows the particular feature term of a brand. The correlation of a feature term with a brand is also calculated to show the related terms of a brand. Furthermore, the divergence of probability between the two different corpora is calculated for a candidate term to show the special terms of different corpora. Finally, we propose an importance measure function of terms to evaluate the importance of terms, which combine the scores of the above various evaluation methods. The experimental results show that the rank list of feature terms obtained by using the importance measure function could extract product feature terms automatically and effectively.

    附表目錄 iii 附圖目錄 iv 第一章 緒論 1 1-1 研究動機與目的 1 1-2 論文方法 2 1-3 論文架構 3 第二章 文獻探討 4 2-1 一般文件特徵表示方法 4 2-2 意見擷取方法 5 2-3 特徵字詞擷取方法 6 第三章 系統架構與資料前處理 9 3-1 系統架構與流程 9 3-2資料蒐集與前處理 10 3-2.1 文件庫建立與處理 11 3-2.1-1論壇文件庫建立 11 3-2.1-2相機介紹文件庫建立 13 3-2.2 斷句處理與詞性標註(Part-of-Speech tagging) 15 3-3 建立文件內容索引 17 3-3.1 Lucene概要介紹 17 3-3.2 建立文件索引 18 第四章 字詞統計分析方法 21 4-1文件庫的字詞類型 21 4-2 文件庫內部的字詞分析 22 4-2.1 詞頻(Term Frequency) 22 4-2.2 Kullback-Leibler Divergence 23 4-2.3 Mutual Information 24 4-3 跨文件庫的字詞分析 25 4-3.1 KLCF divergence 25 4-3.2 Jensen-Shannon Divergence 26 第五章 字詞重要性評估 28 5-1 文件庫內部字詞重要性分析 28 5-1.1詞頻分析 28 5-1.2 Kullback-Leibler Divergence 分析 30 5-1.3 Mutual Information 分析 32 5-2 跨文件庫字詞重要性分析 34 5-2.1 KLCF divergence 分析 34 5-2.2 Jensen-Shannon Divergence 分析 34 5-2.3 Frequency Lists Filtering 36 5-3 字詞重要性評估函式 38 5-3.1 文件庫內部字詞重要性資訊 38 5-3.2 組合跨文件庫字詞重要性資訊 39 第六章 實驗結果與討論 41 6-1 實驗來源 41 6-2 實驗評估 42 [實驗1] 文件庫內部字詞為特徵字詞之比率評估 42 [實驗2] 論壇文件庫內部字詞資訊的準確度 43 [實驗3] 相機介紹文件庫內部字詞資訊的準確度 47 [實驗4] 跨文件庫字詞資訊的準確度 51 第七章 結論與未來研究方向 56 參考文獻 58 附表目錄 表5.1 四大廠牌在論壇文件庫前20名的字詞 29 表5.2 四大廠牌在相機介紹文件庫前20名的字詞 29 表5.3 四大廠牌KL divergence在論壇文件庫前20名的字詞 31 表5.4 四大廠牌KL divergence在相機介紹文件庫前20名的字詞 31 表5.5 四大廠牌Mutual Information在論壇文件庫前20名的字詞 33 表5.6 四大廠牌Mutual Information在相機介紹文件庫前20名字詞 33 表5.7 四大廠牌KLCF在跨文件庫前20名的字詞 35 表5.8 四大廠牌DJS在跨文件庫前20名的字詞 35 表5.9 常見字詞的FL(t)前20名的排名 38 表6.1 文件庫的句數與字詞數 41 表6.2 特徵字詞數 42 表6.3 文件庫內部特徵字詞涵蓋比例 43 表6.4 以Sony為例各評估函式找出的字詞集合的前20名 55 附圖目錄 圖3.1 系統流程圖 9 圖3.2 相機論壇的品牌分頁 12 圖3.3 論壇中的討論文章格式 12 圖3.4 論壇網頁文章內容 13 圖3.5 相機介紹網站 14 圖3.6 相機介紹網頁的文字 15 圖3.7 論壇文件斷句結果與詞性標註結果 16 圖3.8 相機介紹文件斷句與詞性標註結果 16 圖3.9 Document 物件結構 18 圖3.10 文件檢索流程圖 19 圖5.1 常見用詞列表 37 圖6.1 論壇文件庫內部字詞資訊重要性各別分析的準確度 45 圖6.2 論壇文件庫內部字詞重要性兩兩結合方法評估的準確度 46 圖6.3 論壇文件庫內部字詞多個方法結合重要性評估的準確度 46 圖6.4 論壇文件庫中相機特徵字詞的準確度比較 47 圖6.5 相機介紹文件庫內部字詞重要性各別方法的準確度 49 圖6.6 相機介紹文件庫內部字詞重要性兩兩結合方法評估的準確度 49 圖6.7 相機介紹文件庫內部字詞多個方法結合重要性評估的準確度 50 圖6.8 相機介紹文件庫中相機特徵字詞的準確度比較 50 圖6.9 跨文件庫字詞重要性各別分析方法的準確度 52 圖6.10跨文件庫字詞重要性結合兩個分析方法的準確度 52 圖6.11跨文件庫字詞重要性結合多個分析方法的準確度 53 圖6.12各個重要性評估函式的準確度 53 圖6.13相機特徵字詞在各個重要性評估函式的準確度 54

    [1] L. Ku, Y. Liang and H. Chen, “Opinion Extraction, Summarization and Tracking in News and Blog Corpora” in Proceedings of International Conference on Artificial Intelligence(AAAI) ,2006.
    [2] B. Liu and N. Jindal, “Opinion Spam and Sentiment Analysis”, in Proceedings of the 1st ACM International Conference on Web Search and Data Mining (WSDM), 2008.
    [3] G.. Mishne “Using Blog Properties to Improve Retrieval”, in Proceedings of the 1st International Conference on Weblogs and Social Media(ICWSM), 2007.
    [4] W. Zhang, C.Yu, and W. Meng, “Opinion Retrieval from Blogs”, in Proceedings of the16th ACM Conference on Information and Knowledge Management(CIKM), 2007.
    [5] Q.Su, X. Xu, H. Guo, Z. Guo, X. Wu, X. Zhang, B. Swen, “Hidden Sentiment Association in Chinese Web Opinion Mining”, in Proceedings of the 17th International Conference on World Wide Web(WWW), 2008.
    [6] W. Dakka and P. G. Ipeirotis, “Automatic Extraction of Useful Facet Hierarchies from Text Databases”, in Proceedings of the 24th International Conference on Data Engineering (ICDE), 2008.
    [7] D. Dash, J. Rao, N. Megiddo, A. Ailamaki1, and G. Lohman, “Dynamic Faceted Search for Discovery-driven Analysis”, in Proceedings of the 17th ACM Conference on Information and Knowledge Management(CIKM), 2008.
    [8] B. He, C. Macdonald, J. He, and I. Ounis, “ An Effective Statistical Approach to Blog Post Opinion Retrieval”, in Proceedings of the 17th ACM Conference on Information and Knowledge Management(CIKM), 2008.
    [9] X. Ling, Q. Mei, C. Zhai, and B. Schatz, “Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection”, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(SIGKDD), 2008.
    [10] G. Salton, “Automatic Information Organization and Retrieval” McGraw-Hill, New York, 1968.
    [11] M. Hu, B. Liu, “ Mining and Summarizing Customer Reviews” , in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(SIGKDD), 2004.
    [12] L. Zhuang, F. Jing, X. Zhu, “Movie Review Mining and Summarization” , in Proceedings of the 15th ACM Conference on Information and Knowledge Management(CIKM), 2006.
    [13] X. Ding, B. Liu, L. Zhang, “Entity Discovery and Assignment for Opinion Mining Applications”, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2009.
    [14] W. Jin, H. Ho, R. Srihari, “OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction”, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2009.
    [15] M.Grineva, M.Grinev, D. Lizorkin, “Extract Key Terms from Noisy and Multi-theme Documents”, in Proceedings of the 18th International Conference on World Wide Web (WWW), 2009.
    [16] C. Fautsch, Jacques Savoy, “Adapting the Tf-idf Vector Space Model to Domain Specific Information Retrieval” in Proceedings of the 25th ACM Symposium on Applied Computing(SAC),2010.
    [17] D. Carmel, H. Rotiman, N.zwerding, “Enhancing Clustering Labeling Using Wikipedia” in Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval(SIGIR),2009.

    下載圖示
    QR CODE