簡易檢索 / 詳目顯示

研究生: 許翼麟
論文名稱: 以討論人物隱含類別輔助論壇討論句自動分類之研究
Sentence Classification for Web Forum by Using the Implicit Categories of Discussed Persons
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 50
中文關鍵詞: 分類棒球論壇人物新聞擴展句
英文關鍵詞: classification, baseball, forum
論文種類: 學術論文
相關次數: 點閱:125下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 網路論壇是使用者自由分享意見交流的平台,同時也充滿著各式各樣的討論,熱門的棒球賽事一天動輒百篇以上的回覆量,使用者不容易從大量的討論內容裡找到自己感興趣的觀點。本論文研究方法可以透過論壇討論內容的人物類型,依其字詞的關聯程度,對討論句進行分類。若討論句出現人名,則可用來查詢近期的新聞文章,作為擴展資料來源。將討論句自動分類整理為投手句以及野手句。分類的過程中,特徵的選擇是重要的一環。在特徵選取過程中,我們透過統計方法得到足以代表各自分類的特徵,並用以建立特徵向量,透過這些特徵向量進行分類學習,來建立分類模型。讓論壇內容的討論句透過這些工具,來決定分類為投手句或野手句的類別。實驗結果顯示,本論文系統所決定的類別與實際的類別有很高的一致性,當利用新聞擴展句之後,也能得到更好的分類效果。

    There are kinds of discussions in the web forum which is also a platform for users sharing their opinions. In the hot baseball game , there are hundreds of replies , and the users are not easy to find the viewpoints they are interested in from the large number of discussions. In the thesis, the discussed sentences were classified by the discussed person or the associative level of a word in the content. The discussed sentence will be classified into pitcher-related sentences or batter-related sentences. In our work, feature selection plays an important part in the classification process. During this process, we get representable feature and make the feature vectors by the Jason-Shannon divergence. Using these vectors, our system learns and makes a classification model. Through these processes, the discussed sentences in the forum could be classified into the appropriate class. The experiment results show that the class assigned by the proposed method is consistent. In our work, if there is a person name in the sentence, we could query the latest news as a news expansive sentence. It gets a better result to use the news expansive sentences.

    附表目錄……………………………………………………………iv 附圖目錄……………………………………………………………v 第一章 緒論…………………………………………………………1  1-1 研究動機……………………………………………………1  1-2 研究目的……………………………………………………2  1-3 研究範圍與限制……………………………………………4  1-4 論文內容的安排……………………………………………5 第二章 文獻探討……………………………………………………6  2-1 資料分類方法………………………………………………6  2-2 文件分類……………………………………………………7  2-3 句子分類……………………………………………………8  2-4 命名識別實體………………………………………………10 第三章 系統架構與流程……………………………………………12  3-1 系統架構與流程……………………………………………12 第四章 資料蒐集與前處理…………………………………………16  4-1 訓練資料蒐集………………………………………………16   4-1.1 新聞內容下載…………………………………………16   4-1.2 斷句處理………………………………………………18   4-1.3 分類訓練句擷取………………………………………19  4-2 訓練集資料前處理…………………………………………21  4-3 測試集資料蒐集及前處理…………………………………23 第五章 討論句分類方法……………………………………………24  5-1 建立特徵向量………………………………………………24   5-1.1 特徵字選取……………………………………………24   5-1.2 建立句子特徵向量……………………………………28  5-2 建立分類模型………………………………………………29   5-2.1 支持向量機分類器(SVM)……………………………29   5-2.2 K-最接近鄰居分類器 ………………………………30  5-3 討論句分類方法……………………………………………32 第六章 實驗設置與分析……………………………………………34  6-1 實驗來源……………………………………………………34  6-2 實驗結果……………………………………………………37 第七章 結論與未來研究方向………………………………………46 參考文獻 ……………………………………………………………47

    [1] An, A., Cercone, N. (1999), “Discretization of continuous attributes for learning classification rules,” in Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery & Data Mining, 509-514.
    [2] Yandong Liu , Jiang Bian , Eugene Agichtein, “Predicting information seeker satisfaction in community question answering, ”in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval,2008.
    [3] L Wu, Z Li, M Li, WY Ma, N Yu, “Mutually Beneficial Learning with Application to On-line News Classification,” in Proceedings of the ACM First Ph.D. Workshop in CIKM,2007.
    [4] Sun, Jing-tao and Zhang,, “A Junk Mail Filtering Method Based on LSA and FSVM,” in Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 2008.
    [5] Pera, Maria S. and Ng, Yiu-Kai , “Using word similarity to eradicate junk emails,”in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management , 2007.
    [6] S Appavu, M Pandian, R Rajaram. “Detection of e-mail concerning criminal activities using association rule-based decision tree, ”int. J. Electron. Secur. Digit. Forensic 1, 2 (May. 2007), 131-145.
    [7] Phan, X., Nguyen, L., and Horiguchi, S. “Learning to classify short and sparse text & web with hidden topics from large-scale data collections,” In Proceeding of the 17th international Conference on World Wide Web 2008.
    [8] Liu, B., Hu, M., and Cheng, J., “Opinion Observer: Analyzing and Comparing Opinions on the Web , ” in Proceeding of the 14th international Conference on World Wide Web , 2005.
    [9] Zhang, W., Yu, C., and Meng, W. “ Opinion retrieval from blogs,” in Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management , 2007.
    [10] Changhua Yang Kevin Hsin-Yih Lin Hsin-Hsi Chen “Building Emotion Lexicon from Weblog Corpora,”in Proceedings of the ACL 2007 Demo and Poster Sessions.
    [11] Stevenson, M., Guo, Y., Gaizauskas, R., and Martinez, D. “Knowledge sources for word sense disambiguation of biomedical text,”in Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing , 2008.
    [12] H Raghavan, J Allan, A McCallum, “An Exploration of Entity Models, Collective Classification and Relation Description,” In KDD Workshop on Link Analysis and Group Detection.
    [13] Cao, L., Luo, J., and Huang, T. S. , “Annotating Photo Collections by Label Propagation According to Multiple Similarity Cues, ”in Proceeding of the 16th ACM international Conference on Multimedia , 2008.
    [14] Kim, M. and Kim, H. , “Design of Question Answering System with Automated Question Generation,”in Proceedings of the 2008 Fourth international Conference on Networked Computing and Advanced information Management - Volume 02 , 2008.
    [15] MLB News Archive http://www.mlb.com/
    [16] Talk-baseball Forum http://www.talk-baseball.com/
    [17]Justin Martineau,Tim Finin, “Delta TFIDF: An Improved Feature Space for Sentiment Analysis, ” Assosiation for the Advvanced of Artificial Intelligence.
    [18] X. Ni, X.Wu and Y. Yu, “ Automatically Idenfication of Chinese Weblogger’s Interests Based on Text Classification,”in Proceedings of the 2006. IEEE/ACM/ International Conference on Web Intelligence.
    [19] Prem Melville , Wojciech Gryc, Richard D. Lawrence, “Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification,” In Proceedings of the 15th ACM SIG KDD International Conference on Knowledge Discovery and data Mining,2009.
    [20] Gregory Druck , Gideon Mann, Andrew McCallum, “Learning from Labeled Features using Generalized Expectation Criteria,” SIGIR 2008.
    [21] Bing Liu , “ Text Classification by Labeling Words,” in AAAI ,2004
    [22] Wermter, S. and Hung, C. 2002. “Self organizing classification on the Reuters news corpus,”in Proceedings of the 19th international Conference on Computational Linguistics - Volume 1.
    [23] Kim, H. and Zhai, C. 2009. “Generating comparative summaries of contradictory opinions in text,” In Proceeding of the 18th ACM Conference on information and Knowledge Management (Hong Kong, China, November 02 - 06, 2009). CIKM '09.
    [24] Zhuang, L., Jing, F., and Zhu, X. 2006. “Movie review mining and summarization,” In Proceedings of the 15th ACM international Conference on information and Knowledge Management (Arlington, Virginia, USA, November 06 - 11, 2006). CIKM '06.
    [25] Lin, J. (1991). “Divergence measures based on the shannon entropy,” IEEE Transactions on Information Theory 37 (1): 145–151.
    [26] R.-E. Fan, P.-H. Chen, and C.-J. Lin. “ Working set selection using the second order information for training SVM,” Journal of Machine Learning Research 6, 1889-1918, 2005.

    下載圖示
    QR CODE