簡易檢索 / 詳目顯示

研究生: 范喬彬
Chou-Bin Fan
論文名稱: 以字詞類別概念輔助部落格文件分群之研究
An Effective Approach for Weblog Documents Clustering based on Categorical Concepts of Words
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 69
中文關鍵詞: 資料探勘部落格文章分群類別特徵向量
英文關鍵詞: Data Mining, Blog Post Clustering, Category Feature Vector
論文種類: 學術論文
相關次數: 點閱:120下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文研究使用ODP (Open Directory Project)目錄結構做為外部知識來源,透過ODP的查詢功能得到字詞的所屬類別作為特徵,結合文章中所有字詞所屬的類別及比重值來建構出特徵向量,希望改進單純以關鍵字擷取建立特徵向量的缺點,進而達到較好的主題式文章分群效果。此外,每個部落格中文章內容主題的集中度不同,在以K-Means演算法進行分群時,經常遇到的問題是不知道如何設定適當的聚落數目K值,本論文研究亦提出根據文章集合中各文章的特徵向量自動決定K-Means演算法的聚落數目及初始代表點,使部落格文章分群能更自動化。
    我們將類別特徵向量法與字詞特徵向量法分別套用在文章分群實驗上,並將分群結果以Accuracy及Purity值進行評估,評估結果顯示類別特徵向量法在測試集中大多數的部落格皆能得到比字詞特徵向量法更好的分群結果。此外,實驗顯示結合文章的標題詞與複合詞類別特徵向量可進一步提升文章分群的效果。

    Our approach uses ODP (Open Directory Project) directory structure as the external knowledge. Through the query function of ODP, we can get categories of query word, and we set those categories as word feature. To build category feature vector of post, we merging all of categories of post words and corresponding weight of words. We hope to improve the drawback of using keyword frequency to build feature vector, and achieve better topic based clustering result. We propose a method to assist the decision of K value in K-means algorithm. We take the category relation between each posts of a blog into consideration which makes clustering more automation.
    We compare the clustering result of our approach with term based feature vector in Purity and Accuracy measure. The experiments show that our approach is better than term based feature vector approach. We also combine the title and phrase of a post as other feature vectors, and prove these two features can assist clustering effectively.

    附表目錄 iii 附圖目錄 iv 第一章 緒論 1 1-1 研究動機 1 1-2 研究目的 2 1-3 研究的範圍與限制 6 1-4 論文架構 7 第二章 文獻探討 8 2-1 資料分群方法 8 2-2 文件相關性評估方法 10 2-3 利用外部知識輔助文件探勘 12 2-4 文件分群 14 第三章 系統架構與運作流程 17 第四章 資料前處理 20 4-1 下載部落格文章 20 4-2 部落格文章前處理 21 第五章 建立文章類別特徵向量 26 5-1 計算字詞TF--IDF 26 5-2 字詞ODP類別特徵擷取方法 28 5-3 類別特徵值計算方法 32 第六章 部落格文章分群方法 37 6-1 挑選起始中心點之策略 37 6-2 使用K-means演算法進行文章分群 40 第七章 實驗結果與討論 43 7-1 資料集介紹 43 7-2 ODP類別特徵有效性評估 44 7-3 部落格文章分群結果評估 49 第八章 結論與未來研究方向 64 8-1 結論 64 8-2 未來研究方向 64 參考文獻 66

    [1] Li, B., Xu, S., and Zhang, J. ,“Enhancing clustering blog documents by utilizing author/reader comments,”in Proceedings of the 45th Annual Southeast Regional Conference ,2007.

    [2] Agarwal, N., Galan, M., Liu, H., and Subramanya, S. ,“Clustering Blogs with Collective Wisdom,”in Proceedings of the 2008 Eighth International Conference on Web Engineering , 2008.

    [3] Kang, S. ,”Keyword-based document clustering,”in Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages - Volume 11, 2003.

    [4] E.Jamison,“Using Online Knowledge Sources for Semantic Noun Clustering,”, in Proceedings of The Sixth Midwest Computational Linguistics Colloquium , 2009.

    [5] Chirita, P. A., Nejdl, W., Paiu, R., and Kohlschutter, C. ,” Using ODP metadata to personalize search,”in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005.

    [6] Elgersma, E. and de Rijke, M., “Personal vs non-personal blogs: initial classification experiments,”in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008.

    [7] Carmel, D., Roitman, H., and Zwerdling, N., “Enhancing cluster labeling using wikipedia,”in Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009.

    [8] Seo, J. and Croft, W. B., “Blog site search using resource selection,” in Proceeding of the 17th ACM Conference on Information and Knowledge Management, 2008.

    [9] Agarwal, N. and Liu, H. , “Blogosphere: Research Issues, Applications, and Tools,”in Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,2008.

    [10] Figueiredo, F., Belem, F., Pinto, H., Almeida, J., Goncalves, M., Fernandes, D., Moura, E., and Cristo, M. ,“Evidence of quality of textual features on the web 2.0,”in Proceeding of the 18th ACM Conference on Information and Knowledge Management, 2009.

    [11] Ramage, D., Heymann, P., Manning, C. D., and Garcia-Molina, H., “Clustering the tagged web,”in Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009.

    [12] Elsas, J. L., Arguello, J., Callan, J., and Carbonell, J. G., “Retrieval and feedback models for blog feed search,”in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008.

    [13] Lin, Y., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. L. ,“Splog detection using self-similarity analysis on blog temporal dynamics,”in Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, 2007.

    [14] Dubes , Jain,” Unweighted Pair Group Method with Arithmatic Mean (UPGMA),” Numerical ecology. Elsevier. pp. 319–321. ISBN 978-0444-89250-8.

    [15] Anil K. Jain ,Richard C. Dubes ,“Algorithms for clustering data,” Prentice-Hall ,1988.

    [16] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu ,”A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the Knowledge Discovery and Data Mining (KDD’96), 1996.

    [17] G. Attardi and M. Simi.,”Blog mining through opinionated words,”in Proceedings of the Fifteenth Text Retrieval Conference (TREC), 2006.

    [18] L. Efimova and S. Hendrick,”In search for a virtual settlement: An exploration of weblog community boundaries,” Telematica Insituut, 2005.

    [19] Nitin Agarwal, Huan Liu, Lei Tang, and Philip S.Yu. ,”Identifying the influential bloggers,”in Proccedings of the First ACM International Conference on Web Search and Data Mining , 2008.

    [20] Brooks, C. H. and Montanez, N. 2006.,“Improved annotation of the blogosphere via autotagging and hierarchical clustering,”in Proceedings of the 15th International Conference on World Wide Web ,2006.

    [21] Benjamin C.M. Fung , Ke Wang , Martin Ester,”Hierarchical Document Clustering Using Frequent Itemsets,”in Proceedings of SIAM International Conference ON Data Mining ,2003.

    [22] Fautsch, C. and Savoy, J. 2010. ,”Adapting the tf idf vector-space model to domain specific information retrieval,” in Proceedings of the 2010 ACM Symposium on Applied Computing, 2010.

    [23] Wang, L., Jia, Y., and Han, W. 2007.,“Instant message clustering based on extended vector space model,”in Proceedings of the 2nd International Conference on Advances in Computation and Intelligence, 2007.

    [24] Mishne, G.,“AutoTag: a collaborative approach to automated tag assignment for weblog posts,”in Proceedings of the 15th International Conference on World Wide Web, 2006.

    下載圖示
    QR CODE