簡易檢索 / 詳目顯示

研究生: 謝俊緯
Chun-Wei Hsieh
論文名稱: 網頁點選資料流中最近瀏覽樣式探勘方法之研究
Mining Recent Path Traversal Patterns on Webclick Streams
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 66
中文關鍵詞: 資料探勘資料流瀏覽樣式
英文關鍵詞: Data Mining, Data Streams, Path Traversal Patterns
論文種類: 學術論文
相關次數: 點閱:255下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 摘要

    網頁點選資料流中最近瀏覽樣式探勘方法之研究

    謝俊緯

    從歷史資料中探勘出的常見瀏覽樣式代表長期的現象,未必能反應最近的趨勢,通常網站經營者對最近使用者的瀏覽樣式會比較感興趣,因此本論文提出從網頁點選資料流中探勘最近封閉常見瀏覽樣式的方法,稱為RPTP(mining Recent Path Traversal Patterns on webclick streams)演算法,其採用滑動視窗及Lossy Counting方法的觀念,只保留最近固定數目之連接記錄中的常見及潛在常見瀏覽樣式,因此能以動態探勘方式,有效率地從網頁點選資料流中探勘出瀏覽樣式。本方法並未保留原始資料,只需記錄最近常見瀏覽樣式與最近潛在瀏覽樣式資訊。此外,本論文方法探討從儲存結構中有效率探勘出封閉瀏覽樣式的技術,以避免探勘結果中的重覆資訊,讓探勘使用者能夠更容易地分析結果。我們並結合封閉樣式的觀念,減少所需儲存樣式的數量。由實驗結果顯示,本方法可在合理的儲存空間下需求下快速進行最近常見瀏覽樣式探勘,且和相關論文相較,可較快速反應出資料流中最近常見瀏覽樣式的改變。

    Abstract

    Mining Recent Path Traversal Patterns on Webclick Streams

    by

    Chun-Wei Hsieh

    Frequent traversal patterns extracted from the history data represent the mining results of long term but not necessary the recent trend. However, the web administrators are usually interesting in the traversal path of recent users. Therefore, an algorithm, called RPTP, for mining recent path traversal patterns on webclick streams is proposed in this thesis. In our approach, the lossy counting techniques are applied to maintain frequent and semi-frequent patterns in a sliding window of recent user sessions. Hence, frequent patterns on webclick streams are discovered efficiently in a dynamic way. It is not necessary for RPTP to store the original data. Instead, the appearing information of recent frequent and semi-frequent patterns is recorded. Moreover, the strategies for mining closed frequent patterns from the constructed data structures are provided to avoid generating redundant information in the mining result. Accordingly, the concept of closed patterns is applied to reduce the number of maintained patterns. The experimental results show that the RPTP achieves an efficient execution time under a reasonable memory requirement. Furthermore, by comparing with the related work, RPTP provides a shorter response time to reflect the change of frequent traversal patterns on webclick streams.

    目 錄 附表目錄 ii 附圖目錄 iii 第一章 緒論 1 1.1背景與研究動機 1 1.2相關文獻探討 3 1.3論文方法 14 1.4論文架構 15 第二章 相關名詞與問題定義 16 第三章 探勘最近封閉常見瀏覽樣式 22 3.1取出最大向前瀏覽序列 22 3.2瀏覽樣式儲存結構 25 3.3樣式儲存結構維護方法 27 3.4探勘最近常見瀏覽樣式 29 第四章 儲存結構改進方法 39 4.1減少重覆資訊儲存 39 4.2樹狀網頁架構處理方式 44 第五章 演算法執行效率評估 50 5.1實驗環境 50 5.2資料產生方式 50 5.3實驗評估 51 第六章 結論及未來研究方向 64 參考文獻 65 附表目錄 表2.1 mfr_WSW(5)中的瀏覽樣式 20 表2.2 mfr_WSW(6)中的瀏覽樣式 20 表4.1 最大向前瀏覽序列 40 表5.1 實驗資料集的參數說明 51 表5.2 不同最大誤差值設定下探勘出的樣式累計總數 58 表5.3 演算法所需記憶體 59 附圖目錄 圖1.1 SPADE的支持度計數 5 圖1.2 相同字首樹 6 圖1.3 使用者瀏覽記錄 7 圖1.4 KR串列與相同字首樹 9 圖2.1 網頁點選資料流範例 19 圖2.2 最大向前瀏覽序列 19 圖3.1 取出最大向前瀏覽序列演算法 23 圖3.2 取出最大向前瀏覽序列 24 圖3.3 最近瀏覽樣式樹 26 圖3.4 瀏覽樣式維護演算法 28 圖3.5 檢查封閉瀏覽樣式虛擬程式碼 30 圖3.6 加入S1的瀏覽序列後之儲存結構內容 32 圖3.7 加入S2的瀏覽序列後之儲存結構內容 33 圖3.8 加入S3的瀏覽序列後之儲存結構內容 33 圖3.9 加入S4的瀏覽序列後之儲存結構內容 34 圖3.10 加入S5的瀏覽序列後之儲存結構內容 34 圖3.11 加入S5並進行探勘後的儲存結構內容 35 圖3.12 最近封閉常見瀏覽樣式檢查過程 36 圖3.13 刪除S1的記錄後之儲存結構內容 37 圖3.14 加入S6的瀏覽序列後之儲存結構內容 37 圖3.15 加入S6並進行探勘後的儲存結構內容 38 圖3.16 刪除S2的記錄後之儲存結構內容 38 圖4.1 加入S1的瀏覽序列後之儲存結構內容 42 圖4.2 加入S2的瀏覽序列後之儲存結構內容 42 圖4.3 加入S3的瀏覽序列後之儲存結構內容 43 圖4.4 樹狀結構網頁資料 44 圖4.5 加入S1的瀏覽序列後之儲存結構內容 47 圖4.6 加入S2的瀏覽序列後之儲存結構內容 47 圖4.7 加入S3的瀏覽序列後之儲存結構內容 48 圖4.8 加入S4的瀏覽序列後之儲存結構內容 48 圖4.9 加入S5的瀏覽序列後之儲存結構內容 49 圖5.1 探勘結果的錯誤率 52 圖5.2 探勘結果的漏失率 53 圖5.3 變動視窗大小對累計執行時間之比較 54 圖5.4 變動視窗大小對單位執行時間之比較 55 圖5.5 變動最小支持度門檻值對累計執行時間之比較 56 圖5.6 變動最小支持度門檻值對單位執行時間之比較 56 圖5.7 變動最大誤差設定值對累計執行時間之比較 57 圖5.8 變動最大誤差設定值對單位執行時間之比較 58 圖5.9 改進演算法對累計執行時間之比較 59 圖5.10 實驗一資料集產生示意圖 60 圖5.11 最近常見瀏覽樣式改變之反應時間比較 61 圖5.12 實驗二資料集產生示意圖 62 圖5.13 最近常見瀏覽樣式改變之反應時間比較二 63

    參考文獻
    [1] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Association Rules in Large Databases,” in Proceeding of the 20th International Conference on Very Large Data Bases (VLDB'94), page 487-499, Santiago de Chile, Chile, September 12-15, 1994.
    [2] R. Agrawal, and R. Srikant, “Mining Sequentila Patterns,” in Proceeding of the 11th IEEE International Conference on Data Engineering (ICDE'95), page 3-14, Taipei, Taiwan, March 6-10, 1995.
    [3] R. Agrawal, and R. Srikant, “Mining Sequential Patterns: Generalizations and Performance Improvements,” in Proceeding of the 5th International Conference on Extending DataBase Technology (EDBT'96), Avignon, France, March 25-29, 1996.
    [4] M.-S. Chen, J. S. Park, and P. S. Yu, "Efficient Data Mining for Path Traversal Patterns," IEEE Transaction on Knowledge and Data Engineering, Vol. 10, No. 2, page 209-221, April, 1998.
    [5] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, ”Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window,” in Proceeding of the 4th IEEE International Conference on Data Mining (ICDM'04), Brighton, UK, November 01–04, 2004.
    [6] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities,” in Proceeding of the National Science Foundation Workshop on Next Generation Data Mining (NGDM02), Baltimore, November 1-3, 2002.
    [7] H.-F. Li, S.-Y. Lee, and M.-K. Shan, "DSM-TKP: Mining Top-K Path Traversal Patterns over Web Click-Streams," in Proceeding of the 2005 IEEE/WIC/ACM International Joint Conference on Web Intelligence (WI’05), France, September 19-22, 2005.
    [8] G. S. Manku, and R. Motwani, “Approximate frequency counts over data Streams,” in Proceeding of the 28th International Conference on Very Large Data Bases (VLDB’02), page 346-357, Hong Kong, China, August 20-23, 2002.
    [9] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” in Proceeding of the 17th IEEE International Conference on Data Engineering (ICDE'01), page 215-226, Heidelberg, Germany, April 2-6, 2001.
    [10] M. Seno and G. Karypis, “SLPMiner: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint,” in Proceeding of the 2nd IEEE International Conference on Data Mining (ICDM'02), Maebashi TERRSA, Maebashi City, Japan, December 9-12, 2002.
    [11] P. Tzvetkov, X. Yan, and J. Han, “TSP: Mining Top-K Closed Sequential Patterns,” in Proceeding of the 3rd IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, USA, November 19-22, 2003.
    [12] A. Udechukwu, K. Barker, and R. Alhajj ,“Maintaining Knowledge-Bases of Navigational Patterns from Streams of Navigational Sequences,” in Proceeding of the 15th IEEE International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05), page 37-44, Tokyo, Japan, April 3-7, 2005.
    [13] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” in Proceeding of SIAM International Conference on Data Mining (SDM'03), San Francisco, California, USA, May 1-3, 2003.
    [14] M. J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences,” in Proceeding of Machine Learning Journal, special issue on Unsupervised Learning (Doug Fisher, ed.), Vol. 42 Nos. 1/2, page 31-60, Jan/Feb 2001.

    QR CODE