簡易檢索 / 詳目顯示

研究生: 石舒寧
論文名稱: 近似探勘資料流最近常見資料項集之研究
Approximately Mining Recent Frequent Itemsets on Data Streams
指導教授: 柯佳伶
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 65
中文關鍵詞: 資料探勘資料流資料項集項目集
論文種類: 學術論文
相關次數: 點閱:373下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近來有很多實際的應用,其資料是以資料流的形式產生。本論文針對資料流探勘最近常見資料項集的問題,提出兩個近似探勘的方法,分別稱為ATS演算法(Average TimeStamp mining method)及FCP演算法(Frequency Changing Point mining method)。這兩個演算法對資料流裡每筆交易皆只需掃描處理一次,記錄下可能為最近常見資料項集的出現摘要資訊。由摘要資訊結構中的資訊,不需儲存目前視窗中的交易資料,即可更新並刪除不屬於目前視窗的過時資訊,並從中快速地近似找出最近常見資料項集。其中FCP演算法記錄資料項集在出現頻率改變點的累計次數,可用以估算出該資料項集在目前視窗中的支持度,近似探勘出目前交易視窗中的最近常見資料項集。由實驗結果顯示,採用FCP演算法來近似探勘資料流中的最近常見資料項集,可達到極高的正確率(precision)及回覆率(recall)。且相較於以往所提出的SW演算法,可有效率減少資料維護及探勘執行時間,且明顯減少所需記憶體大小。

    Recently, the data of many real applications are generated in the form of data streams. In this thesis, two approximately mining methods, named ATS (Average TimeStamp mining method) and FCP (Frequency Changing Point mining method), are proposed for finding recent frequent itemsets on data streams. In these two approaches, each transaction in the data stream is examined at most once and the appearing summarization of possible recent frequent itemsets is recorded. According to the data structure of appearing summarization, the outdated data which does not belong to the current window can be deleted without needing to store the transactions in the current window. Then recent frequent itemsets can be mined approximately and efficiently. Among the two approaches, FCP algorithm stores the accumulative counts for itemsets at their frequency changing points. Such information can be used to estimate the supports of itemsets in current window and then the recent frequent itemsets are mined approximately. The experimental results show that, FCP algorithm can achieve high precision and recall when used for mining recent frequent itemsets in data streams. Moreover, by comparing with the existing SW algorithm, it shows FCP algorithm improves the efficiency of data maintenance and mining process significantly and reduces the memory usage further.

    附表目錄 ii 附圖目錄 iii 第一章 緒論 1 1-1 背景與研究動機 1 1-2 相關研究 3 1-3 論文方法 8 1-4 論文架構 9 第二章 問題定義及背景知識 10 2-1 問題定義 10 2-2 FP-tree結構及其動態調整方法 12 第三章 出現摘要資訊儲存方法 22 3-1 平均時戳法 22 3-2 出現頻率改變點法 30 第四章 類FP-tree出現摘要儲存結構探勘法 42 4-1 平均時戳探勘法(Average TimeStamp mining method)42 4-2 出現頻率改變點探勘法(Frequency Changing Point mining method)46 第五章 演算法執行效率評估 52 5-1 資料產生方式 52 5-2 實驗評估 53 5-3 實驗結果總結 60 第六章 結論 61 參考文獻 62

    [1] Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan, “An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams, ” Accepted for publication in the Proc. of the First International Workshop on Knowledge Discovery in Data Streams, to be held in conjunction with the 15th European Conference on Machine Learning (ECML 2004).
    [2] J. K. Koh and Shui-Feng Shieh, “An Efficient Approach for Maintaining Association Rules based on Adjusting FP-tree Structure, ” Lecture Notes in Computer Science: DASFAA’04 Database Systems for Advanced Applications, Springer-Verlag, 2004.
    [3] Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz, “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding window, ” in Proc. of the 4th IEEE International Conference on Data Mining, pages pp. 59-66, 2004.
    [4] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou, “Dynamically Maintaining Frequent Items Over a Data Stream, ” in Proc. of the 12th ACM International Conference on Information and Knowledge Management, 2003.
    [5] C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities, ” H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), Next Generation Data Mining, MIT Press, 2003.
    [6] J. H, Chang and W.S. Lee, “Finding Recent Frequent Itemsets Adaptively over Online Data Streams, ” in Proc. of the 9th ACM International Conference on Knowledge Discovery and Data Ming, 2003.
    [7] J. H. Chang and W. S. Lee, “A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams, ” in Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
    [8] Q. Zheng, K. Xu, and S. Ma, “When to Update the Sequential Patterns of Stream Data?, ” in Proc. of the 7th Pacific-Asia In Conference on Knowledge Discovery and Data Mining, 2003.
    [9] C.-Y. Chang, M.-S. Chen, and C.-H. Lee, “Mining General Temporal Association Rules for Items with Different Exhibition Periods, ” in Proc. of the 2nd IEEE International Conference on Data Mining, 2002.
    [10] G.. S. Manku and R. Chen Motwani, “Approximate frequent counts over data Streams, ” in Proc. of the 28th International Conference on Very Large Database, Hong Kong, China Aug, 2002.
    [11] K Wang, L. Tang, J. Han, and J. Liu, “Top Down FP-Growth for Association Rule Mining, ” in Proc. of the 6th Pacific Area Conference on Knowledge Discovery and Data Mining, May 6-8, Taipei, Taiwan, PAKDD-2002.
    [12] Q. Zheng, K. Xu, W. Lv, and S. Ma, “The Algorithm of Updating Sequential Patterns,” in Proc. of the 5th International Workshop on High Performance Data Mining (HPDM), in conjunction with 2nd SIAM Conference on Data Mining, USA, 2002.
    [13] C.-H. Lee, C-R. Lin and M. S. Chen, “Sliding Window Filtering: An Efficient Algorithm for Incremental Mining, ” in Proc. of the 10th International Conference on Information and Knowledge Management, pages 263-270, Atlanta, GE, Nov. 2001.
    [14] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation, ” in Proc. of the ACM SIGMOD International Conference on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.
    [15] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, “An Efficient Approach for the Incremental Association Rule Mining, ” in Proc. of the 3rd Incremental conference on Knowledge Discovery and Data Mining (KDD97), New Port Beach, California, August, pages 263-266, 1997.
    [16] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Association Rule in Large Databases,” in Proc. of the 20th International Conference on Very Large Database, 1994.

    QR CODE