研究生: |
李蕙君 Huei-Jyun Li |
---|---|
論文名稱: |
資料流最近常見項目集變動探勘之研究 Mining Recently Frequent Itemsets Change over Data Streams |
指導教授: |
柯佳伶
Koh, Jia-Ling |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 中文 |
論文頁數: | 64 |
中文關鍵詞: | 資料流 、滑動視窗 、資料探勘 |
英文關鍵詞: | data stream, sliding window, data mining |
論文種類: | 學術論文 |
相關次數: | 點閱:150 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文針對資料流滑動視窗的模型提出一個探勘狀態變動項目集的方法,稱為CV-SCD(Cross-Verify Status Change Detection)演算法。本方法主要利用兩棵稱為Base-Tree及Delta-Tree的相同字首樹之樹狀結構,儲存在任一時間點t時滑動視窗中所有交易資料,以及從t到t+1之間新增及過時的交易資料,並利用Base-Tree及Delta-Tree的資訊判斷出狀態變動項目集,再同時對兩棵樹遞迴建立包含特定項目的條件樹,以探勘出更長的狀態變動項目集。本論文對固定區間長度探勘出的狀態變動項目集儲存成狀態變動資料項集快照,並採用金字塔式時間框架的結構來儲存快照,提供可由使用者指定特定時間區間對其中各狀態變動項目集的變動情形進行相對特性分析。實驗結果顯示,當新增及過時的交易資料相對於滑動視窗資料為少量,或是資料集中包含之項目種類較多,或是在支持度小的情況下,CV-SCD演算法相較於以FP-growth探勘出常見項目集後再進行狀態變動項目集比對可顯著增進執行效率。
In this thesis, a method for discovering recently status-changed itemsets over data streams is proposed, which is named the CV-SCD (Cross-Verify Status Change Detection) algorithm. In this algorithm, two prefix tree structures, which are named Base-Tree and Delta-Tree, respectively, are constructed for maintaining the transaction data in a sliding window at time t, and the newly inserted and removed transactions at time t+1. The data stored in the Base-Tree and Delta-Tree is used to discover the status-changed itemsets at t+1 with respect to t. By performing cross-verification on Delta-Tree and Base-Tree, then by constructing conditional Delta-Tree and conditional Base-Tree recursively, all the status-changed itemsets are discovered in depth-first search. The discovered status-changed itemsets within a fixed time interval are stored in a snapshot of status-changed itemsets. Accordingly, given a specific period of time interval, the changing characteristics of itemsets in this interval can be compared relatively.
The experiment results show that FP-growth to discover all the frequent itemsets at each time point and then performing itemsets matching to get status-changed itemsets, the CV-SCD algorithm provides significant improvement on execution time when the added and expired transactions are few, or there are many different items in transaction data, or the support is small.
[1] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation, ” in Proc. of ACM SIGMOD Int. Conf. on Management of data, 2000.
[2] Y. Chi, H. Wang , P. S. Yu ,and R. R. Muntz,” Moment: Maintaining closed frequent itemsets over a stream sliding window,” In Proc. of Int. Conf. on Data Mining and Knowledge Discovery, 2004.
[3] N.Jiang and L.Gruenwald, ”CFI-Stream: mining closed frequent itemsets in data streams,” In Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining,2006.
[4] H. J. Woo, W. S. Lee, “estMax: Tracing maximal frequent itemsets over online data streams,” in Proc. of the 2007 Seventh IEEE Int. Conf. on Data Mining, 2007.
[5] B. Mozafari, H. Thakkar, C. Zaniolo, “Verifying and Mining Frequent Patterns from Large Windows over Data Streams,” in Proc. of the 25th Int. Conf. on Data Engineering, 2008.
[6] M. Feng, G. Dong, J. Li, Y. Tan, L. Wong, “Evolution and Maintenance of Frequent Pattern Space when Transactions are Removed,” in Proc. of 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2007.
[7] S. K. Tanbeer, C. F. Ahmed, B. Jeong, Y. Lee, “CP-Tree: A Tree Structure for Single-Pass Frequent Pattern Mining” in Proc. of 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2008.
[8] C. C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Evolving Data Streams,” in Proc. of the 29th Int. Conf. on Very large data bases - Volume 29, 2003.
[9] E. J. Spinosa, A. P. de L. F. de Carvalho, J. Gama, “OLINDDA: A cluster-based approach for detecting novelty and concept drift in data streams, ” in Proc. of the 2007 ACM symposium on Applied computing, 2007.
[10] P. Zhang, X. Zhu, Y. Shi, “Categorizing and Mining Concept Drifting Data Streams, ” in Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, 2008.
[11] S. K. Tanbeer, C. F. Ahmed, B. S. Jeong, Y. K. Lee, “Efficient Frequent Pattern Mining over Data Streams, ” in Proc. of the 17th ACM Conf. on Information and knowledge management, 2008.
[12] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases, “ in Proc. of the 18th Int. Conf. on Data Engineering, 2001.
[13] J. Cheng, Y. Ke, W. Ng, “A Survey on Algorithms for Mining Frequent Itemsets over Data Streams, “ in Knowledge and Information Systems, 2006.
[14] K. Li, Y. Wang, M. Elahi, X. Li, H. Wang, “Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning, “ in in Proc. of 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2008.
[15] X. Sun, M. E. Orlowska, X. Li, “Finding Frequent Itemsets in High-Speed Data Streams, ” in Proc. of 6th SIAM International Conference on Data Mining, 2006.
[16] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Association Rule in Large Databases,” in Proc. 20th Int. Conf. Very Large Data Bases, 1994.
[17] M.-S. Chen, J. Han, P. S. Yu, “Data Mining: An Overview from a Database Perspective,” in Proc. of the IEEE Transactions on Knowledge and Data Engineering, Volume 8(6): pages 866-883, 1996.
[18] Manku, G. Singh, “Approximate frequency counts over data streams,” in Proc. of the 28th Int. Conf. on Very Large Data Bases, 2002.
[19] J. H. Chang and W. S. Lee, “A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams,” In Journal of Information Science and Engineering, Vol. 20, No. 4, July 2004.
[20] C. Y. Lin, “Frequent Patterns Change Detection over Data Streams, ” 2007.
[21] http://www.almaden.ibm.com/cs/disciplines/iis/
[22] http://www.ecn.purdue.edu/KDDCUP/