研究生: |
周蓓旻 |
---|---|
論文名稱: |
近似探勘資料流中最近重覆樣式方法之研究 Approximately Mining Recently Repeating Patterns on Data Streams |
指導教授: |
柯佳伶
Koh, Jia-Ling |
學位類別: |
碩士 Master |
系所名稱: |
資訊教育研究所 Graduate Institute of Information and Computer Education |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 53 |
中文關鍵詞: | 重覆樣式 、資料流 、資料探勘 |
英文關鍵詞: | Repeating pattern, Data Stream, Data Mining |
論文種類: | 學術論文 |
相關次數: | 點閱:181 下載:6 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
重覆樣式可以顯示資料項出現的前後關聯性,應用於資料摘要與資訊預測的依據。愈來愈多的應用之資料輸入方式形成資料流型態,傳統對靜態資料庫探勘重覆樣式的探勘方法已不適用。此外,在資料流的動態環境下,若從整個歷史資料序列中探勘出重覆樣式,則無法反應資料流中的最新趨勢。因此,本論文提出有效率偵測動態資料流中的最近重覆樣式的兩個演算法,分別稱為出現位元序列漸進探勘法及保留樣式估算法。出現位元序列漸進探勘法運用出現位元序列表示法計算出資料樣式的出現次數,並保留最大重覆樣式的出現位元序列資訊。當最近視窗序列內容改變,將運用所記錄之最大重覆樣式的出現位元序列方式資訊進行新重覆樣式之漸進探勘,以減少探勘計算成本。保留樣式估算法則保留重覆樣式、潛在候選樣式、及2-資訊樣式,並運用分段計算方式記錄資料樣式最近出現次數,架構一個可有效率存取樣式的保留樣式資料結構,由最大字首子樣式及最大字尾子樣式估算出未保留樣式的出現次數,達到近似探勘出所有資料流中最近視窗內最近重覆樣式的方式。實驗結果顯示出現位元序列漸進探勘法可有效率的正確探勘出最近視窗序列中的最近重覆樣式,而保留樣式估算法則可以更快速的近似探勘出資料流中的最近重覆樣式。
Repeating patterns represent temporal relations among data items, which could be used for data summarization and data prediction. More and more data of various applications is generated as a data stream. Accordingly, the traditional strategies for mining repeating patterns on static database are not suitable in a data stream environment. Besides, in the dynamic environment of a data stream, mining the repeating patterns from the whole history data sequence does not extract the newest trend of patterns in the data stream. For this reason, two algorithms for efficiently mining recently repeating patterns in a data stream are proposed in this thesis. One is named the appearing-bit-sequence-based incremental mining algorithm and the other one is named the basic-patterns estimating-based algorithm. The incremental mining approach applies appearing bit sequences to compute the frequencies of data patterns efficiently within the sliding window. By maintaining the appearing bit sequences of maximal repeating patterns, the newly generated recently repeating patterns are mined from the maintained information to reduce processing cost when the window slides. The estimating-based method maintains the repeating patterns, potential repeating patterns, and 2-item patterns, a partition-based scheme is used to count the frequencies of patterns. By constructing a data structure to support efficiently access of remained patterns, the frequency of an unretained pattern is estimated according to the frequencies of its maximum prefix-subpattern and suffix-subpattern. The experimental results show that the incremental mining method is an efficient way for mining recently repeating patterns correctly. And the estimating-based method provides a even more faster way to discover recently repeating patterns from a data stream approximately.
[1] T. Calders , B. Goethals “Depth-First Non-Derivable Itemset Mining” in Proc. of 2005 SIAM Int.Conf. on Data Mining (SDM'05)
[2] J.H. Chang and W.S. Lee “Finding Recent Frequent Itemsets Adaptively over Online Data Streams” in Proc. of the 9th ACM International Conference on Knowledge Discovery and Data Ming, 2003
[3] C.C. Chang, Y.C. Li, J.S. Lee “An Efficient Algorithm for Incremental Mining of Association Rules”in Proc. of RIDE-SDMA’05
[4] D.W Cheung, J.Han, V.T. Ng, and C.Y.Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Update Technique,” in Proc. of the 12th International Conference on Data Engineering (ICDE’96)
[5] Y. Chi, H. Wang, P.S. Yu, R.R. Muntz“ Moment:Maintaining Closed Frequent Itemsets over a Stream Sliding Window ”, in Proc. of 2004 Int. Conf. on Data Engineering (ICDE'04)
[6] B. Goethals , J. Muhonen , H. Toivonen “Mining Non-Derivable Association Rules” in Proc. of 2005 SIAM Int.Conf. on Data Mining (SDM'05)
[7] J.L. Hsu, C.C. Liu, Arbee L.P. Chen, “Discovering Nontrivial Repeating Patterns in Music Data” in Proc. of IEEE Transactions on Multimedia , 2001
[8] J.L. Koh , W.D.C.Yu “Efficient Feature Mining in Music Objects” in Proc. of Lecture Notes in Computer Science:DEXA ’01:Database and Expert Systems Applications,pp.221-231 , Springer-Verlag , 2001
[9] J.L. Koh and Y.T. Kung “An Efficient Approach for Mining Top-K Fault-Tolerant Repeating Patterns.”in Proc. of Lecture Notes in Computer Science:DASFAA’06:Database Systems for Advanced Applications , Springer-Verlag , 2006
[10] H. Li, S. Lee, M.K. Shan “Online Mining (Recently) Maximal Frequent Itemsets over Data Streams” in Proc. of RIDE-SDMA’05
[11] C.H. Lin, D.Y. Chiu, Y.H. Wu, Arbee L.P. Chen “Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window” in Proc. of SIAM International Conference on Data Mining 2005
[12] C.C. Liu, J.L. Hsu, and Arbee L.P. Chen ,“Efficient Theme and Non-Trivial Repeating Pattern Discovering in Music Databases,” in Proc. of the 15th International Conference on Data Engineering (ICDE’99)
[13] N.H. Liu, Y.H. Wu, Arbee L.P. Chen “An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases” in Proc. of International Conference on Database Systems for Advanced Applications (DASFAA 05)
[14] G.S. Manku and R.C. Motwani,”Approximate frequent counts over data Streams” in Proc. of the 28th International Conference on Very Large Database,Hong Kong, China Aug, 2002
[15] S. Thomas, S. Bodagala ,K.Alsabti, and S.Ranka, “An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases” in Proc. of 3rd International conference on Knowledge Discovery and Data Mining (KDD97)
[16] J. Wang and J. Han, “BIDE: Efficient Mining of Frequent Closed Sequences” in Proc. of 2004 Int. Conf. on Data Engineering (ICDE'04)