研究生: |
彭紹峻 Peng, Shao-Chun |
---|---|
論文名稱: |
以MapReduce分散式架構有效率進行相似資料配對搜尋之研究 MapReduce Approach for Efficient Computation of All-Pairs Similarity Search |
指導教授: |
柯佳伶
Koh, Jia-Ling |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 中文 |
論文頁數: | 67 |
中文關鍵詞: | 相似資料配對問題 、分群篩除方法 、MapReduce演算法 |
英文關鍵詞: | All Pair Similarity Search(APSS), partition based data filtering, MapReduce algorithm |
論文種類: | 學術論文 |
相關次數: | 點閱:92 下載:7 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為了有效率計算相似資料配對搜尋問題,本論文提供一個有效篩除不相似配對的方法,並設計成對應的MapReduce平行處理架構。我們提出以最大值出現維度對資料預先分群,並設計改變分群數的方式及平衡各群間資料個數的分群調整方法。根據分群結果,將相似配對計算工作分成群內相似配對以及跨群相似配對;群內相似配對運用反向串列索引方式快速計算,在跨群配對方面,本研究先以計算群代表向量的方式對進行群配對的篩除,計算可能產生相似配對的群配對,再對群配對中資料進行候選配對的組合。此外本研究也運用MapReduce平行處理架構,以平行處理上述各步驟所需執行工作。實驗結果顯示此方法適合採用MapReduce平行處理架構,可比集中式處理有效減少相似資料配對問題的回覆時間。
For solving the All Pair Similarity Search (APSS) problem efficiently, this thesis aims to provides an effective approach to filter non-similar pairs. Besides, the MapReduce version of the approach is provided to solve the problem in parallel. At first, for each data point, the dimension with the maximum value is used to decide the corresponding group of data partition. An adjusting method is designed to further balance the number of elements in each data group. The similar pairs could be inter-group similar pairs or intra-group similar pairs. For finding the intra-group similar pairs, we apply the inverted list index to improve the efficiency of computing the similarity of data pairs in each group. In addition, for speeding up the computation of finding inter-group similar pairs, the leader-vector is used to represent each group for estimating the upper bound of similarity between each group pair. Only the pairs of groups, whose upper bounds of similarity are larger than the given similarity threshold, need to exactly compute similarity of the inter-group data pairs. Finally, we design the corresponding MapReduce algorithm to perform the proposed approach in parallel. The experimental results show that the proposed partition-based method can fit into the MapReduce programing scheme properly. Moreover, the proposed MapReduce algorithm can effectively reduce the response time of solving the APSS problem than the centralized approach.
[1] M. Alabduljalil, X. Tang and T.Yang. Optimizing Parallel Algorithms for All Pairs Similarity Search. WSDM, 2013.
[2] D. C. Anastasiu and G. Karypis. L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds. ICDE, 2014.
[3] A. Arasu, V. Ganti and R. Kaushik. Efficient Exact Set-Similarity Joins. VLDB, 2006.
[4] A. Awekar, N. F. Samatova1 and P. Breimyer. Incremental All Pairs Similarity Search for Varying Similarity Thresholds with Reduced I/O Overhead. 3rd SNA-KDD workshop, 2009.
[5] R. J. Bayardo, Y. Ma and R. Srikant. Scaling Up All Pairs Similarity Search. WWW 2007.
[6] S. Chaudhuri, V. Ganti and R. Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. ICDE, 2006.
[7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.
[8] G. D. Francisci, C. Lucchese and R. Baraglia. Scaling Out All Pairs Similarity Search with MapReduce. LSDS-IR, 2010.
[9] A. Gionis and P. Indyky. Similarity Search in High Dimensions via Hashing. 25th VLDB Conference, 1999.
[10] J. Lin. Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. SIGIR, 2009.
[11] A. Metwally and C. Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. VLDB Endowment, 2012.
[12] L. A. Ribeiro and T. H¨arder. Efficient Set Similarity Joins Using Min-prefixes. ADBIS, 2009.
[13] V. Satuluri. Bayesian Locality Sensitive Hashing for Fast Similarity Search. VLDB, 2012.
[14] X. Tang, M. Alabduljalil, X. Jin, Tao Yang. Load Balancing for Partition-based Similarity Search. SIGIR, 2014
[15] Y.Wang, A. Metwally and S. Parthasarathy. Scalable All-Pairs Similarity Search in Metric Spaces. KDD, 2013.
[16] J. Wang, G. Li and Jianhua Feng Department. Can We Beat the Prefix Filtering An Adaptive Framework for Similarity Join and Search. SIGMOD, 2012.