Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China
浙 江 工 商 大 学 信 电 学 院
23 June 2019
Efficient Parallel Algorithm for Mining High Utility Patterns Based - - PowerPoint PPT Presentation
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 23
Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China
浙 江 工 商 大 学 信 电 学 院
23 June 2019
High Utility Pattern Mining, Sequential Algorithms, Frameworks
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Utility = user’s interest + statistical significance -
Support = statistical significance only -
Motivation Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Anti-monotonicity is satisfied for FP
Anti-monotonicity is not satisfied with HUP
1
Problem Statement & Preliminaries High Utility Pattern Mining
What products purchased together have high profits?
The utility of a set of products =
the profits of the products in transactions containing them and depending on quantity and price/cost
I U
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
2
FP: What products are frequently purchased together?
I U
Problem Statement & Preliminaries Well-known Sequential Algorithms
TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD Breadth (Apriori) With TWU
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
3
IHUP [5] TKDE Depth (FP-Growth) With TWU UPGrowth [6] KDD, TKDE Depth (FP-Growth) With TWU D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound HUI-Miner [8] CIKM Depth (Eclat) Without Tight bounds
Data are distributed over a cluster
One split on one node Represented as <key, value> pairs: input, output, and interim results
Processing by a series of jobs
Problem Statement & Preliminaries Spark / MapReduce Framework Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Processing by a series of jobs
Job is dispatched to where a data split reside,
and executed in parallel
Job is defined by a mapper and a reducer, and
executed in two phases
Resilient Dynamic Dataset (RDD): Memory based Transformations / Actions on RDD
4
Master
Cluster of servers (nodes)
adapting HUI-Miner derived from Eclat , which is Depth-First
Ordering items, e, c, b, a, d, in ascending transaction utilities
Our Mining Approach Breadth-First Search, Improved Utility Lists Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
5
Mining high utility patterns by joining UtilityLists
two k-patterns
Our Mining Approach Join Utility Lists Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
(k+1)-pattern
6
{e,a}, UL({e,a}) {e,b}, UL({e,b}) {e,b,a}, UL({e,b,a})
New Parallel Algorithm Based on Spark Three phases
i, (u(i,tid), u(t,tid) ) i, twu(i) )
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
7
i, (tid, iutil, rutil) i, List(tid, iutil, rutil,piutil) i, (List(,,,), iutilSum, rutilSum) Pk-1,UL(Pk-1) Pk-2, (Pk-1, UL(Pk-1)) Pk, List(,,,) Pk,UL(Pk)
Phps
PhpMR
Experimental Evaluation Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
8
Experimental Evaluation Running time with changing minUtil
20 40 60 80 2.5 3 3.5 4 Phps PhpMR (s) 300 600 900 1200 30 35 40 45 50 (s) Phps PhpMR Running Time Running Time
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
9
2.5 3 3.5 4 minutil (%) (b) WebView 30 35 40 45 50 minutil (%) (a) Chess (s) 200 400 600 800 1000 0.2 0.4 0.6 0.8 1 (s) minutil (%) Phps PhpMR (d) Chainstore Running Time (c) T10DI6N1KD1M 5 200 400 600 800 2 3 4 (s) minutil (%) Phps PhpMR Running Time
Experimental Evaluation Running time with each iteration
15 20 25 Running Time (s) Phps PhpMR 20 30 40 (s) Phps PhpMR Running Time
Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
10
5 10 1 2 3 4 5 6 7 8 Running Time Pass of Iteration (a) Chess: minutil = 37% 10 20 1 2 3 4 (b) Chainstore: minutil = 0.5% Running Time Pass of Iteration
An improved vertical data structure A three-phase parallel mining framework An efficient algorithm
Conclusion and Future Work Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
11
[1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data
References Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate
[8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150.