Efficient Parallel Algorithm for Mining High Utility Patterns Based - - PowerPoint PPT Presentation

efficient parallel algorithm for mining high utility
SMART_READER_LITE
LIVE PREVIEW

Efficient Parallel Algorithm for Mining High Utility Patterns Based - - PowerPoint PPT Presentation

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 23


slide-1
SLIDE 1

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China

浙 江 工 商 大 学 信 电 学 院

23 June 2019

slide-2
SLIDE 2

Content

 Motivation  Problem Statement & Preliminaries

 High Utility Pattern Mining, Sequential Algorithms, Frameworks

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

 Our Mining Approach  New Parallel Algorithm Based on Spark  Experimental Evaluation  Conclusion and Future Work  References

slide-3
SLIDE 3

Motivation

 High Utility Pattern Mining vs Frequent Pattern Mining

 Utility = user’s interest + statistical significance -

HUP

 Support = statistical significance only -

FP

HUP Mining much harder than FP Mining

Motivation Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

 HUP Mining much harder than FP Mining

 Anti-monotonicity is satisfied for FP

support of a pattern  support of its sub-pattern

 Anti-monotonicity is not satisfied with HUP

utility of a pattern  ? utiltiy of its sub-pattern  Parallelization to deal with hardness in mining big data

1

slide-4
SLIDE 4

High Utility Pattern Mining

Problem Statement & Preliminaries High Utility Pattern Mining

 What products purchased together have high profits?

 The utility of a set of products =

the profits of the products in transactions containing them and depending on quantity and price/cost

Shopping Transactions

Tid Items

I U

Utility table

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

2

 FP: What products are frequently purchased together?

Tid Items t1 b:1, c:2, d:1, g:1 t2 a:4, b:1 c:3, d:1,e:1 t3 a:4, c:2, d:1 t4 c:2, e:1,f:1 ... ...

I U

a 1 b 2 c 1 d 5 ... ...

slide-5
SLIDE 5

Well-known Sequential Mining Algorithms

Problem Statement & Preliminaries Well-known Sequential Algorithms

Algorithm References Search Strategy Candidates Pruning Strategy

TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD Breadth (Apriori) With TWU

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

3

IHUP [5] TKDE Depth (FP-Growth) With TWU UPGrowth [6] KDD, TKDE Depth (FP-Growth) With TWU D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound HUI-Miner [8] CIKM Depth (Eclat) Without Tight bounds

slide-6
SLIDE 6

Distributed Computing Frameworks [9,10]

 Data are distributed over a cluster

One split on one node Represented as <key, value> pairs: input, output, and interim results

 Processing by a series of jobs

Problem Statement & Preliminaries Spark / MapReduce Framework Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

 Processing by a series of jobs

Job is dispatched to where a data split reside,

and executed in parallel

Job is defined by a mapper and a reducer, and

executed in two phases

 Resilient Dynamic Dataset (RDD): Memory based  Transformations / Actions on RDD

4

Master

Slaves

Cluster of servers (nodes)

slide-7
SLIDE 7

Our Mining Approach

 Breadth-First Search

 adapting HUI-Miner derived from Eclat , which is Depth-First

 Improved vertical data structure - UtilityList

 Ordering items, e, c, b, a, d, in ascending transaction utilities

Our Mining Approach Breadth-First Search, Improved Utility Lists Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

5

 {e}, UL({e})   {b}, UL({b}) 

slide-8
SLIDE 8

Our Mining Approach (cont)

 Mining high utility patterns by joining UtilityLists

two k-patterns

Our Mining Approach Join Utility Lists Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

(k+1)-pattern

6

 {e,a}, UL({e,a})   {e,b}, UL({e,b})   {e,b,a}, UL({e,b,a}) 

slide-9
SLIDE 9

Phps: Parallel high utility pattern mining based on Spark

New Parallel Algorithm Based on Spark Three phases

I

 i, (u(i,tid), u(t,tid) )   i, twu(i) ) 

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

7

II III

 i, (tid, iutil, rutil)   i, List(tid, iutil, rutil,piutil)   i, (List(,,,), iutilSum, rutilSum)   Pk-1,UL(Pk-1)   Pk-2, (Pk-1, UL(Pk-1))   Pk, List(,,,)   Pk,UL(Pk) 

slide-10
SLIDE 10

Experimental Evaluation

 2 algorithms

 Phps

  • our algorithm

 PhpMR

  • the competitor

 4 datasets

Experimental Evaluation Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

 4 datasets

8

Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2

slide-11
SLIDE 11

Running time with changing minUtil

Experimental Evaluation Running time with changing minUtil

20 40 60 80 2.5 3 3.5 4 Phps PhpMR (s) 300 600 900 1200 30 35 40 45 50 (s) Phps PhpMR Running Time Running Time

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

9

2.5 3 3.5 4 minutil (%) (b) WebView 30 35 40 45 50 minutil (%) (a) Chess (s) 200 400 600 800 1000 0.2 0.4 0.6 0.8 1 (s) minutil (%) Phps PhpMR (d) Chainstore Running Time (c) T10DI6N1KD1M 5 200 400 600 800 2 3 4 (s) minutil (%) Phps PhpMR Running Time

slide-12
SLIDE 12

Running time with each iteration

Experimental Evaluation Running time with each iteration

15 20 25 Running Time (s) Phps PhpMR 20 30 40 (s) Phps PhpMR Running Time

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

10

5 10 1 2 3 4 5 6 7 8 Running Time Pass of Iteration (a) Chess: minutil = 37% 10 20 1 2 3 4 (b) Chainstore: minutil = 0.5% Running Time Pass of Iteration

slide-13
SLIDE 13

Conclusion

 Phps: a parallel Eclat-like algorithm based on Spark

 An improved vertical data structure  A three-phase parallel mining framework  An efficient algorithm

Conclusion and Future Work Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Future Work

 Hybrid Search : BF + DF  More Pruning in Phase I (filtering irrelevant items)  Algorithms parallelizing D2HUP  Algorithms on new parallel programming frameworks

11

slide-14
SLIDE 14

[1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data

References Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate

  • Generation. In IEEE 12th International Conference on Data Mining, 2012, p101-109.

[8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150.

slide-15
SLIDE 15

Thank You ! Questions ? Gracias ! Pregunta?

slide-16
SLIDE 16

IEEE DSC 2019 - IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE BDMC 2019 - BIG DATA MINING FOR CYBERSPACE

23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu http://www.ieee-dsc.org/2019/