smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A - - PowerPoint PPT Presentation

smartadvisor
SMART_READER_LITE
LIVE PREVIEW

smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A - - PowerPoint PPT Presentation

Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2 Alexandra Institute The Alexandra Institute is


slide-1
SLIDE 1

Case study: d60 Raptor smartAdvisor

Jan Neerbek Alexandra Institute

slide-2
SLIDE 2

2

Agenda

· d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution

slide-3
SLIDE 3

3

Alexandra Institute

The Alexandra Institute is a non-profit company that works with application-

  • riented IT research.

Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation.

slide-4
SLIDE 4

4

The case: d60

· Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure

slide-5
SLIDE 5

5

The setup

Internet Webshops Log shopping patterns Do data mining Product Recommendations

slide-6
SLIDE 6

6

The cloud potential

· Elasticity · No upfront server cost · Cheaper licenses · Faster calculations

slide-7
SLIDE 7

7

Challenges

· No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is

slow

slide-8
SLIDE 8

8

The cloud

Node Node Node Node Node Node Node

slide-9
SLIDE 9

9

The cloud and services

Node Node Node Node Node Node Node Data layer service Messaging Service

slide-10
SLIDE 10

10

Data layer service

· Application specific (schema/layout) · SQL, table or other · Easy a bottleneck · Can be difficult to scale

Data layer service

slide-11
SLIDE 11

11

Messaging service Task Queues

· Standard data structure · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages

Messaging Service

slide-12
SLIDE 12

12

slide-13
SLIDE 13

13

Data mining

Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals

[about.com/wikipedia.org]

slide-14
SLIDE 14

14

Market basket analysis

Customer1 Avocado Milk Butter Potatoes Customer2 Milk Diapers Avocado Beer Customer3 Beef Lemons Beer Chips Customer4 Cereal Beer Beef Diapers

slide-15
SLIDE 15

15

Market basket analysis

Customer1 Avocado Milk Butter Potatoes Customer2 Milk Diapers Avocado Beer Customer3 Beef Lemons Beer Chips Customer4 Cereal Beer Beef Diapers

Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible

slide-16
SLIDE 16

16

Market basket analysis

Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups 

slide-17
SLIDE 17

17

Building the FP-tree

(extends the prefix-tree structure)

Avocado Butter Milk Potatoes Customer1 Avocado Milk Butter Potatoes

slide-18
SLIDE 18

18

Building the FP-tree

Customer2 Milk Diapers Avocado Beer Avocado Butter Milk Potatoes

slide-19
SLIDE 19

19

Building the FP-tree

Customer2 Milk Diapers Avocado Beer Avocado Butter Milk Potatoes Beer Diapers Milk

slide-20
SLIDE 20

20

Building the FP-tree

Customer2 Milk Diapers Avocado Beer Avocado Butter Milk Potatoes Beer Diapers Milk

slide-21
SLIDE 21

21

Building the FP-tree

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers

slide-22
SLIDE 22

22

FP-growth

Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }

slide-23
SLIDE 23

23

FP-growth algorithm

Divide and Conquer

Traverse tree

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers

slide-24
SLIDE 24

24

FP-growth algorithm

Divide and Conquer

Generate sub-trees

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers

slide-25
SLIDE 25

25

FP-growth algorithm

Divide and Conquer

Call recursively

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers Avocado Butter Beer Diapers

slide-26
SLIDE 26

26

FP-growth algorithm Memory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory

slide-27
SLIDE 27

27

Distributed Shared Memory?

· To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency,

high-speed networks

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

Shared Memory Network

slide-28
SLIDE 28

28

FP-growth algorithm Memory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea

slide-29
SLIDE 29

29

Get a good idea

· Database scans are serial and can be

distributed

· The list of items used in the recursive calls

uniquely determines what part of data we are looking at

slide-30
SLIDE 30

30

Get a good idea

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers Avocado Butter Beer Diapers

slide-31
SLIDE 31

31

Get a good idea

Milk Butter, Milk

Avocado Butter Beer Diapers Avocado Avocado Beer

Diapers,Milk

These are postfix paths

slide-32
SLIDE 32

32

slide-33
SLIDE 33

33

Buckets

· Use postfix paths for messaging · Working with buckets

Transactions Items

slide-34
SLIDE 34

34

FP-growth revisited

FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }

Replaced with postfix Done in parallel Done in parallel Done in parallel

slide-35
SLIDE 35

35

Communication

Node Node Node Node Data layer

slide-36
SLIDE 36

36

Revised Communication

Node Node Node Node Data layer

MQ

slide-37
SLIDE 37

37

Running FP-growth

Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth

slide-38
SLIDE 38

38

Running FP-growth

Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth

slide-39
SLIDE 39

39

Collecting what we have learned

· Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth

slide-40
SLIDE 40

40

Advantages

· Configurable work sizes · Good distribution of work · Robust against computer failure · Fast!

slide-41
SLIDE 41

41

So what about performance?

00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 1 2 4 8

Message-driven FP-growth FP-growth Total node time

slide-42
SLIDE 42

42

Thank you!