[PPT] - smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A PowerPoint Presentation

SLIDE 1

Case study: d60 Raptor smartAdvisor

Jan Neerbek Alexandra Institute

SLIDE 2

2

Agenda

· d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution

SLIDE 3

3

Alexandra Institute

The Alexandra Institute is a non-profit company that works with application-

riented IT research.

Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation.

SLIDE 4

4

The case: d60

· Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure

SLIDE 5

5

The setup

Internet Webshops Log shopping patterns Do data mining Product Recommendations

SLIDE 6

6

The cloud potential

· Elasticity · No upfront server cost · Cheaper licenses · Faster calculations

SLIDE 7

7

Challenges

· No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is

slow

SLIDE 8

8

The cloud

Node Node Node Node Node Node Node

SLIDE 9

9

The cloud and services

Node Node Node Node Node Node Node Data layer service Messaging Service

SLIDE 10

10

Data layer service

· Application specific (schema/layout) · SQL, table or other · Easy a bottleneck · Can be difficult to scale

Data layer service

SLIDE 11

11

Messaging service Task Queues

· Standard data structure · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages

Messaging Service

SLIDE 12

12

SLIDE 13

13

Data mining

Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals

[about.com/wikipedia.org]

SLIDE 14

14

Market basket analysis

Customer1 Avocado Milk Butter Potatoes Customer2 Milk Diapers Avocado Beer Customer3 Beef Lemons Beer Chips Customer4 Cereal Beer Beef Diapers

SLIDE 15

15

Market basket analysis

Customer1 Avocado Milk Butter Potatoes Customer2 Milk Diapers Avocado Beer Customer3 Beef Lemons Beer Chips Customer4 Cereal Beer Beef Diapers

Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible

SLIDE 16

16

Market basket analysis

Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups 

SLIDE 17

17

Building the FP-tree

(extends the prefix-tree structure)

Avocado Butter Milk Potatoes Customer1 Avocado Milk Butter Potatoes

SLIDE 18

18

Building the FP-tree

Customer2 Milk Diapers Avocado Beer Avocado Butter Milk Potatoes

SLIDE 19

19

Building the FP-tree

Customer2 Milk Diapers Avocado Beer Avocado Butter Milk Potatoes Beer Diapers Milk

SLIDE 20

20

Building the FP-tree

Customer2 Milk Diapers Avocado Beer Avocado Butter Milk Potatoes Beer Diapers Milk

SLIDE 21

21

Building the FP-tree

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers

SLIDE 22

22

FP-growth

Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }

SLIDE 23

23

FP-growth algorithm

Divide and Conquer

Traverse tree

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers

SLIDE 24

24

FP-growth algorithm

Divide and Conquer

Generate sub-trees

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers

SLIDE 25

25

FP-growth algorithm

Divide and Conquer

Call recursively

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers Avocado Butter Beer Diapers

SLIDE 26

26

FP-growth algorithm Memory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory

SLIDE 27

27

Distributed Shared Memory?

· To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency,

high-speed networks

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

Shared Memory Network

SLIDE 28

28

FP-growth algorithm Memory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea

SLIDE 29

29

Get a good idea

· Database scans are serial and can be

distributed

· The list of items used in the recursive calls

uniquely determines what part of data we are looking at

SLIDE 30

30

Get a good idea

Avocado Butter Milk Potatoes Beer Diapers Milk Beef Beer Chips Lemon Cereal Diapers Avocado Butter Beer Diapers

SLIDE 31

31

Get a good idea

Milk Butter, Milk

Avocado Butter Beer Diapers Avocado Avocado Beer

Diapers,Milk

These are postfix paths

SLIDE 32

32

SLIDE 33

33

Buckets

· Use postfix paths for messaging · Working with buckets

Transactions Items

SLIDE 34

34

FP-growth revisited

FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }

Replaced with postfix Done in parallel Done in parallel Done in parallel

SLIDE 35

35

Communication

Node Node Node Node Data layer

SLIDE 36

36

Revised Communication

Node Node Node Node Data layer

MQ

SLIDE 37

37

Running FP-growth

Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth

SLIDE 38

38

Running FP-growth

Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth

SLIDE 39

39

Collecting what we have learned

· Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth

SLIDE 40

40

Advantages

· Configurable work sizes · Good distribution of work · Robust against computer failure · Fast!

SLIDE 41

41

So what about performance?

00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 1 2 4 8

Message-driven FP-growth FP-growth Total node time

SLIDE 42

42

Case study: d60 Raptor smartAdvisor

Agenda

· d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution

Alexandra Institute

The Alexandra Institute is a non-profit company that works with application-

Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation.

The case: d60

· Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure

The setup

The cloud potential

· Elasticity · No upfront server cost · Cheaper licenses · Faster calculations

Challenges

· No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is

slow

The cloud

The cloud and services

Data layer service

· Application specific (schema/layout) · SQL, table or other · Easy a bottleneck · Can be difficult to scale

Messaging service Task Queues

· Standard data structure · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages

Data mining

Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals

Market basket analysis

Market basket analysis

Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible

Market basket analysis

Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups 

Building the FP-tree

(extends the prefix-tree structure)

Building the FP-tree

Building the FP-tree

Building the FP-tree

Building the FP-tree

FP-growth

FP-growth algorithm

Divide and Conquer

Traverse tree

FP-growth algorithm

Divide and Conquer

Generate sub-trees

FP-growth algorithm

Divide and Conquer

Call recursively

FP-growth algorithm Memory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory

Distributed Shared Memory?

· To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency,

high-speed networks

FP-growth algorithm Memory usage

The FP-tree does not fit in local memory; what to do?

· Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea

Get a good idea

· Database scans are serial and can be

distributed

· The list of items used in the recursive calls

uniquely determines what part of data we are looking at

Get a good idea

Get a good idea

Buckets

· Use postfix paths for messaging · Working with buckets

FP-growth revisited

Communication

Revised Communication

Running FP-growth

Running FP-growth

Collecting what we have learned

· Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth

Advantages

· Configurable work sizes · Good distribution of work · Robust against computer failure · Fast!

So what about performance?

Thank you!