A Pattern-Aware Graph Mining System Kasra Jamshidi Rakesh Mahadasa - - PowerPoint PPT Presentation

a pattern aware graph mining system
SMART_READER_LITE
LIVE PREVIEW

A Pattern-Aware Graph Mining System Kasra Jamshidi Rakesh Mahadasa - - PowerPoint PPT Presentation

A Pattern-Aware Graph Mining System Kasra Jamshidi Rakesh Mahadasa Keval Vora Simon Fraser University https://github.com/pdclab/peregrine Why should you pay attention? Peregrine executes 700x faster Peregrine consumes 100x less memory


slide-1
SLIDE 1

A Pattern-Aware Graph Mining System

Kasra Jamshidi Rakesh Mahadasa Keval Vora Simon Fraser University

https://github.com/pdclab/peregrine

slide-2
SLIDE 2

Why should you pay attention?

slide-3
SLIDE 3

executes 700x faster Peregrine

slide-4
SLIDE 4

consumes 100x less memory Peregrine

slide-5
SLIDE 5

scales to 100x larger datasets Peregrine

slide-6
SLIDE 6

On 8x fewer machines

slide-7
SLIDE 7

With a more expressive API

slide-8
SLIDE 8

a d b a b d Data Graph e c g f

Graph Mining

a d b

7

slide-9
SLIDE 9

a d b a b d

Graph Mining

e c g f Data Graph a d b

8

slide-10
SLIDE 10

a d b a b d

Graph Mining

e c g f Data Graph a d b Subgraph

9

slide-11
SLIDE 11

a d b a b d

Graph Mining

e c g f Data Graph Pattern

10

slide-12
SLIDE 12

Graph Mining

a d b a b d Edge-Induced

11

slide-13
SLIDE 13

Graph Mining

Edge-Induced a d b a b d

12

slide-14
SLIDE 14

Graph Mining

a d b a b d Edge-Induced

13

slide-15
SLIDE 15

Graph Mining

a d b a b d Vertex-Induced

14

slide-16
SLIDE 16

Graph Mining

a d b a b d Vertex-Induced

15

slide-17
SLIDE 17

Graph Mining

a d b a b d Vertex-Induced

16

slide-18
SLIDE 18

Graph Mining

a d b e c g f Data Graph

17

slide-19
SLIDE 19

Graph Mining

a d b e c g f Data Graph a d b a b d Vertex-Induced

18

slide-20
SLIDE 20

Graph Mining

a d b e c g f Data Graph a d b a b d a d b a b d a d b a b d a d b a b d Edge-Induced a d b a b d Vertex-Induced

19

slide-21
SLIDE 21

Frequent Patterns

(Edge-Induced)

Graph Mining

a d b e c g f Data Graph

20

slide-22
SLIDE 22

Unlabeled Pattern Distribution

(Vertex-Induced)

2 1 1 2

Graph Mining

a d b e c g f Data Graph

21

slide-23
SLIDE 23

Scalability Challenge

22

slide-24
SLIDE 24
  • 4-motif counting on Orkut graph (|V| = 13M, |E| = 117M)

Scalability Challenge

23

slide-25
SLIDE 25

Scalability Challenge

  • 4-motif counting on Orkut graph (|V| = 13M, |E| = 117M)

123,503,340,341,270 subgraphs

24

slide-26
SLIDE 26

System Requirements

25

slide-27
SLIDE 27

System Requirements

Uniqueness x z y x z y x z y

26

slide-28
SLIDE 28

System Requirements

Uniqueness x z y x z y x z y

27

slide-29
SLIDE 29

System Requirements

Uniqueness x z y x z y x z y

28

slide-30
SLIDE 30

System Requirements

Uniqueness Structure x z y

29

slide-31
SLIDE 31

System Requirements

Uniqueness Structure x z y

30

slide-32
SLIDE 32

System Requirements

Uniqueness Structure Interestingness

31

slide-33
SLIDE 33

System Requirements

Uniqueness Structure Interestingness

32

slide-34
SLIDE 34

System Requirements

Uniqueness Structure Interestingness

33

slide-35
SLIDE 35

System Requirements

Uniqueness Structure Interestingness

34

slide-36
SLIDE 36

Existing Work

Uniqueness Structure Interestingness Arabesque (SOSP ‘15) RStream (OSDI ‘18) Fractal (SIGMOD ‘19) AutoMine (SOSP ‘19)

35

slide-37
SLIDE 37

Existing Work

Uniqueness Structure Interestingness Arabesque (SOSP ‘15) RStream (OSDI ‘18) Fractal (SIGMOD ‘19) AutoMine (SOSP ‘19) Overlook user requirements

36

slide-38
SLIDE 38

Existing Work

Uniqueness Structure Interestingness Arabesque (SOSP ‘15) RStream (OSDI ‘18) Fractal (SIGMOD ‘19) AutoMine (SOSP ‘19) Overlook user requirements Per-subgraph computations

37

slide-39
SLIDE 39

38

slide-40
SLIDE 40

Pattern Awareness

Pattern Selection Pattern Matching

39

slide-41
SLIDE 41

Pattern Awareness

Pattern Selection Pattern Matching

40

slide-42
SLIDE 42

Pattern Awareness

Pattern Selection

41

slide-43
SLIDE 43

Pattern Programming

#include “Peregrine.hh” using namespace Peregrine; void motifCounting(int size) { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns); for (auto &[pattern, n] : counts) std::cout << pattern << “ ” << n << std::endl; }

42

slide-44
SLIDE 44

Pattern Programming

#include “Peregrine.hh” using namespace Peregrine; void motifCounting(int size) { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns); for (auto &[pattern, n] : counts) std::cout << pattern << “ ” << n << std::endl; }

43

slide-45
SLIDE 45

Pattern Programming

#include “Peregrine.hh” using namespace Peregrine; void motifCounting(int size) { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns); for (auto &[pattern, n] : counts) std::cout << pattern << “ ” << n << std::endl; }

44

slide-46
SLIDE 46

Pattern Programming

#include “Peregrine.hh” using namespace Peregrine; void motifCounting(int size) { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns); for (auto &[pattern, n] : counts) std::cout << pattern << “ ” << n << std::endl; }

45

slide-47
SLIDE 47

Pattern Programming

DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns);

46

slide-48
SLIDE 48

Pattern Programming

DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); patterns[0].set_labels({‘a’, ‘b’, ‘c’, ‘d’}); auto counts = count(G, patterns);

47

slide-49
SLIDE 49

Pattern Programming

DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); patterns[0].set_labels({‘a’, ‘b’, ‘c’, ‘d’}); patterns[0].add_edge(1, 5); auto counts = count(G, patterns);

48

slide-50
SLIDE 50

Pattern Programming

DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); patterns[0].set_labels({‘a’, ‘b’, ‘c’, ‘d’}); patterns[0].add_edge(1, 5); patterns.emplace_back(“path/to/pattern.txt”); auto counts = count(G, patterns);

49

slide-51
SLIDE 51

Pattern Programming

DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto pattern = Pattern().add_edge(1, 2) .add_edge(1, 3) .add_edge(2, 3); auto counts = count(G, {pattern});

50

slide-52
SLIDE 52

Pattern Programming

#include “Peregrine.hh” using namespace Peregrine; void motifCounting(int size) { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns); for (auto &[pattern, n] : counts) std::cout << pattern << “ ” << n << std::endl; }

51

slide-53
SLIDE 53

Pattern Programming

#include “Peregrine.hh” using namespace Peregrine; void motifCounting(int size) { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(size, VERTEX_INDUCED); auto counts = count(G, patterns); for (auto &[pattern, n] : counts) std::cout << pattern << “ ” << n << std::endl; }

52

slide-54
SLIDE 54

Pattern Programming

void frequentSubgraphMining() { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(2, EDGE_INDUCED); auto mapDomain = [](auto &&match, auto &&aggregator) { aggregator.map(match.pattern, match.mapping); }; auto results = match<Pattern, Domain>(G, patterns, mapDomain); for (auto &[pattern, frequency] : results) std::cout << pattern << “ ” << frequency << std::endl; }

53

slide-55
SLIDE 55

Pattern Programming

void frequentSubgraphMining() { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(2, EDGE_INDUCED); auto mapDomain = [](auto &&match, auto &&aggregator) { aggregator.map(match.pattern, match.mapping); }; auto results = match<Pattern, Domain>(G, patterns, mapDomain); for (auto &[pattern, frequency] : results) std::cout << pattern << “ ” << frequency << std::endl; }

54

slide-56
SLIDE 56

Pattern Programming

void frequentSubgraphMining() { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(2, EDGE_INDUCED); auto mapDomain = [](auto &&match, auto &&aggregator) { aggregator.map(match.pattern, match.mapping); }; auto results = match<Pattern, Domain>(G, patterns, mapDomain); for (auto &[pattern, frequency] : results) std::cout << pattern << “ ” << frequency << std::endl; }

55

slide-57
SLIDE 57

Pattern Programming

void frequentSubgraphMining() { DataGraph G(“path/to/graph/”); auto patterns = PatternGenerator::all(2, EDGE_INDUCED); auto mapDomain = [](auto &&match, auto &&aggregator) { aggregator.map(match.pattern, match.mapping); }; auto results = match<Pattern, Domain>(G, patterns, mapDomain); for (auto &[pattern, frequency] : results) std::cout << pattern << “ ” << frequency << std::endl; }

56

slide-58
SLIDE 58

Pattern Awareness

Pattern Selection Pattern Matching

57

slide-59
SLIDE 59

Pattern Awareness

Pattern Selection

u v Anti-Edge

58

slide-60
SLIDE 60

Pattern Awareness

Pattern Selection

u v Anti-Vertex

59

slide-61
SLIDE 61

Pattern Awareness

Pattern Selection Pattern Matching

60

slide-62
SLIDE 62

Pattern Awareness

Pattern Selection Pattern Matching

61

slide-63
SLIDE 63

Pattern Awareness

Pattern Matching

  • Symmetry breaking

(RECOMB ‘07)

  • Core pattern reduction

(SIGMOD ‘16)

62

slide-64
SLIDE 64

Symmetry Breaking

63

slide-65
SLIDE 65

Symmetry Breaking

u x y v

64

slide-66
SLIDE 66

Symmetry Breaking

u x y v

65

slide-67
SLIDE 67

Symmetry Breaking

x y

66

v u

slide-68
SLIDE 68

Symmetry Breaking

u x y v

67

slide-69
SLIDE 69

Symmetry Breaking

u y x v

68

slide-70
SLIDE 70

Symmetry Breaking

u x y v

Worker 1 a d b e c g f Data Graph

69

slide-71
SLIDE 71

b x y v

Symmetry Breaking

a d b e c g f Data Graph Worker 1

70

slide-72
SLIDE 72

b x y d

Symmetry Breaking

Worker 1 a d b e c g f Data Graph

71

slide-73
SLIDE 73

Symmetry Breaking

b x y d

Worker 1

u x y v

Worker 2 a d b e c g f Data Graph

72

slide-74
SLIDE 74

Symmetry Breaking

b x y d

Worker 1

d x y v

Worker 2 a d b e c g f Data Graph

73

slide-75
SLIDE 75

Symmetry Breaking

b x y d

Worker 1

d x y b

Worker 2 a d b e c g f Data Graph

74

slide-76
SLIDE 76

Symmetry Breaking

b a y d

Worker 1

d a y b

Worker 2 a d b e c g f Data Graph

75

slide-77
SLIDE 77

Symmetry Breaking

b a g d

Worker 1

d a g b

Worker 2 a d b e c g f Data Graph

76

slide-78
SLIDE 78

Symmetry Breaking

b a g d

Worker 1

d a g b

Worker 2 a d b e c g f Data Graph

77

slide-79
SLIDE 79

Symmetry Breaking

u < v

u v x y

78

slide-80
SLIDE 80

Symmetry Breaking

x < y

u v x y

79

slide-81
SLIDE 81

Symmetry Breaking

u < v x < y

u v x y

80

slide-82
SLIDE 82

Core Pattern

u v x y

81

slide-83
SLIDE 83

Core Pattern

u v x y

82

slide-84
SLIDE 84

Core Pattern

u v x y

83

slide-85
SLIDE 85

Core Pattern

b d

Pattern-Unaware a d b e c g f Data Graph

84

slide-86
SLIDE 86

Core Pattern

b a d

Pattern-Unaware a d b e c g f Data Graph

85

slide-87
SLIDE 87

Core Pattern

b a d

Pattern-Unaware a d b e c g f Data Graph

86

slide-88
SLIDE 88

Core Pattern

b a g d

Pattern-Unaware a d b e c g f Data Graph

87

slide-89
SLIDE 89

Core Pattern

b a g d

Pattern-Unaware a d b e c g f Data Graph

88

slide-90
SLIDE 90

Core Pattern

b x y d

Pattern-Aware a d b e c g f Data Graph

89

slide-91
SLIDE 91

Core Pattern

b x y d

Pattern-Aware a d b e c g f Data Graph

90

slide-92
SLIDE 92

Core Pattern

a d b e c g f Data Graph

b x y d

Pattern-Aware

91

slide-93
SLIDE 93

Core Pattern

b a g d

Pattern-Aware a d b e c g f Data Graph

92

slide-94
SLIDE 94

Core Pattern

b a g d

Pattern-Aware a d b e c g f Data Graph h

93

slide-95
SLIDE 95

Core Pattern

b a h d

Pattern-Aware

b a g d

a d b e c g f Data Graph h

94

slide-96
SLIDE 96

Core Pattern

b a h d

Pattern-Aware

b a g d

a d b e c g f Data Graph h

95

slide-97
SLIDE 97

Pattern Awareness

Pattern Matching

  • Symmetry breaking

(RECOMB ‘07)

  • Core pattern reduction

(SIGMOD ‘16)

96

slide-98
SLIDE 98

Pattern Awareness

Pattern Matching

  • Early termination

97

slide-99
SLIDE 99

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

98

slide-100
SLIDE 100

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

99

slide-101
SLIDE 101

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

100

Triangle

slide-102
SLIDE 102

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

101

Triplet Triangle

slide-103
SLIDE 103

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

102

Triplet Triangle

slide-104
SLIDE 104

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

103

Triplet Triangle

slide-105
SLIDE 105

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

104

Triplet Triangle

slide-106
SLIDE 106

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

105

Triplet Triangle

slide-107
SLIDE 107

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

106

Triplet Triangle

slide-108
SLIDE 108

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

107

Triplet Triangle

slide-109
SLIDE 109

Early Termination

bool globalClusteringCoefficient(int bound) { DataGraph G(“path/to/graph/”); auto triplet = PatternGenerator::star(3); int numTriplets = count(G, {triplet}); auto countAndCheck = [=](auto &&match, auto &&aggregator) { int numTriangles = aggregator.readValue(match.pattern); if (3*numTriangles/numTriplets > bound) aggregator.stop(); else aggregator.map(match.pattern, 1); } auto triangle = PatternGenerator::clique(3); auto result = match<Pattern, int>(G, triangle, countAndCheck); return 3*result[triangle]/numTriplets > bound; }

108

slide-110
SLIDE 110

Comparison with Existing Work

  • Peregrine with 16 logical cores and 32GB RAM
  • Arabesque & Fractal with 8x16 logical cores and 8x32GB RAM
  • RStream with 96 logical cores and 192GB RAM

109

slide-111
SLIDE 111

Comparison with Existing Work

110

slide-112
SLIDE 112

Comparison with Existing Work

111

slide-113
SLIDE 113

Effects of Pattern Awareness

112

slide-114
SLIDE 114

Scalability

113

slide-115
SLIDE 115

Scalability

114

slide-116
SLIDE 116

Scalability

115

slide-117
SLIDE 117

Scalability

116

slide-118
SLIDE 118

Scalability

117

slide-119
SLIDE 119

Scalability

118

slide-120
SLIDE 120

Scalability

119

slide-121
SLIDE 121

Pattern-aware programming and processing models

120

slide-122
SLIDE 122

Pattern-aware programming and processing models

  • Shift abstraction from subgraph to pattern

121

slide-123
SLIDE 123

Pattern-aware programming and processing models

  • Shift abstraction from subgraph to pattern
  • User program is transparent to the system

122

slide-124
SLIDE 124

Pattern-aware programming and processing models

  • Shift abstraction from subgraph to pattern
  • User program is transparent to the system
  • Up to 42x faster than pattern-unaware
  • Up to 737x faster than state-of-the-art

123

slide-125
SLIDE 125

Pattern-aware programming and processing models

  • Shift abstraction from subgraph to pattern
  • User program is transparent to the system
  • Up to 42x faster than pattern-unaware
  • Up to 737x faster than state-of-the-art

https://github.com/pdclab/peregrine

124