SLIDE 12 Application selection – memory usage patterns
12
- 1. GroupBy (from BigSQLBench)
- 2. Join (from BigSQLBench)
1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]
Map Stage Reduce Stage
Memory usage pattern: Long-lived accumulated records 2, [b, B] map() merge() spill() groupByKey()
1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 2, D 1, [(a,d), A] 3, [c, C] 2, [b, (B,D)] 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) map()
Map Stage Reduce Stage
Long-lived accumulated records join() Massive temporary records Cartesian product
gradient vector
x1 y1 x2 y2
x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8
Input matrix
Map Stage Reduce Stage
Features
n
X
i=1
grad(w, xi),
n
X
i=1
loss(w, xi)
Humongous data objects
n
X
i=1
grad(w, xi),
n
X
i=1
loss(w, xi)
n
X
i=1
grad(w, xi),
n
X
i=1
loss(w, xi)
n
X
i=1 , n
X
i=1
× × ×
wT xi xi yi yi
=
+ = ,
n
X
i=1
grad,
n
X
i=1
loss + =
Driver broadcast the new hyperplane vector wT n Long-lived cached records
wT
Label loss value
compute(wnew)
Long-lived accumulated records 1 2 2 1 3 5 3 6 1, 2 2, 1 3, 5 3, 6 1, [2] 6, [3, 7] 2, [1] 4 1 6 3 6 7 7 4 4, 1 6, 3 6, 7 7, 4 3, [5, 6] 7, [4] 4, [1] 3, 1.0 7, 1.0 4, 1.0 1, [(2), 1.0] 6, [(3, 7), 1.0] 2, 1.0 3, 0.5 7, 0.5 1, 1.0 5, 0.5 6, 0.5 4, 1.0 1, 1.0 join() 1, 2.0 6, 0.5 2, 1.0 3, 0.5 7, 0.5 4, 1.0 2, 1.0 reduceByKey() 1, 1.0 6, 1.0
Map Stage 1st Iterative Stage
2, [(1), 1.0] 3, [(5, 6),1.0] 7, [(4), 1.0] 4, [(1), 1.0] Long-lived cached records reduceByKey()
3rd Iterative Stage 2nd Iterative Stage Cached data
- 3. SVM (from Spark MLlib)
- 4. PageRank (from Spark Graph library)