An Experimental Evaluation of Garbage Collectors on Big Data - - PowerPoint PPT Presentation

an experimental evaluation of garbage collectors on big
SMART_READER_LITE
LIVE PREVIEW

An Experimental Evaluation of Garbage Collectors on Big Data - - PowerPoint PPT Presentation

The 45th International Conference on Very Large Data Bases (VLDB 2019) An Experimental Evaluation of Garbage Collectors on Big Data Applications Lijie Xu 1 , Tian Guo 2 , Wensheng Dou 1 , Wei Wang 1 , Jun Wei 1 1 Institute of Software, Chinese


slide-1
SLIDE 1

An Experimental Evaluation of Garbage Collectors on Big Data Applications

August 2019

Lijie Xu1, Tian Guo2, Wensheng Dou1, Wei Wang1, Jun Wei1

1 Institute of Software, Chinese Academy of Sciences 2 Worcester Polytechnic Institute 1

The 45th International Conference on Very Large Data Bases (VLDB 2019)

slide-2
SLIDE 2

Garbage-collected Languages: Big Data Frameworks:

2

Popular big data frameworks rely on garbage-collected languages to manage in-memory objects

Rely on JVM garbage collector to manage the in-memory objects generated by big data applications

GC root GC root GC root

slide-3
SLIDE 3

GC inefficiency

Big data applications suffer from heavy GC overhead

  • GC time takes up to ~30% of Spark application execution

time [StackOverflow][1]

  • 3

Time memory usage memory size frequent and long GC pauses Memory usage

[1] https://stackoverflow.com/questions/38965787/spark-executor-gc-taking-long

GC: When JVM memory is nearly full, garbage collector will pause the application to reclaim unused objects.

slide-4
SLIDE 4

Question

4

Q: What are the causes of heavy GC overhead?

slide-5
SLIDE 5

3

Example: Spark Join Application

mapper 1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D map() map() map() map() 1, a 2, b 2 b B D 3, c 1, d 1, A 2, B 3, C 2, D 1 c C reduce() reduce() Shuffle write in-memory partition Shuffle read in-memory aggregattion 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) a d A if cached

User code Large intermediate computing results Shuffle write/read Large intermediate data Cached data Large cached data

GC root

In-memory objects

GC root GC root

The causes of heavy GC overhead

5

Cause 1: Big data application generates massive data objects ⇒ GC is time-consuming Application dataflow:

Objects managed by JVM

slide-6
SLIDE 6

Memory space managed by the framework

3

Example: Spark Join Application

mapper 1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D map() map() map() map() 1, a 2, b 2 b B D 3, c 1, d 1, A 2, B 3, C 2, D 1 c C reduce() reduce() Shuffle write in-memory partition Shuffle read in-memory aggregattion 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) a d A if cached

  • 1. User space

for user code

  • 2. Execution space

for shuffle write/read

  • 3. Storage space

for cached data

GC root

In-memory objects

GC root GC root

The causes of heavy GC overhead

6

Cause 2:

The framework only manages the intermediate and cached data in a logical space ⇒ Rely on garbage collectors to manage the data objects Objects managed by JVM

slide-7
SLIDE 7

Memory space managed by the framework

3

Example: Spark Join Application

mapper 1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D map() map() map() map() 1, a 2, b 2 b B D 3, c 1, d 1, A 2, B 3, C 2, D 1 c C reduce() reduce() Shuffle write in-memory partition Shuffle read in-memory aggregattion 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) a d A if cached

  • 1. User space

for user code

  • 2. Execution space

for shuffle write/read

  • 3. Storage space

for cached data

GC root

In-memory objects

GC root GC root

The causes of heavy GC overhead

7

Cause 3: Current garbage collectors are not designed for big data applications (not aware of the characteristics

  • f big data objects)

Objects managed by JVM

slide-8
SLIDE 8

Three popular garbage collectors

JVM has three popular garbage collectors

  • Parallel, CMS, and G1 collectors
  • One JVM uses only one collector at runtime
  • 8

Parallel/CMS JVM Young Generation Old Generation G1 JVM (equal-sized regions) O E E O O S H H H H E S O O S O E O E Eden Survivor Survivor

  • bject
  • bject

JVM heap layout

O S Survivor H Humongous Non-allocated E Eden Old E

Parallel/CMS collector Contiguous generations G1 collector Non-contiguous generations (equal-size regions)

Young Gen: for storing short-lived objects Old Gen: for storing long-lived objects

slide-9
SLIDE 9

Three popular garbage collectors

GC process: Mark unused objects Sweep unused objects Compact the space (optional)

9

GC root live objects unused objects GC root GC root GC root GC root GC root GC root GC root GC root

Before GC Marking unused objects Sweeping unused objects Parallel GC

Stop-the-world GC

CMS/G1 GC

Concurrent GC

App threads

Stop the world

GC threads for marking for sweeping

Stop-the-world

App threads

Stop the world

Concurrent marking

Stop the world

App threads Concurrent sweeping

Stop the world

App threads

Different GC algorithms

slide-10
SLIDE 10

Research questions

10

Q1: Why are current garbage collectors inefficient for big data applications? Root causes? Q2: Are there any GC optimization methods?

slide-11
SLIDE 11
  • 2. Run applications on different garbage collectors

to identify GC patterns (Parallel, CMS, G1 collector)

  • 1. Select representative big data applications with different

memory usage patterns

Methodology – Experimental evaluation

11

SQL Graph Machine Learning

Time memory usage memory size frequent and long GC pauses Memory usage

  • 3. Analyze the correlation between memory usage patterns

and GC patterns to identify the causes of GC inefficiency

slide-12
SLIDE 12

Application selection – memory usage patterns

12

  • 1. GroupBy (from BigSQLBench)
  • 2. Join (from BigSQLBench)

1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]

Map Stage Reduce Stage

Memory usage pattern: Long-lived accumulated records 2, [b, B] map() merge() spill() groupByKey()

1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 2, D 1, [(a,d), A] 3, [c, C] 2, [b, (B,D)] 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) map()

Map Stage Reduce Stage

Long-lived accumulated records join() Massive temporary records Cartesian product

gradient vector

x1 y1 x2 y2

x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8

Input matrix

Map Stage Reduce Stage

Features

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

Humongous data objects

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1 , n

X

i=1

× × ×

wT xi xi yi yi

=

+ = ,

n

X

i=1

grad,

n

X

i=1

loss + =

Driver broadcast the new hyperplane vector wT n Long-lived cached records

wT

Label loss value

compute(wnew)

Long-lived accumulated records 1 2 2 1 3 5 3 6 1, 2 2, 1 3, 5 3, 6 1, [2] 6, [3, 7] 2, [1] 4 1 6 3 6 7 7 4 4, 1 6, 3 6, 7 7, 4 3, [5, 6] 7, [4] 4, [1] 3, 1.0 7, 1.0 4, 1.0 1, [(2), 1.0] 6, [(3, 7), 1.0] 2, 1.0 3, 0.5 7, 0.5 1, 1.0 5, 0.5 6, 0.5 4, 1.0 1, 1.0 join() 1, 2.0 6, 0.5 2, 1.0 3, 0.5 7, 0.5 4, 1.0 2, 1.0 reduceByKey() 1, 1.0 6, 1.0

Map Stage 1st Iterative Stage

2, [(1), 1.0] 3, [(5, 6),1.0] 7, [(4), 1.0] 4, [(1), 1.0] Long-lived cached records reduceByKey()

3rd Iterative Stage 2nd Iterative Stage Cached data

  • 3. SVM (from Spark MLlib)
  • 4. PageRank (from Spark Graph library)
slide-13
SLIDE 13

Application selection – memory usage patterns

13

  • 1. GroupBy (SQL)
  • 2. Join (SQL)

Young Gen

1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]

Map Stage Reduce Stage

Memory usage pattern: Long-lived accumulated records 2, [b, B] map() merge() spill() groupByKey() 1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 2, D 1, [(a,d), A] 3, [c, C] 2, [b, (B,D)] 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) map()

Map Stage Reduce Stage

Long-lived accumulated records join() Massive temporary records Cartesian product

JVM heap

Old Gen P1: Long-lived accumulated records Shuffled records are accumulated in memory ⇒ Long-lived objects ⇒ Stored in Old Gen

slide-14
SLIDE 14

Application selection – memory usage patterns

14

  • 1. GroupBy (SQL)
  • 2. Join (SQL)

Young Gen

1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]

Map Stage Reduce Stage

Memory usage pattern: Long-lived accumulated records 2, [b, B] map() merge() spill() groupByKey() 1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 2, D 1, [(a,d), A] 3, [c, C] 2, [b, (B,D)] 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) map()

Map Stage Reduce Stage

Long-lived accumulated records join() Massive temporary records Cartesian product

JVM heap

Temporary results are generated in user code (e.g., cartesian()) ⇒ short-lived objects ⇒ Stored in Young Gen Old Gen P1: Long-lived accumulated records P2: Massive temporary records Shuffled records are accumulated in memory ⇒ Long-lived objects ⇒ Stored in Old Gen

slide-15
SLIDE 15

Young Gen

Application selection – memory usage patterns

15

Machine learning app: SVM with cached data

gradient vector

x1 y1 x2 y2

x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8

Input matrix

Map Stage Reduce Stage

Features

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

Humongous data objects

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1 , n

X

i=1

× × ×

wT xi xi yi yi

=

+ = ,

n

X

i=1

grad,

n

X

i=1

loss + =

Driver broadcast the new hyperplane vector wT n Long-lived cached records

wT

Label loss value

compute(wnew)

Old Gen:

JVM heap

SVM stores training data in memory ⇒ Long-lived cached records ⇒ Sorted in Old Gen P1: Long-lived cached records

slide-16
SLIDE 16

Young Gen

Application selection – memory usage patterns

16

Machine learning app: SVM with cached data

gradient vector

x1 y1 x2 y2

x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8

Input matrix

Map Stage Reduce Stage

Features

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

Humongous data objects

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1 , n

X

i=1

× × ×

wT xi xi yi yi

=

+ = ,

n

X

i=1

grad,

n

X

i=1

loss + =

Driver broadcast the new hyperplane vector wT n Long-lived cached records

wT

Label loss value

compute(wnew)

Old Gen:

JVM heap

SVM generates large vectors (large arrays) ⇒ A vector achieves 345MB ⇒ Humongous data object ⇒ Stored in Old Gen SVM stores training data in memory ⇒ Long-lived cached records ⇒ Sorted in Old Gen P1: Long-lived cached records P2: Humongous data objects

slide-17
SLIDE 17

Young Gen Old Gen

Application selection – memory usage patterns

17

Iterative Graph App: PageRank with cached data JVM heap

P1: Iterative long-lived accumulated records

Long-lived accumulated records 1 2 2 1 3 5 3 6 1, 2 2, 1 3, 5 3, 6 1, [2] 6, [3, 7] 2, [1] 4 1 6 3 6 7 7 4 4, 1 6, 3 6, 7 7, 4 3, [5, 6] 7, [4] 4, [1] 3, 1.0 7, 1.0 4, 1.0 1, [(2), 1.0] 6, [(3, 7), 1.0] 2, 1.0 3, 0.5 7, 0.5 1, 1.0 5, 0.5 6, 0.5 4, 1.0 1, 1.0 join() 1, 2.0 6, 0.5 2, 1.0 3, 0.5 7, 0.5 4, 1.0 2, 1.0 reduceByKey() 1, 1.0 6, 1.0

Map Stage 1st Iterative Stage

2, [(1), 1.0] 3, [(5, 6),1.0] 7, [(4), 1.0] 4, [(1), 1.0] Long-lived cached records reduceByKey()

3rd Iterative Stage 2nd Iterative Stage Cached data

PageRank generates shuffled records in each iteration ⇒ Iterative long-lived accumulated records ⇒ Similar memory usage pattern in each iteration

slide-18
SLIDE 18

Young Gen Old Gen

Application selection – memory usage patterns

18

Iterative Graph App: PageRank with cached data JVM heap

P2: Long-lived cached records P1: Iterative long-lived accumulated records

Long-lived accumulated records 1 2 2 1 3 5 3 6 1, 2 2, 1 3, 5 3, 6 1, [2] 6, [3, 7] 2, [1] 4 1 6 3 6 7 7 4 4, 1 6, 3 6, 7 7, 4 3, [5, 6] 7, [4] 4, [1] 3, 1.0 7, 1.0 4, 1.0 1, [(2), 1.0] 6, [(3, 7), 1.0] 2, 1.0 3, 0.5 7, 0.5 1, 1.0 5, 0.5 6, 0.5 4, 1.0 1, 1.0 join() 1, 2.0 6, 0.5 2, 1.0 3, 0.5 7, 0.5 4, 1.0 2, 1.0 reduceByKey() 1, 1.0 6, 1.0

Map Stage 1st Iterative Stage

2, [(1), 1.0] 3, [(5, 6),1.0] 7, [(4), 1.0] 4, [(1), 1.0] Long-lived cached records reduceByKey()

3rd Iterative Stage 2nd Iterative Stage Cached data

PageRank generates shuffled records in each iteration ⇒ Iterative long-lived accumulated records ⇒ Similar memory usage pattern in each iteration

slide-19
SLIDE 19

Young Gen Old Gen

Application selection – memory usage patterns

19

Iterative Graph App: PageRank with cached data JVM heap

P2: Long-lived cached records P1: Iterative long-lived accumulated records

Long-lived accumulated records 1 2 2 1 3 5 3 6 1, 2 2, 1 3, 5 3, 6 1, [2] 6, [3, 7] 2, [1] 4 1 6 3 6 7 7 4 4, 1 6, 3 6, 7 7, 4 3, [5, 6] 7, [4] 4, [1] 3, 1.0 7, 1.0 4, 1.0 1, [(2), 1.0] 6, [(3, 7), 1.0] 2, 1.0 3, 0.5 7, 0.5 1, 1.0 5, 0.5 6, 0.5 4, 1.0 1, 1.0 join() 1, 2.0 6, 0.5 2, 1.0 3, 0.5 7, 0.5 4, 1.0 2, 1.0 reduceByKey() 1, 1.0 6, 1.0

Map Stage 1st Iterative Stage

2, [(1), 1.0] 3, [(5, 6),1.0] 7, [(4), 1.0] 4, [(1), 1.0] Long-lived cached records reduceByKey()

3rd Iterative Stage 2nd Iterative Stage Cached data

Similar memory usage pattern in each iteration

slide-20
SLIDE 20

Methodology – Input data variation

Vary input data size to simulate different memory pressures

  • [1] Generated by HiBench. https://github.com/Intel-bigdata/HiBench.

[2] KDD Cup 2012 dataset. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/binary.html. [3] Twitter social graph. http://an.kaist.ac.kr/traces/WWW2010.html.

20

Application Data-1.0 (100%) Data-0.5 (50%) GroupBy 200GB Uservisits (1.2B rows) [1] 50% rows (100GB) Join 200GB Uservisits (1.2B rows), 40GB Rankings (600M rows) [1] 50% rows (100GB, 20GB) SVM 21GB KDD2012 matrix [2] (149M rows, 54M features) 50% columns (11.2GB, 27M features) PageRank 25GB Twitter graph [3] (476M edges, 17M nodes) 50% edges (12.2GB, 238M edges)

slide-21
SLIDE 21

Testbed

  • 21

Master

driver

slave4 slave5 slave6 slave7 slave8 slave1 slave3 slave2 Alibaba Cloud 9 nodes, Java 8, Spark 2.1.2 Each node runs 4 JVMs (tasks) Each JVM runs one task (core = 1, memory = 6.5GB) ⇒ Observe the memory usage and GC activities of each task

3

Tasks of a Spark application

1 a 2 b 3 c 1 d 1 A 2 B 3 C 2 D map() map() map() map() 1, a 2, b 2 b B D 3, c 1, d 1, A 2, B 3, C 2, D 1 c C reduce() reduce() 1, (a, A) 1, (d, A) 3, (c, C) 2, (b, B) 2, (b, D) a d A

slide-22
SLIDE 22

Experimental results – Observation 1

App execution time and GC time comparison

22

Observation 1: Applications with long-lived accumulated records suffer from high GC overhead (1-40 mins)

6.2 6 6 0.4 0.3 0.1

Parallel CMS G1

SVM-0.5 (3.2%)

26.1 19.5 38.3 11.3 3.5 3.3

Parallel CMS G1

PageRank-0.5 (49.1%)

45.4 36.3 39.4 19 0.9 1.2 Parallel CMS G1

GroupBy-1.0 (20.1%)

App Execution Time (min) GC Time (min) 78.7 54.7 57.1 41 0.7 2.6

Parallel CMS G1

Join-1.0 (30.5%)

App Execution Time (min) GC Time (min)

slide-23
SLIDE 23

Where are long-lived accumulated records from?

23

Massive long-lived accumulated records

1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]

Map Stage Reduce Stage

Long-lived accumulated records map() groupByKey()

3 1 c a 7 5 f e 9 l c a c a a b A F S n C d G D m

  • C

E d Q l E H p J

GroupBy Join PageRank Application Shuffled records are aggregated in memory e.g., ~40M records (5.5GB)

slide-24
SLIDE 24

Require large old generation to store the increasing accumulated records

JVM

Memory usage pattern of long-lived accumulated records

24

1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]

Map Stage Reduce Stage

Long-lived accumulated records map() groupByKey()

GroupBy Join PageRank Application Shuffled records are aggregated in memory Young Gen: for storing short-lived objects Old Gen: for storing long-lived objects

Massive long-lived accumulated records

3 1 c a 7 5 f e 9 l c a c a a b A F S n C d G D m

  • C

E d Q l E H p J

e.g., ~40M records (5.5GB)

3 1 c a 7 5 f e 9 l c a c a a b A F S n C d G D m

  • C

E d Q l E H p J

slide-25
SLIDE 25

Require large old generation to store the increasing accumulated records

JVM

Memory usage pattern of long-lived accumulated records

25

1.1 a 2.3 b 3.5 c 1.2 d 1.6 A 2.8 B 3.7 C 4.9 D 1, a 2, b 3, c 1, d 1, A 2, B 3, C 4, D 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D] 1, [a, d, A] 3, [c, C] 2, [b, B] 4, [D]

Map Stage Reduce Stage

Long-lived accumulated records map() groupByKey()

GroupBy Join PageRank Application Shuffled records are aggregated in memory Young Gen: for storing short-lived objects If old generation is limited => frequent GCs

Massive long-lived accumulated records

3 1 c a 7 5 f e 9 l c a c a a b A F S n C d G D m

  • C

E d Q l E H p J

e.g., ~40M records (5.5GB)

Time memory usage memory size frequent and long GC pauses Memory usage

slide-26
SLIDE 26

Findings about the garbage collectors

26

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 200 400 600 800 1000 5 10 GC time (s) YGC Pause FGC Pause

(b) GroupBy-1.0-Slowest-CMS-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 200 400 600 800 1000 1200 2 4 GC time (s) YGC Pause FGC Pause

(c) GroupBy-1.0-Slowest-G1-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 250 500 750 1000 1250 1500 10 20 GC time (s) YGC Pause FGC Pause

(a) GroupBy-1.0-Slowest-Parallel-Task

Time (s) Time (s)

shuffle spill

Time (s)

shuffle spill shuffle spill A long (11 s) full GC pause due to concurrent mode failure

Different heap sizing policies lead to different GC frequencies

Parallel GC Allocate the smallest old gen & shrink with memory usage ⇒ Suffer from frequent GC G1 GC Allocate large old gen & shrink with memory usage ⇒ Less GCs than Parallel CMS GC Allocate the largest old gen & does not shrink ⇒ The fewest GC pauses

slide-27
SLIDE 27

Findings about GC inefficiency

27

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 200 400 600 800 1000 5 10 GC time (s) YGC Pause FGC Pause

(b) GroupBy-1.0-Slowest-CMS-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 200 400 600 800 1000 1200 2 4 GC time (s) YGC Pause FGC Pause

(c) GroupBy-1.0-Slowest-G1-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 250 500 750 1000 1250 1500 10 20 GC time (s) YGC Pause FGC Pause

(a) GroupBy-1.0-Slowest-Parallel-Task

Time (s) Time (s)

shuffle spill

Time (s)

shuffle spill shuffle spill A long (11 s) full GC pause due to concurrent mode failure

Finding1: Current collectors allocate limited old generation based

  • n historical and current memory usage => frequent GC pauses

Parallel GC G1 GC CMS GC

slide-28
SLIDE 28

Implication about the heap sizing policy

28

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 200 400 600 800 1000 5 10 GC time (s) YGC Pause FGC Pause

(b) GroupBy-1.0-Slowest-CMS-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 200 400 600 800 1000 1200 2 4 GC time (s) YGC Pause FGC Pause

(c) GroupBy-1.0-Slowest-G1-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 250 500 750 1000 1250 1500 10 20 GC time (s) YGC Pause FGC Pause

(a) GroupBy-1.0-Slowest-Parallel-Task

Time (s) Time (s)

shuffle spill

Time (s)

shuffle spill shuffle spill A long (11 s) full GC pause due to concurrent mode failure

Implication: Design more intelligent heap sizing polices for accommodating big data objects (especially for long-lived accumulated records)

slide-29
SLIDE 29

45.4 36.3 39.4 19 0.9 1.2

Parallel CMS G1

GroupBy-1.0 (20.1%)

App Execution Time (min) GC Time (min) 78.7 54.7 57.1 41 0.7 2.6 Parallel CMS G1

Join-1.0 (30.5%)

App Execution Time (min) GC Time (min) 6.2 6 6 0.4 0.3 0.1 Parallel CMS G1

SVM-0.5 (3.2%)

26.1 19.5 38.3 11.3 3.5 3.3

Parallel CMS G1

PageRank-0.5 (49.1%)

Experimental results – Observation 2

App execution time and GC time comparison

  • 29

Observation 2: Applications with Parallel GC suffer from longer app execution and GC time than applications with CMS/G1 GC

slide-30
SLIDE 30

GC process

30

  • 1. Mark live objects
  • 2. Sweep unused objects

GC root live objects unused objects GC root GC root GC root GC root GC root

GC process:

slide-31
SLIDE 31

GC process

31

GC root live objects unused objects GC root GC root GC root GC root GC root

GC process: Big data applications have too many objects to mark and reclaim in each GC activity ⇒ Long individual GC pause (10-20s)

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 250 500 750 1000 1250 1500 10 20 GC time (s) YGC Pause FGC Pause

(a) GroupBy-1.0-Slowest-Parallel-Task

  • 1. Mark live objects
  • 2. Sweep unused objects
slide-32
SLIDE 32

Why Parallel GC is slower than CMS/G1 GC?

32

Parallel GC

Stop-the-world GC

CMS/G1 GC

Concurrent marking/sweeping

App threads

Stop the world

Stop the app threads for GC (10-20s pause)

Stop-the-world

App threads

Stop the world Stop the world

App threads Concurrent marking & sweeping (less than 1s pause for remark)

Stop the world

App threads

Finding 2: Parallel collector’s stop-the-world marking/sweeping algorithm leads to more pause time (10-20s)

slide-33
SLIDE 33

Concurrent garbage collectors

33

Parallel GC

Stop-the-world GC

CMS/G1 GC

Concurrent marking/sweeping

App threads

Stop the world

Stop the app threads for GC (10-20s pause)

Stop-the-world

App threads

Stop the world Stop the world

App threads Concurrent marking & sweeping (less than 1s pause for remark)

Stop the world

App threads

Question:

Are concurrent collectors good enough for big data applications?

slide-34
SLIDE 34

Finding about marking & sweeping algorithm

34

Finding 3: CMS and G1 collectors’ concurrent marking algorithms suffer from CPU contentions with CPU-intensive data operators

App threads

Stop the world Stop the world Stop the world

App threads

CMS/G1 GC

Concurrent marking/sweeping

GC thread has CPU contention with App threads

slide-35
SLIDE 35

Finding about marking & sweeping algorithm

35

App threads

Stop the world Stop the world Stop the world

App threads

CMS/G1 GC

Concurrent marking/sweeping

GC thread has CPU contention with App threads

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 0.00 0.05 0.10 GC time (s) FGC Pause Concurrent mark phase 500 1000 1500 2000 100 200 300 CPU utilization (%) CPU Usage Memory Usage 0.0 2.5 5.0 7.5 Memory (GB) 5 10 Concurrent GC (s)

(b) Join-1.0-Slowest-CMS-Task

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 0.0 0.5 1.0 GC time (s) FGC Pause Concurrent mark phase 500 1000 1500 2000 100 200 300 CPU utilization (%) CPU Usage Memory Usage 0.0 2.5 5.0 7.5 Memory (GB) 10 20 30 Concurrent GC (s)

(c) Join-1.0-Slowest-G1-Task

Time (s) Time (s) shuffle phase

  • utput

phase

shuffle spill

shuffle phase

  • utput phase

shuffle spill

Leads to high CPU usage during concurrent GC activities Finding 3: CMS and G1 collectors’ concurrent marking algorithms suffer from CPU contentions with CPU-intensive data operators

slide-36
SLIDE 36

Implication about GC algorithm

36

App threads

Stop the world Stop the world Stop the world

App threads

CMS/G1 GC

Concurrent marking/sweeping

Implication: Design more efficient marking & sweeping algorithms for big data objects to reduce individual GC pause time

slide-37
SLIDE 37

Experimental results – Observation 3

Reliability problem

  • 37

Observation 3: Applications with G1 collector suffers from OutOfMemory error while processing humongous objects in SVM application.

15.2 14.5 1.2 1.1

Parallel CMS G1

SVM on large data (data-1.0)

App Execution Time (m) GC Time (m)

OutOfMemory

slide-38
SLIDE 38

Root cause of the OOM error

38

Humongous data object (~500 MB vector array)

gradient vector

x1 y1 x2 y2

x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 Map Stage Reduce Stage

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

Humongous data objects

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1

grad(w, xi),

n

X

i=1

loss(w, xi)

n

X

i=1 , n

X

i=1

× × ×

wT xi xi yi yi

=

+ = ,

n

X

i=1

grad,

n

X

i=1

loss + =

Driver broadcast the new hyperplane vector wT loss value

wT

SVM execution plan generate

slide-39
SLIDE 39

Root cause of the OOM error

39

Humongous data object (~500 MB vector array)

(equal-sized regions) O E E O O S H H H H E S O O S O E O E O S Survivor H Humongous Non-allocated

G1 JVM

E Eden Old E O E E O

OOM

Root cause: Not enough contiguous space for keeping this humongous object

slide-40
SLIDE 40

Root cause of the OOM error

40

Humongous data object (~500 MB vector array)

(equal-sized regions) O E E O O S H H H H E S O O S O E O E O S Survivor H Humongous Non-allocated

G1 JVM

E Eden Old E O E E O

Implication Redesign the object allocation algorithm to balance the trade-

  • ff between memory utilization

and reliability

slide-41
SLIDE 41

The other findings

(see our paper for details)

  • ParallelGC tasks trigger 1.5x more shuffle spills than CMS and G1 tasks. The root

cause is that Parallel collector has the smallest available heap size that leads to the lowest spill threshold of ParallelGC tasks.

  • Threshold-based full GC triggering conditions lead to frequent, but unnecessary full

GC pauses towards the long- lived accumulated records. Due to different full GC triggering thresholds, ParallelGC suffers from 1.7x more full GC pauses than G1, and G1 suffers from 7x more full GC pauses than CMS.

  • For iterative applications that require to reclaim massive long-lived accumulated

records in each iteration, CMS collector’s concurrent sweeping algorithm achieves 16x shorter full GC time than G1’s incremental sweeping algorithm.

  • ParallelGC tasks suffer from 2.5- 7.6x higher CPU usage than CMS and G1 tasks,

due to 1.7-12x more full GC pauses and 10x longer individual full GC pause.

  • G1 tasks suffer from 1.1-1.2x higher physical memory usage than ParallelGC and

CMS tasks.

  • Compared to CMS and G1 collectors, Parallel collector’s inappropriate generation

resizing timing mechanism leads to 38% more full GC pauses.

41

slide-42
SLIDE 42

GC optimization methods

Propose three GC optimization methods

  • Prediction-based dynamic heap sizing policies
  • Label-based object marking algorithms

⇒ Explicitly label the long-lived data objects based on lifecycles to avoid unnecessary marking

  • Overriding-based object reclamation algorithms
  • 42

2 4 6 8 Old Gen (GB) BeforeGC AfterGC Allocated 2 4 GC time (s) YGC Pause FGC Pause 20 40 60 80 100 120 200 400 600 800 CPU utilization (%) CPU Usage Memory Usage 2 4 6 8 Memory usage (GB)

(c) GroupBy-0.5-Slowest-G1-task

3 1 c a 7 5 f e 9 l c a c a a b A F S n C d G D m

  • C

E d Q l E H p J

GC root GC root GC root

slide-43
SLIDE 43

Related work

Performance studies on big data applications

  • PerformanceStudyOnSpark [NSDI 2015], StudyOnMemoryBloat

[ISMM 2013], OutOfMemoryStudy [ISSRE 2015]

  • Framework memory management optimization
  • MemTune [IPDPS 2016], Broom [HotOS 2015], Façade

[ASPLOS 2015], Deca [VLDB 2016], Tungsten [SparkSQL]

GC optimization for big data applications

  • Yak [OSDI 2016], NG2C [ISMM 2017]
  • 43
slide-44
SLIDE 44

Conclusions

  • Summarize the unique memory usage patterns of big

data applications

  • Experimental evaluation on three collectors and identify

the root causes of GC inefficiencies

  • Propose GC optimization methods
  • 44
slide-45
SLIDE 45

Q & A Thanks!

45