[PPT] - HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish PowerPoint Presentation

SLIDE 1

Ashish Panwar1, Sorav Bansal2, K. Gopinath1

Indian Institute of Science (IISc), Bangalore1 Indian Institute of Technology, Delhi 2

1

Architectural Support for Programming Languages and Operating Systems (ASPLOS) - 2019.

HawkEye: Efficient Fine-grained OS Support for Huge Pages

SLIDE 2

2

Virtual address space

SLIDE 3

3

Physical address space Virtual address space

SLIDE 4

4

Physical address space Virtual address space

SLIDE 5

5

Physical address space Virtual address space

SLIDE 6

6

Too much TLB pressure!

Physical address space Virtual address space

SLIDE 7

7

Physical address space Virtual address space

SLIDE 8

8

Physical address space Virtual address space

Huge pages Fewer misses

SLIDE 9

OS Challenges

11

❑ Complex trade-offs

Memory bloat vs. performance
Page fault latency vs. the number of page faults

❑ Challenges due to (external) fragmentation

How to leverage limited memory contiguity
Fairness in huge page allocation

SLIDE 10

Memory bloat vs. performance

13

SLIDE 11

Internal fragmentation

14

Virtual memory Physical memory huge page mapping

aggressive allocation

SLIDE 12

15

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

Internal fragmentation

SLIDE 13

16

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

unused pages

Internal fragmentation

SLIDE 14

17

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

unused pages bloat

Internal fragmentation

SLIDE 15

18

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

Lower TLB reach (impacts performance) bloat

Internal fragmentation

unused pages

SLIDE 16

Bloat vs. performance

Aggressive

Higher perf Higher bloat

Conservative

Lower perf Lower bloat

SLIDE 17

20

Latency vs. # page faults

SLIDE 18

21

▪ Find a page

pre

4-KB

SLIDE 19

22

▪ Find a page, zero-fill

pre zero-fill post

4-KB

SLIDE 20

23

▪ Find a page, zero-fill, map

pre zero-fill post

4-KB

SLIDE 21

24

▪ Find a page, zero-fill, map

pre zero-fill post

4-KB

25%

SLIDE 22

25

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

s

t

4-KB 2-MB

25%

SLIDE 23

26

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

s

t

4-KB 2-MB

25%

SLIDE 24

27

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

s

t

4-KB 2-MB

25%

SLIDE 25

28

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

s

t

4-KB

dominated by zero-filling (97%)

2-MB

25%

SLIDE 26

Latency vs. # page faults

32

Aggressive

High latency Fewer faults

Conservative

Low latency Higher faults

SLIDE 27

33

FreeBSD Linux

Memory bloat Low High Performance Low High Allocation latency Low High # page faults High Low

conservative vs. aggressive Tradeoff-1: Tradeoff-2:

Current systems favor opposite ends of the design spectrum

FreeBSD is conservative (compromise on performance)
Linux is throughput-oriented (compromise on latency and bloat)

SLIDE 28

Ingens (OSDI’16)

34

▪ Asynchronous allocation

Huge pages allocated in the background

▪ Utilization-threshold based allocation

Tunable bloat vs. performance
Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

Heuristic based on past behavior

SLIDE 29

Ingens (OSDI’16)

35

▪ Asynchronous allocation

Huge pages allocated in the background

▪ Utilization-threshold based allocation

Tunable bloat vs. performance
Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

Heuristic based on past behavior

low latency too many page faults

SLIDE 30

Ingens (OSDI’16)

36

▪ Asynchronous allocation

Huge pages allocated in the background

▪ Utilization-threshold based allocation

Tunable bloat vs. performance
Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

Heuristic based on past behavior

low latency too many page faults manual

SLIDE 31

Ingens (OSDI’16)

37

▪ Asynchronous allocation

Huge pages allocated in the background

▪ Utilization-threshold based allocation

Tunable bloat vs. performance
Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

Heuristic based on past behavior

low latency too many page faults manual weak correlation with page walk overhead

SLIDE 32

Current state-of-the-art

38

FreeBSD Linux Ingens

Memory bloat Low High Tunable Performance Low High Tunable Allocation latency Low High Low # page faults High Low High

Tradeoff-1: Tradeoff-2:

▪ Hard to find the sweet-spot for utilization-threshold in Ingens

Application dependent, phase dependent

SLIDE 33

HawkEye

39

SLIDE 34

Key Optimizations

40

➢ Asynchronous page pre-zeroing[1] ➢ Content deduplication based bloat mitigation ➢ Fine-grained intra-process allocation ➢ Fairness driven by hardware performance counters

[1] Optimizing the Idle Task and Other MMU Tricks, OSDI'99

SLIDE 35

Asynchronous page pre-zeroing

41

▪ Pages zero-filled in the background ▪ Potential issues:

Cache pollution – leverage non-temporal writes
DRAM bandwidth consumption – rate-limited
Limit CPU utilization (e.g., 5%)

SLIDE 36

Asynchronous page pre-zeroing

42

Enables aggressive allocation with low latency

✓ 13.8x faster VM spin-up ✓ 1.26x higher throughput (Redis)

SLIDE 37

Mitigating bloat

43

SLIDE 38

Mitigating bloat

44

Virtual memory Physical memory huge page mapping

SLIDE 39

Mitigating bloat

45

Virtual memory Physical memory huge page mapping

unused

SLIDE 40

Mitigating bloat

46

Virtual memory Physical memory huge page mapping

unused zero-filled

SLIDE 41

Mitigating bloat

47

▪ Observation: Unused base pages remain zero-filled ▪ Identify bloat by scanning memory ▪ Dedup zero-filled base pages to remove bloat

Virtual memory Physical memory huge page mapping

unused zero-filled

SLIDE 42

Mitigating bloat

48

67.5 55.4 115.5 3.9 2.8 1.2 1 6.63 27.4 9.11

30 60 90 120

distance (bytes)

▪ Ease of detecting non-zero pages

ffset (bytes)

SLIDE 43

Mitigating bloat

49

✓ Automated "bloat vs. performance" management

8 16 24 32 40 48

1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1

RSS (GB) Time (seconds) Linux Ingens HawkEye

ut-of-memory

success

ut-of-memory

success

P 1 P2 P3

Redis P1: insert P2: delete P3: insert

SLIDE 44

50

FreeBSD Linux Ingens HawkEye

Memory bloat Low High Tunable Automated Performance Low High Tunable Automated Allocation latency Low High Low Low # page faults High Low High Low

Tradeoff-1: Tradeoff-2:

SLIDE 45

Fine-grained (intra-process) allocation

51

▪ Maximizing performance with limited contiguity

SLIDE 46

Fine-grained (intra-process) allocation

52

hot regions

access-coverage: # base pages accessed per second ❖ A good indicator of TLB-contention due to a region ▪ Maximizing performance with limited contiguity

XSBench access-coverage

SLIDE 47

Fine-grained (intra-process) allocation

53

access_map ▪ Track access-coverage (access_map) ▪ Allocate in the sorted order (top to bottom) ✓ Yields higher profit per allocation

SLIDE 48

Fine-grained (intra-process) allocation

54

10 20 30 40 50 1 101 201 301 401 501

MMU Overhead (%) Time (seconds)

Linux Ingens HawkEye

Workload: XSBench

Page Walk Overhead (%)

access-coverage

SLIDE 49

Fine-grained (intra-process) allocation

55

300 600 900 1200

Graph500 XSBench NPB_CG.D ms saved per huge page

Linux Ingens HawkEye

Execution time (ms) saved per huge page allocation

SLIDE 50

Fair (inter-process) allocation

56

▪ Prioritize allocation to the process with highest expected improvement ▪ How to estimate page walk overhead

Profile hardware performance counters
Low cost, accurate!

SLIDE 51

Fair (inter-process) allocation

57

10

10 20 30 40 50 60 70

cactusADM tigr Graph500 lbm_s SVM XSBench CG.D

% speedup Linux Ingens HawkEye

Workloads running alongside a TLB-insensitive process

% speedup

SLIDE 52

Summary

58

▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness

SLIDE 53

Summary

59

▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness HawkEye: Resolving fundamental conflicts for huge page optimizations

https://github.com/apanwariisc/HawkEye

SLIDE 54

60