HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish - - PowerPoint PPT Presentation

hawkeye efficient fine grained os support for huge pages
SMART_READER_LITE
LIVE PREVIEW

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish - - PowerPoint PPT Presentation

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K. Gopinath 1 Indian Institute of Science (IISc), Bangalore 1 Indian Institute of Technology, Delhi 2 Architectural Support for Programming Languages and


slide-1
SLIDE 1

Ashish Panwar1, Sorav Bansal2, K. Gopinath1

Indian Institute of Science (IISc), Bangalore1 Indian Institute of Technology, Delhi 2

1

Architectural Support for Programming Languages and Operating Systems (ASPLOS) - 2019.

HawkEye: Efficient Fine-grained OS Support for Huge Pages

slide-2
SLIDE 2

2

Virtual address space

slide-3
SLIDE 3

3

Physical address space Virtual address space

slide-4
SLIDE 4

4

Physical address space Virtual address space

slide-5
SLIDE 5

5

Physical address space Virtual address space

slide-6
SLIDE 6

6

Too much TLB pressure!

Physical address space Virtual address space

slide-7
SLIDE 7

7

Physical address space Virtual address space

slide-8
SLIDE 8

8

Physical address space Virtual address space

Huge pages Fewer misses

slide-9
SLIDE 9

OS Challenges

11

❑ Complex trade-offs

  • Memory bloat vs. performance
  • Page fault latency vs. the number of page faults

❑ Challenges due to (external) fragmentation

  • How to leverage limited memory contiguity
  • Fairness in huge page allocation
slide-10
SLIDE 10

Memory bloat vs. performance

13

slide-11
SLIDE 11

Internal fragmentation

14

Virtual memory Physical memory huge page mapping

aggressive allocation

slide-12
SLIDE 12

15

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

Internal fragmentation

slide-13
SLIDE 13

16

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

unused pages

Internal fragmentation

slide-14
SLIDE 14

17

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

unused pages bloat

Internal fragmentation

slide-15
SLIDE 15

18

Virtual memory Physical memory huge page mapping

aggressive allocation conservative allocation

Lower TLB reach (impacts performance) bloat

Internal fragmentation

unused pages

slide-16
SLIDE 16

Bloat vs. performance

Aggressive

Higher perf Higher bloat

Conservative

Lower perf Lower bloat

slide-17
SLIDE 17

20

Latency vs. # page faults

slide-18
SLIDE 18

21

▪ Find a page

pre

4-KB

slide-19
SLIDE 19

22

▪ Find a page, zero-fill

pre zero-fill post

4-KB

slide-20
SLIDE 20

23

▪ Find a page, zero-fill, map

pre zero-fill post

4-KB

slide-21
SLIDE 21

24

▪ Find a page, zero-fill, map

pre zero-fill post

4-KB

25%

slide-22
SLIDE 22

25

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

  • s

t

4-KB 2-MB

25%

slide-23
SLIDE 23

26

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

  • s

t

4-KB 2-MB

25%

slide-24
SLIDE 24

27

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

  • s

t

4-KB 2-MB

25%

slide-25
SLIDE 25

28

▪ Find a page, zero-fill, map

pre zero-fill post p r e

zero-fill

p

  • s

t

4-KB

dominated by zero-filling (97%)

2-MB

25%

slide-26
SLIDE 26

Latency vs. # page faults

32

Aggressive

High latency Fewer faults

Conservative

Low latency Higher faults

slide-27
SLIDE 27

33

FreeBSD Linux

Memory bloat Low High Performance Low High Allocation latency Low High # page faults High Low

conservative vs. aggressive Tradeoff-1: Tradeoff-2:

Current systems favor opposite ends of the design spectrum

  • FreeBSD is conservative (compromise on performance)
  • Linux is throughput-oriented (compromise on latency and bloat)
slide-28
SLIDE 28

Ingens (OSDI’16)

34

▪ Asynchronous allocation

  • Huge pages allocated in the background

▪ Utilization-threshold based allocation

  • Tunable bloat vs. performance
  • Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

  • Heuristic based on past behavior
slide-29
SLIDE 29

Ingens (OSDI’16)

35

▪ Asynchronous allocation

  • Huge pages allocated in the background

▪ Utilization-threshold based allocation

  • Tunable bloat vs. performance
  • Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

  • Heuristic based on past behavior

low latency too many page faults

slide-30
SLIDE 30

Ingens (OSDI’16)

36

▪ Asynchronous allocation

  • Huge pages allocated in the background

▪ Utilization-threshold based allocation

  • Tunable bloat vs. performance
  • Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

  • Heuristic based on past behavior

low latency too many page faults manual

slide-31
SLIDE 31

Ingens (OSDI’16)

37

▪ Asynchronous allocation

  • Huge pages allocated in the background

▪ Utilization-threshold based allocation

  • Tunable bloat vs. performance
  • Adaptive based on memory pressure

▪ Fairness driven by per-process fairness metric

  • Heuristic based on past behavior

low latency too many page faults manual weak correlation with page walk overhead

slide-32
SLIDE 32

Current state-of-the-art

38

FreeBSD Linux Ingens

Memory bloat Low High Tunable Performance Low High Tunable Allocation latency Low High Low # page faults High Low High

Tradeoff-1: Tradeoff-2:

▪ Hard to find the sweet-spot for utilization-threshold in Ingens

  • Application dependent, phase dependent
slide-33
SLIDE 33

HawkEye

39

slide-34
SLIDE 34

Key Optimizations

40

➢ Asynchronous page pre-zeroing[1] ➢ Content deduplication based bloat mitigation ➢ Fine-grained intra-process allocation ➢ Fairness driven by hardware performance counters

[1] Optimizing the Idle Task and Other MMU Tricks, OSDI'99

slide-35
SLIDE 35

Asynchronous page pre-zeroing

41

▪ Pages zero-filled in the background ▪ Potential issues:

  • Cache pollution – leverage non-temporal writes
  • DRAM bandwidth consumption – rate-limited
  • Limit CPU utilization (e.g., 5%)
slide-36
SLIDE 36

Asynchronous page pre-zeroing

42

Enables aggressive allocation with low latency

✓ 13.8x faster VM spin-up ✓ 1.26x higher throughput (Redis)

slide-37
SLIDE 37

Mitigating bloat

43

slide-38
SLIDE 38

Mitigating bloat

44

Virtual memory Physical memory huge page mapping

slide-39
SLIDE 39

Mitigating bloat

45

Virtual memory Physical memory huge page mapping

unused

slide-40
SLIDE 40

Mitigating bloat

46

Virtual memory Physical memory huge page mapping

unused zero-filled

slide-41
SLIDE 41

Mitigating bloat

47

▪ Observation: Unused base pages remain zero-filled ▪ Identify bloat by scanning memory ▪ Dedup zero-filled base pages to remove bloat

Virtual memory Physical memory huge page mapping

unused zero-filled

slide-42
SLIDE 42

Mitigating bloat

48

67.5 55.4 115.5 3.9 2.8 1.2 1 6.63 27.4 9.11

30 60 90 120

distance (bytes)

▪ Ease of detecting non-zero pages

  • ffset (bytes)
slide-43
SLIDE 43

Mitigating bloat

49

✓ Automated "bloat vs. performance" management

8 16 24 32 40 48

1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1

RSS (GB) Time (seconds) Linux Ingens HawkEye

  • ut-of-memory

success

  • ut-of-memory

success

P 1 P2 P3

Redis P1: insert P2: delete P3: insert

slide-44
SLIDE 44

50

FreeBSD Linux Ingens HawkEye

Memory bloat Low High Tunable Automated Performance Low High Tunable Automated Allocation latency Low High Low Low # page faults High Low High Low

Tradeoff-1: Tradeoff-2:

slide-45
SLIDE 45

Fine-grained (intra-process) allocation

51

▪ Maximizing performance with limited contiguity

slide-46
SLIDE 46

Fine-grained (intra-process) allocation

52

hot regions

access-coverage: # base pages accessed per second ❖ A good indicator of TLB-contention due to a region ▪ Maximizing performance with limited contiguity

XSBench access-coverage

slide-47
SLIDE 47

Fine-grained (intra-process) allocation

53

access_map ▪ Track access-coverage (access_map) ▪ Allocate in the sorted order (top to bottom) ✓ Yields higher profit per allocation

slide-48
SLIDE 48

Fine-grained (intra-process) allocation

54

10 20 30 40 50 1 101 201 301 401 501

MMU Overhead (%) Time (seconds)

Linux Ingens HawkEye

Workload: XSBench

Page Walk Overhead (%)

access-coverage

slide-49
SLIDE 49

Fine-grained (intra-process) allocation

55

300 600 900 1200

Graph500 XSBench NPB_CG.D ms saved per huge page

Linux Ingens HawkEye

Execution time (ms) saved per huge page allocation

slide-50
SLIDE 50

Fair (inter-process) allocation

56

▪ Prioritize allocation to the process with highest expected improvement ▪ How to estimate page walk overhead

  • Profile hardware performance counters
  • Low cost, accurate!
slide-51
SLIDE 51

Fair (inter-process) allocation

57

  • 10

10 20 30 40 50 60 70

cactusADM tigr Graph500 lbm_s SVM XSBench CG.D

% speedup Linux Ingens HawkEye

Workloads running alongside a TLB-insensitive process

% speedup

slide-52
SLIDE 52

Summary

58

▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness

slide-53
SLIDE 53

Summary

59

▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness HawkEye: Resolving fundamental conflicts for huge page optimizations

https://github.com/apanwariisc/HawkEye

slide-54
SLIDE 54

60