[PPT] - ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES Kyle PowerPoint Presentation

SLIDE 1

ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES

Kyle C. Hale and Peter Dinda

1

SLIDE 2

v3vee.org v3vee.org/palacios

nautilus.halek.co

2

SLIDE 3

SOFTWARE EVENTS

event occurs in some execution context another execution context takes action based

n event

for example, a thread

3

SLIDE 4

SOME TYPES OF EVENTS

message arrival work is completed work is available something terrible happened

4

SLIDE 5

5

AN EXAMPLE: LEGION

thread thread 1 thread 2 CPU 0 CPU 1 CPU2

pthread worker threads: waiting for work (pthread_cond_wait())

unit of work

pthread_cond_broadcast()

RACE

SLIDE 6

ASYNCHRONOUS EVENTS

the receiving side is not blocked

6

ther things can run

SLIDE 7

WE WANT FAST EVENTS

moment of event trigger first instruction of event handling code

}

notification latency

we want to minimize this

7

SLIDE 8

WHAT’S THE LOWER LIMIT?

light!

8

SLIDE 9

what we want: SoL† what we actually get with existing software events: SoL††

†speed of light

††s**t out of luck

9

SLIDE 10

OUTLINE

10

software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware

SLIDE 11

CPU ready queue

CONDITION VARIABLES

cond. var queue

thread

pthread_cond_wait()

thread

pthread_cond_signal() running scheduling delay

11

SLIDE 12

CPU 0 ready queue CPU 1 ready queue

BROADCAST

cond. var queue

thread thread thread

pthread_cond_broadcast()

12

SLIDE 13

IMMEDIATELY VISIBLE ISSUES

we’re at the behest of the scheduler broadcast is linear in number of waiters we can’t tell scheduler to initiate a “fast” wakeup

13

SLIDE 14

OUTLINE

14

software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware

SLIDE 15

WHAT CAN WE DO IN HARDWARE?

inter-processor interrupts (IPIs)

15

int vector n IDT

handler code

n first instruction executed on receiving end

SLIDE 16

IPIS ARE FAST

16 0.2 0.4 0.6 0.8 1 1000 1200 1400 1600 1800 2000 2200 2400 CDF Cycles mesaured from BSP (core 0) 95th percentile = 1728 cycles

socket NUMA domain physical core logical core

SLIDE 17

OUTLINE

17

software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware

SLIDE 18

MEASURING EVENT WAKEUP LATENCY

18

event trigger

first instruction

f handler

core 0 core i read_start_time() read_end_time()

SLIDE 19

MEASURING EVENT BROADCAST LATENCY

19

event trigger

first instruction

f handler

core 0 core i read_start_time() read_end_time_i()

first instruction

f handler

core j

first instruction

f handler

core k read_end_time_j() read_end_time_k()

SLIDE 20

EXISTING SOFTWARE EVENTS ARE SLOW

20

5000 10000 15000 20000 25000 30000 pthread condvar futex wakeup unicast IPI

µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 1572.68 min = 1150 max = 17397 σ = 523.279

Cycles to Wakeup

16x

SLIDE 21

BROADCASTS ARE ALSO TERRIBLE

21

500000 1×106 1.5×106 2×106 2.5×106 p t h r e a d c

n

d v a r f u t e x w a k e u p b r

a

d c a s t I P I

µ = 995795 min = 17538 max = 2.17277e+06 σ = 544512 µ = 370630 min = 16402 max = 1.89553e+06 σ = 199680 µ = 12827.3 min = 1252 max = 57467 σ = 2931.32

Cycles to Wakeup

29x

SLIDE 22

SYNCHRONY

22

for broadcasts, we want events to be delivered to all cores at the same time useful for, e.g. BSP apps with events measure the deviation of wakeup time across cores in a broadcast

SLIDE 23

SYNCHRONY

23

70x difference between hardware IPIs and software mechanisms hardly any predictability!

SLIDE 24

24

SLIDE 25

OUTLINE

25

software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware

SLIDE 26

NAUTILUS

26 Paging Thread Misc Timers

Hardware

Ints

Aerokernel

Topo Synch/ Events

Kernel Mode User Mode (Nothing) HRT Kernel

Alloc

Parallel Runtime Parallel Application

Full Privileged HW Access

[ Hale, Dinda HPDC ’15] [ Hale, Dinda VEE ’16] [ Hale, Hetland, Dinda FRIDAY]

SLIDE 27

27

SLIDE 28

RETAINING FAMILIAR INTERFACES

28

use a lightweight, kernel-mode framework (like Nautilus) to eliminate overheads maintain userspace interfaces (e.g. condition variable wait,signal,broadcast etc.) if we build our kernel from scratch, how fast can we get?

SLIDE 29

NEMO HAS 2 COMPATIBLE CONDITION VARIABLE IMPLEMENTATIONS

29

lightweight condition variables leverage IPI access to “kick” the scheduler

SLIDE 30

EXISTING SOFTWARE EVENTS ARE SLOW

30

5000 10000 15000 20000 25000 30000 pthread condvar futex wakeup unicast IPI

µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 1572.68 min = 1150 max = 17397 σ = 523.279

Cycles to Wakeup

16x

SLIDE 31

NEMO SPEEDS THINGS UP

31

5000 10000 15000 20000 25000 30000 p t h r e a d c

n

d v a r f u t e x w a k e u p A e r

k

e r n e l c

n

d v a r A e r

k

e r n e l c

n

d v a r + I P I u n i c a s t I P I

µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 9128.78 min = 4195 max = 29990 σ = 3025.12 min = 4730 max = 6392 µ = 5348.51 σ = 290.006 min = 1150 max = 17397 µ = 1572.68 σ = 523.279

Cycles to Wakeup

SLIDE 32

NEMO SPEEDS THINGS UP

32

5000 10000 15000 20000 25000 30000 p t h r e a d c

n

d v a r f u t e x w a k e u p A e r

k

e r n e l c

n

d v a r A e r

k

e r n e l c

n

d v a r + I P I u n i c a s t I P I

µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 9128.78 min = 4195 max = 29990 σ = 3025.12 min = 4730 max = 6392 µ = 5348.51 σ = 290.006 min = 1150 max = 17397 µ = 1572.68 σ = 523.279

Cycles to Wakeup

Nemo events 5x

SLIDE 33

BROADCASTS ARE ALSO TERRIBLE

33

500000 1×106 1.5×106 2×106 2.5×106 p t h r e a d c

n

d v a r f u t e x w a k e u p b r

a

d c a s t I P I

µ = 995795 min = 17538 max = 2.17277e+06 σ = 544512 µ = 370630 min = 16402 max = 1.89553e+06 σ = 199680 µ = 12827.3 min = 1252 max = 57467 σ = 2931.32

Cycles to Wakeup

29x

SLIDE 34

NEMO BRINGS US CLOSER TO IPI BROADCAST LATENCY

34

500000 1×106 1.5×106 2×106 2.5×106 p t h r e a d c

n

d v a r f u t e x w a k e u p A e r

k

e r n e l c

n

d v a r A e r

k

e r n e l c

n

d v a r + I P I b r

a

d c a s t I P I

µ = 995795 min = 17538 max = 2.17277e+06 σ = 544512 µ = 370630 min = 16402 max = 1.89553e+06 σ = 199680 µ = 265820 min = 3258 max = 612959 σ = 159421 min = 7842 max = 464015 µ = 132417 σ = 98637.4 min = 1252 max = 57467 µ = 12827.3 σ = 2931.32

Cycles to Wakeup

Nemo events 3x

SLIDE 35

SYNCHRONY

35

we can do 2x better than user-space mechanisms (with compatible interfaces)

SLIDE 36

OUTLINE

36

software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware

SLIDE 37

WHAT IF WE GIVE UP THE FAMILIAR INTERFACE?

37

modify condition variable semantics we don’t necessarily care which context (thread) receives the event, as long as it’s handled at a particular core not appropriate for all situations

SLIDE 38

ACTIVE MESSAGES

38

message

CPU memory handler handle_msg() claim: better fit than, e.g. cond vars, for many event-based schemes

SLIDE 39

39

we want to use IPIs as an active message substrate problem: IPIs don’t have a payload!

SLIDE 40

40

allocate several event IDs when core receives interrupt, lookup the event ID in a table indexed on core ID

core 0 core 1 … core n-1

Action Lookup Table

event ID 3

nemo_notify_event(core=1, event=3)

SLIDE 41

41

event ID corresponds to an “action” (a handler)

event ID 0 1 m-1 … Action Descriptor Table event ID 3 0xdeadbeef handle_event()

SLIDE 42

NEMO WAKEUPS HAVE CONSTANT OFFSET FROM IPIS

42

0.2 0.4 0.6 0.8 1 1200 1400 1600 1800 2000 CDF Cycles mesaured from BSP (core 0) unicast IPI nemo event notify

95th% = 1728 95th% = 1824

~100 cycles

SLIDE 43

BROADCAST LATENCY ALSO ON PAR WITH IPIS

43

5000 10000 15000 20000 25000 30000 IPI broadcast Nemo broadcast

µ = 12792 min = 1252 max = 26838 σ = 2718.73 µ = 12958 min = 1376 max = 29703 σ = 2819.29

Cycles to Wakeup

SLIDE 44

NEMO ACHIEVES TIGHT SYNCHRONY

44

< 50 cycles variation in broadcast wakeups between cores

SLIDE 45

SUMMARY

45

if you want asynch. event delivery close to hardware latency… existing mechanisms are pretty terrible SOME WAYS TO FIX IT: throw out general purpose OS abstractions (e.g. user/kernel boundary) throw out typical event abstractions use the hardware directly!

SLIDE 46

THANKS

http://halek.co http://presciencelab.org http://nautilus.halek.co http://xstack.sandia.gov/hobbes me

ur lab

Nautilus Hobbes Exascale OS/R project

46

ありがとう

SLIDE 47

BACKUPS

47

SLIDE 48

TIGHT SYNCHRONY FOR IPIS, NOT FOR SOFTWARE EVENTS

0.2 0.4 0.6 0.8 1 1000 10000 100000 1×106 CDF σ pthread condvar futex broadcast broadcast IPI

48

70x

SLIDE 49

0.2 0.4 0.6 0.8 1 1000 10000 100000 1×106 CDF σ pthread condvar futex broadcast Aerokernel condvar Aerokernel condvar + IPI broadcast IPI

49

NEMO GETS US CLOSER

2x

SLIDE 50

NEMO ACHIEVES TIGHT SYNCHRONY

50 0.2 0.4 0.6 0.8 1 2000 4000 6000 8000 10000 CDF σ IPI broadcast Nemo broadcast 0.2 0.4 0.6 0.8 1 2520 2610 2700 2790

~50 cycles

SLIDE 51

HOW FAST CAN A NOTIFICATION BE IN H/W?

memory

poll!

thread

while (!stuff happened)

cache cache

51

thread

stuff happened

cache invalidation (10s of cycles)

SLIDE 52

0.2 0.4 0.6 0.8 1 500 1000 1500 2000 CDF Cycles mesaured from BSP (core 0)

unicast IPI synchronous event (mem polling)

Unicast IPI vs memory polling

52

coherence network interrupt network

SLIDE 53

WHY THE GAP?

53

SLIDE 54

54

we’re bounded by interrupt handling logic in the hardware

IPI cost breakdown destination handling

time on wire APIC write

note: these are indirectly measured

SLIDE 55

55

we used to have a problem like this with INT80 syscalls… solution: introduce a new instruction, skip a lot of the interrupt handling logic

SLIDE 56

56

0.2 0.4 0.6 0.8 1 500 1000 1500 2000 CDF Cycles mesaured from BSP (core 0)

unicast IPI synchronous event (mem polling) projected remote syscall

cost of dest. handling