ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES
Kyle C. Hale and Peter Dinda
1
ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES Kyle - - PowerPoint PPT Presentation
ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES Kyle C. Hale and Peter Dinda 1 nautilus.halek.co v3vee.org/palacios v3vee.org 2 SOFTWARE EVENTS another event occurs execution in some context takes execution action based
Kyle C. Hale and Peter Dinda
1
v3vee.org v3vee.org/palacios
nautilus.halek.co
2
event occurs in some execution context another execution context takes action based
for example, a thread
3
4
5
thread thread 1 thread 2 CPU 0 CPU 1 CPU2
pthread worker threads: waiting for work (pthread_cond_wait())
unit of work
pthread_cond_broadcast()
6
moment of event trigger first instruction of event handling code
notification latency
7
8
†speed of light
††s**t out of luck
9
10
software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware
CPU ready queue
cond. var queue
thread
pthread_cond_wait()
thread
pthread_cond_signal() running scheduling delay
11
CPU 0 ready queue CPU 1 ready queue
cond. var queue
thread thread thread
pthread_cond_broadcast()
12
13
14
software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware
15
int vector n IDT
handler code
n first instruction executed on receiving end
16 0.2 0.4 0.6 0.8 1 1000 1200 1400 1600 1800 2000 2200 2400 CDF Cycles mesaured from BSP (core 0) 95th percentile = 1728 cycles
socket NUMA domain physical core logical core
17
software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware
18
event trigger
first instruction
core 0 core i read_start_time() read_end_time()
19
event trigger
first instruction
core 0 core i read_start_time() read_end_time_i()
first instruction
core j
first instruction
core k read_end_time_j() read_end_time_k()
20
5000 10000 15000 20000 25000 30000 pthread condvar futex wakeup unicast IPI
µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 1572.68 min = 1150 max = 17397 σ = 523.279
Cycles to Wakeup
16x
21
500000 1×106 1.5×106 2×106 2.5×106 p t h r e a d c
d v a r f u t e x w a k e u p b r
d c a s t I P I
µ = 995795 min = 17538 max = 2.17277e+06 σ = 544512 µ = 370630 min = 16402 max = 1.89553e+06 σ = 199680 µ = 12827.3 min = 1252 max = 57467 σ = 2931.32
Cycles to Wakeup
29x
22
23
24
25
software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware
26 Paging Thread Misc Timers
Hardware
Ints
Aerokernel
Topo Synch/ Events
Kernel Mode User Mode (Nothing) HRT Kernel
Alloc
Parallel Runtime Parallel Application
Full Privileged HW Access
[ Hale, Dinda HPDC ’15] [ Hale, Dinda VEE ’16] [ Hale, Hetland, Dinda FRIDAY]
27
28
29
30
5000 10000 15000 20000 25000 30000 pthread condvar futex wakeup unicast IPI
µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 1572.68 min = 1150 max = 17397 σ = 523.279
Cycles to Wakeup
16x
31
5000 10000 15000 20000 25000 30000 p t h r e a d c
d v a r f u t e x w a k e u p A e r
e r n e l c
d v a r A e r
e r n e l c
d v a r + I P I u n i c a s t I P I
µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 9128.78 min = 4195 max = 29990 σ = 3025.12 min = 4730 max = 6392 µ = 5348.51 σ = 290.006 min = 1150 max = 17397 µ = 1572.68 σ = 523.279
Cycles to Wakeup
32
5000 10000 15000 20000 25000 30000 p t h r e a d c
d v a r f u t e x w a k e u p A e r
e r n e l c
d v a r A e r
e r n e l c
d v a r + I P I u n i c a s t I P I
µ = 25176.5 min = 1145 max = 29955 σ = 3698.93 µ = 24640.5 min = 81 max = 29996 σ = 3750.51 µ = 9128.78 min = 4195 max = 29990 σ = 3025.12 min = 4730 max = 6392 µ = 5348.51 σ = 290.006 min = 1150 max = 17397 µ = 1572.68 σ = 523.279
Cycles to Wakeup
Nemo events 5x
33
500000 1×106 1.5×106 2×106 2.5×106 p t h r e a d c
d v a r f u t e x w a k e u p b r
d c a s t I P I
µ = 995795 min = 17538 max = 2.17277e+06 σ = 544512 µ = 370630 min = 16402 max = 1.89553e+06 σ = 199680 µ = 12827.3 min = 1252 max = 57467 σ = 2931.32
Cycles to Wakeup
29x
34
500000 1×106 1.5×106 2×106 2.5×106 p t h r e a d c
d v a r f u t e x w a k e u p A e r
e r n e l c
d v a r A e r
e r n e l c
d v a r + I P I b r
d c a s t I P I
µ = 995795 min = 17538 max = 2.17277e+06 σ = 544512 µ = 370630 min = 16402 max = 1.89553e+06 σ = 199680 µ = 265820 min = 3258 max = 612959 σ = 159421 min = 7842 max = 464015 µ = 132417 σ = 98637.4 min = 1252 max = 57467 µ = 12827.3 σ = 2931.32
Cycles to Wakeup
Nemo events 3x
35
36
software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware
37
38
message
CPU memory handler handle_msg() claim: better fit than, e.g. cond vars, for many event-based schemes
39
40
core 0 core 1 … core n-1
Action Lookup Table
event ID 3
nemo_notify_event(core=1, event=3)
41
event ID 0 1 m-1 … Action Descriptor Table event ID 3 0xdeadbeef handle_event()
42
0.2 0.4 0.6 0.8 1 1200 1400 1600 1800 2000 CDF Cycles mesaured from BSP (core 0) unicast IPI nemo event notify
95th% = 1728 95th% = 1824
~100 cycles
43
5000 10000 15000 20000 25000 30000 IPI broadcast Nemo broadcast
µ = 12792 min = 1252 max = 26838 σ = 2718.73 µ = 12958 min = 1376 max = 29703 σ = 2819.29
Cycles to Wakeup
44
45
http://halek.co http://presciencelab.org http://nautilus.halek.co http://xstack.sandia.gov/hobbes me
Nautilus Hobbes Exascale OS/R project
46
47
0.2 0.4 0.6 0.8 1 1000 10000 100000 1×106 CDF σ pthread condvar futex broadcast broadcast IPI
48
70x
0.2 0.4 0.6 0.8 1 1000 10000 100000 1×106 CDF σ pthread condvar futex broadcast Aerokernel condvar Aerokernel condvar + IPI broadcast IPI
49
2x
50 0.2 0.4 0.6 0.8 1 2000 4000 6000 8000 10000 CDF σ IPI broadcast Nemo broadcast 0.2 0.4 0.6 0.8 1 2520 2610 2700 2790
~50 cycles
memory
thread
while (!stuff happened)
cache cache
51
thread
stuff happened
cache invalidation (10s of cycles)
0.2 0.4 0.6 0.8 1 500 1000 1500 2000 CDF Cycles mesaured from BSP (core 0)
unicast IPI synchronous event (mem polling)
52
coherence network interrupt network
53
54
IPI cost breakdown destination handling
time on wire APIC write
note: these are indirectly measured
55
56
0.2 0.4 0.6 0.8 1 500 1000 1500 2000 CDF Cycles mesaured from BSP (core 0)
unicast IPI synchronous event (mem polling) projected remote syscall
cost of dest. handling