Scalable NUMA-aware Blocking Synchronization Primitives
Sanidhya Kashyap, Changwoo Min, Taesoo Kim
Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - - PowerPoint PPT Presentation
Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim The rise of big NUMA machines 'i The rise of big NUMA machines 'i The rise of big NUMA machines 'i Importance of NUMA awareness NUMA node 1
Sanidhya Kashyap, Changwoo Min, Taesoo Kim
'i
'i
'i
NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6
L
File W1 W2 W3 W4 W6 W5 NUMA oblivious
NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6
L
File W1 W2 W3 W4 W6 W5 NUMA oblivious W1 W6 W2 W3 W4 W5 NUMA aware/hierarchical
NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6
L
File
W1 W2 W3 W4 W6 W5 NUMA oblivious W1 W6 W2 W3 W4 W5 NUMA aware/hierarchical
Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)
Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)
Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)
Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block → Rwsem TTAS + spin + block → Spinlock ticket → Mutex TTAS + block → Rwsem TTAS + block → Spinlock qspinlock → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →
1990s 2011 2014 2016
Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)
Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block → Rwsem TTAS + spin + block → Spinlock ticket → Mutex TTAS + block → Rwsem TTAS + block → Spinlock qspinlock → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →
1990s 2011 2014 2016
– Cohort lock single instance: 1600 bytes – Example: 1–4 GB of lock space vs 38 MB of Linux’s
– #threads <= #cores
– #threads > #cores
under-subscription Lock throughput → #thread →
– #threads <= #cores
– #threads > #cores
under-subscription Lock throughput → #thread →
– #threads <= #cores
– #threads > #cores
under-subscription Lock throughput → #thread →
Spinning
– #threads <= #cores
– #threads > #cores
under-subscription Lock throughput → #thread →
Spinning Parking
– #threads <= #cores
– #threads > #cores
under-subscription Lock throughput → #thread →
Spinning Parking Spin + park
– Scheduling overhead in the critical path – Cache-line contention while scheduling out
– Allocate socket specifjc data structure when used – 1.5–10X memory less memory consumption
– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X
➢ Cohort lock principle
➢ Allocate socket structure (snode) when used ➢ Snodes are active until the life-cycle of the lock
➢ Maintain separate per-socket parking lists for
➢ Pass the lock to a spinning waiter ➢ Waiters park themselves if more than one tasks are
socket_list socket_list global_tail global_tail
Threads:
socket_list socket_list global_tail global_tail
T1/S1 T1/S1
Threads:
Socket 1
socket_list [S1] socket_list [S1]
socket_list socket_list global_tail global_tail
T1/S1 T1/S1
Threads:
Socket 1 T2/S1 T2/S1 T3/S1 T3/S1
socket_list [S1] socket_list [S1]
socket_list socket_list global_tail global_tail
T1/S1 T1/S1
Threads:
Socket 1 Socket 2 T2/S1 T2/S1 T3/S1 T3/S1 T4/S2 T4/S2
socket_list [S1] socket_list [S1] socket_list [S1, S2] socket_list [S1, S2]
CST lock instance
socket_list socket_list
Threads:
global_tail global_tail
– L
– UW
– PW
CST lock instance
T1/S1 T1/S1
socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next
L L
Threads:
Socket 1
socket_list [S1] socket_list [S1] global_tail global_tail
– L
– UW
– PW
CST lock instance
T1/S1 T1/S1
socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
Socket 1
socket_list [S1] socket_list [S1] global_tail global_tail
T1
– L
locked →
– UW
unparked/spinning waiter →
– PW
parked / blocked / scheduled out waiter →
CST lock instance
T1/S1 T1/S1
socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L
Acquire global lock
L L
Threads:
Socket 1
socket_list [S1] socket_list [S1] global_tail global_tail
T1
– L
locked →
– UW
unparked/spinning waiter →
– PW
parked / blocked / scheduled out waiter →
CST lock instance
T1/S1 T1/S1
socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 T3/S1 T3/S1 Socket 1
socket_list [S1] socket_list [S1] global_tail global_tail next next p_next p_next
UW UW
T1 T2 T3
– L
locked →
– UW
unparked/spinning waiter →
– PW
parked / blocked / scheduled out waiter →
next next p_next p_next
L L
CST lock instance
T1/S1 T1/S1
waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 T3/S1 T3/S1 Socket 1 Socket 2
socket_list [S1] socket_list [S1] socket_list [S1, S2] socket_list [S1, S2] global_tail global_tail
T4/S2 T4/S2
next next p_next p_next
UW UW
T1 T2 T3 T4
– L
locked →
– UW
unparked/spinning waiter →
– PW
parked / blocked / scheduled out waiter →
next next p_next p_next
L L
CST lock instance
T1/S1 T1/S1
waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next socket_list
CVsocket_list
CVwaiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 T3/S1 T3/S1 Socket 1 Socket 2
PW PW
socket_list [S1]
CVsocket_list [S1]
CVsocket_list [S1, S2]
CVsocket_list [S1, S2]
CVglobal_tail global_tail
T4/S2 T4/S2
next next p_next p_next
UW UW
T2/S1 T2/S1
– Update the status from UW to PW – Add themselves to the parking list
T1 T2 T3 T4
CST lock instance
T1/S1 T1/S1
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1
next next p_next p_next
UW UW
T3
CST lock instance
T1/S1 T1/S1
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1
next next p_next p_next
UW UW
T3
T2/S1 T2/S1
CST lock instance
T1/S1 T1/S1
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1
next next p_next p_next
UW UW
T3
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
next next p_next p_next
UW UW
Socket 1
PW PW
T1 T2
next next p_next p_next
UW UW
T3
T2/S1 T2/S1
CST lock instance
T1/S1 T1/S1
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
Threads:
next next p_next p_next
UW UW
T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1
next next p_next p_next
UW UW
T3
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next next next p_next p_next
L L L L
next next p_next p_next
UW UW
Socket 1
PW PW
T1 T2
next next p_next p_next
UW UW
T3
socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail
UW UW
snode_next snode_next
L L
next next p_next p_next
UW UW
Socket 1
PW PW
T2
next next p_next p_next
L L
T3
T2/S1 T2/S1
– File system: inode – Memory management: mmap_sem
– Read-write lock – Pseudo code
Jobs/hour
Memory footprint (MB)
40 80 120 160 20 40 60 80 100 120 #thread Memory utilization 40 80 120 160 200 240 20 40 60 80 100 120 #thread Vanilla Cohort CST Throughput
40 80 120 160 200 240 1 2 4 8 16 32 64 128 256 #thread Enumerate a directory (rwsem) 0.1 0.2 0.3 0.4 0.5 1 2 4 8 16 32 64 128 256 #thread File creation (mutex) Vanilla Cohort CST
M ops / sec
– NUMA-aware mutex and read-write semaphore
– Resolve NUMA-aware lock’s footprint issue
– Scheduling-aware parking/wake-up strategy – Mitigate scheduler interaction