[PPT] - Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya PowerPoint Presentation

SLIDE 1

Scalable NUMA-aware Blocking Synchronization Primitives

Sanidhya Kashyap, Changwoo Min, Taesoo Kim

SLIDE 2

'i

The rise of big NUMA machines

SLIDE 3

'i

The rise of big NUMA machines

SLIDE 4

'i

The rise of big NUMA machines

SLIDE 5

Importance of NUMA awareness

NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6

L

File W1 W2 W3 W4 W6 W5 NUMA oblivious

SLIDE 6

Importance of NUMA awareness

NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6

L

File W1 W2 W3 W4 W6 W5 NUMA oblivious W1 W6 W2 W3 W4 W5 NUMA aware/hierarchical

SLIDE 7

Importance of NUMA awareness

NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6

L

File

Idea:

Make synchronization primitives NUMA aware!

W1 W2 W3 W4 W6 W5 NUMA oblivious W1 W6 W2 W3 W4 W5 NUMA aware/hierarchical

SLIDE 8

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

Lock's research efgorts

SLIDE 9

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Lock's research efgorts

SLIDE 10

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block → Rwsem TTAS + spin + block → Spinlock ticket → Mutex TTAS + block → Rwsem TTAS + block → Spinlock qspinlock → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →

Lock's research efgorts

1990s 2011 2014 2016

SLIDE 11

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block → Rwsem TTAS + spin + block → Spinlock ticket → Mutex TTAS + block → Rwsem TTAS + block → Spinlock qspinlock → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →

Lock's research efgorts

1990s 2011 2014 2016

Adopting NUMA aware locks is not easy

SLIDE 12

Issues with NUMA-aware primitives

Memory footprint overhead

– Cohort lock single instance: 1600 bytes – Example: 1–4 GB of lock space vs 38 MB of Linux’s

lock for 10 M inodes

Does not support blocking/parking behavior

SLIDE 13

Blocking/parking approach

Under subscription

– #threads <= #cores

Over subscription

– #threads > #cores

Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

SLIDE 14

Blocking/parking approach

Under subscription

– #threads <= #cores

Over subscription

– #threads > #cores

Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

ver-subscription

SLIDE 15

Blocking/parking approach

Under subscription

– #threads <= #cores

Over subscription

– #threads > #cores

Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

ver-subscription

Spinning

SLIDE 16

Blocking/parking approach

Under subscription

– #threads <= #cores

Over subscription

– #threads > #cores

Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

ver-subscription

Spinning Parking

SLIDE 17

Blocking/parking approach

Under subscription

– #threads <= #cores

Over subscription

– #threads > #cores

Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

ver-subscription

Spinning Parking Spin + park

SLIDE 18

Issues with blocking synchronization primitives

High memory footprint for NUMA-aware locks
Ineffjcient blocking strategy

– Scheduling overhead in the critical path – Cache-line contention while scheduling out

SLIDE 19

CST lock

NUMA-aware lock
Low memory footprint

– Allocate socket specifjc data structure when used – 1.5–10X memory less memory consumption

Effjcient parking/wake-up strategy

– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

SLIDE 20

CST lock design

NUMA-aware lock

➢ Cohort lock principle

+ Mitigates cache-line contention and bouncing

Memory effjcient data structure

➢ Allocate socket structure (snode) when used ➢ Snodes are active until the life-cycle of the lock

+ Does not stress the memory allocator

SLIDE 21

CST lock design

NUMA-aware parking list

➢ Maintain separate per-socket parking lists for

readers and writers + Mitigates cache-line contention in over-subscribed scenario + Allows distributed wake-up of parked readers

SLIDE 22

CST lock design

Remove scheduler intervention

➢ Pass the lock to a spinning waiter ➢ Waiters park themselves if more than one tasks are

running on a CPU (system load) + Scheduler not involved in the critical path + Guarantees forward progress of the system

SLIDE 23

Lock instantiation

socket_list socket_list global_tail global_tail

Threads:

Initially no snodes are

allocated

Thread in a particular

socket initiates an allocation

SLIDE 24

Lock instantiation

socket_list socket_list global_tail global_tail

T1/S1 T1/S1

Threads:

Socket 1

socket_list [S1] socket_list [S1]

Initially no snodes are

allocated

Thread in a particular

socket initiates an allocation

SLIDE 25

Lock instantiation

socket_list socket_list global_tail global_tail

T1/S1 T1/S1

Threads:

Socket 1 T2/S1 T2/S1 T3/S1 T3/S1

socket_list [S1] socket_list [S1]

Initially no snodes are

allocated

Thread in a particular

socket initiates an allocation

SLIDE 26

Lock instantiation

socket_list socket_list global_tail global_tail

T1/S1 T1/S1

Threads:

Socket 1 Socket 2 T2/S1 T2/S1 T3/S1 T3/S1 T4/S2 T4/S2

socket_list [S1] socket_list [S1] socket_list [S1, S2] socket_list [S1, S2]

Initially no snodes are

allocated

Thread in a particular

socket initiates an allocation

SLIDE 27

CST lock phase

CST lock instance

socket_list socket_list

Threads:

global_tail global_tail

Allocate thread specifjc structure on the stack
Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

SLIDE 28

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next

L L

Threads:

Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail

Allocate thread specifjc structure on the stack
Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

SLIDE 29

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail

T1

Allocate thread specifjc structure on the stack
Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

SLIDE 30

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L

Acquire global lock

L L

Threads:

Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail

T1

Allocate thread specifjc structure on the stack
Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

SLIDE 31

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 T3/S1 T3/S1 Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail next next p_next p_next

UW UW

T1 T2 T3

Allocate thread specifjc structure on the stack
Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

SLIDE 32

CST lock phase

next next p_next p_next

L L

CST lock instance

T1/S1 T1/S1

waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 T3/S1 T3/S1 Socket 1 Socket 2

socket_list [S1] socket_list [S1] socket_list [S1, S2] socket_list [S1, S2] global_tail global_tail

T4/S2 T4/S2

next next p_next p_next

UW UW

T1 T2 T3 T4

Allocate thread specifjc structure on the stack
Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

SLIDE 33

CST lock phase: blocking/parking

next next p_next p_next

L L

CST lock instance

T1/S1 T1/S1

waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next socket_list

CV

socket_list

CV

waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 T3/S1 T3/S1 Socket 1 Socket 2

PW PW

socket_list [S1]

CV

socket_list [S1]

CV

socket_list [S1, S2]

CV

socket_list [S1, S2]

CV

global_tail global_tail

T4/S2 T4/S2

next next p_next p_next

UW UW

T2/S1 T2/S1

Before scheduling out, waiters atomically

– Update the status from UW to PW – Add themselves to the parking list

T1 T2 T3 T4

SLIDE 34

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

Pass the lock to a spinning waiter

SLIDE 35

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

Pass the lock to a spinning waiter

T2/S1 T2/S1

SLIDE 36

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

next next p_next p_next

UW UW

Socket 1

PW PW

T1 T2

next next p_next p_next

UW UW

T3

Pass the lock to a spinning waiter

T2/S1 T2/S1

SLIDE 37

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

next next p_next p_next

UW UW

Socket 1

PW PW

T1 T2

next next p_next p_next

UW UW

T3

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next

L L

next next p_next p_next

UW UW

Socket 1

PW PW

T2

next next p_next p_next

L L

T3

Pass the lock to a spinning waiter

T2/S1 T2/S1

SLIDE 38

Implementation

Implemented in the Linux kernel
Structures modifjed

– File system: inode – Memory management: mmap_sem

Please see our paper

– Read-write lock – Pseudo code

https://github.com/sslab-gatech/cst-locks

SLIDE 39

Evaluation

Performance of locks in terms of scalability and

memory footprint?

Blocking/parking strategy efgectiveness?
Setup: 8-socket, 120-core NUMA machine

SLIDE 40

Case study: Psearchy

Jobs/hour

Overcomes memory footprint and scheduling overhead
Uses 1.5–9.1X less memory than the Cohort lock
Improves throughput by 1.4–1.6X

Memory footprint (MB)

40 80 120 160 20 40 60 80 100 120 #thread Memory utilization 40 80 120 160 200 240 20 40 60 80 100 120 #thread Vanilla Cohort CST Throughput

SLIDE 41

Efgective parking strategy

Better performance for both under- and over-

subscribed scenario

Improves scalability by 1.3–3.7X

40 80 120 160 200 240 1 2 4 8 16 32 64 128 256 #thread Enumerate a directory (rwsem) 0.1 0.2 0.3 0.4 0.5 1 2 4 8 16 32 64 128 256 #thread File creation (mutex) Vanilla Cohort CST

M ops / sec

SLIDE 42

Conclusion

Two blocking synchronization primitives

– NUMA-aware mutex and read-write semaphore

Dynamically allocated data structure

– Resolve NUMA-aware lock’s footprint issue

Effjcient spin-then-park strategy

– Scheduling-aware parking/wake-up strategy – Mitigate scheduler interaction