Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - - PowerPoint PPT Presentation

scalable numa aware blocking synchronization primitives
SMART_READER_LITE
LIVE PREVIEW

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - - PowerPoint PPT Presentation

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim The rise of big NUMA machines 'i The rise of big NUMA machines 'i The rise of big NUMA machines 'i Importance of NUMA awareness NUMA node 1


slide-1
SLIDE 1

Scalable NUMA-aware Blocking Synchronization Primitives

Sanidhya Kashyap, Changwoo Min, Taesoo Kim

slide-2
SLIDE 2

'i

The rise of big NUMA machines

slide-3
SLIDE 3

'i

The rise of big NUMA machines

slide-4
SLIDE 4

'i

The rise of big NUMA machines

slide-5
SLIDE 5

Importance of NUMA awareness

NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6

L

File W1 W2 W3 W4 W6 W5 NUMA oblivious

slide-6
SLIDE 6

Importance of NUMA awareness

NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6

L

File W1 W2 W3 W4 W6 W5 NUMA oblivious W1 W6 W2 W3 W4 W5 NUMA aware/hierarchical

slide-7
SLIDE 7

Importance of NUMA awareness

NUMA node 1 NUMA node 2 W1 W2 W3 W4 W5 W6

L

File

Idea:

Make synchronization primitives NUMA aware!

W1 W2 W3 W4 W6 W5 NUMA oblivious W1 W6 W2 W3 W4 W5 NUMA aware/hierarchical

slide-8
SLIDE 8

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

Lock's research efgorts

slide-9
SLIDE 9

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Lock's research efgorts

slide-10
SLIDE 10

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block → Rwsem TTAS + spin + block → Spinlock ticket → Mutex TTAS + block → Rwsem TTAS + block → Spinlock qspinlock → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →

Lock's research efgorts

1990s 2011 2014 2016

slide-11
SLIDE 11

Lock's research efgorts and their use

Linux kernel lock adoption / modifjcation

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block → Rwsem TTAS + spin + block → Spinlock ticket → Mutex TTAS + block → Rwsem TTAS + block → Spinlock qspinlock → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →

Lock's research efgorts

1990s 2011 2014 2016

Adopting NUMA aware locks is not easy

slide-12
SLIDE 12

Issues with NUMA-aware primitives

  • Memory footprint overhead

– Cohort lock single instance: 1600 bytes – Example: 1–4 GB of lock space vs 38 MB of Linux’s

lock for 10 M inodes

  • Does not support blocking/parking behavior
slide-13
SLIDE 13

Blocking/parking approach

  • Under subscription

– #threads <= #cores

  • Over subscription

– #threads > #cores

  • Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

slide-14
SLIDE 14

Blocking/parking approach

  • Under subscription

– #threads <= #cores

  • Over subscription

– #threads > #cores

  • Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

  • ver-subscription
slide-15
SLIDE 15

Blocking/parking approach

  • Under subscription

– #threads <= #cores

  • Over subscription

– #threads > #cores

  • Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

  • ver-subscription

Spinning

slide-16
SLIDE 16

Blocking/parking approach

  • Under subscription

– #threads <= #cores

  • Over subscription

– #threads > #cores

  • Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

  • ver-subscription

Spinning Parking

slide-17
SLIDE 17

Blocking/parking approach

  • Under subscription

– #threads <= #cores

  • Over subscription

– #threads > #cores

  • Spin-then-park strategy

1) Spin for a certain duration 2) Add to a parking list 3) Schedule out (park/block)

under-subscription Lock throughput → #thread →

  • ver-subscription

Spinning Parking Spin + park

slide-18
SLIDE 18

Issues with blocking synchronization primitives

  • High memory footprint for NUMA-aware locks
  • Ineffjcient blocking strategy

– Scheduling overhead in the critical path – Cache-line contention while scheduling out

slide-19
SLIDE 19

CST lock

  • NUMA-aware lock
  • Low memory footprint

– Allocate socket specifjc data structure when used – 1.5–10X memory less memory consumption

  • Effjcient parking/wake-up strategy

– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

slide-20
SLIDE 20

CST lock design

  • NUMA-aware lock

➢ Cohort lock principle

+ Mitigates cache-line contention and bouncing

  • Memory effjcient data structure

➢ Allocate socket structure (snode) when used ➢ Snodes are active until the life-cycle of the lock

+ Does not stress the memory allocator

slide-21
SLIDE 21

CST lock design

  • NUMA-aware parking list

➢ Maintain separate per-socket parking lists for

readers and writers + Mitigates cache-line contention in over-subscribed scenario + Allows distributed wake-up of parked readers

slide-22
SLIDE 22

CST lock design

  • Remove scheduler intervention

➢ Pass the lock to a spinning waiter ➢ Waiters park themselves if more than one tasks are

running on a CPU (system load) + Scheduler not involved in the critical path + Guarantees forward progress of the system

slide-23
SLIDE 23

Lock instantiation

socket_list socket_list global_tail global_tail

Threads:

  • Initially no snodes are

allocated

  • Thread in a particular

socket initiates an allocation

slide-24
SLIDE 24

Lock instantiation

socket_list socket_list global_tail global_tail

T1/S1 T1/S1

Threads:

Socket 1

socket_list [S1] socket_list [S1]

  • Initially no snodes are

allocated

  • Thread in a particular

socket initiates an allocation

slide-25
SLIDE 25

Lock instantiation

socket_list socket_list global_tail global_tail

T1/S1 T1/S1

Threads:

Socket 1 T2/S1 T2/S1 T3/S1 T3/S1

socket_list [S1] socket_list [S1]

  • Initially no snodes are

allocated

  • Thread in a particular

socket initiates an allocation

slide-26
SLIDE 26

Lock instantiation

socket_list socket_list global_tail global_tail

T1/S1 T1/S1

Threads:

Socket 1 Socket 2 T2/S1 T2/S1 T3/S1 T3/S1 T4/S2 T4/S2

socket_list [S1] socket_list [S1] socket_list [S1, S2] socket_list [S1, S2]

  • Initially no snodes are

allocated

  • Thread in a particular

socket initiates an allocation

slide-27
SLIDE 27

CST lock phase

CST lock instance

socket_list socket_list

Threads:

global_tail global_tail

  • Allocate thread specifjc structure on the stack
  • Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

slide-28
SLIDE 28

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next

L L

Threads:

Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail

  • Allocate thread specifjc structure on the stack
  • Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

slide-29
SLIDE 29

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail

T1

  • Allocate thread specifjc structure on the stack
  • Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

slide-30
SLIDE 30

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L

Acquire global lock

L L

Threads:

Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail

T1

  • Allocate thread specifjc structure on the stack
  • Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

slide-31
SLIDE 31

CST lock phase

CST lock instance

T1/S1 T1/S1

socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 T3/S1 T3/S1 Socket 1

socket_list [S1] socket_list [S1] global_tail global_tail next next p_next p_next

UW UW

T1 T2 T3

  • Allocate thread specifjc structure on the stack
  • Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

slide-32
SLIDE 32

CST lock phase

next next p_next p_next

L L

CST lock instance

T1/S1 T1/S1

waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next socket_list socket_list waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 T3/S1 T3/S1 Socket 1 Socket 2

socket_list [S1] socket_list [S1] socket_list [S1, S2] socket_list [S1, S2] global_tail global_tail

T4/S2 T4/S2

next next p_next p_next

UW UW

T1 T2 T3 T4

  • Allocate thread specifjc structure on the stack
  • Three states for each node

– L

locked →

– UW

unparked/spinning waiter →

– PW

parked / blocked / scheduled out waiter →

slide-33
SLIDE 33

CST lock phase: blocking/parking

next next p_next p_next

L L

CST lock instance

T1/S1 T1/S1

waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next socket_list

CV

socket_list

CV

waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 T3/S1 T3/S1 Socket 1 Socket 2

PW PW

socket_list [S1]

CV

socket_list [S1]

CV

socket_list [S1, S2]

CV

socket_list [S1, S2]

CV

global_tail global_tail

T4/S2 T4/S2

next next p_next p_next

UW UW

T2/S1 T2/S1

  • Before scheduling out, waiters atomically

– Update the status from UW to PW – Add themselves to the parking list

T1 T2 T3 T4

slide-34
SLIDE 34

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

Pass the lock to a spinning waiter

slide-35
SLIDE 35

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

Pass the lock to a spinning waiter

T2/S1 T2/S1

slide-36
SLIDE 36

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

next next p_next p_next

UW UW

Socket 1

PW PW

T1 T2

next next p_next p_next

UW UW

T3

Pass the lock to a spinning waiter

T2/S1 T2/S1

slide-37
SLIDE 37

CST unlock phase

CST lock instance

T1/S1 T1/S1

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

Threads:

next next p_next p_next

UW UW

T2/S1 T2/S1 Socket 1 T1 T2 T3/S1 T3/S1

next next p_next p_next

UW UW

T3

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next next next p_next p_next

L L L L

next next p_next p_next

UW UW

Socket 1

PW PW

T1 T2

next next p_next p_next

UW UW

T3

socket_list [S1] socket_list [S1] global_tail global_tail waiting_tail waiting_tail parking_tail parking_tail

UW UW

snode_next snode_next

L L

next next p_next p_next

UW UW

Socket 1

PW PW

T2

next next p_next p_next

L L

T3

Pass the lock to a spinning waiter

T2/S1 T2/S1

slide-38
SLIDE 38

Implementation

  • Implemented in the Linux kernel
  • Structures modifjed

– File system: inode – Memory management: mmap_sem

  • Please see our paper

– Read-write lock – Pseudo code

https://github.com/sslab-gatech/cst-locks

slide-39
SLIDE 39

Evaluation

  • Performance of locks in terms of scalability and

memory footprint?

  • Blocking/parking strategy efgectiveness?
  • Setup: 8-socket, 120-core NUMA machine
slide-40
SLIDE 40

Case study: Psearchy

Jobs/hour

  • Overcomes memory footprint and scheduling overhead
  • Uses 1.5–9.1X less memory than the Cohort lock
  • Improves throughput by 1.4–1.6X

Memory footprint (MB)

40 80 120 160 20 40 60 80 100 120 #thread Memory utilization 40 80 120 160 200 240 20 40 60 80 100 120 #thread Vanilla Cohort CST Throughput

slide-41
SLIDE 41

Efgective parking strategy

  • Better performance for both under- and over-

subscribed scenario

  • Improves scalability by 1.3–3.7X

40 80 120 160 200 240 1 2 4 8 16 32 64 128 256 #thread Enumerate a directory (rwsem) 0.1 0.2 0.3 0.4 0.5 1 2 4 8 16 32 64 128 256 #thread File creation (mutex) Vanilla Cohort CST

M ops / sec

slide-42
SLIDE 42

Conclusion

  • Two blocking synchronization primitives

– NUMA-aware mutex and read-write semaphore

  • Dynamically allocated data structure

– Resolve NUMA-aware lock’s footprint issue

  • Effjcient spin-then-park strategy

– Scheduling-aware parking/wake-up strategy – Mitigate scheduler interaction

Thank you!