Lock-Free Algorithms Martin Thompson - @mjpt777 Mike Barker - - - PowerPoint PPT Presentation

lock free algorithms
SMART_READER_LITE
LIVE PREVIEW

Lock-Free Algorithms Martin Thompson - @mjpt777 Mike Barker - - - PowerPoint PPT Presentation

Lock-Free Algorithms Martin Thompson - @mjpt777 Mike Barker - @mikeb2701 Modern Hardware Modern Hardware (Intel Nehalem) Registers/Buffers C1 C2 C3 C4 C1 C2 C3 C4 <1ns L1 L1 L1 L1 L1 L1 L1 L1 ~4 cycles ~1ns L2 L2 L2 L2


slide-1
SLIDE 1

Martin Thompson - @mjpt777 Mike Barker - @mikeb2701

Lock-Free Algorithms

slide-2
SLIDE 2

Modern Hardware

slide-3
SLIDE 3

Modern Hardware (Intel Nehalem)

C2 C3 C1 C4 C2 C3 C1 C4

Registers/Buffers <1ns L1 L1 L1 L1 L1 L1 L1 L1 ~4 cycles ~1ns

L2 L2 L2 L2 L2 L2 L2 L2

~12 cycles ~3ns

L3 L3

~45 cycles ~15ns QPI ~20ns ~65ns

SDRAM SDRAM SDRAM SDRAM SDRAM SDRAM MC MC

slide-4
SLIDE 4
slide-5
SLIDE 5

Memory Ordering

Registers L1 L2 L3 Execution Units MOB LF/WC Buffers

Core 1

Registers L1 L2 Execution Units MOB LF/WC Buffers

Core 2 Core n

Load Buffer Store Buffer

slide-6
SLIDE 6

Cache Structure & Coherence

L1(D) - 32K L2 - 256K L3 – 8-20MB LF/WC Buffers L1(I) – 32K

“8-Ways with write-back”

64-byte “Cache-lines”

L0(I) – 1.5k µops

128 bits 128 bits 256 bits

MC & QPI

MESI+F State Model SRAM

slide-7
SLIDE 7

DRAM DRAM

Main Memory

SDRAM

64-bit words

Bank Select, then RAS + CAS” Bank 1 Bank 0 Bank n Column Row

Memory Module

BUS BUS BUS

Row Buffer Channel

slide-8
SLIDE 8

Memory Models

slide-9
SLIDE 9

Hardware Memory Models Memory consistency models describe how threads may interact through shared memory consistently.

  • Program Order (PO) for a single thread
  • Sequential Consistency (SO) [Lamport 1979]

> What you expect a program to do! (for race free)

  • Strict Consistency (Linearizability)

> Some special instructions

  • Total Store Order (TSO)

> Sparc model that is stronger than SC

  • x86/64 is SC + (Total Lock Order & Causal Consistency)

> http://www.youtube.com/watch?v=WUfvvFD5tAA

  • Other Processors have weaker models
slide-10
SLIDE 10

Intel x86/64 Memory Model

http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a- part-1-manual.html

  • 1. Loads are not reordered with other loads.
  • 2. Stores are not reordered with other stores.
  • 3. Stores are not reordered with older loads.
  • 4. Loads may be reordered with older stores to different locations but

not with older stores to the same location.

  • 5. In a multiprocessor system, memory ordering obeys causality

(memory ordering respects transitive visibility).

  • 6. In a multiprocessor system, stores to the same location have a total
  • rder.
  • 7. In a multiprocessor system, locked instructions have a total order.
  • 8. Loads and stores are not reordered with locked instructions.
slide-11
SLIDE 11

Language/Runtime Memory Models Some languages/Runtimes have a well defined memory model for portability:

  • Java Memory Model (Java 5)
  • C++ 11
  • Erlang

For most other languages we are at the mercy of the compiler

  • Instruction reordering
  • C “volatile” is inadequate
  • Register allocation for caching values
  • No mapping to the hardware memory model
  • Fences/Barriers need to be applied
slide-12
SLIDE 12

Measuring What Is Going On

slide-13
SLIDE 13

Model Specific Registers (MSR)

  • Many and varied uses

> Timestamp Invariant Counter > Memory Type Range Registers

  • Performance Counters!!!

> L2/L3 Cache Hits/Misses > TLB Hits/Misses > QPI Transfer Rates > Instruction and Cycle Counts > Lots of others....

slide-14
SLIDE 14

Accessing MSRs

void rdmsr(uint32_t msr, uint32_t* lo, uint32_t* hi) { asm volatile(“rdmsr” : “=a” lo, “=d” hi : “c” msr); } void wrmsr(uint32_t msr, uint32_t lo, uint32_t hi) { asm volatile(“wrmsr” :: “c” msr, “a” lo, “d” hi); }

slide-15
SLIDE 15

On Linux

f = new RandomAccessFile(“/dev/cpu/0/msr”, “rw”); ch = f.getChannel(); buffer.order(ByteOrder.LITTLE_ENDIAN); ch.read(buffer, msrNumber); long value = buffer.getLong(0);

slide-16
SLIDE 16

Contention Is The Enemy

slide-17
SLIDE 17

Contention

  • Managing Contention

> Locks > CAS Techniques

  • Little’s & Amdahl’s Laws

> L = λW > Sequential Component Constraint

  • Single Writer Principle
  • Shared Nothing Designs
slide-18
SLIDE 18

Locks

slide-19
SLIDE 19

Software Locks

  • Mutex, Semaphore, Critical Section, etc.

> What happens when un-contended? > What happens when contention occurs? > What if we need condition variables? > What are the cost of software locks? > Can they be optimised?

slide-20
SLIDE 20

Hardware Locks

  • Atomic Instructions

> Compare And Swap > Lock instructions on x86

– LOCK XADD is a bit special

  • Used to update sequences and pointers
  • What are the costs of these operations?
  • Guess how software locks are created?
slide-21
SLIDE 21

Let’s Look At A Lock- Free Algorithm

slide-22
SLIDE 22

Single Producer – Single Consumer Queue

public final class ConcurrentArrayQueue<E> implements Queue<E> { private final E[] ringBuffer; private volatile int addedCounter = 0; private volatile int removedCounter = 0; public ConcurrentArrayQueue(final int size) { ringBuffer = (E[])new Object[size]; }

slide-23
SLIDE 23

Single Producer – Single Consumer Queue

public boolean offer(final E e) { if (addedCounter - removedCounter == ringBuffer.length) { return false; } ringBuffer[addedCounter % ringBuffer.length] = e; addedCounter++; return true; }

slide-24
SLIDE 24

Single Producer – Single Consumer Queue

public E poll() { if (addedCounter == removedCounter) { return null; } int removeIndex = removedCounter % ringBuffer.length; E element = ringBuffer[removeIndex]; ringBuffer[removeIndex] = null; removedCounter++; return element; }

slide-25
SLIDE 25

Let’s Apply Some “Mechanical Sympathy”

slide-26
SLIDE 26

Mechanical Sympathy In Action

  • Power of 2 Queue Size
  • Padded counters to prevent false sharing
  • Avoiding lock instructions on volatile operations
slide-27
SLIDE 27

Single Producer – Single Consumer Queue 2

public final class ConcurrentArrayQueue2<E> implements Queue<E> { private final int maxSize; private final int mask; private final E[] ringBuffer; private final AtomicInteger addedCounter = new PaddedAtomicInteger(0); private final AtomicInteger removedCounter = new PaddedAtomicInteger(0); public ConcurrentArrayQueue2(final int size) { maxSize = findNextPowerOfTwo(size); mask = maxSize - 1; ringBuffer = (E[])new Object[maxSize]; }

slide-28
SLIDE 28

Single Producer – Single Consumer Queue 2

public boolean offer(final E e) { int added = addedCounter.get(); if (added - removedCounter.get() == maxSize) { return false; } ringBuffer[added & mask] = e; addedCounter.lazySet(added + 1); return true; }

slide-29
SLIDE 29

Single Producer – Single Consumer Queue 2

public E poll() { int removed = removedCounter.get(); if (addedCounter.get() == removed) { return null; } int removeIndex = removed & mask; E element = ringBuffer[removeIndex]; ringBuffer[removeIndex] = null; removedCounter.lazySet(removed + 1); return element; }

slide-30
SLIDE 30

Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns)

LinkedBlockingQueue 5 ~32,000 / ~500 ArrayBlockingQueue 6 ~32,000 / ~600 ConcurrentLinkedQueue 15 NA / ~180 ConcurrentArrayQueue 15 NA / ~120 ConcurrentArrayQueue2 65 NA / ~120

Note: None of these test are run with thread affinity set Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()

slide-31
SLIDE 31

False Sharing

slide-32
SLIDE 32

Unpadded

*address1

(thread a) Padded

*address2

(thread b)

*address1

(thread a)

*address2

(thread b)

Cache Lines

slide-33
SLIDE 33

int64_t* address = seq->address for (int i = 0; i < ITERATIONS; i++) { int64_t value = *address; value += i; *address = value; asm volatile(“lock addl 0x0,(%rsp)”); }

False Sharing Test Results

slide-34
SLIDE 34

False Sharing Test Results Unpadded Padded

Million Ops/sec 12.4 104.9 L2 Hit Ratio 1.16% 23.05% L3 Hit Ratio 2.51% 39.18% Instructions 4559 M 4508 M CPU Cycles 63480 M 7551 M Ins/Cycle Ratio 0.07 0.60

slide-35
SLIDE 35

Signalling

slide-36
SLIDE 36

// Lock pthread_mutex_lock(&lock); sequence = i; pthread_cond_signal(&condition); pthread_mutex_unlock(&lock); // Soft Barrier asm volatile(“” ::: “memory”); sequence = i; // Fence asm volatile(“” ::: “memory”); sequence = i; asm volatile(“lock addl $0x0,(%rsp)”);

Signalling

slide-37
SLIDE 37

Lock Fence Soft

Million Ops/Sec 9.4 45.7 108.1 L2 Hit Ratio 17.26 28.17 13.32 L3 Hit Ratio 0.78 29.60 27.99 Instructions 12846 M 906 M 801 M CPU Cycles 28278 M 5808 M 1475 M Ins/Cycle 0.45 0.16 0.54

Signalling Costs

slide-38
SLIDE 38

How Far Can We Go With Lock Free Algorithms?

slide-39
SLIDE 39

Further Adventures With Lock-Free Algorithms

  • State Machines
  • CAS operations
  • Wait-Free in addition to Lock-Free algorithms
  • Thread Affinity
  • x86 and busy spinning and back off
slide-40
SLIDE 40

Questions? Blog (Martin): http://mechanical-sympathy.blogspot.com/ Blog (Mike): http://bad-concurrency.blogspot.com/ Code: http://github.com/mikeb01/nonblock Twitter: @mjpt777, @mikeb2701 “The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”

  • Henry Peteroski