spcl.inf.ethz.ch @spcl_eth
MACIEJ BESTA, HERMANN SCHWEIZER, TORSTEN HOEFLER
Evaluating the Cost of Atomic Operations
- n Modern Architectures
Evaluating the Cost of Atomic Operations on Modern Architectures M - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth Evaluating the Cost of Atomic Operations on Modern Architectures M ACIEJ B ESTA , H ERMANN S CHWEIZER , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth L ARGE -S CALE I RREGULAR G RAPH P ROCESSING spcl.inf.ethz.ch
spcl.inf.ethz.ch @spcl_eth
MACIEJ BESTA, HERMANN SCHWEIZER, TORSTEN HOEFLER
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
[1] A. Lumsdaine et al. Challenges in Parallel Graph Processing. Parallel Processing Letters. 2007.
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Hardware
PACT’15 HPDC’15 HPDC’14 HPDC’16
spcl.inf.ethz.ch @spcl_eth
Hardware Topologies
PACT’15 SC14 HPDC’15 HPDC’14 HPDC’16
spcl.inf.ethz.ch @spcl_eth
Hardware Topologies OS, middleware
PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 HPDC’16
spcl.inf.ethz.ch @spcl_eth
Hardware Topologies OS, middleware Algorithms
PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16
spcl.inf.ethz.ch @spcl_eth
Hardware Topologies OS, middleware Algorithms Programming models
PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16
spcl.inf.ethz.ch @spcl_eth
Hardware Topologies OS, middleware Algorithms Programming models
PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16
spcl.inf.ethz.ch @spcl_eth
Hardware Topologies OS, middleware Algorithms Programming models
PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16
spcl.inf.ethz.ch @spcl_eth
LOCKS
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
LOCKS
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Intuitive semantics
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Intuitive semantics
LOCKS
An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Intuitive semantics
LOCKS
Serialization An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Possibly complex protocols Intuitive semantics
LOCKS
Serialization An example graph
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Possibly complex protocols Intuitive semantics
LOCKS
Serialization An example graph High performance distributed locks?
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
ATOMIC OPERATIONS
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
ATOMIC OPERATIONS
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
High performance
ATOMIC OPERATIONS
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
High performance
ATOMIC OPERATIONS
Very common, truly hardware mechanizm
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
High performance
ATOMIC OPERATIONS
Very common, truly hardware mechanizm
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Complex protocols High performance
ATOMIC OPERATIONS
Very common, truly hardware mechanizm
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Complex protocols High performance
ATOMIC OPERATIONS
Subtle issues (ABA problem, ...)
Very common, truly hardware mechanizm
spcl.inf.ethz.ch @spcl_eth
Proc q Proc p
Complex protocols High performance
ATOMIC OPERATIONS
Subtle issues (ABA problem, ...)
Do we really understand their performance? Very common, truly hardware mechanizm
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]
spcl.inf.ethz.ch @spcl_eth
[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]
spcl.inf.ethz.ch @spcl_eth
[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Cache level?
spcl.inf.ethz.ch @spcl_eth
Cache level? Locality?
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Atomic?
spcl.inf.ethz.ch @spcl_eth
Atomic? Performance metrics?
spcl.inf.ethz.ch @spcl_eth
Atomic? Performance metrics? Architecture
spcl.inf.ethz.ch @spcl_eth
Atomic? Contention? Performance metrics? Architecture
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Atomic? Contention? Performance metrics? Architecture
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Architecture
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Alignment? Architecture
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Alignment? Mechanisms? Architecture
spcl.inf.ethz.ch @spcl_eth
Atomic?
spcl.inf.ethz.ch @spcl_eth
Atomic? Compare-and- Swap (CAS)
spcl.inf.ethz.ch @spcl_eth
Atomic? Compare-and- Swap (CAS) Fetch-and-Add (FAA)
spcl.inf.ethz.ch @spcl_eth
Atomic? Compare-and- Swap (CAS) Fetch-and-Add (FAA) Swap (SWP)
spcl.inf.ethz.ch @spcl_eth
Performance metrics?
spcl.inf.ethz.ch @spcl_eth
Performance metrics? Latency
spcl.inf.ethz.ch @spcl_eth
Performance metrics? Bandwidth Latency
spcl.inf.ethz.ch @spcl_eth
Cache coherence state?
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Modified: in one cache and dirty
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean Shared: in >1 cache and clean
spcl.inf.ethz.ch @spcl_eth
Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean Shared: in >1 cache and clean Invalid: garbage data
spcl.inf.ethz.ch @spcl_eth
Architecture
spcl.inf.ethz.ch @spcl_eth
Architecture
spcl.inf.ethz.ch @spcl_eth
Architecture
spcl.inf.ethz.ch @spcl_eth
Architecture
spcl.inf.ethz.ch @spcl_eth
Architecture
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Read for ownership
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Read for ownership
Cache coherence state
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Read for ownership = max(read latency, invalidation latency)
Cache coherence state
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership = max(read latency, invalidation latency)
Cache coherence state
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership = max(read latency, invalidation latency)
Cache coherence state
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant
Cache coherence state
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant
Cache coherence state Atomic
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant
Cache coherence state Atomic
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic = max(read latency, invalidation latency) = constant
Cache coherence state Atomic
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = max(read latency, invalidation latency) = constant
Cache coherence state Atomic
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant
EXCLUSIVE OR MODIFIED STATE
= max(read latency, invalidation latency)
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = read latency = constant
EXCLUSIVE OR MODIFIED STATE
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = read latency = constant
EXCLUSIVE OR MODIFIED STATE
predictions
data mean of
spcl.inf.ethz.ch @spcl_eth
HASWELL, EXCLUSIVE
spcl.inf.ethz.ch @spcl_eth
BULLDOZER, EXCLUSIVE
spcl.inf.ethz.ch @spcl_eth
HASWELL, EXCLUSIVE
Alignment?
spcl.inf.ethz.ch @spcl_eth
BULLDOZER, EXCLUSIVE
Operand size?
spcl.inf.ethz.ch @spcl_eth
HASWELL, ATOMICS
spcl.inf.ethz.ch @spcl_eth
PERFORMANCE INSIGHTS
spcl.inf.ethz.ch @spcl_eth
PERFORMANCE INSIGHTS
The same latency of different atomics in most scenarios
spcl.inf.ethz.ch @spcl_eth
PERFORMANCE INSIGHTS
The same latency of different atomics in most scenarios CAS is the fastest for some cases
spcl.inf.ethz.ch @spcl_eth
PERFORMANCE INSIGHTS
The same latency of different atomics in most scenarios CAS is the fastest for some cases Unaligned atomics should be avoided at all costs
spcl.inf.ethz.ch @spcl_eth
PERFORMANCE INSIGHTS
The same latency of different atomics in most scenarios CAS is the fastest for some cases Unaligned atomics should be avoided at all costs No parallel execution (low bandwidth) even if there are no data deps
spcl.inf.ethz.ch @spcl_eth
PERFORMANCE INSIGHTS
The same latency of different atomics in most scenarios CAS is the fastest for some cases Small operand sizes give best performance Unaligned atomics should be avoided at all costs No parallel execution (low bandwidth) even if there are no data deps
spcl.inf.ethz.ch @spcl_eth
HASWELL, EXCLUSIVE
spcl.inf.ethz.ch @spcl_eth
HASWELL, MODIFIED
spcl.inf.ethz.ch @spcl_eth
IVY BRIDGE, EXCLUSIVE
spcl.inf.ethz.ch @spcl_eth
IVY BRIDGE, EXCLUSIVE
spcl.inf.ethz.ch @spcl_eth
XEON PHI, MODIFIED / EXCLUSIVE
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant
EXCLUSIVE OR MODIFIED STATE
= max(read latency, invalidation latency)
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant
SHARED STATE
= max(read latency, invalidation latency)
spcl.inf.ethz.ch @spcl_eth
Core Cache Cache Cache
Cache line Cache line Read for ownership Execute Atomic Cache coherence state = invalidation latency = constant
SHARED STATE
spcl.inf.ethz.ch @spcl_eth
HASWELL, SHARED
spcl.inf.ethz.ch @spcl_eth
Multicore SMP Systems. MICRO’09