[PPT] - Evaluating the Cost of Atomic Operations on Modern Architectures M PowerPoint Presentation

SLIDE 1

spcl.inf.ethz.ch @spcl_eth

MACIEJ BESTA, HERMANN SCHWEIZER, TORSTEN HOEFLER

Evaluating the Cost of Atomic Operations

n Modern Architectures

SLIDE 2

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

SLIDE 3

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

SLIDE 4

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

SLIDE 5

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

SLIDE 6

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

[1] A. Lumsdaine et al. Challenges in Parallel Graph Processing. Parallel Processing Letters. 2007.

SLIDE 7

spcl.inf.ethz.ch @spcl_eth

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 8

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 9

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware

PACT’15 HPDC’15 HPDC’14 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 10

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies

PACT’15 SC14 HPDC’15 HPDC’14 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 11

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 12

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 13

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms Programming models

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 14

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms Programming models

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

SLIDE 15

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms Programming models

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

Most of these layers require efficient synchronization

SLIDE 16

spcl.inf.ethz.ch @spcl_eth

SYNCHRONIZATION MECHANISMS

LOCKS

SLIDE 17

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

SLIDE 18

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 19

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 20

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 21

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 22

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 23

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 24

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

SLIDE 25

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization An example graph

SLIDE 26

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Possibly complex protocols Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization An example graph

SLIDE 27

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Possibly complex protocols Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization An example graph High performance distributed locks?

SLIDE 28

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

SLIDE 29

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

SLIDE 30

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

SLIDE 31

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

Very common, truly hardware mechanizm

SLIDE 32

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

Very common, truly hardware mechanizm

SLIDE 33

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Complex protocols High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

Very common, truly hardware mechanizm

SLIDE 34

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Complex protocols High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Subtle issues (ABA problem, ...)

Complex access patterns 

Very common, truly hardware mechanizm

SLIDE 35

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Complex protocols High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Subtle issues (ABA problem, ...)

Complex access patterns 

Do we really understand their performance? Very common, truly hardware mechanizm

SLIDE 36

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

SLIDE 37

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

SLIDE 38

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]

SLIDE 39

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]

Used in so many designs... But do we really know their raw performance?

SLIDE 40

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]

Used in so many designs... But do we really know their raw performance? Raw performance... Of what?

SLIDE 41

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

SLIDE 42

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

SLIDE 43

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache level?

SLIDE 44

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache level? Locality?

SLIDE 45

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

SLIDE 46

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

SLIDE 47

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Performance metrics?

SLIDE 48

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Performance metrics? Architecture

SLIDE 49

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Contention? Performance metrics? Architecture

SLIDE 50

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Performance metrics? Architecture

SLIDE 51

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Architecture

SLIDE 52

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Alignment? Architecture

SLIDE 53

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Alignment? Mechanisms? Architecture

SLIDE 54

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

SLIDE 55

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Compare-and- Swap (CAS)

SLIDE 56

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Compare-and- Swap (CAS) Fetch-and-Add (FAA)

SLIDE 57

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Compare-and- Swap (CAS) Fetch-and-Add (FAA) Swap (SWP)

SLIDE 58

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance metrics?

SLIDE 59

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance metrics? Latency

SLIDE 60

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance metrics? Bandwidth Latency

SLIDE 61

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state?

SLIDE 62

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty

SLIDE 63

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean

SLIDE 64

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean Shared: in >1 cache and clean

SLIDE 65

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean Shared: in >1 cache and clean Invalid: garbage data

SLIDE 66

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

SLIDE 67

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

SLIDE 68

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

SLIDE 69

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

SLIDE 70

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

SLIDE 71

spcl.inf.ethz.ch @spcl_eth

RESEARCH QUESTIONS

SLIDE 72

spcl.inf.ethz.ch @spcl_eth

How do we model the performance of atomics?

RESEARCH QUESTIONS

SLIDE 73

spcl.inf.ethz.ch @spcl_eth

What is the performance difference between various atomics? How do we model the performance of atomics?

RESEARCH QUESTIONS

SLIDE 74

spcl.inf.ethz.ch @spcl_eth

What is the performance difference between various atomics? What is the influence

f various parameters

and mechanisms? How do we model the performance of atomics?

RESEARCH QUESTIONS

SLIDE 75

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line

LATENCY MODEL

SLIDE 76

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line

LATENCY MODEL

SLIDE 77

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Read for ownership

LATENCY MODEL

SLIDE 78

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Read for ownership

LATENCY MODEL

Cache coherence state

SLIDE 79

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Read for ownership = max(read latency, invalidation latency)

LATENCY MODEL

Cache coherence state

SLIDE 80

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership = max(read latency, invalidation latency)

LATENCY MODEL

Cache coherence state

SLIDE 81

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership = max(read latency, invalidation latency)

LATENCY MODEL

Cache coherence state

SLIDE 82

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state

SLIDE 83

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

SLIDE 84

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

SLIDE 85

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

SLIDE 86

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

SLIDE 87

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

= max(read latency, invalidation latency)

SLIDE 88

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = read latency = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

SLIDE 89

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = read latency = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

predictions

bserved

data mean of

bserved data

SLIDE 90

spcl.inf.ethz.ch @spcl_eth

LATENCY

HASWELL, EXCLUSIVE

SLIDE 91

spcl.inf.ethz.ch @spcl_eth

CAS FAA

LATENCY

BULLDOZER, EXCLUSIVE

SLIDE 92

spcl.inf.ethz.ch @spcl_eth

LATENCY

HASWELL, EXCLUSIVE

Alignment?

SLIDE 93

spcl.inf.ethz.ch @spcl_eth

64 bit 128 bit

LATENCY

BULLDOZER, EXCLUSIVE

Operand size?

SLIDE 94

spcl.inf.ethz.ch @spcl_eth

BANDWIDTH

HASWELL, ATOMICS

SLIDE 95

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

SLIDE 96

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios

SLIDE 97

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases

SLIDE 98

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases Unaligned atomics should be avoided at all costs

SLIDE 99

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases Unaligned atomics should be avoided at all costs No parallel execution (low bandwidth) even if there are no data deps

SLIDE 100

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases Small operand sizes give best performance Unaligned atomics should be avoided at all costs No parallel execution (low bandwidth) even if there are no data deps

SLIDE 101

spcl.inf.ethz.ch @spcl_eth

Atomics Read

LATENCY

HASWELL, EXCLUSIVE

SLIDE 102

spcl.inf.ethz.ch @spcl_eth

Atomics Read

LATENCY

HASWELL, MODIFIED

SLIDE 103

spcl.inf.ethz.ch @spcl_eth

LATENCY

IVY BRIDGE, EXCLUSIVE

CAS

SLIDE 104

spcl.inf.ethz.ch @spcl_eth

LATENCY

IVY BRIDGE, EXCLUSIVE

CAS

SLIDE 105

spcl.inf.ethz.ch @spcl_eth

LATENCY

XEON PHI, MODIFIED / EXCLUSIVE

CAS

SLIDE 106

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

= max(read latency, invalidation latency)

SLIDE 107

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant

LATENCY MODEL

SHARED STATE

= max(read latency, invalidation latency)

SLIDE 108

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

…

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = invalidation latency = constant

LATENCY MODEL

SHARED STATE

SLIDE 109

spcl.inf.ethz.ch @spcl_eth

Atomics Read

LATENCY

HASWELL, SHARED

SLIDE 110

spcl.inf.ethz.ch @spcl_eth

How to force cache coherence state

F(M): Write cache line (invalidates all copies)
F(E): F(M)  flush  read
F(S): F(E)  read by some other core
D. Hackenberg, D. Molka, W. Nagel. Comparing Cache Architectures and Coherency Protocols on x86-64

Multicore SMP Systems. MICRO’09