Evaluating the Cost of Atomic Operations on Modern Architectures M - - PowerPoint PPT Presentation

evaluating the cost of atomic operations on modern
SMART_READER_LITE
LIVE PREVIEW

Evaluating the Cost of Atomic Operations on Modern Architectures M - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Evaluating the Cost of Atomic Operations on Modern Architectures M ACIEJ B ESTA , H ERMANN S CHWEIZER , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth L ARGE -S CALE I RREGULAR G RAPH P ROCESSING spcl.inf.ethz.ch


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

MACIEJ BESTA, HERMANN SCHWEIZER, TORSTEN HOEFLER

Evaluating the Cost of Atomic Operations

  • n Modern Architectures
slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

[1] A. Lumsdaine et al. Challenges in Parallel Graph Processing. Parallel Processing Letters. 2007.

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware

PACT’15 HPDC’15 HPDC’14 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies

PACT’15 SC14 HPDC’15 HPDC’14 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms Programming models

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms Programming models

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

Computing abstractions

Hardware Topologies OS, middleware Algorithms Programming models

PACT’15 SC14 ICS’15 HPDC’15 HPDC’14 SC13 HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

Most of these layers require efficient synchronization

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

SYNCHRONIZATION MECHANISMS

LOCKS

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-24
SLIDE 24

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

An example graph

slide-25
SLIDE 25

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization An example graph

slide-26
SLIDE 26

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Possibly complex protocols Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization An example graph

slide-27
SLIDE 27

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Possibly complex protocols Intuitive semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization An example graph High performance distributed locks?

slide-28
SLIDE 28

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

slide-29
SLIDE 29

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

slide-30
SLIDE 30

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

slide-31
SLIDE 31

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

Very common, truly hardware mechanizm

slide-32
SLIDE 32

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

Very common, truly hardware mechanizm

slide-33
SLIDE 33

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Complex protocols High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access patterns 

Very common, truly hardware mechanizm

slide-34
SLIDE 34

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Complex protocols High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Subtle issues (ABA problem, ...)

Complex access patterns 

Very common, truly hardware mechanizm

slide-35
SLIDE 35

spcl.inf.ethz.ch @spcl_eth

Proc q Proc p

Complex protocols High performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Subtle issues (ABA problem, ...)

Complex access patterns 

Do we really understand their performance? Very common, truly hardware mechanizm

slide-36
SLIDE 36

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

slide-37
SLIDE 37

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

slide-38
SLIDE 38

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]

slide-39
SLIDE 39

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]

Used in so many designs... But do we really know their raw performance?

slide-40
SLIDE 40

spcl.inf.ethz.ch @spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast Concurrent Lock-Free Binary Search Trees [PPoPP’14] A General Technique for Non- blocking Trees [PPoPP’14] Practical Concurrent Binary Search Trees via Logical Ordering [PPoPP’14] A Practical Wait-Free Simulation for Lock-Free Data Structures [PPoPP’15] [PPoPP’15] [SPAA’15] [SPAA’16]

Used in so many designs... But do we really know their raw performance? Raw performance... Of what?

slide-41
SLIDE 41

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

slide-42
SLIDE 42

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

slide-43
SLIDE 43

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache level?

slide-44
SLIDE 44

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache level? Locality?

slide-45
SLIDE 45

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

slide-46
SLIDE 46

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

slide-47
SLIDE 47

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Performance metrics?

slide-48
SLIDE 48

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Performance metrics? Architecture

slide-49
SLIDE 49

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Contention? Performance metrics? Architecture

slide-50
SLIDE 50

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Performance metrics? Architecture

slide-51
SLIDE 51

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Architecture

slide-52
SLIDE 52

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Alignment? Architecture

slide-53
SLIDE 53

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Atomic? Contention? Operand size? Performance metrics? Alignment? Mechanisms? Architecture

slide-54
SLIDE 54

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

slide-55
SLIDE 55

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Compare-and- Swap (CAS)

slide-56
SLIDE 56

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Compare-and- Swap (CAS) Fetch-and-Add (FAA)

slide-57
SLIDE 57

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic? Compare-and- Swap (CAS) Fetch-and-Add (FAA) Swap (SWP)

slide-58
SLIDE 58

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance metrics?

slide-59
SLIDE 59

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance metrics? Latency

slide-60
SLIDE 60

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance metrics? Bandwidth Latency

slide-61
SLIDE 61

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state?

slide-62
SLIDE 62

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty

slide-63
SLIDE 63

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean

slide-64
SLIDE 64

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean Shared: in >1 cache and clean

slide-65
SLIDE 65

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache coherence state? Modified: in one cache and dirty Exclusive: in one cache and clean Shared: in >1 cache and clean Invalid: garbage data

slide-66
SLIDE 66

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

slide-67
SLIDE 67

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

slide-68
SLIDE 68

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

slide-69
SLIDE 69

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

slide-70
SLIDE 70

spcl.inf.ethz.ch @spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

slide-71
SLIDE 71

spcl.inf.ethz.ch @spcl_eth

RESEARCH QUESTIONS

slide-72
SLIDE 72

spcl.inf.ethz.ch @spcl_eth

How do we model the performance of atomics?

RESEARCH QUESTIONS

slide-73
SLIDE 73

spcl.inf.ethz.ch @spcl_eth

What is the performance difference between various atomics? How do we model the performance of atomics?

RESEARCH QUESTIONS

slide-74
SLIDE 74

spcl.inf.ethz.ch @spcl_eth

What is the performance difference between various atomics? What is the influence

  • f various parameters

and mechanisms? How do we model the performance of atomics?

RESEARCH QUESTIONS

slide-75
SLIDE 75

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line

LATENCY MODEL

slide-76
SLIDE 76

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line

LATENCY MODEL

slide-77
SLIDE 77

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Read for ownership

LATENCY MODEL

slide-78
SLIDE 78

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Read for ownership

LATENCY MODEL

Cache coherence state

slide-79
SLIDE 79

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Read for ownership = max(read latency, invalidation latency)

LATENCY MODEL

Cache coherence state

slide-80
SLIDE 80

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership = max(read latency, invalidation latency)

LATENCY MODEL

Cache coherence state

slide-81
SLIDE 81

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership = max(read latency, invalidation latency)

LATENCY MODEL

Cache coherence state

slide-82
SLIDE 82

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state

slide-83
SLIDE 83

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

slide-84
SLIDE 84

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

slide-85
SLIDE 85

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

slide-86
SLIDE 86

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = max(read latency, invalidation latency) = constant

LATENCY MODEL

Cache coherence state Atomic

slide-87
SLIDE 87

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

= max(read latency, invalidation latency)

slide-88
SLIDE 88

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = read latency = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

slide-89
SLIDE 89

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = read latency = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

predictions

  • bserved

data mean of

  • bserved data
slide-90
SLIDE 90

spcl.inf.ethz.ch @spcl_eth

LATENCY

HASWELL, EXCLUSIVE

slide-91
SLIDE 91

spcl.inf.ethz.ch @spcl_eth

CAS FAA

LATENCY

BULLDOZER, EXCLUSIVE

slide-92
SLIDE 92

spcl.inf.ethz.ch @spcl_eth

LATENCY

HASWELL, EXCLUSIVE

Alignment?

slide-93
SLIDE 93

spcl.inf.ethz.ch @spcl_eth

64 bit 128 bit

LATENCY

BULLDOZER, EXCLUSIVE

Operand size?

slide-94
SLIDE 94

spcl.inf.ethz.ch @spcl_eth

BANDWIDTH

HASWELL, ATOMICS

slide-95
SLIDE 95

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

slide-96
SLIDE 96

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios

slide-97
SLIDE 97

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases

slide-98
SLIDE 98

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases Unaligned atomics should be avoided at all costs

slide-99
SLIDE 99

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases Unaligned atomics should be avoided at all costs No parallel execution (low bandwidth) even if there are no data deps

slide-100
SLIDE 100

spcl.inf.ethz.ch @spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of different atomics in most scenarios CAS is the fastest for some cases Small operand sizes give best performance Unaligned atomics should be avoided at all costs No parallel execution (low bandwidth) even if there are no data deps

slide-101
SLIDE 101

spcl.inf.ethz.ch @spcl_eth

Atomics Read

LATENCY

HASWELL, EXCLUSIVE

slide-102
SLIDE 102

spcl.inf.ethz.ch @spcl_eth

Atomics Read

LATENCY

HASWELL, MODIFIED

slide-103
SLIDE 103

spcl.inf.ethz.ch @spcl_eth

LATENCY

IVY BRIDGE, EXCLUSIVE

CAS

slide-104
SLIDE 104

spcl.inf.ethz.ch @spcl_eth

LATENCY

IVY BRIDGE, EXCLUSIVE

CAS

slide-105
SLIDE 105

spcl.inf.ethz.ch @spcl_eth

LATENCY

XEON PHI, MODIFIED / EXCLUSIVE

CAS

slide-106
SLIDE 106

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

= max(read latency, invalidation latency)

slide-107
SLIDE 107

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = constant

LATENCY MODEL

SHARED STATE

= max(read latency, invalidation latency)

slide-108
SLIDE 108

spcl.inf.ethz.ch @spcl_eth

Core Cache Cache Cache

Cache line Cache line Read for ownership Execute Atomic Cache coherence state = invalidation latency = constant

LATENCY MODEL

SHARED STATE

slide-109
SLIDE 109

spcl.inf.ethz.ch @spcl_eth

Atomics Read

LATENCY

HASWELL, SHARED

slide-110
SLIDE 110

spcl.inf.ethz.ch @spcl_eth

How to force cache coherence state

  • F(M): Write cache line (invalidates all copies)
  • F(E): F(M)  flush  read
  • F(S): F(E)  read by some other core
  • D. Hackenberg, D. Molka, W. Nagel. Comparing Cache Architectures and Coherency Protocols on x86-64

Multicore SMP Systems. MICRO’09