Memory Consistency Models Adam Wierman Daniel Neill Adve, Pai, and - - PowerPoint PPT Presentation

memory consistency models
SMART_READER_LITE
LIVE PREVIEW

Memory Consistency Models Adam Wierman Daniel Neill Adve, Pai, and - - PowerPoint PPT Presentation

Memory Consistency Models Adam Wierman Daniel Neill Adve, Pai, and Ranganathan. Recent advances in memory consistency models for hardware shared-memory systems, 1999. Gniady, Falsafi, and Vijaykumar . Is SC+ILP=RC? , 1999. Hill. Multiprocessors


slide-1
SLIDE 1

Carnegie Mellon

School of Computer Science

1

Architecture

Memory Consistency Models

Adam Wierman Daniel Neill

Adve, Pai, and Ranganathan. Recent advances in memory consistency models for hardware shared-memory systems, 1999. Gniady, Falsafi, and Vijaykumar. Is SC+ILP=RC?, 1999.

  • Hill. Multiprocessors should support simple

memory consistency models, 1998.

slide-2
SLIDE 2

Carnegie Mellon

School of Computer Science

2

Architecture

Memory consistency models

  • The memory consistency model of a shared-memory

system determines the order in which memory operations will appear to execute to the programmer.

– Processor 1 writes to some memory location… – Processor 2 reads from that location… – Do I get the result I expect?

  • Different models make different guarantees; the processor

can reorder/overlap memory operations as long as the guarantees are upheld. Tradeoff between programmability and performance!

slide-3
SLIDE 3

Carnegie Mellon

School of Computer Science

3

Architecture

Code example 1

Data1 = 64 Data2 = 55 Flag = 1 while (Flag != 1) {;} register1 = Data1 register2 = Data2

P2 P1 initially Data1 = Data2 = Flag = 0 What should happen?

slide-4
SLIDE 4

Carnegie Mellon

School of Computer Science

4

Architecture

Code example 1

Data1 = 64 Data2 = 55 Flag = 1 while (Flag != 1) {;} register1 = Data1 register2 = Data2

P2 P1 initially Data1 = Data2 = Flag = 0 What could go wrong?

slide-5
SLIDE 5

Carnegie Mellon

School of Computer Science

5

Architecture

Three models of memory consistency

  • Sequential Consistency (SC):

– Memory operations appear to execute one at a time, in some sequential order. – The operations of each individual processor appear to execute in program order.

  • Processor Consistency (PC):

– Allows reads following a write to execute out of program

  • rder (if they’re not reading/writing the same address!)

– Writes may not be immediately visible to other processors, but become visible in program order.

  • Release Consistency (RC):

– All reads and writes (to different addresses!) are allowed to

  • perate out of program order.
slide-6
SLIDE 6

Carnegie Mellon

School of Computer Science

6

Architecture

Code example 1

Data1 = 64 Data2 = 55 Flag = 1 while (Flag != 1) {;} register1 = Data1 register2 = Data2

P2 P1 initially Data1 = Data2 = Flag = 0 Does it work under:

  • SC (no relaxation)?
  • PC (Write→Read relaxation)?
  • RC (all relaxations)?
slide-7
SLIDE 7

Carnegie Mellon

School of Computer Science

7

Architecture

Code example 2

Flag1 = 1 register1 = Flag2 if (register1 == 0) critical section

P2 P1 initially Flag1 = Flag2 = 0

Flag2 = 1 register2 = Flag1 if (register2 == 0) critical section

What should happen?

slide-8
SLIDE 8

Carnegie Mellon

School of Computer Science

8

Architecture

Code example 2

Flag1 = 1 register1 = Flag2 if (register1 == 0) critical section

P2 P1 initially Flag1 = Flag2 = 0

Flag2 = 1 register2 = Flag1 if (register2 == 0) critical section

What could go wrong?

slide-9
SLIDE 9

Carnegie Mellon

School of Computer Science

9

Architecture

Code example 2

Flag1 = 1 register1 = Flag2 if (register1 == 0) critical section

P2 P1 initially Flag1 = Flag2 = 0

Flag2 = 1 register2 = Flag1 if (register2 == 0) critical section

Does it work under:

  • SC (no relaxation)?
  • PC (Write→Read relaxation)?
  • RC (all relaxations)?
slide-10
SLIDE 10

Carnegie Mellon

School of Computer Science

10

Architecture

The performance/programmability tradeoff

Increasing performance Increasing programmability

slide-11
SLIDE 11

Carnegie Mellon

School of Computer Science

11

Architecture

Programming difficulty

  • PC/RC include special synchronization operations to allow

specific instructions to execute atomically and in program

  • rder.
  • The programmer must identify conflicting memory
  • perations, and ensure that they are properly synchronized.
  • Missing or incorrect synchronization → program gives

unexpected/incorrect results.

  • Too many unnecessary synchronizations → performance

reduced (no better than SC?) Idea: normally ensure sequential consistency; allow programmer to specify when relaxation possible?

slide-12
SLIDE 12

Carnegie Mellon

School of Computer Science

12

Architecture

Code example 1, revisited

Data1 = 64 Data2 = 55 Flag = 1 while (Flag != 1) {;} register1 = Data1 register2 = Data2

P2 P1 initially Data1 = Data2 = Flag = 0 MEMBAR (ST-ST) MEMBAR (LD-LD) Programmer adds synchronization commands… … and now it works as expected!

slide-13
SLIDE 13

Carnegie Mellon

School of Computer Science

13

Architecture

Performance of memory consistency models

  • Relaxed memory models (PC/RC) hide much of memory
  • perations’ long latencies by reordering and overlapping

some or all memory operations.

– PC/RC can use write buffering. – RC can be aggressively out of order.

  • This is particularly important:

– When cache performance poor, resulting in many memory

  • perations.

– In distributed shared memory systems, when remote memory accesses may take much longer than local memory accesses.

  • Performance results for straightforward implementations:

as compared to SC, PC and RC reduce execution time by 23% and 46% respectively (Adve et al).

slide-14
SLIDE 14

Carnegie Mellon

School of Computer Science

14

Architecture

The big question

How can SC approach the performance of RC?

slide-15
SLIDE 15

Carnegie Mellon

School of Computer Science

15

Architecture

Compiler Optimizations Hardware Optimizations

How can SC approach RC?

2 Techniques

slide-16
SLIDE 16

Carnegie Mellon

School of Computer Science

16

Architecture

Hardware Optimizations

What can SC do?

Can SC have per-processor caches? Can SC have non-binding prefetching? Can SC have multithreading? Can SC use a write buffer?

YES YES YES NO

SC cannot reorder memory operations because it might cause inconsistency.

slide-17
SLIDE 17

Carnegie Mellon

School of Computer Science

17

Architecture

Hardware Optimizations

Speculation with SC

SC only needs to appear to do memory operations in order 1. Speculatively perform all memory operations 2. Roll back to “sequentially consistent” state if constraints are violated This emulates RC as long as rollbacks are infrequent.

slide-18
SLIDE 18

Carnegie Mellon

School of Computer Science

18

Architecture

Hardware Optimizations

Speculation with SC

SC only needs to appear to do memory operations in order 1. Speculatively perform all memory operations 2. Roll back to “sequentially consistent” state if constraints are violated

  • Must allow both loads and stores to

bypass each other

  • Needs a very large speculative state
  • Don’t introduce overhead to the pipeline
slide-19
SLIDE 19

Carnegie Mellon

School of Computer Science

19

Architecture

Hardware Optimizations

Speculation with SC

SC only needs to appear to do memory operations in order 1. Speculatively perform all memory operations 2. Roll back to “sequentially consistent” state if constraints are violated

  • Must detect violations quickly
  • Must be able to roll back quickly
  • Rollbacks can’t happen often
slide-20
SLIDE 20

Carnegie Mellon

School of Computer Science

20

Architecture

Hardware Optimizations

Results

SC only needs to appear to do memory operations in order These changes were implemented in SC++ and results showed a narrowing gap as compared to PC and RC … but SC++ used significantly more hardware. The gap is negligible! Unlimited SHiQ, BLT

slide-21
SLIDE 21

Carnegie Mellon

School of Computer Science

21

Architecture

Compiler Optimizations Hardware Optimizations

How can SC approach RC?

2 Techniques

slide-22
SLIDE 22

Carnegie Mellon

School of Computer Science

22

Architecture

Compiler Optimizations

Compiler optimizations?

If we could figure out ahead of time which operations need to be run in order we wouldn’t need speculation

  • Data1 = 64
  • Data2 = 55
  • Flag = 1

while (Flag != 1) {;} register1 = Data1 register2 = Data2

P2 P1

slide-23
SLIDE 23

Carnegie Mellon

School of Computer Science

23

Architecture

  • Data1 = 64
  • Data2 = 55
  • Flag = 1

while (Flag != 1) {;} register1 = Data1 register2 = Data2

P2 P1 Compiler Optimizations

Where are the conflicts?

Write Data1 Write Data2 Write Flag Read Flag Read Data1 Read Data2 Guaranteeing no

  • perations on an edge in a

cycle are reordered guarantees consistency! If there are no cycles then there are no conflicts

slide-24
SLIDE 24

Carnegie Mellon

School of Computer Science

24

Architecture

Conclusion

SC approaches RC

Speculation and compiler optimizations allow SC to achieve nearly the same performance as RC

RC approaches SC

Programming constructs allow user to distinguish possible conflicts as synchronization operations and atill obtain the simplicity of SC

slide-25
SLIDE 25

Carnegie Mellon

School of Computer Science

25

Architecture

Memory Consistency Models

Adam Wierman Daniel Neill

Adve, Pai, and Ranganathan. Recent advances in memory consistency models for hardware shared-memory systems, 1999. Gniady, Falsafi, and Vijaykumar. Is SC+ILP=RC?, 1999.

  • Hill. Multiprocessors should support simple

memory consistency models, 1998.