Fast and Accurate Performance Analysis of Synchronization Mario - - PowerPoint PPT Presentation

fast and accurate performance analysis of synchronization
SMART_READER_LITE
LIVE PREVIEW

Fast and Accurate Performance Analysis of Synchronization Mario - - PowerPoint PPT Presentation

Fast and Accurate Performance Analysis of Synchronization Mario Badr and Natalie Enright Jerger Evaluating Multi-Threaded Performance Difficult and Time Consuming Non-Determinism Cross-stack effects Different Architectures


slide-1
SLIDE 1

Fast and Accurate Performance Analysis of Synchronization

Mario Badr and Natalie Enright Jerger

slide-2
SLIDE 2

Evaluating Multi-Threaded Performance

  • Difficult and Time Consuming
  • Non-Determinism
  • Cross-stack effects
  • Different Architectures
  • Goal: Make it Straightforward and Fast
  • One trace, many total orders
  • High level of abstraction

2

slide-3
SLIDE 3

More Synchronization != More Overhead

3

Ferret

slide-4
SLIDE 4

Multi-threaded, Multi-core Workflows

Write Multi-threaded Program Profile for Bottlenecks Implement Optimization Release Program Change Kernel Test Implementation Modify Implementation Release Modifications Design Multi-processor Simulate with Benchmarks Optimize Design Release Chip

Programmer Systems Researcher Architect

One architecture? Multiple architectures? Architectures that don’t exist? One application? Application input? Simulation time?

4

slide-5
SLIDE 5

Cross-Stack Interactions for Synchronization

5

Application Thread Library/Application Runtime Operating System Architecture

slide-6
SLIDE 6

Modelling Multithreaded Applications

Thread Model Representation

  • f the

Application Runtime/OS Model Architecture Model Architectural Configuration

6

slide-7
SLIDE 7

t1 t2 t3 t4

Execution of a Parallel Program

7

slide-8
SLIDE 8

What impacts a thread’s execution time?

  • Heterogeneity
  • Architectures (e.g., big.LITTLE)
  • Dynamic Voltage and Frequency Scaling (DVFS)
  • Contention
  • Synchronization
  • Many other things

8

slide-9
SLIDE 9

The Impact of Heterogeneity

t1 t2 t3 t4

9

slide-10
SLIDE 10

The Impact of Synchronization

t1 t2 t3 t4

10

slide-11
SLIDE 11

Heterogeneity and Synchronization

t1 t2 t3 t4

The order and time of synchronization events impacts performance.

11

slide-12
SLIDE 12

Modelling Cross-Stack Interactions

  • How to represent a multi-threaded application?
  • Task Graph
  • Trace
  • How to model the operating system and runtime?
  • Thread scheduling
  • Synchronization
  • How to model the architecture?
  • Rate of execution (e.g., cycles per instruction)

12

slide-13
SLIDE 13

The Producer Consumer Example

Adding Work to a Queue Removing Work from a Queue

13

slide-14
SLIDE 14

Representing an Application

Synchronization Trace

Thread Event Primitive Consumer Lock mutex Producer Lock mutex Consumer Wait enqueue Producer Signal enqueue Producer Unlock mutex Consumer Signal dequeue Consumer Unlock mutex

Task Graph

L S U S U U L wait Producer Consumer

14

slide-15
SLIDE 15

The order of synchronization events

  • A synchronization trace gives us the program order of each thread
  • We want to determine the total order of all synchronization events
  • The total order must be correct
  • Safety (e.g., no two threads in the same critical section)
  • Liveness (e.g., all threads make progress eventually)

15

slide-16
SLIDE 16

16

Thread Event Consumer Lock Producer Lock Consumer Wait Producer Signal Producer Unlock Consumer Signal Consumer Unlock Thread Event Consumer Lock Consumer Wait Producer Lock Producer Signal Producer Unlock Consumer Signal Consumer Unlock Thread Event Producer Lock Consumer Lock Producer Signal Producer Unlock Consumer Wait Consumer Signal Consumer Unlock Thread Event Producer Lock Producer Signal Producer Unlock Consumer Lock Consumer Wait Consumer Signal Consumer Unlock Consumer locks mutex first (original trace) Consumer is much faster than producer Producer locks mutex first Producer is much faster than consumer

One Trace, Multiple Total Orders – Captures Non-Determinism

slide-17
SLIDE 17

Modelling Locks and Condition Variables

Per-Lock Thread Queues Condition Variable Counters

  • On wait
  • Decrement counter by 1
  • On signal
  • Increment counter by 1
  • On broadcast
  • Increment counter by number of

consumers

t1 t2 t3 t4 t1

17

slide-18
SLIDE 18

Estimating the Time Between Events

  • 1. Dynamic Instructions
  • The distance between events
  • 2. Core Frequency and Microarchitecture
  • The rate between events
  • 3. The Scheduling of Threads
  • The opportunity to execute dynamic instructions
  • 4. The Timing of Prior Events
  • The dependencies between threads

18

slide-19
SLIDE 19

Our High Level Abstraction

Trace

TID(1) Acquire(A) 100 TID(3) Acquire(A) 342 TID(2) Barrier(B) 612 TID(1) Release(A) 30 TID(3) Release(A) 34 TID(1) Barrier(B) 843 TID(3) Barrier(B) 702 ... ... ... ... ...

Thread Model – A sequence of events instruction count between events current event instructions till next event Scheduler Synchronization Model sleep schedule

A

cycles per instruction frequency

C

executing threads queue Each core has its own frequency and the CPI for each thread Controller

Inter-thread dependencies

B

thread to core map instruction count between events for a given thread ID (TID) the synchronization event and the object it is acting on

19

slide-20
SLIDE 20

Validation Methodology

  • Benchmarks: PARSEC 3.0, Splash-3
  • Execution time measured with GNU time
  • Traces generated with Pin
  • Cycles-per-instruction profiled with VTune™
  • Architecture: Intel Xeon E5-2650 v2
  • 2 sockets, 8 cores per socket, 2 threads per core
  • 20 MB L3 Cache
  • 2.6 GHz
  • Three runs for each experiment

20

slide-21
SLIDE 21

Assumptions and Approximations

  • Cycles-Per-Instruction encompasses microarchitecture and memory

hierarchy performance

  • Synchronization events have zero latency
  • Context switches have zero latency
  • Synchronization model approximates application state
  • i.e., for condition variables

21

slide-22
SLIDE 22

Model Validation: 4 Cores, Single Socket

22

20 40 60 80 100 120 140 160 180 200 Time (seconds) Average of measured Average of estimated

slide-23
SLIDE 23

Model Validation: 32 Cores, Dual Socket

23

10 20 30 40 50 60 Time (seconds) Average of measured Average of estimated

slide-24
SLIDE 24

Water (nsquared): 8 Cores

Estimated with Our Model Estimated with Vtune™

24

1E+10 2E+10 3E+10 4E+10 5E+10 6E+10 7E+10 8E+10 1 2 3 4 5 6 7 Time (nanoseconds) Thread ID

Average of Computation Average of Synchronization

10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 Time (seconds) Thread ID

Average of Computation Average of Synchronization

slide-25
SLIDE 25

Model Runtime

Benchmark Input Set Input Size Trace Size Runtime blackscholes Native 603 MB 1.1 KB 4 ms bodytrack Native 616 MB 31 MB 4.9 minutes water (nsquared) Native 3.6 MB 53 MB 7.5 minutes average 2 MB 32 seconds

25

Orders of magnitude faster than simulation of smaller input sets.

slide-26
SLIDE 26

Conclusion

  • A very high level of abstraction can accurately and quickly estimate

the performance of a multi-threaded application on a multi-core processor.

  • Average 7.2% error in total execution time
  • Average 32 seconds to generate an estimate
  • Programmers and Systems Researchers can evaluate on many

architectures

  • Architects can evaluate with native inputs and many applications

26

slide-27
SLIDE 27

Future Work

  • How much non-determinism is there across multiple traces of an

application?

  • How can a {memory, network} contention model be added to

improve error without significantly increasing model complexity?

27

slide-28
SLIDE 28

Our Work is Open Source

https://github.com/mariobadr/simsync-pmam License: Apache 2.0 Mario Badr and Natalie Enright Jerger

28

slide-29
SLIDE 29

Scenario A – Consumer locks mutex first

Thread Event Primitive Consumer Lock mutex Producer Lock mutex Consumer Wait enqueue Producer Signal enqueue Producer Unlock mutex Consumer Signal dequeue Consumer Unlock mutex

1. Consumer locks mutex 2. Producer attempts lock

  • Producer blocked

3. Consumer waits for enqueue

  • Consumer blocked, silent unlock
  • Producer unblocked, silent lock

4. Producer signals enqueue

  • Consumer tries to lock, remains blocked

5. Producer unlocks mutex

  • Consumer unblocked, silent lock

6. Consumer signals dequeue 7. Consumer unlocks mutex

29

slide-30
SLIDE 30

Scenario B – Consumer is much faster

Thread Event Primitive Consumer Lock mutex Consumer Wait enqueue Producer Lock mutex Producer Signal enqueue Producer Unlock mutex Consumer Signal dequeue Consumer Unlock mutex

  • 1. Consumer locks mutex
  • 2. Consumer waits for enqueue
  • Consumer blocked, silent unlock
  • 3. Producer locks mutex
  • 4. Producer signals enqueue
  • Consumer tries lock, remains blocked
  • 5. Producer unlocks mutex
  • Consumer unblocked, silent lock
  • 6. Consumer signals dequeue
  • 7. Consumer unlocks mutex

30

slide-31
SLIDE 31

Scenario C – Producer locks mutex first

Thread Event Primitive Producer Lock mutex Consumer Lock mutex Producer Signal enqueue Producer Unlock mutex Consumer Wait enqueue Consumer Signal dequeue Consumer Unlock mutex

  • 1. Producer locks mutex
  • 2. Consumer attempts lock
  • Consumer blocked
  • 3. Producer signals enqueue
  • 4. Producer unlocks mutex
  • Consumer unblocked
  • 5. Consumer locks mutex
  • 6. Consumer does not have to wait
  • 7. Consumer signals dequeue
  • 8. Consumer unlocks mutex

31

slide-32
SLIDE 32

Scenario D – Producer is much faster

Thread Event Primitive Producer Lock mutex Producer Signal enqueue Producer Unlock mutex Consumer Lock mutex Consumer Wait enqueue Consumer Signal dequeue Consumer Unlock mutex

  • 1. Producer locks mutex
  • 2. Producer signals enqueue
  • 3. Producer unlocks mutex
  • 4. Consumer locks mutex
  • 5. Consumer does not have to

wait

  • 6. Consumer signals dequeue
  • 7. Consumer unlocks mutex

32