Challenges for Worst-case Execution Time Analysis of Multi-core - - PowerPoint PPT Presentation

challenges for worst case execution time analysis of
SMART_READER_LITE
LIVE PREVIEW

Challenges for Worst-case Execution Time Analysis of Multi-core - - PowerPoint PPT Presentation

Challenges for Worst-case Execution Time Analysis of Multi-core Architectures Jan Reineke @ saarland university computer science Intel, Braunschweig April 29, 2013 The Context: Hard Real-Time Systems Safety-critical applications:


slide-1
SLIDE 1

Challenges for Worst-case Execution Time Analysis of Multi-core Architectures

Jan Reineke @ Intel, Braunschweig April 29, 2013

computer science

saarland

university

slide-2
SLIDE 2 Jan Reineke, Saarland 2

The Context: Hard Real-Time Systems

Safety-critical applications:

¢ Avionics, automotive, train industries, manufacturing ¢ Embedded controllers must finish their tasks within

given time bounds.

¢ Developers would like to know the Worst-Case

Execution Time (WCET) to give a guarantee.

Side airbag in car Reaction in < 10 msec Crankshaft-synchronous tasks Reaction in < 45 microsec
slide-3
SLIDE 3 Jan Reineke, Saarland 3

The Timing Analysis Problem

Embedded Software Timing Requirements

? ¡

+

Microarchitecture

+ ¡

slide-4
SLIDE 4 Jan Reineke, Saarland 4

Simple CPU Memory

What does the execution time depend on?

¢ The input, determining which path is taken

through the program.

¢ The state of the hardware platform:

l Due to caches, pipelining, speculation, etc.

¢ Interference from the environment:

l External interference as seen from the analyzed

task on shared busses, caches, memory.

slide-5
SLIDE 5 Jan Reineke, Saarland 5

Simple CPU Memory

Complex CPU (out-of-order execution, branch prediction, etc.) Main Memory L1 Cache

What does the execution time depend on?

¢ The input, determining which path is taken

through the program.

¢ The state of the hardware platform:

l Due to caches, pipelining, speculation, etc.

¢ Interference from the environment:

l External interference as seen from the analyzed

task on shared busses, caches, memory.

slide-6
SLIDE 6 Jan Reineke, Saarland 6

Simple CPU Memory

Complex CPU (out-of-order execution, branch prediction, etc.) Main Memory L1 Cache Complex CPU L1 Cache Complex CPU L1 Cache ... L2 Cache Main Memory

What does the execution time depend on?

¢ The input, determining which path is taken

through the program.

¢ The state of the hardware platform:

l Due to caches, pipelining, speculation, etc.

¢ Interference from the environment:

l External interference as seen from the analyzed

task on shared busses, caches, memory.

slide-7
SLIDE 7 Jan Reineke, Saarland 7

Example of Influence of Microarchitectural State

PowerPC 755

x=a+b; LOAD r2, _a LOAD r1, _b ADD r3,r2,r1

slide-8
SLIDE 8 Jan Reineke, Saarland 8

Example of Influence of Corunning Tasks in Multicores

Radojkovic et al. (ACM TACO, 2012) on Intel Atom and Intel Core 2 Quad: up to 14x slow-down due to interference

  • n shared L2 cache and memory controller
slide-9
SLIDE 9 Jan Reineke, Saarland 9

Challenges

  • 1. Modeling

How to construct sound timing models?

  • 2. Analysis

How to precisely & efficiently bound the WCET?

  • 3. Design

How to design microarchitectures that enable precise & efficient WCET analysis?

slide-10
SLIDE 10 Jan Reineke, Saarland 10

The Modeling Challenge

Timing model = Formal specification of microarchitecture’s timing Incorrect timing model à possibly incorrect WCET bound.

+

Timing Model

Micro- architecture ? ¡

slide-11
SLIDE 11 Jan Reineke, Saarland 11

Current Process of Deriving Timing Model +

Micro- architecture

Timing Model

? ¡

slide-12
SLIDE 12 Jan Reineke, Saarland 12

Current Process of Deriving Timing Model +

Micro- architecture

Timing Model

? ¡

slide-13
SLIDE 13 Jan Reineke, Saarland 13

Current Process of Deriving Timing Model +

Micro- architecture

Timing Model

? ¡

slide-14
SLIDE 14 Jan Reineke, Saarland 14

Current Process of Deriving Timing Model +

Micro- architecture

Timing Model

? ¡

slide-15
SLIDE 15 Jan Reineke, Saarland 15

Current Process of Deriving Timing Model

à Time-consuming, and à error-prone.

+

Micro- architecture

Timing Model

? ¡

slide-16
SLIDE 16 Jan Reineke, Saarland 16

Current Process of Deriving Timing Model

à Time-consuming, and à error-prone.

+

Micro- architecture

Timing Model

? ¡

slide-17
SLIDE 17 Jan Reineke, Saarland 17
  • 1. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

VHDL Model
slide-18
SLIDE 18 Jan Reineke, Saarland 18
  • 1. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

VHDL Model

Derive timing model automatically from formal specification of microarchitecture.

à Less manual effort, thus less time-consuming, and à provably correct.

slide-19
SLIDE 19 Jan Reineke, Saarland 19
  • 1. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

VHDL Model

Derive timing model automatically from formal specification of microarchitecture.

à Less manual effort, thus less time-consuming, and à provably correct.

slide-20
SLIDE 20 Jan Reineke, Saarland 20
  • 1. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

VHDL Model

Derive timing model automatically from formal specification of microarchitecture.

à Less manual effort, thus less time-consuming, and à provably correct.

slide-21
SLIDE 21 Jan Reineke, Saarland 21
  • 2. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

Perform measurements on hardware Infer model
slide-22
SLIDE 22 Jan Reineke, Saarland 22
  • 2. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

Perform measurements on hardware

Derive timing model automatically from measurements

  • n the hardware using ideas from automata learning.

à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch.

Infer model
slide-23
SLIDE 23 Jan Reineke, Saarland 23
  • 2. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

Perform measurements on hardware

Derive timing model automatically from measurements

  • n the hardware using ideas from automata learning.

à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch.

Infer model
slide-24
SLIDE 24 Jan Reineke, Saarland 24
  • 2. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

Perform measurements on hardware

Derive timing model automatically from measurements

  • n the hardware using ideas from automata learning.

à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch.

Infer model
slide-25
SLIDE 25 Jan Reineke, Saarland 25
  • 2. Future Process of Deriving Timing Model

+

Micro- architecture

Timing Model

Perform measurements on hardware

Derive timing model automatically from measurements

  • n the hardware using ideas from automata learning.

à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch.

Infer model
slide-26
SLIDE 26 Jan Reineke, Saarland 26

Proof-of-concept: Automatic Modeling of the Cache Hierarchy

¢ Cache Model is important part of Timing Model ¢ Can be characterized by a few parameters: l ABC: associativity, block size, capacity l Replacement policy

chi [Abel and Reineke, RTAS 2013] derives all of these parameters fully automatically.

Data Tag Data Tag Data Tag Data Tag A = Associativity Data Tag Data Tag Data Tag Data Tag ... Data Tag Data Tag Data Tag Data Tag N = Number of Cache Sets B = Block Size
slide-27
SLIDE 27 Jan Reineke, Saarland 27

Example: Intel Core 2 Duo E6750, L1 Data Cache

10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 L1 Misses |Misses| |Size|
slide-28
SLIDE 28 Jan Reineke, Saarland 28

Example: Intel Core 2 Duo E6750, L1 Data Cache

10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 L1 Misses

Capacity = 32 KB

|Misses| |Size|
slide-29
SLIDE 29 Jan Reineke, Saarland 29

Example: Intel Core 2 Duo E6750, L1 Data Cache

10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 L1 Misses

Capacity = 32 KB Way Size = 4 KB

|Misses| |Size|
slide-30
SLIDE 30 Jan Reineke, Saarland 30

Replacement Policy

Approach inspired by methods to learn finite

  • automata. Heavily specialized to problem domain.
slide-31
SLIDE 31 Jan Reineke, Saarland 31

Replacement Policy

Approach inspired by methods to learn finite

  • automata. Heavily specialized to problem domain.

a b c d e f d c a b e f d x e d c a b x

More information: http://embedded.cs.uni-saarland.de/chi.php

Discovered to our knowledge undocumented policy of the Intel Atom D525:

slide-32
SLIDE 32 Jan Reineke, Saarland 32

Modeling Challenge: Future Work

Extend automation to other parts of the microarchitecture:

¢ Translation lookaside buffers, branch

predictors

¢ Shared caches in multicores including their

coherency protocols

¢ Out-of-order pipelines?

slide-33
SLIDE 33 Jan Reineke, Saarland 33

The Analysis Challenge

Precise & Efficient Timing Analysis

+

Timing Model

Micro- architecture

? ¡ ! ¡

WCETH(P) := max

i∈Inputs

max

h∈States(H) ETH(P, i, h)

Consider all possible program inputs Consider all possible initial states of the hardware
slide-34
SLIDE 34 Jan Reineke, Saarland 34

The Analysis Challenge

WCETH(P) := max

i∈Inputs

max

h∈States(H) ETH(P, i, h)

Consider all possible program inputs Consider all possible initial states of the hardware

Explicitly evaluating ET for all inputs and all hardware states is not feasible in practice:

¢ There are simply too many. è Need for abstraction and thus approximation!

slide-35
SLIDE 35 Jan Reineke, Saarland 35

The Analysis Challenge: State of the Art

Component Analysis Status Caches, Branch Target Buffers Precise & efficient abstractions, for

  • LRU [Ferdinand, 1999]

Not-so-precise but efficient abstractions, for

  • FIFO, PLRU, MRU [Grund and Reineke,

2008-2011] Complex Pipelines Precise but very inefficient; little abstraction Major challenge: timing anomalies Shared resources, e.g. busses, shared caches, DRAM No realistic approaches yet Major challenge: interference between hardware threads à execution time depends on corunning tasks

slide-36
SLIDE 36 Jan Reineke, Saarland 36 A A Resource 1 Resource 2 Resource 1 Resource 2 C B C B D E D E

Scheduling Anomaly

Timing Anomalies

Timing Anomaly = Local worst-case does not imply Global worst-case

slide-37
SLIDE 37 Jan Reineke, Saarland 37

Timing Anomalies

Timing Anomaly = Local worst-case does not imply Global worst-case

A A Cache Miss Cache Hit C Branch Condition Evaluated Prefetch B - Miss C

Speculation Anomaly

slide-38
SLIDE 38 Jan Reineke, Saarland 38

The Design Challenge

Wanted: Multi-/many-core architecture with

¢ No timing anomalies à precise & efficient analysis of individual cores ¢ Temporal isolation between cores à independent/incremental development & analysis

and high performance!

slide-39
SLIDE 39 Jan Reineke, Saarland 39

Approaches to the Design Challenge

At the level of individual cores:

¢ Simple in-order pipelines, with static or no

branch prediction

¢ Scratchpad Memories or LRU Caches

Software- controlled caches
slide-40
SLIDE 40 Jan Reineke, Saarland 40

Approaches to the Design Challenge

For resources shared among multiple cores:

¢ Temporal partitioning, e.g. l TDMA arbitration of buses, shared memories l Thread-interleaved pipeline in PRET ¢ Spatial partitioning, e.g. l Partition shared caches l Partition shared DRAM à Temporal isolation

slide-41
SLIDE 41 Jan Reineke, Saarland 41

Design Challenge: Predictable Pipelining

from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007.
slide-42
SLIDE 42 Jan Reineke, Saarland 42

Pipelining: Hazards

from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007.
slide-43
SLIDE 43 Jan Reineke, Saarland 43

Forwarding helps, but not all the time…

LD R1, 45(r2) DADD R5, R1, R7 BE R5, R3, R0 ST R5, 48(R2)

Unpipelined F D E M W F D E M W F D E M W F D E M W F D E M W The Dream F D E M W F D E M W F D E M W F D E M W The Reality F D E M W Memory Hazard F D E M W Data Hazard F D E M W Branch Hazard

slide-44
SLIDE 44 Jan Reineke, Saarland 44

Solution: PTARM Thread-interleaved Pipelines [Lickly et al., CASES 2008]

T1: F D E M W F D E M W T2: F D E M W F D E M W T3: F D E M W F D E M W T4: F D E M W F D E M W T5: F D E M W F D E M W

+

Each thread occupies only one stage of the pipeline at a time à No hazards; perfect utilization of pipeline à Simple hardware implementation (no forwarding, etc.) à Each instruction takes the same amount of time à Temporal isolation between different hardware threads Drawback: reduced single-thread performance

slide-45
SLIDE 45 Jan Reineke, Saarland 45

Design Challenge: DRAM Controller

Translates sequences of memory accesses by Clients (CPUs and I/O) into legal sequences of DRAM commands

l Needs to obey all timing constraints l Needs to insert refresh commands sufficiently often l Needs to translate “physical” memory addresses into row/column/

bank tuples

CPU1 CPU1 I/O ... DRAM Module Interconnect + Arbitration Memory Controller
slide-46
SLIDE 46 Jan Reineke, Saarland 46

Dynamic RAM Timing Constraints

DIMM addr+cmd chip select 0 16 data chip select 1 x16 Device 16 data 16 data 16 data x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device 64 data Rank 0 Rank 1 address I/O Registers + Data I/O Address Register Control Logic Mode Register 16 data command chip select DRAM Device Bank Bank Bank Bank Row Address Mux Refresh Counter I/O Gating DRAM Array Row Decoder Sense Amplifiers and Row Buffer Column Decoder/ Multiplexer Row Address Bank Capacitor Bit line Word line Transistor Capacitor

DRAM Memory Controllers have to conform to different timing constraints that define minimal distances between consecutive DRAM commands. Almost all of these constraints are due to the sharing of resources at different levels of the hierarchy:

Needs to insert refresh commands sufficiently often Rows within a bank share sense amplifiers Banks within a DRAM device share I/O gating and control logic Different ranks share data/address/ command busses
slide-47
SLIDE 47 Jan Reineke, Saarland 47

General-Purpose DRAM Controllers

¢ Schedule DRAM commands dynamically ¢ Timing hard to predict even for single client: l Timing of request depends on past requests:

  • Request to same/different bank?
  • Request to open/closed row within bank?
  • Controller might reorder requests to minimize latency

l Controllers dynamically schedule refreshes ¢ No temporal isolation. Timing depends on

behavior of other clients:

l They influence sequence of “past requests” l Arbitration may or may not provide guarantees

slide-48
SLIDE 48 Jan Reineke, Saarland 48 Thread 2 Thread 1

General-Purpose DRAM Controllers

Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5

Arbitration Memory Controller

Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5

?

Load B1.R3.C2 Load B3.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R5.C3 Store B2.R3.C5
slide-49
SLIDE 49 Jan Reineke, Saarland 49 Thread 2 Thread 1

General-Purpose DRAM Controllers

Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5

Arbitration Memory Controller

Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5

?

Load B1.R3.C2 Load B3.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R5.C3 Store B2.R3.C5
slide-50
SLIDE 50 Jan Reineke, Saarland 50

General-Purpose DRAM Controllers

Load B1.R3.C2 Load B1.R4.C3 Load B1.R3.C5 … RAS B1.R3 CAS B1.C2 … RAS B1.R4 CAS B1.C3 … RAS B1.R3 CAS B1.C5 … RAS B1.R3 CAS B1.C2 … RAS B1.R4 CAS B1.C3 … CAS B1.C5

Memory Controller

?

slide-51
SLIDE 51 Jan Reineke, Saarland 51

PRET DRAM Controller: Three Innovations [Reineke et al., CODES+ISSS 2011]

¢ Expose internal structure of DRAM devices: l Expose individual banks within DRAM device as

multiple independent resources

¢ Defer refreshes to the end of transactions l Allows to hide refresh latency ¢ Perform refreshes “manually”: l Replace standard refresh command with multiple reads

CPU1 CPU1 I/O ... Interconnect + Arbitration PRET DRAM Controller DRAM Module DRAM Module DRAM Module DRAM Bank
slide-52
SLIDE 52 Jan Reineke, Saarland 52

PRET DRAM Controller: Exploiting Internal Structure of DRAM Module

l Consists of 4-8 banks in 1-2 ranks

  • Share only command and data bus, otherwise independent

l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion

Bank Bank 1 Bank 2 Bank 3 Rank 0: Bank Bank 1 Bank 2 Bank 3 Rank 1:
slide-53
SLIDE 53 Jan Reineke, Saarland 53

PRET DRAM Controller: Exploiting Internal Structure of DRAM Module

l Consists of 4-8 banks in 1-2 ranks

  • Share only command and data bus, otherwise independent

l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion

Bank Bank 1 Bank 2 Bank 3 Rank 0: Bank Bank 1 Bank 2 Bank 3 Rank 1:
  • Successive accesses to

same group obey timing constraints

  • Reads/writes to different

groups do not interfere

slide-54
SLIDE 54 Jan Reineke, Saarland 54

PRET DRAM Controller: Exploiting Internal Structure of DRAM Module

l Consists of 4-8 banks in 1-2 ranks

  • Share only command and data bus, otherwise independent

l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion

Bank Bank 1 Bank 2 Bank 3 Rank 0: Bank Bank 1 Bank 2 Bank 3 Rank 1:
  • Successive accesses to

same group obey timing constraints

  • Reads/writes to different

groups do not interfere

Provides four independent and predictable resources

slide-55
SLIDE 55 Jan Reineke, Saarland 55

Conventional DRAM Controller (DRAMSim2) vs PRET DRAM Controller: Latency Evaluation

0.5 1 1.5 2 2.5 3 1,000 2,000 3,000 Interference [# of other threads occupied] latency [cycles] 4096B transfers, conventional controller 4096B transfers, PRET controller 1024B transfers, conventional controller 1024B transfers, PRET controller 1,000 2,000 3,000 4,000 1,000 2,000 3,000 transfer size [bytes] average latency [cycles] Conventional controller PRET controller Figure 10: Latencies of conventional and PRET memory con-

Varying interference for fixed transfer size: Varying transfer size at maximal interference: More information: http://chess.eecs.berkeley.edu/pret/

slide-56
SLIDE 56 Jan Reineke, Saarland 56

Emerging Challenge: Microarchitecture Selection & Configuration

Embedded Software Timing Requirements

with ? ¡

Family of Microarchitectures = Platform

slide-57
SLIDE 57 Jan Reineke, Saarland 57

Emerging Challenge: Microarchitecture Selection & Configuration

Embedded Software Timing Requirements

with ? ¡

Family of Microarchitectures = Platform

Choices:

  • Processor frequency
  • Sizes and latencies
  • f local memories
  • Latency and

bandwidth of interconnect

  • Presence of floating-

point unit

slide-58
SLIDE 58 Jan Reineke, Saarland 58

Emerging Challenge: Microarchitecture Selection & Configuration

Embedded Software Timing Requirements

with ? ¡

Family of Microarchitectures = Platform

Choices:

  • Processor frequency
  • Sizes and latencies
  • f local memories
  • Latency and

bandwidth of interconnect

  • Presence of floating-

point unit

Select a microarchitecture that a) satisfies all timing requirements, and b) minimizes cost/size/energy.

slide-59
SLIDE 59 Jan Reineke, Saarland 59

Conclusions

Challenges in modeling analysis design remain.

slide-60
SLIDE 60 Jan Reineke, Saarland 60

Conclusions

Challenges in modeling analysis design remain. Progress based on automation abstraction partitioning has been made.

slide-61
SLIDE 61 Jan Reineke, Saarland 61

Conclusions

Challenges in modeling analysis design remain. Progress based on automation abstraction partitioning has been made.

Thank you for your attention!