[PDF] - Embedded Application pull 1TOPS/W 3D gaming 3D TV 3D ambient PDF Document

SLIDE 1

1

Multiprocessor Allocation and Scheduling using Advanced Optimization Technology

L. Benini, M. Milano, D. Bertozzi*
M. Lombardi, A. Guerri, M. Ruggiero

Università di Bologna, *Università di Ferrara

Embedded Application pull

Year of Introduction 2005 2007 2009 2011 2013 2015 5 GOPS/W 100GOPS/W

Sign recognition A/V streaming Adaptive route Collision avoidance Autonomous driving 3D projected display

HMI by motion Gesture detection

Ubiquitous navigation Si Xray Gbit radio UWB 802.11n Structured encoding Structured decoding 3D TV 3D gaming H264 encoding H264 decoding Image recognition Fully recognition (security) Auto personalization dictation 3D ambient interaction Language

Emotion recognition Gesture recognition Expression recognition

Mobile Base-band

1TOPS/W [IMEC]

SLIDE 2

2

MPSoC – 2005 ITRS roadmap

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 200 400 600 800 1000 60 50 40 30 20 10 1200

Number of Processing Engines Logic, Memory Size (Normalized to 2005)

Number of Processing Engines (Right Axis) Total Logic Size (Normalized to 2005, Left Axis) Total Memory Size (Normalized to 2005, Left Axis) 16 23 32 46 63 79 101 133 161 212 268 348 424 526 669 878

[Martin06]

MPSoC Platform Evolution

45 nm <4mm <1GHz

I/O P E R I P H E R A L S

3D stacked main memory

2

30Mtr Local Memory hierarchy Net Int Power Test Mgmt router Bus based Multi Proc Applications Software opt. Middleware, RTOS, API, Run-Time Controller Mapping & scheduling V,Vt,Fclk,IL

SLIDE 3

3

Design as optimization

Design space

The set of “all” possible design choices

Constraints

Solutions that we are not willing to accept

Cost function

A property we are interested in (execution time, power, reliability…)

When & Why Offline Optimization?

Plenty of design-time knowledge

Applications pre-characterized at design time Dynamic transitions between different pre-

characterized scenarios

Aggressive exploitation of system resources

Reduces overdesign (lowers cost) Strong performance guarantees

Applicable for many embedded applications

SLIDE 4

4

Application Mapping

The problem of allocating, scheduling for task graphs on

multi-processors in a distributed real-time system is NP- hard.

New tool flows for efficient mapping of multi-task

applications onto hardware platforms

T1 T2 T3 T4 T5 T6 T7 T8

…

Proc. 1
Proc. 2
Proc. N

INTERCONNECT

Private Mem Private Mem Private Mem

…

T1 T2 T3 T4 T5 T6 T8 T7 Time Resources T1 T2 T3 T4 T5 T7

Deadline

T8 A l l

c

a t i

n

S c h e d u l e

Scheduling & Voltage Scaling

deadline

t P

τ1 τ2 τ3 Energy/speed trade-offs: varying the voltages

Vbs

CPU

Vdd

f1 f2 f3

Different voltages: different frequencies Mapping and scheduling: given (fastest freq.)

Power

deadline

t

τ1 τ2 τ3 Slack Voltage and Frequency scaling make the problem even harder! Current off-line approaches solve mapping, scheduling and voltage selection separately (sequentially)

SLIDE 5

5

Target architecture

Homogeneous computation

tiles:

ARM cores (including

instruction and data caches);

Tightly coupled software-

controlled scratch-pad memories (SPM);

AMBA AHB;
DMA engine;
RTEMS OS;
Power models for 0.13µm power

models (STM)

Variable Voltage/Frequency cores with

discrete (Vdd,f) pairs

Frequency dividers scale down the baseline

200 MHz system clock

Cores use non-cacheable shared memory to

communicate

Semaphore and interrupt facilities are used

for synchronization

Tile Tile Tile Tile …

Sync. Sync. Sync. Sync.

Private Mem Private Mem Private Mem Private Mem Shared Mem

AMBA AHB INTERCONNECT

Private Mem

..

Prog. REG CLOCK TREE GENERATOR System CLOCK CLOCK N CLOCK 3 CLOCK 2 CLOCK 1

INT Slave

…

Int_CLK

Tile Tile Tile Tile …

Sync. Sync. Sync. Sync.

Private Mem Private Mem Private Mem Private Mem Shared Mem

AMBA AHB INTERCONNECT

Private Mem

..

Prog. REG CLOCK TREE GENERATOR System CLOCK CLOCK N CLOCK 3 CLOCK 2 CLOCK 1

INT Slave

…

Int_CLK

Task graph

A group of tasks T Task dependencies Execution times express in clock cycles: WCN(Ti) Communication time (writes & reads) expressed as: WCN(WTiTj) and

WCN(RTiTj)

These values can be back-annotated from functional simulation or

computed using WCET analysis tools (e.g. AbsINT)

Node type

Normal; Fork, And; Branch, Or

Application model

Task1 Task2 Task3 Task4 Task5 Task6 WCN(WT1T2) WCN(RT1T2) WCN(T1) WCN(WT1T3) WCN(RT1T3) WCN(T2) WCN(WT2T4) WCN(RT2T4) WCN(WT3T5) WCN(RT3T5) WCN(WT4T6) WCN(RT4T6) WCN(WT5T6) WCN(RT5T6) WCN(T3) WCN(T4) WCN(T5) WCN(T6)

SLIDE 6

6 System Bus

Private Mem Private Mem

ARM Core

Int controller

SPM Semaphores

ARM Core

Int controller

Semaphores SPM

#2 #1

Task memory requirements

Communicating tasks might run:

On the same processor → negligible communication cost On different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:

On the local SPM On the remote Private Memory

Each task has three kinds

f memory requirements

Program Data Internal State Communication queues

System Bus

Private Mem Private Mem

ARM Core

Int controller

SPM Semaphores

ARM Core

Int controller

Semaphores SPM

Task memory requirements

Each task has three kinds

f memory requirements:

Program Data; Internal State; Communication queues.

#2 #1

Communicating tasks might run:

On the same processor → negligible communication cost On different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:

On the local SPM On the remote Private Memory

SLIDE 7

7

Application Development Flow

CTG

Characterization Phase

Simulator Optimization Phase Optimizer

Application Profiles

Optimal SW Application Implementation

A l l

c

a t i

n

S c h e d u l i n g

Application Development Support

Platform Execution

Optimization framework

Deterministic & stochastic task graphs Constraints

Resources: computation, communication, storage Timing: task deadlines, makespan

Objective functions

Performance (e.g. Makespan) Power (energy) Bus utilization

General modeling framework highly unstructured

ptimization problems

No black-box/generic optimizer can solve them efficiently

We developed a flexible algorithmic framework

wich is tuned on specific problems

SLIDE 8

8

Logic Based Benders Decomposition

Obj. Function:

Communication cost & energy consumption

Valid allocation Allocation & Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:

CONSTRAINT PROGRAMMING

No good: linear constraint

Memory constraints Timing constraint

Decomposes the problem into 2 sub-problems:

Allocation & Assignment (& freq. setting) → IP

Objective Function: E.g.: minimizing energy consumption

during execution and communication of tasks

Scheduling → CP

Objective Function: E.g.: minimizing energy consumption

during frequency switching

Computational scalability

Simplified CP and IP formulations Hybrid approach clearly outperforms pure CP and IP techniques Search time bounded to 1000 sec. CP and IP can found a solution only in 50%- of the instances Hybrid approach always found a solution

Deterministic task graphs, mapping & scheduling

16 25 36 49 64 81 100 1 2 3 4 5 6 7

SLIDE 9

9

Computational Scalability

Hundreds of of decision variables Much beyond ILP solver or CP solver capability

Deterministic task graphs, mapping & scheduling & v,f selection Stochastic task graphs, mapping & scheduling & min bus usage

Optimality gap

Comparison with heuristic 2-phase solution (GA) “timing barrier” gap significant when constraints are tight

SLIDE 10

10

Optimization Development

The abstraction gap between high level optimization tools

and standard application programming models can introduce unpredictable and undesired behaviours.

Programmers must be conscious about simplified

assumptions taken into account in optimization tools.

Platform Modelling Optimization Analysis Optimal Solution Starting Implementation Platform Execution Abstraction gap

(

. .

Final Implementation

Challenge: the Abstraction Gap

MAX error lower than 10% AVG error equal to 4.51%, with standard

deviation of 1.94

All deadlines are met

Optimizer

Optimal Allocation & Schedule Virtual Platform validation

0.05

0.05 0.1 0.15 0.2 0.25

5% -4% -3% -2% -1%

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions Throughput

Probability (%) Throughput difference (%)

SLIDE 11

11

MAX error lower than 10%; AVG error equal to 4.80%, with standard

deviation of 1.71;

Optimizer

Optimal Allocation & Schedule Virtual Platform validation

250 instances

Validation of optimizer solutions Power

0.05

0.05 0.1 0.15 0.2 0.25 0.3 0.35

5% -4%
3% -2% -1%

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances Probability (%) Energy consumption difference (%)

GSM Encoder

Throughput required: 1 frame/10ms. With 2 processors and 4 possible frequency

& voltage settings:

Task Graph:

10 computational tasks; 15 communication tasks.

Without optimizations: 50.9µJ With optimizations: 17.1 µJ

66,4%

SLIDE 12

12

Challenge: programming environment

A software development toolkit to help programmers in

software implementation:

a generic customizable application template OFFLINE

SUPPORT;

a set of high-level APIs ONLINE SUPPORT in RT-OS (RTEMS)

The main goals are:

predictable application execution after the optimization step; guarantees on high performance and constraint satisfaction.

Starting from a high level task and data flow graph, software

developers can easily and quickly build their application infrastructure.

Programmers can intuitively translate high level

representation into C-code using our facilities and library

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH uint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = { {0,1,1,0,..}, {0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..}; //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTIC uint node_type[TASK_NUMBER] = {1,2,2,1,..};

Example

Number of nodes (e.g 12)
Graph of activities
Node type

Normal, Branch, Conditional, Terminator

Node behaviour

Or, And, Fork, Branch

Number of CPUs : 2
Task Allocation
Task Scheduling
Arc priorities

Time Resources N1 B2 B3 C4 C7

Deadline

N8 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T1 N1 B2 B3 C4 C5 C6 C7 N8 N9 N10 N11 T12

fork

r
r

and branch branch

P1 P2 N11 N10 T12

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12

B3 C7 N10 T12

a13 a14

#define TASK_NUMBER 12

SLIDE 13

13

Relationship with RT techniques

We can handle periodic task graphs

Multiple rates can be analyzed by unrolling

and periodic extension

Cannot deal with aperiodic/sporadic tasks

unknown at design time

They would require unbounded unrolling

Currently assumes non-preemptive

scheduling

Summary & future work

Toward a mature SDK

Mature programmer support (Eclipse toolkit,

OpenMAX support)

Extend semantics (multi-rate SDF) Ports on real platforms (Cell BC underway,

Nomadik is under discussion)

Optimization engine enhancements

Dealing with multiple use cases Variable execution times Aggressive communication scheduling on NoCs Address preemption and sporadic tasks

SLIDE 14

14

Backup Slides Allocation problem model

Xtfp = 1 if task t executes on processor p at frequency f; Wijfp = 1 if task i and j run on different core. Task i on core p writes data to j at freq. f; Rijfp = 1 if task i and j run on different core. Task j on core p reads data to i at freq. f;

Write ad Comp P p M f ijfp ijfp P p M f ijfp P p M f ijfp P p M f tfp

En En En OF T j i R W T j i R T j i W t X + + = ∈ ∀ = − ∈ ∀ ≤ ∈ ∀ ≤ ∀ =

∑∑ ∑∑ ∑∑ ∑∑

= = = = = = = = Re 1 1 1 1 1 1 1 1

, ) ( , 1 , 1 1

Each task can execute only on

ne processor at one freq.

Communication between tasks can execute only once for execution and one write corresponds to one read

The objective function: minimize energy consumption

associated with task execution and communication

SLIDE 15

15

ad Write Comp P p M f f T t LocRij mRij ijfp LocRij ifp ad P p M f f T t LocWij mWij ijfp LocWij ifp Write P p M f T t f t tfp Comp

En En En OF E WCN WCN R WCN X En E WCN WCN W WCN X En E WCN X En

Re 1 1 1 Re Re 1 1 1 Re 1 1 1

)) ( ( )) ( ( + + = − + = − + = =

∑∑∑ ∑∑∑ ∑∑∑

= = = = = = = = =

Communication energy for Reads from shared memory. Reads carried out at the same frequency of the task

Allocation problem model

Bus

Mem

CPU CPU

Computation energy for all tasks in the system Communication energy for Writes to shared memory. Writes carried out at the same frequency

f the task
Five phases behaviour
INPUT=input data reading;
EXEC=computation activity;
OUTPUT=output data

writing.

Atomic activities

Scheduling problem model

INPUT EXEC OUTPUT The objective function: minimize energy consumption associated with frequency switching

Processors are modelled as unary resource
Bus is modelled as additive resource

Duration of task i is now fixed since mode is fixed:

Reading phase

input input input exec

utput
utput
utput

fork join Writing phase

j ij ij i i j i i i j i i

Start ad d dWrite duration Start Start T duration Start Start duration Start ≤ + + + ≤ + + ≤ + Re

Task i Task j

Tasks running on the same processor at the same frequency

Tasks running on the same processor at different frequencies Tasks running on different processors

SLIDE 16

16

Queue ordering optimization

Communication ordering affects system

performances

T1 T2 T4

CPU1 CPU2 …

C3 C 1 T3

…

C2

Wait! RUN!

T5 T6

… …

C 4 C5

Queue ordering optimization

Communication ordering affects system

performances

T1 T2 T4 T5 T6

CPU1 CPU2 … … …

C3 C 1 T3

…

C2

Wait! RUN!

C 4 C5

SLIDE 17

17 T4 re-activated

Synchronization among tasks

T1 T2 T4 C2 T3 C1 C3

Proc. 1

T1

Proc. 2

T2 T3 T4