Combining Data Remapping and Voltage/Frequency Scaling of Second - - PowerPoint PPT Presentation

combining data remapping and voltage frequency scaling of
SMART_READER_LITE
LIVE PREVIEW

Combining Data Remapping and Voltage/Frequency Scaling of Second - - PowerPoint PPT Presentation

Combining Data Remapping and Voltage/Frequency Scaling of Second Level Memory for Energy Reduction in Embedded Systems Sudarshan K. Srinivasan, Jun Cheol Park and Vincent J. Mooney III Georgia Institute of Technology {darshan, jcpark,


slide-1
SLIDE 1

Combining Data Remapping and Voltage/Frequency Scaling of Second Level Memory for Energy Reduction in Embedded Systems

Sudarshan K. Srinivasan, Jun Cheol Park and Vincent J. Mooney III Georgia Institute of Technology

{darshan, jcpark, mooney}@ece.gatech.edu

slide-2
SLIDE 2

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 2

Outline

Introduction Motivation Related Work in Power Modeling Experimental Setup Data Remapping Voltage/Frequency Scaling of Off-chip

Memory and Bus

Experimental Results Conclusion

slide-3
SLIDE 3

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 3

Introduction

Power/energy is a

major issue in embedded systems

Mobile devices

require longer usage time

slide-4
SLIDE 4

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 4

Introduction (Cont.)

Memory consumes

up to 45% of the total system power*

Memory is a main

target for power/energy reduction

Non- memory Memory

*P. Panda, N. Dutt, and A. Nicolau. Memory Issues In Embedded Systems-On-Chip, Optimizations and Exploration. Kluwer Academic Publishers, 1999.

Processor Embedded System L1 cache Off-chip bus Off-chip memory

slide-5
SLIDE 5

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 5

Motivation

Voltage/Frequency Scaling Hardware Technique Data Remapping Software Technique Processor+L1 cache Energy Reduction Reduction in E.T. & Energy Off-chip Bus+L2 cache Embedded System

slide-6
SLIDE 6

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 6

Related Work in Power Modeling

Simplescalar/ARM PowerAnalyzer*

Cycle level power/performance simulator

SimplePower**

Architectural power estimation tool Does not capture the energy of control unit

  • f processor, clock generation

* http://www.eecs.umich.edu/~jringenb/power/ ** http://www.cse.psu.edu/~mdl/software.htm

slide-7
SLIDE 7

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 7

Experimental Setup

System Energy Processor Core Energy Off-chip Bus Energy L1 and L2 caches Energy +

slide-8
SLIDE 8

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 8

Experimental Setup (Cont.)

Binary Translation ARM9 Based System Architecture RTL Description (Verilog) Functional Simulation (VCS) Benchmark Program (C/C++) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

Processor core power

slide-9
SLIDE 9

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 9

Experimental Setup (Cont.)

Processor core power

MARS (Michigan ARM

Simulator)

A cycle accurate verilog

model of a RISC processor

Capable of running ARM

instructions

Binary Translation ARM9 Based System Architecture RTL Description (Verilog) Functional Simulation (VCS) Benchmark Program (C/C++) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-10
SLIDE 10

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 10

Experimental Setup (Cont.)

Processor core power

Collect toggle rate of

internal logic signals using Synopsys VCS simulation

Synthesize verilog model

using TSMC .25µ library

Binary Translation ARM9 Based System Architecture RTL Description (Verilog) Functional Simulation (VCS) Benchmark Program (C/C++) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-11
SLIDE 11

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 11

Experimental Setup (Cont.)

Processor core power

Estimate power using

Synopsys Power Compiler

Binary Translation ARM9 Based System Architecture RTL Description (Verilog) Functional Simulation (VCS) Benchmark Program (C/C++) Toggle Rate (Activity) Generation Processor Core Power Synthesize Verilog Model

slide-12
SLIDE 12

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 12

Experimental Setup (Cont.)

Off-chip bus power

Bus capacitance obtained

from actual board

PCB board with SA110

processor (Skiff board)

Binary Translation ARM9 Based System Architecture RTL Description (Verilog) Functional Simulation (VCS) Benchmark Program (C/C++) Toggle Rate (Activity) Generation Off-chip Bus Power Off-chip bus parameters Skiff board

slide-13
SLIDE 13

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 13

L1 and L2 caches Energy SMACS TRICEPS Execution Time

Experimental Setup (Cont.)

  • L1 and L2 caches energy
  • TRIMARAN*

Integrated compilation and

performance monitoring infrastructure

ARM-like processor simulator TRICEPS

  • Generate ARM code

SMACS (Smart Memory and Cache

Hierarchy Simulator)

  • cache activity statistics
  • Kamble and Ghose model**

TRIMARAN

Execution Statistics

*TRIMARAN http://www.trimaran.org **M. Kamble and K. Ghose ”Analytical energy dissipation models for low power caches,” Proceedings of the International Symposium on Low Power Electronics and Design, pp. 143-148, Aug. 1997.

slide-14
SLIDE 14

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 14

Experimental Setup (Cont.)

System Energy Processor Core Power Off-chip Bus Power L1 and L2 caches Energy +

Execution Time TRIMARAN

+

slide-15
SLIDE 15

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 15

Data Remapping*

A compile time technique for performance

enhancement and energy reduction

Remapping data into new set such that data

items that are more likely to be used together are grouped together into the same cache block

Enhancing spatial locality

*K. Palem, R. Rabbah, P. Korkmaz, V. Mooney and K. Puttaswamy, "Design Space Optimization of Embedded Memory Systems via Data Remapping," Proceedings of the Languages, Compilers, and Tools for Embedded Systems (LCTES’02), pp. 28-37, June 2002.

slide-16
SLIDE 16

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 16

Data Remapping (Cont.)

Amount of data fetched before and after remapping (Traveling salesman problem in Olden Suite)

slide-17
SLIDE 17

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 17

Data Remapping (Cont.)

  • An item in memory is accessed by initiating a

load of the contents of a memory location or address

  • Since a memory access is expensive, a set of

adjacent memory locations are loaded at the same time and stored in a cache

  • The set of adjacent memory locations is

known as a memory block

Blocks do not overlap and have the same

size

  • Each address can be mapped to a block in

memory

Data Objects Memory block

slide-18
SLIDE 18

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 18

Data Remapping (Cont.)

Data reorganization is the relocation of data objects

in memory

block

Reorganization

data objects

Memory

slide-19
SLIDE 19

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 19

Data Remapping (Cont.)

Analyze application memory access pattern then

remap data

Address Memory Block

slide-20
SLIDE 20

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 20

Voltage/frequency scaling of off-chip memory and bus*

Scaling down supply voltage of off-chip bus

and memory (L2 cache)

P is proportional to V2

Significant energy saving in L2 cache Doubling the memory access latency L2 cache miss rate affects system

performance significantly

*K. Puttaswamy, K. Choi, J. C. Park, V. J. Mooney III, A. Chatterjee and P. Ellervee, System Level Power-Performance Trade-Offs in Embedded Systems Using Voltage and Frequency Scaling of Off-Chip Buses and Memory,” Proceedings of International Symposium on System Synthesis, to appear, October, 2002, Kyoto, Japan.

slide-21
SLIDE 21

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 21

Voltage/frequency scaling of off-chip memory and bus (Cont.)

L2 Memory Processor Core

100 Mhz, 2.75 V 100 Mhz, 3.3 V

L2 Memory Processor Core

100 Mhz, 2.75 V 50 Mhz, 2.0 V

Write Buffer

slide-22
SLIDE 22

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 22

Experimental Results

Two Olden benchmarks (Health and

Perimeter) are used

The supply voltage for L2 cache and buses

are scaled down to 2V, 50Mhz

The benchmarks are remapped and simulated

with 50Mhz L2 cache

Half size L1 and L2 cache system is simulated

Data remapping can achieve same execution time

with half cache resources

slide-23
SLIDE 23

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 23

Experimental Results (Cont.)

45.63 42.18 58.38 60.94 6.89 63.79 % Energy*Delay Reduction 40.65 34.66 44.55 45.69 16.16 39.33 % Energy Reduction 74.618 79.35 57.118 53.608 127.778 49.687 137.231 Energy*Delay 10.134 11.158 9.468 9.274 14.316 10.360 17.076 Energy(J) 7.363 7.112 6.033 5.78 8.926 4.796 8.036 Delay (Execution Time)(s) 736311686 711151104 603275469 578046486 892552982 479612138 803645821 Execution Cycles After DR+FVM 1/2 size L1,L2 After DR+FVM 1/2 size L2 After DR+FVM 1/2 size L1 After DR+FVM After FVM After DR Before DR, FVM

Energy delay with frequency/voltage scaling of memory (FVM) and data remapping (DR) for health benchmark (L1 32KB 16B/line, L2 1MB 32B/line)

slide-24
SLIDE 24

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 24

Experimental Results (Cont.)

10 20 30 40 50 60 70

Before DR, FVM After DR After FVM After DR+FVM After DR+FVM 1/2 size L1 After DR+FVM 1/2 size L2 After DR+FVM 1/2 size L1, L2

% Energy Reduction % Energy*Delay Reduction

Maximum of 46% of

energy reduction

Energy consumption

  • f the cache

reduced by half after halving L1 and L2 cache without performance loss

Energy delay with frequency/voltage scaling of memory (FVM) and data remapping (DR) for health benchmark (L1 32KB 16B/line, L2 1MB 32B/line)

slide-25
SLIDE 25

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 25

Conclusion

Combine of two techniques (HW & SW)

to maximize energy reduction

Achieve 46% of energy reduction

without performance loss

Achieve 1/2 energy consumption with

half size cache, same performance

slide-26
SLIDE 26

ESCODES 24 Sep. 2002 Jun Cheol Park Georgia Institute of Technology 26

Thank you.