Parallel Frame Rendering: Trading Responsiveness for Energy on a - - PowerPoint PPT Presentation

parallel frame rendering trading responsiveness for
SMART_READER_LITE
LIVE PREVIEW

Parallel Frame Rendering: Trading Responsiveness for Energy on a - - PowerPoint PPT Presentation

PACT 2013 Parallel Frame Rendering: Trading Responsiveness for Energy on a Mobile GPU Jose-Maria Arnau 1 Joan-Manuel Parcerisa 1 Polychronis Xekalakis 2 jarnau@ac.upc.edu jmanel@ac.upc.edu polychronis.xekalakis@intel.com 1 Universitat


slide-1
SLIDE 1

PACT 2013

Parallel Frame Rendering: Trading Responsiveness for Energy on a Mobile GPU

09 / September / 2013

Jose-Maria Arnau1

jarnau@ac.upc.edu

Joan-Manuel Parcerisa1

jmanel@ac.upc.edu

Polychronis Xekalakis2

polychronis.xekalakis@intel.com

1Universitat Politecnica de Catalunya 2Intel Labs, Intel Corporation

1

slide-2
SLIDE 2

Bandwidth Usage for Graphics

Textures and 3D models from: http://www.turbosquid.com 2

slide-3
SLIDE 3

Bandwidth Usage for Graphics

Textures and 3D models from: http://www.turbosquid.com

62% 62%

2

slide-4
SLIDE 4

Texture Reuse Frame i Frame i+1

3

slide-5
SLIDE 5

Texture Reuse Frame i Frame i+1

3

86% of the texture dataset is shared 86% of the texture dataset is shared

slide-6
SLIDE 6

Texture Reuse Frame i Frame i+1

3

86% of the texture dataset is shared 86% of the texture dataset is shared

Mobile games exhibit a high degree of texture similarity between consecutive frames

slide-7
SLIDE 7

Texture Reuse

4

Mobile games exhibit a high degree of texture similarity between consecutive frames

slide-8
SLIDE 8

Outline

  • 1. Motivation
  • 2. Conventional Rendering
  • 3. Parallel Frame Rendering
  • 4. Experimental Results
  • 5. Conclusions

5

slide-9
SLIDE 9

Outline

  • 1. Motivation
  • 2. Conventional Rendering
  • 3. Parallel Frame Rendering
  • 4. Experimental Results
  • 5. Conclusions

5

slide-10
SLIDE 10

GPU

Conventional Tile-Based Rendering

6

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Command Processor Command Processor

CPU

Process user inputs Physical simulation Dispatch drawing commands

System Memory

Color Buffer Textures Geometry

slide-11
SLIDE 11

GPU

Conventional Tile-Based Rendering

CPU F0 CPU stage GPU stage

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4

GPU F4

Screen refresh F0 F1 F2 F3 F4 6

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Command Processor Command Processor

CPU

Process user inputs Physical simulation Dispatch drawing commands

System Memory

Color Buffer Textures Geometry Time

slide-12
SLIDE 12

GPU

Conventional Tile-Based Rendering

6

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Command Processor Command Processor

CPU

Process user inputs Physical simulation Dispatch drawing commands

System Memory

Color Buffer Textures Geometry

CPU F0 CPU stage GPU stage

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4

GPU F4

Screen refresh F0 F1 F2 F3 F4

Time

slide-13
SLIDE 13

GPU

Conventional Tile-Based Rendering

6

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Command Processor Command Processor

CPU

Process user inputs Physical simulation Dispatch drawing commands

System Memory

Color Buffer Textures Geometry

CPU F0 CPU stage GPU stage

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4

GPU F4

Screen refresh F0 F1 F2 F3 F4

Time

slide-14
SLIDE 14

GPU

Conventional Tile-Based Rendering

6

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Command Processor Command Processor

CPU

Process user inputs Physical simulation Dispatch drawing commands

System Memory

Color Buffer Textures Geometry

CPU F0 CPU stage GPU stage

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4

GPU F4

Screen refresh F0 F1 F2 F3 F4

Time

slide-15
SLIDE 15

GPU

Conventional Tile-Based Rendering

6

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Command Processor Command Processor

CPU

Process user inputs Physical simulation Dispatch drawing commands

System Memory

Color Buffer Textures Geometry

Capacity miss

CPU F0 CPU stage GPU stage

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4

GPU F4

Screen refresh F0 F1 F2 F3 F4

Time

slide-16
SLIDE 16

L2 Cache Reuse Distances

7

slide-17
SLIDE 17

L2 Cache Reuse Distances

7

The L2 Cache cannot capture the inter-frame texture reuse due to the huge distances

slide-18
SLIDE 18

Outline

  • 1. Motivation
  • 2. Conventional Rendering
  • 3. Parallel Frame Rendering
  • 4. Experimental Results
  • 5. Conclusions

8

slide-19
SLIDE 19

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

slide-20
SLIDE 20

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

Render Frame 0 Render Frame 1

Time

slide-21
SLIDE 21

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

Render Frame 0 Render Frame 1 Render Frame 0 Render Frame 1

Time

slide-22
SLIDE 22

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

Render Frame 0 Render Frame 1 Render Frame 0 Render Frame 1

Time

slide-23
SLIDE 23

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

Render Frame 0 Render Frame 1 Render Frame 0 Render Frame 1

Time

slide-24
SLIDE 24

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

Render Frame 0 Render Frame 1 Render Frame 0 Render Frame 1

Time

slide-25
SLIDE 25

Cluster 0 Cluster 1

Parallel Frame Rendering

Geometry Unit Geometry Unit Tiling Engine Tiling Engine L2 Cache L2 Cache Raster Unit 0 Raster Unit 0 Memory Controller Memory Controller Raster Unit 1 Raster Unit 1 Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1

Conventional GPU Clustered GPU 9

Render Frame 0 Render Frame 1 Render Frame 0 Render Frame 1

Time

Textures are fetched once in the shared L2 cache and accessed by the 2 clusters in a short timespan

slide-26
SLIDE 26

Parallel Frame Rendering

CPU F0

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4 F0 F1 F2 F3 F4 CPU stage GPU stage Screen refresh 1

GPU F4 Time

slide-27
SLIDE 27

Parallel Frame Rendering

CPU F0

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4 F0 F1 F2 F3 F4 CPU stage GPU stage Screen refresh CPU F0 CPU F1 CPU F2 CPU F3 CPU F4 F0 F1 F2

GPU Cluter 0 - F2 GPU Cluster 1 - F3 GPU Cluster 0 - F0 GPU Cluster 1 - F1

1

GPU F4 Time

slide-28
SLIDE 28

Parallel Frame Rendering

CPU F0

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4 F0 F1 F2 F3 F4 CPU stage GPU stage Screen refresh CPU F0 CPU F1 CPU F2 CPU F3 CPU F4 F0 F1 F2

GPU Cluter 0 - F2 GPU Cluster 1 - F3 GPU Cluster 0 - F0 GPU Cluster 1 - F1

1

GPU F4 Time

slide-29
SLIDE 29

Parallel Frame Rendering

CPU F0

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4 F0 F1 F2 F3 F4 CPU stage GPU stage Screen refresh CPU F0 CPU F1 CPU F2 CPU F3 CPU F4 F0 F1 F2

GPU Cluter 0 - F2 GPU Cluster 1 - F3 GPU Cluster 0 - F0 GPU Cluster 1 - F1 Smaller reuse distance

1

GPU F4 Time

PFR saves bandwidth by overlapping the memory accesses of 2 consecutive frames

slide-30
SLIDE 30

Parallel Frame Rendering

CPU F0

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4 F0 F1 F2 F3 F4 CPU stage GPU stage Screen refresh CPU F0 CPU F1 CPU F2 CPU F3 CPU F4 F0 F1 F2

GPU Cluter 0 - F2 GPU Cluster 1 - F3 GPU Cluster 0 - F0 GPU Cluster 1 - F1

Input lag

Smaller reuse distance

1

GPU F4 Time

PFR saves bandwidth by overlapping the memory accesses of 2 consecutive frames

slide-31
SLIDE 31

Parallel Frame Rendering

CPU F0

GPU F0

CPU F1

GPU F1

CPU F2

GPU F2

CPU F3

GPU F3

CPU F4 F0 F1 F2 F3 F4 CPU stage GPU stage Screen refresh CPU F0 CPU F1 CPU F2 CPU F3 CPU F4 F0 F1 F2

GPU Cluter 0 - F2 GPU Cluster 1 - F3 GPU Cluster 0 - F0 GPU Cluster 1 - F1

Input lag Input lag

Smaller reuse distance Bigger input lag

1

PFR saves bandwidth by overlapping the memory accesses of 2 consecutive frames

GPU F4 Time

slide-32
SLIDE 32

User Inputs in Mobile Games

1 1

The user does not provide any input most of the time, except in hotpursuit and ibowl

slide-33
SLIDE 33

Reactive-PFR

1 2

slide-34
SLIDE 34

Reactive-PFR

1 2

slide-35
SLIDE 35

Reactive-PFR

1 2

slide-36
SLIDE 36

Reactive-PFR

1 2

Conventional Rendering (1 frame) Parallel Frame Rendering (2 frames) T frames without user inputs User provides inputs

slide-37
SLIDE 37

N-Frames Reactive-PFR

1 3

Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1 Cluster 0 Cluster 1 Geometry Unit 2 Geometry Unit 2 Tiling Engine 2 Tiling Engine 2 Raster Unit 2 Raster Unit 2 Cluster 2 Geometry Unit 3 Geometry Unit 3 Tiling Engine 3 Tiling Engine 3 Raster Unit 3 Raster Unit 3 Cluster 3

slide-38
SLIDE 38

N-Frames Reactive-PFR

Cluster 0, 1, 2 and 3 work on Frame i 1 3

Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1 Cluster 0 Cluster 1 Geometry Unit 2 Geometry Unit 2 Tiling Engine 2 Tiling Engine 2 Raster Unit 2 Raster Unit 2 Cluster 2 Geometry Unit 3 Geometry Unit 3 Tiling Engine 3 Tiling Engine 3 Raster Unit 3 Raster Unit 3 Cluster 3

1 frame

slide-39
SLIDE 39

N-Frames Reactive-PFR

Cluster 0, 1, 2 and 3 work on Frame i Cluster 0 and 1 work on Frame i Cluster 2 and 3 work on Frame i + 1 1 3

Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1 Cluster 0 Cluster 1 Geometry Unit 2 Geometry Unit 2 Tiling Engine 2 Tiling Engine 2 Raster Unit 2 Raster Unit 2 Cluster 2 Geometry Unit 3 Geometry Unit 3 Tiling Engine 3 Tiling Engine 3 Raster Unit 3 Raster Unit 3 Cluster 3

1 frame 2 frames

slide-40
SLIDE 40

N-Frames Reactive-PFR

Cluster 0, 1, 2 and 3 work on Frame i Cluster 0 and 1 work on Frame i Cluster 2 and 3 work on Frame i + 1 Cluster 0: Frame i Cluster 2: Frame i + 2 Cluster 1: Frame i + 1 Cluster 3: Frame i + 3 1 3

Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1 Cluster 0 Cluster 1 Geometry Unit 2 Geometry Unit 2 Tiling Engine 2 Tiling Engine 2 Raster Unit 2 Raster Unit 2 Cluster 2 Geometry Unit 3 Geometry Unit 3 Tiling Engine 3 Tiling Engine 3 Raster Unit 3 Raster Unit 3 Cluster 3

1 frame 2 frames 4 frames

slide-41
SLIDE 41

N-Frames Reactive-PFR

Cluster 0, 1, 2 and 3 work on Frame i Cluster 0 and 1 work on Frame i Cluster 2 and 3 work on Frame i + 1 Cluster 0: Frame i Cluster 2: Frame i + 2 Cluster 1: Frame i + 1 Cluster 3: Frame i + 3 1 3

Shared L2 Cache Shared L2 Cache Memory Controller Memory Controller Geometry Unit 0 Geometry Unit 0 Tiling Engine 0 Tiling Engine 0 Raster Unit 0 Raster Unit 0 Geometry Unit 1 Geometry Unit 1 Tiling Engine 1 Tiling Engine 1 Raster Unit 1 Raster Unit 1 Cluster 0 Cluster 1 Geometry Unit 2 Geometry Unit 2 Tiling Engine 2 Tiling Engine 2 Raster Unit 2 Raster Unit 2 Cluster 2 Geometry Unit 3 Geometry Unit 3 Tiling Engine 3 Tiling Engine 3 Raster Unit 3 Raster Unit 3 Cluster 3

Conventional Rendering (1 frame) PFR (2 frames)

T1 frames without user inputs User provides inputs

PFR (4 frames)

T2 frames without user inputs User provides inputs

1 frame 2 frames 4 frames

slide-42
SLIDE 42

Delay Randomly PFR

■ PFR delays all user inputs

Big bandwidth savings

Worse responsiveness

■ R-PFR and NR-PFR do not delay any input

Same responsiveness than conventional GPUs

Smaller bandwidth savings

■ Delay Randomly PFR (DR-PFR)

Delay a percentage, P, of randomly selected frames with user inputs Conventional Rendering (1 frame) PFR (2 frames)

T frames without user inputs User provides inputs && random() >= P

1 4

User provides inputs && random() < P

slide-43
SLIDE 43

Outline

  • 1. Motivation
  • 2. Conventional Rendering
  • 3. Parallel Frame Rendering
  • 4. Experimental Results
  • 5. Conclusions

1 5

slide-44
SLIDE 44

Evaluation Methodology

■ TEAPOT simulator

Runs unmodified Android applications

Mobile GPU timing simulator (TBDR)

Power estimations (McPAT)

“TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems”. ICS 2013.

■ Workloads

2D games: angrybirds, badpiggies

Simple 3D games: crazysnowboard, ibowl, templerun

Complex 3D games: captainamerica, chaos, hotpursuit, sleepyjack

■ Hardware Parameters

Technology 32 nm L2 Cache 128 KB, 8-way, 12 cycles Frequency 300 MHz Tile Cache 32 KB, 4-way, 4 cycles Tile size 16 x 16 Texture Caches 8 KB, 2-way, 1 cycle Screen 800 x 480 (WVGA) Main memory 1 GB, 8 bytes/cycle

  • Conv. Rendering

PFR, R-PFR, DR-PFR NR-PFR Num clusters 1 2 4 Raster units/cluster 4 2 1 Vertex Proc./cluster 4 2 1 1 6

slide-45
SLIDE 45

Normalized Memory Traffic

1 7

slide-46
SLIDE 46

Normalized Memory Traffic

1 7

16.5% 27.5%

slide-47
SLIDE 47

Normalized Memory Traffic

1 7

NR-PFR provides bigger savings in 5 games, despite it achieves high responsiveness...

slide-48
SLIDE 48

Normalized Memory Traffic

1 7

...but the Reactive versions offer modest savings for games with intensive user interaction

slide-49
SLIDE 49

Speedup

1 8

GPUs are bandwidth bound

slide-50
SLIDE 50

Normalized Energy

1 9

slide-51
SLIDE 51

Trading Responsiveness for Energy

2

slide-52
SLIDE 52

Outline

  • 1. Motivation
  • 2. Conventional Rendering
  • 3. Parallel Frame Rendering
  • 4. Experimental Results
  • 5. Conclusions

2 1

slide-53
SLIDE 53

Conclusions

■ Consecutive frames exhibit a high degree of texture similarity,

since a big percentage of the texture dataset is shared

■ Mobile GPU memory bandwidth can be significantly reduced

by overlapping the memory accesses of 2 consecutive frames, we term this technique as Parallel Frame Rendering (PFR)

■ PFR comes at a cost in the responsiveness of the system, as

the input lag is increased

■ Reactive versions of PFR are able to adapt to the amount of

input provided by the user, achieving high responsiveness when necessary

■ The experimental results show that PFR and its different

variations provide consistent performance and energy improvements

2 2

slide-54
SLIDE 54

PACT 2013

Parallel Frame Rendering: Trading Responsiveness for Energy on a Mobile GPU

09 / September / 2013

Jose-Maria Arnau1

jarnau@ac.upc.edu

Joan-Manuel Parcerisa1

jmanel@ac.upc.edu

Polychronis Xekalakis2

polychronis.xekalakis@intel.com

1Universitat Politecnica de Catalunya 2Intel Labs, Intel Corporation

2 3