[PPT] - Hardware Acceleration of Hardware Acceleration of Graphics and PowerPoint Presentation

SLIDE 1

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging Algorithms Using FPGAs Algorithms Using FPGAs

Pavel Zemčík Pavel Zemčík Department of Computer Graphics and Multimedia, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Faculty of Information Technology, Brno University of Technology, Brno University of Technology, Czech Republic Czech Republic e-mail: e-mail: zemcik@fit.vutbr.cz zemcik@fit.vutbr.cz

SLIDE 2

Overview Overview

History and current state of the art

History and current state of the art

Possible future development

Possible future development

FPGA capabilities and configuration, DSP

FPGA capabilities and configuration, DSP

Design examples

Design examples

Algorithm examples

Algorithm examples

Conclusion

Conclusion

SLIDE 3

History and current History and current state of the art state of the art

SLIDE 4

Historical development Historical development

1) Special hardware - “the only way” 1) Special hardware - “the only way” Computer Graphics device

SLIDE 5

Historical development Historical development

2) Integration - “the cheaper way” 2) Integration - “the cheaper way” Computer RAM Graphics

SLIDE 6

Historical development Historical development

3) On board memory - “the faster way” 3) On board memory - “the faster way” Computer Graphics Dedicated RAM

SLIDE 7

Historical development Historical development

3) Acceleration - “the modern way” 3) Acceleration - “the modern way” Computer Graphics Dedicated RAM Graphics pipeline

SLIDE 8

Historical development Historical development

Bandwidth Bandwidth

ISA 8/16 bit - 5MB/s

ISA 8/16 bit - 5MB/s

PCI 32 bit - 132MB/s

PCI 32 bit - 132MB/s

AGP 64 bit - 512 MB/s

AGP 64 bit - 512 MB/s

SLIDE 9

Historical development Historical development

Resolution Resolution

CGA - 320x240x4 colours

CGA - 320x240x4 colours

VGA - 640x480x256 colours

VGA - 640x480x256 colours

XGA, … 1600x1200 and more, 32-bit

XGA, … 1600x1200 and more, 32-bit RGB RGB

SLIDE 10

Current state of the art Current state of the art

Typical configuration Typical configuration

AGP interface

AGP interface

64 MB RAM - 256 bit, >3GB/s

64 MB RAM - 256 bit, >3GB/s

Graphics pipeline (3D transformation,

Graphics pipeline (3D transformation, clipping, shading, 2D/3D textures, clipping, shading, 2D/3D textures, partially programmable, etc.) partially programmable, etc.)

>100 000 polygons/s, >1G pixels/s

>100 000 polygons/s, >1G pixels/s

SLIDE 11

Current state of the art Current state of the art

Pixel oriented architectures (Pixel Planes) Pixel oriented architectures (Pixel Planes)

Each pixel has its own “CPU”

Each pixel has its own “CPU”

Extremely high pixel rates

Extremely high pixel rates

Limited features in texturing etc.

Limited features in texturing etc.

SLIDE 12

Current state of the art Current state of the art

Volume rendering architecture (VolumePro) Volume rendering architecture (VolumePro)

Very high voxel rate

Very high voxel rate

Shadows and perspective projection

Shadows and perspective projection (so far) not included (so far) not included

SLIDE 13

Possible Possible future development future development

SLIDE 14

Possible future development Possible future development

Frequency cannot increase much Frequency cannot increase much

Frequency - light travels only 30cm/1ns

Frequency - light travels only 30cm/1ns and electrical signal propagation is much and electrical signal propagation is much slower inside the chips slower inside the chips

SLIDE 15

Possible future development Possible future development

Parallelism and configurable devices Parallelism and configurable devices

Parallelism - semantic problems

Parallelism - semantic problems

Device configuration - difficult to handle

Device configuration - difficult to handle for “programmers” for “programmers”

SLIDE 16

Possible future development Possible future development

Questions Questions

What is the frequency limit?

What is the frequency limit?

What is the limit in the bus width?

What is the limit in the bus width?

Why would programmable logic be

Why would programmable logic be implemented only in graphics implemented only in graphics accelerators and not in CPUs? accelerators and not in CPUs?

SLIDE 17

FPGA capabilities FPGA capabilities and configuration, DSP and configuration, DSP

SLIDE 18

FPGA features FPGA features

General purpose versus application specific General purpose versus application specific

Current processors (and DSPs) are

Current processors (and DSPs) are suitable for any algorithm but are not suitable for any algorithm but are not fast enough (e.g. For 2D or 3D data) fast enough (e.g. For 2D or 3D data)

Speed in processing can be achieved by

Speed in processing can be achieved by hard-wired digital circuits but these are hard-wired digital circuits but these are considered application-specific considered application-specific

SLIDE 19

Sequential processing

Sequential processing

– poor performance poor performance – easy to program easy to program

Fixed architecture

Fixed architecture

Cheap

Cheap MUL ADD 1 MUL ADD 2

Processor Processor

SLIDE 20

More Processors More Processors

Multiprocessing is difficult to handle

Multiprocessing is difficult to handle algorithmically algorithmically

Memory throughput or communication

Memory throughput or communication speed is the limiting factor speed is the limiting factor

Price and power efficiency poor

Price and power efficiency poor

SLIDE 21

Parallel processing

Parallel processing

Fixed architecture

Fixed architecture

Programmable

Programmable MUL ADD 1 MUL ADD 2 MUL ADD N

More Processors More Processors

SLIDE 22

Hard-wired Hard-wired

Each circuit useable only for one task or

Each circuit useable only for one task or very limited set of tasks very limited set of tasks

Maximum size complexity is the limiting

Maximum size complexity is the limiting factor factor

Expensive to design

Expensive to design

SLIDE 23

MUL ADD 1

Parallel processing

Parallel processing

Fixed architecture

Fixed architecture

Few functions

Few functions

Not programmable

Not programmable MUL 2 MUL 3 MUL N

Hard-wired Hard-wired

SLIDE 24

MUL ADD 1 MUL 2 MUL 3 MUL N

Parallel processing

Parallel processing

Flexible architecture

Flexible architecture

Configurable for different tasks

Configurable for different tasks

FPGAs FPGAs

MUL ADD 1 MUL 2 MUL 3 MUL N

SLIDE 25

Benchmark Example Benchmark Example

FPGA is 10 times faster than DSP

FPGA is 10 times faster than DSP (source Xilinx)

(source Xilinx)

SLIDE 26

FPGA DSP is Lower Cost FPGA DSP is Lower Cost

Price per Million MACs per Second

Price per Million MACs per Second (source Xilinx)

(source Xilinx)

SLIDE 27

FPGA structure FPGA structure

SLIDE 28

FPGA configurable logic block FPGA configurable logic block

SLIDE 29

CPU vs. FPGA Speedup Example CPU vs. FPGA Speedup Example

Experimental data

Experimental data

Average Speedup = 24

Average Speedup = 24

SLIDE 30

Price versus Performance Price versus Performance

Software solution = cheap, but slow

Software solution = cheap, but slow

Hardware Solution = fast, but expensive

Hardware Solution = fast, but expensive Price Performance HW+SW HW SW

MAX possible price MIN required performance

SLIDE 31

Solution? Solution? Coupled DSPs and FPGAs Coupled DSPs and FPGAs

Potentially the best solution

Potentially the best solution

– provides both programmability and provides both programmability and performance performance

Possible trend

Possible trend

– modify the concept of DSPs modify the concept of DSPs – include programmable circuits inside the include programmable circuits inside the DSP DSP – cannot be quite affected by the developers cannot be quite affected by the developers

SLIDE 32

Dynamical Reconfiguration Dynamical Reconfiguration

Algorithms example

Algorithms example (3D Graphics) (3D Graphics)

– Texture Texture – Shadow Shadow – Reflections Reflections – Perspective Perspective – Edge Edge

DSP

DSP

– one function at a time

ne function at a time
FPGA

FPGA

– more functions at a time more functions at a time

Re-configurable FPGA

Re-configurable FPGA

– all functions done in time all functions done in time – some functions run while some functions run while

thers are loading
thers are loading

SLIDE 33

Reconfiguration Advantages Reconfiguration Advantages

Lower cost by reusing silicon for multiple

Lower cost by reusing silicon for multiple functions over time functions over time

Significant performance increase in FPGA

Significant performance increase in FPGA hardware versus software DSP hardware versus software DSP implementation implementation

Possible partial reconfiguration - function

Possible partial reconfiguration - function swapping swapping

SLIDE 34

DSP development DSP development

SLIDE 35

DSP example DSP example

TI 320C32 TI 320C32

SLIDE 36

DSP example II DSP example II

DSP I/O DSP I/O

SLIDE 37

DSP algorithm example DSP algorithm example

SLIDE 38

Algorithm example Algorithm example

Image erosion using 3x3 square mask

Image erosion using 3x3 square mask

SLIDE 39

Algorithm example ASM Algorithm example ASM

Can execute in 9 machine cycles

Can execute in 9 machine cycles

SLIDE 40

General DSP (CPU) strong points General DSP (CPU) strong points

Good in math operations on single data

Good in math operations on single data (or few data with MMX-like SIMD device) (or few data with MMX-like SIMD device)

Wide integer and float data (e.g. 32bits)

Wide integer and float data (e.g. 32bits)

Powerful addressing functions

Powerful addressing functions

SLIDE 41

General DSP (CPU) weaknesses General DSP (CPU) weaknesses

Very poor bit manipulation (shift and

Very poor bit manipulation (shift and logical functions) logical functions)

Poor parallelism

Poor parallelism

Slow conditional execution

Slow conditional execution

SLIDE 42

Which features do we like? Which features do we like?

DSP:

DSP: Fast processing of wide data Fast processing of wide data Addressing logic Addressing logic

FPGA:

FPGA: Parallelism Parallelism Bit manipulation Bit manipulation

SLIDE 43

How to combine the features? How to combine the features?

Separate DSP

Separate DSP and and FPGA circuits FPGA circuits

DSP and FPGA sharing the memory

DSP and FPGA sharing the memory

DSP and FPGA in pipeline

DSP and FPGA in pipeline

DSP and FPGA sharing the data bus

DSP and FPGA sharing the data bus

SLIDE 44

Separate DSP and FPGA Separate DSP and FPGA

Data

Data exchange exchange

Price

Price

SLIDE 45

DSP and FPGA sharing DSP and FPGA sharing memory memory

Complex

Complex FPGA FPGA design design

Too many

Too many pins pins

Price

Price

SLIDE 46

DSP and FPGA in pipeline DSP and FPGA in pipeline

Speed

Speed

Price

Price

Variability

Variability

SLIDE 47

DSP and FPGA sharing data bus DSP and FPGA sharing data bus

Simple

Simple FPGA FPGA

Small

Small number number

f pins
f pins
Price

Price

Speed

Speed

SLIDE 48

Suitable designs Suitable designs

Pipeline - for very high performance

Pipeline - for very high performance systems with fixed function systems with fixed function

Shared data bus - for price/performance

Shared data bus - for price/performance sensitive systems and for systems with sensitive systems and for systems with need of dynamic reconfiguration FPGAs need of dynamic reconfiguration FPGAs

Combination of both above approaches

Combination of both above approaches

SLIDE 49

Algorithm example FPGA II Algorithm example FPGA II

Long registers

Long registers

Image bands

Image bands

SLIDE 50

Design examples Design examples

SLIDE 51

Basic design Basic design

TMS320C32 DSP 32-bit floating point

TMS320C32 DSP 32-bit floating point DSP (60MFLOP/s, 4KB internal RAM) DSP (60MFLOP/s, 4KB internal RAM)

XILINX Spartan FPGA with good I/O

XILINX Spartan FPGA with good I/O capabilities capabilities

FPGA connected through DMA channels

FPGA connected through DMA channels and DSP “host port” and DSP “host port”

SLIDE 52

Experimental design Experimental design

60 MHz DSP 60 MHz DSP 2MB RAM 15ns 2MB RAM 15ns 54 I/O pins 54 I/O pins Dynamic FPGA Dynamic FPGA configuration configuration Stackable Stackable RS232 remote RS232 remote control control

SLIDE 53

Design Overview Design Overview

TMS320C6711 DSP 32-bit VLIW DSP

TMS320C6711 DSP 32-bit VLIW DSP (1GFLOP/s, 256KB internal RAM), (1GFLOP/s, 256KB internal RAM), possibly can be replaced with “6411” possibly can be replaced with “6411”

XILINX Virtex partially reconfigurable

XILINX Virtex partially reconfigurable FPGA with good I/O capabilities FPGA with good I/O capabilities

FPGA connected through DMA channels

FPGA connected through DMA channels and DSP “host port” and DSP “host port”

SLIDE 54

Block diagram Block diagram

FPGA (Xilinx Virtex) SRAM (local) (1 MByte, 8bit) DSP (TI 320C6711) 256kB SRAM SDRAM (64 MByte, 32bit) Host link Peer links (LVDS)

SLIDE 55

More detailed block Diagram More detailed block Diagram

SLIDE 56

Algorithm example FPGA Algorithm example FPGA

Executes in 1 cycle

Executes in 1 cycle

Data flow may delay

Data flow may delay to 2 or more cycles to 2 or more cycles

Can use local

Can use local storage for speedup storage for speedup

SLIDE 57

Possible problems Possible problems

DSP

DSP

difficult to use internal DSP memory
difficult to use internal DSP memory
out-of-order execution complicated
out-of-order execution complicated
FPGA

FPGA

I/O (data bus) speed critical
I/O (data bus) speed critical
reconfiguration difficult to handle
reconfiguration difficult to handle

SLIDE 58

Algorithm design Algorithm design

SLIDE 59

Algorithm design Algorithm design

VHDL language “codesign”

VHDL language “codesign”

Automated C/C++ language analysers

Automated C/C++ language analysers

C/C++ profilers & manual identification

C/C++ profilers & manual identification

f portions to be implemented in FPGA
f portions to be implemented in FPGA

is not “nice and sophisticated”, but is still is not “nice and sophisticated”, but is still the only reasonable way to go the only reasonable way to go

SLIDE 60

Dynamic reconfiguration? Dynamic reconfiguration?

The algorithms generally are diverse and

The algorithms generally are diverse and single “algorithm core” does not exist single “algorithm core” does not exist

Several “algorighm cores” must exist

Several “algorighm cores” must exist

the options are complexity increase or
the options are complexity increase or

dynamical reconfiguration of the FPGA dynamical reconfiguration of the FPGA

Reconfiguration of FPGA is memory

Reconfiguration of FPGA is memory intensive intensive

SLIDE 61

How to reconfigure? How to reconfigure?

Multithreaded environment with

Multithreaded environment with synchronization events from the FPGA synchronization events from the FPGA

Data transfer using DSP

Data transfer using DSP’ ’s DMA channels s DMA channels

Reconfigure only partially if the FPGA

Reconfigure only partially if the FPGA allows this feature (e.g. XILINX Virtex) allows this feature (e.g. XILINX Virtex)

SLIDE 62

FPGA reconfiguration FPGA reconfiguration

SLIDE 63

FPGA resource allocation FPGA resource allocation

FPGA is a shared resource well known

FPGA is a shared resource well known from operating systems theory from operating systems theory

FPGA sharing may cause deadlocks

FPGA sharing may cause deadlocks

To prevent deadlocks, use e.g. the

To prevent deadlocks, use e.g. the “Banker “Banker’ ’s algorithm s algorithm” ”

To increase speed with small number of

To increase speed with small number of threads, “prefetch” can be used threads, “prefetch” can be used

SLIDE 64

Algorithm example Algorithm example

void Process1(const Image * A,Image * OutA) { Alloc(UnitX); // prefetch X PreprocessInDSP(A); Wait(UnitX); // use X now! Execute(UnitX,A,OutA); Free(UnitX); // free X } void Process2(const Image * B,Image * OutB) { Alloc(UnitY); // prefetch Y Wait(UnitY); // and use immediately Execute(UnitY,B,OutB); Free(UnitY); // free Y }

DSP

A,B

FPGA

A

Load X,Y A Exec Y

Exec X,Y
Free X,Y

A,B

SLIDE 65

Example Example

SLIDE 66

Rendering engine Rendering engine

SLIDE 67

Rendering engine Rendering engine

SLIDE 68

Rendering engine Rendering engine

SLIDE 69

Conclusions Conclusions

DSP co-operation with FPGA is possible

DSP co-operation with FPGA is possible

Suitable for both cost efficient and very

Suitable for both cost efficient and very high performance designs high performance designs

Still difficult to generalise where and how

Still difficult to generalise where and how to “place the portions of algorithm” to “place the portions of algorithm”

SLIDE 70