Hardware Acceleration of Hardware Acceleration of Graphics and - - PowerPoint PPT Presentation

hardware acceleration of hardware acceleration of
SMART_READER_LITE
LIVE PREVIEW

Hardware Acceleration of Hardware Acceleration of Graphics and - - PowerPoint PPT Presentation

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging Algorithms Using FPGAs Algorithms Using FPGAs Pavel Zemk Pavel Zemk Department of Computer Graphics and Multimedia, Department of Computer


slide-1
SLIDE 1

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging Algorithms Using FPGAs Algorithms Using FPGAs

Pavel Zemčík Pavel Zemčík Department of Computer Graphics and Multimedia, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Faculty of Information Technology, Brno University of Technology, Brno University of Technology, Czech Republic Czech Republic e-mail: e-mail: zemcik@fit.vutbr.cz zemcik@fit.vutbr.cz

slide-2
SLIDE 2

Overview Overview

  • History and current state of the art

History and current state of the art

  • Possible future development

Possible future development

  • FPGA capabilities and configuration, DSP

FPGA capabilities and configuration, DSP

  • Design examples

Design examples

  • Algorithm examples

Algorithm examples

  • Conclusion

Conclusion

slide-3
SLIDE 3

History and current History and current state of the art state of the art

slide-4
SLIDE 4

Historical development Historical development

1) Special hardware - “the only way” 1) Special hardware - “the only way” Computer Graphics device

slide-5
SLIDE 5

Historical development Historical development

2) Integration - “the cheaper way” 2) Integration - “the cheaper way” Computer RAM Graphics

slide-6
SLIDE 6

Historical development Historical development

3) On board memory - “the faster way” 3) On board memory - “the faster way” Computer Graphics Dedicated RAM

slide-7
SLIDE 7

Historical development Historical development

3) Acceleration - “the modern way” 3) Acceleration - “the modern way” Computer Graphics Dedicated RAM Graphics pipeline

slide-8
SLIDE 8

Historical development Historical development

Bandwidth Bandwidth

  • ISA 8/16 bit - 5MB/s

ISA 8/16 bit - 5MB/s

  • PCI 32 bit - 132MB/s

PCI 32 bit - 132MB/s

  • AGP 64 bit - 512 MB/s

AGP 64 bit - 512 MB/s

slide-9
SLIDE 9

Historical development Historical development

Resolution Resolution

  • CGA - 320x240x4 colours

CGA - 320x240x4 colours

  • VGA - 640x480x256 colours

VGA - 640x480x256 colours

  • XGA, … 1600x1200 and more, 32-bit

XGA, … 1600x1200 and more, 32-bit RGB RGB

slide-10
SLIDE 10

Current state of the art Current state of the art

Typical configuration Typical configuration

  • AGP interface

AGP interface

  • 64 MB RAM - 256 bit, >3GB/s

64 MB RAM - 256 bit, >3GB/s

  • Graphics pipeline (3D transformation,

Graphics pipeline (3D transformation, clipping, shading, 2D/3D textures, clipping, shading, 2D/3D textures, partially programmable, etc.) partially programmable, etc.)

  • >100 000 polygons/s, >1G pixels/s

>100 000 polygons/s, >1G pixels/s

slide-11
SLIDE 11

Current state of the art Current state of the art

Pixel oriented architectures (Pixel Planes) Pixel oriented architectures (Pixel Planes)

  • Each pixel has its own “CPU”

Each pixel has its own “CPU”

  • Extremely high pixel rates

Extremely high pixel rates

  • Limited features in texturing etc.

Limited features in texturing etc.

slide-12
SLIDE 12

Current state of the art Current state of the art

Volume rendering architecture (VolumePro) Volume rendering architecture (VolumePro)

  • Very high voxel rate

Very high voxel rate

  • Shadows and perspective projection

Shadows and perspective projection (so far) not included (so far) not included

slide-13
SLIDE 13

Possible Possible future development future development

slide-14
SLIDE 14

Possible future development Possible future development

Frequency cannot increase much Frequency cannot increase much

  • Frequency - light travels only 30cm/1ns

Frequency - light travels only 30cm/1ns and electrical signal propagation is much and electrical signal propagation is much slower inside the chips slower inside the chips

slide-15
SLIDE 15

Possible future development Possible future development

Parallelism and configurable devices Parallelism and configurable devices

  • Parallelism - semantic problems

Parallelism - semantic problems

  • Device configuration - difficult to handle

Device configuration - difficult to handle for “programmers” for “programmers”

slide-16
SLIDE 16

Possible future development Possible future development

Questions Questions

  • What is the frequency limit?

What is the frequency limit?

  • What is the limit in the bus width?

What is the limit in the bus width?

  • Why would programmable logic be

Why would programmable logic be implemented only in graphics implemented only in graphics accelerators and not in CPUs? accelerators and not in CPUs?

slide-17
SLIDE 17

FPGA capabilities FPGA capabilities and configuration, DSP and configuration, DSP

slide-18
SLIDE 18

FPGA features FPGA features

General purpose versus application specific General purpose versus application specific

  • Current processors (and DSPs) are

Current processors (and DSPs) are suitable for any algorithm but are not suitable for any algorithm but are not fast enough (e.g. For 2D or 3D data) fast enough (e.g. For 2D or 3D data)

  • Speed in processing can be achieved by

Speed in processing can be achieved by hard-wired digital circuits but these are hard-wired digital circuits but these are considered application-specific considered application-specific

slide-19
SLIDE 19
  • Sequential processing

Sequential processing

– poor performance poor performance – easy to program easy to program

  • Fixed architecture

Fixed architecture

  • Cheap

Cheap MUL ADD 1 MUL ADD 2

Processor Processor

slide-20
SLIDE 20

More Processors More Processors

  • Multiprocessing is difficult to handle

Multiprocessing is difficult to handle algorithmically algorithmically

  • Memory throughput or communication

Memory throughput or communication speed is the limiting factor speed is the limiting factor

  • Price and power efficiency poor

Price and power efficiency poor

slide-21
SLIDE 21
  • Parallel processing

Parallel processing

  • Fixed architecture

Fixed architecture

  • Programmable

Programmable MUL ADD 1 MUL ADD 2 MUL ADD N

More Processors More Processors

slide-22
SLIDE 22

Hard-wired Hard-wired

  • Each circuit useable only for one task or

Each circuit useable only for one task or very limited set of tasks very limited set of tasks

  • Maximum size complexity is the limiting

Maximum size complexity is the limiting factor factor

  • Expensive to design

Expensive to design

slide-23
SLIDE 23

MUL ADD 1

  • Parallel processing

Parallel processing

  • Fixed architecture

Fixed architecture

  • Few functions

Few functions

  • Not programmable

Not programmable MUL 2 MUL 3 MUL N

Hard-wired Hard-wired

slide-24
SLIDE 24

MUL ADD 1 MUL 2 MUL 3 MUL N

  • Parallel processing

Parallel processing

  • Flexible architecture

Flexible architecture

  • Configurable for different tasks

Configurable for different tasks

FPGAs FPGAs

MUL ADD 1 MUL 2 MUL 3 MUL N

slide-25
SLIDE 25

Benchmark Example Benchmark Example

  • FPGA is 10 times faster than DSP

FPGA is 10 times faster than DSP (source Xilinx)

(source Xilinx)

slide-26
SLIDE 26

FPGA DSP is Lower Cost FPGA DSP is Lower Cost

  • Price per Million MACs per Second

Price per Million MACs per Second (source Xilinx)

(source Xilinx)

slide-27
SLIDE 27

FPGA structure FPGA structure

slide-28
SLIDE 28

FPGA configurable logic block FPGA configurable logic block

slide-29
SLIDE 29

CPU vs. FPGA Speedup Example CPU vs. FPGA Speedup Example

  • Experimental data

Experimental data

  • Average Speedup = 24

Average Speedup = 24

slide-30
SLIDE 30

Price versus Performance Price versus Performance

  • Software solution = cheap, but slow

Software solution = cheap, but slow

  • Hardware Solution = fast, but expensive

Hardware Solution = fast, but expensive Price Performance HW+SW HW SW

MAX possible price MIN required performance

slide-31
SLIDE 31

Solution? Solution? Coupled DSPs and FPGAs Coupled DSPs and FPGAs

  • Potentially the best solution

Potentially the best solution

– provides both programmability and provides both programmability and performance performance

  • Possible trend

Possible trend

– modify the concept of DSPs modify the concept of DSPs – include programmable circuits inside the include programmable circuits inside the DSP DSP – cannot be quite affected by the developers cannot be quite affected by the developers

slide-32
SLIDE 32

Dynamical Reconfiguration Dynamical Reconfiguration

  • Algorithms example

Algorithms example (3D Graphics) (3D Graphics)

– Texture Texture – Shadow Shadow – Reflections Reflections – Perspective Perspective – Edge Edge

  • DSP

DSP

– one function at a time

  • ne function at a time
  • FPGA

FPGA

– more functions at a time more functions at a time

  • Re-configurable FPGA

Re-configurable FPGA

– all functions done in time all functions done in time – some functions run while some functions run while

  • thers are loading
  • thers are loading
slide-33
SLIDE 33

Reconfiguration Advantages Reconfiguration Advantages

  • Lower cost by reusing silicon for multiple

Lower cost by reusing silicon for multiple functions over time functions over time

  • Significant performance increase in FPGA

Significant performance increase in FPGA hardware versus software DSP hardware versus software DSP implementation implementation

  • Possible partial reconfiguration - function

Possible partial reconfiguration - function swapping swapping

slide-34
SLIDE 34

DSP development DSP development

slide-35
SLIDE 35

DSP example DSP example

TI 320C32 TI 320C32

slide-36
SLIDE 36

DSP example II DSP example II

DSP I/O DSP I/O

slide-37
SLIDE 37

DSP algorithm example DSP algorithm example

slide-38
SLIDE 38

Algorithm example Algorithm example

  • Image erosion using 3x3 square mask

Image erosion using 3x3 square mask

slide-39
SLIDE 39

Algorithm example ASM Algorithm example ASM

  • Can execute in 9 machine cycles

Can execute in 9 machine cycles

slide-40
SLIDE 40

General DSP (CPU) strong points General DSP (CPU) strong points

  • Good in math operations on single data

Good in math operations on single data (or few data with MMX-like SIMD device) (or few data with MMX-like SIMD device)

  • Wide integer and float data (e.g. 32bits)

Wide integer and float data (e.g. 32bits)

  • Powerful addressing functions

Powerful addressing functions

slide-41
SLIDE 41

General DSP (CPU) weaknesses General DSP (CPU) weaknesses

  • Very poor bit manipulation (shift and

Very poor bit manipulation (shift and logical functions) logical functions)

  • Poor parallelism

Poor parallelism

  • Slow conditional execution

Slow conditional execution

slide-42
SLIDE 42

Which features do we like? Which features do we like?

  • DSP:

DSP: Fast processing of wide data Fast processing of wide data Addressing logic Addressing logic

  • FPGA:

FPGA: Parallelism Parallelism Bit manipulation Bit manipulation

slide-43
SLIDE 43

How to combine the features? How to combine the features?

  • Separate DSP

Separate DSP and and FPGA circuits FPGA circuits

  • DSP and FPGA sharing the memory

DSP and FPGA sharing the memory

  • DSP and FPGA in pipeline

DSP and FPGA in pipeline

  • DSP and FPGA sharing the data bus

DSP and FPGA sharing the data bus

slide-44
SLIDE 44

Separate DSP and FPGA Separate DSP and FPGA

  • Data

Data exchange exchange

  • Price

Price

slide-45
SLIDE 45

DSP and FPGA sharing DSP and FPGA sharing memory memory

  • Complex

Complex FPGA FPGA design design

  • Too many

Too many pins pins

  • Price

Price

slide-46
SLIDE 46

DSP and FPGA in pipeline DSP and FPGA in pipeline

  • Speed

Speed

  • Price

Price

  • Variability

Variability

slide-47
SLIDE 47

DSP and FPGA sharing data bus DSP and FPGA sharing data bus

  • Simple

Simple FPGA FPGA

  • Small

Small number number

  • f pins
  • f pins
  • Price

Price

  • Speed

Speed

slide-48
SLIDE 48

Suitable designs Suitable designs

  • Pipeline - for very high performance

Pipeline - for very high performance systems with fixed function systems with fixed function

  • Shared data bus - for price/performance

Shared data bus - for price/performance sensitive systems and for systems with sensitive systems and for systems with need of dynamic reconfiguration FPGAs need of dynamic reconfiguration FPGAs

  • Combination of both above approaches

Combination of both above approaches

slide-49
SLIDE 49

Algorithm example FPGA II Algorithm example FPGA II

  • Long registers

Long registers

  • Image bands

Image bands

slide-50
SLIDE 50

Design examples Design examples

slide-51
SLIDE 51

Basic design Basic design

  • TMS320C32 DSP 32-bit floating point

TMS320C32 DSP 32-bit floating point DSP (60MFLOP/s, 4KB internal RAM) DSP (60MFLOP/s, 4KB internal RAM)

  • XILINX Spartan FPGA with good I/O

XILINX Spartan FPGA with good I/O capabilities capabilities

  • FPGA connected through DMA channels

FPGA connected through DMA channels and DSP “host port” and DSP “host port”

slide-52
SLIDE 52

Experimental design Experimental design

60 MHz DSP 60 MHz DSP 2MB RAM 15ns 2MB RAM 15ns 54 I/O pins 54 I/O pins Dynamic FPGA Dynamic FPGA configuration configuration Stackable Stackable RS232 remote RS232 remote control control

slide-53
SLIDE 53

Design Overview Design Overview

  • TMS320C6711 DSP 32-bit VLIW DSP

TMS320C6711 DSP 32-bit VLIW DSP (1GFLOP/s, 256KB internal RAM), (1GFLOP/s, 256KB internal RAM), possibly can be replaced with “6411” possibly can be replaced with “6411”

  • XILINX Virtex partially reconfigurable

XILINX Virtex partially reconfigurable FPGA with good I/O capabilities FPGA with good I/O capabilities

  • FPGA connected through DMA channels

FPGA connected through DMA channels and DSP “host port” and DSP “host port”

slide-54
SLIDE 54

Block diagram Block diagram

FPGA (Xilinx Virtex) SRAM (local) (1 MByte, 8bit) DSP (TI 320C6711) 256kB SRAM SDRAM (64 MByte, 32bit) Host link Peer links (LVDS)

slide-55
SLIDE 55

More detailed block Diagram More detailed block Diagram

slide-56
SLIDE 56

Algorithm example FPGA Algorithm example FPGA

  • Executes in 1 cycle

Executes in 1 cycle

  • Data flow may delay

Data flow may delay to 2 or more cycles to 2 or more cycles

  • Can use local

Can use local storage for speedup storage for speedup

slide-57
SLIDE 57

Possible problems Possible problems

  • DSP

DSP

  • difficult to use internal DSP memory
  • difficult to use internal DSP memory
  • out-of-order execution complicated
  • out-of-order execution complicated
  • FPGA

FPGA

  • I/O (data bus) speed critical
  • I/O (data bus) speed critical
  • reconfiguration difficult to handle
  • reconfiguration difficult to handle
slide-58
SLIDE 58

Algorithm design Algorithm design

slide-59
SLIDE 59

Algorithm design Algorithm design

  • VHDL language “codesign”

VHDL language “codesign”

  • Automated C/C++ language analysers

Automated C/C++ language analysers

  • C/C++ profilers & manual identification

C/C++ profilers & manual identification

  • f portions to be implemented in FPGA
  • f portions to be implemented in FPGA

is not “nice and sophisticated”, but is still is not “nice and sophisticated”, but is still the only reasonable way to go the only reasonable way to go

slide-60
SLIDE 60

Dynamic reconfiguration? Dynamic reconfiguration?

  • The algorithms generally are diverse and

The algorithms generally are diverse and single “algorithm core” does not exist single “algorithm core” does not exist

  • Several “algorighm cores” must exist

Several “algorighm cores” must exist

  • the options are complexity increase or
  • the options are complexity increase or

dynamical reconfiguration of the FPGA dynamical reconfiguration of the FPGA

  • Reconfiguration of FPGA is memory

Reconfiguration of FPGA is memory intensive intensive

slide-61
SLIDE 61

How to reconfigure? How to reconfigure?

  • Multithreaded environment with

Multithreaded environment with synchronization events from the FPGA synchronization events from the FPGA

  • Data transfer using DSP

Data transfer using DSP’ ’s DMA channels s DMA channels

  • Reconfigure only partially if the FPGA

Reconfigure only partially if the FPGA allows this feature (e.g. XILINX Virtex) allows this feature (e.g. XILINX Virtex)

slide-62
SLIDE 62

FPGA reconfiguration FPGA reconfiguration

slide-63
SLIDE 63

FPGA resource allocation FPGA resource allocation

  • FPGA is a shared resource well known

FPGA is a shared resource well known from operating systems theory from operating systems theory

  • FPGA sharing may cause deadlocks

FPGA sharing may cause deadlocks

  • To prevent deadlocks, use e.g. the

To prevent deadlocks, use e.g. the “Banker “Banker’ ’s algorithm s algorithm” ”

  • To increase speed with small number of

To increase speed with small number of threads, “prefetch” can be used threads, “prefetch” can be used

slide-64
SLIDE 64

Algorithm example Algorithm example

void Process1(const Image * A,Image * OutA) { Alloc(UnitX); // prefetch X PreprocessInDSP(A); Wait(UnitX); // use X now! Execute(UnitX,A,OutA); Free(UnitX); // free X } void Process2(const Image * B,Image * OutB) { Alloc(UnitY); // prefetch Y Wait(UnitY); // and use immediately Execute(UnitY,B,OutB); Free(UnitY); // free Y }

DSP

A,B

FPGA

  • A

Load X,Y A Exec Y

  • Exec X,Y
  • Free X,Y

A,B

slide-65
SLIDE 65

Example Example

slide-66
SLIDE 66

Rendering engine Rendering engine

slide-67
SLIDE 67

Rendering engine Rendering engine

slide-68
SLIDE 68

Rendering engine Rendering engine

slide-69
SLIDE 69

Conclusions Conclusions

  • DSP co-operation with FPGA is possible

DSP co-operation with FPGA is possible

  • Suitable for both cost efficient and very

Suitable for both cost efficient and very high performance designs high performance designs

  • Still difficult to generalise where and how

Still difficult to generalise where and how to “place the portions of algorithm” to “place the portions of algorithm”

slide-70
SLIDE 70

The end The end