Hardware Acceleration of Hardware Acceleration of Graphics and - - PowerPoint PPT Presentation
Hardware Acceleration of Hardware Acceleration of Graphics and - - PowerPoint PPT Presentation
Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging Algorithms Using FPGAs Algorithms Using FPGAs Pavel Zemk Pavel Zemk Department of Computer Graphics and Multimedia, Department of Computer
Overview Overview
- History and current state of the art
History and current state of the art
- Possible future development
Possible future development
- FPGA capabilities and configuration, DSP
FPGA capabilities and configuration, DSP
- Design examples
Design examples
- Algorithm examples
Algorithm examples
- Conclusion
Conclusion
History and current History and current state of the art state of the art
Historical development Historical development
1) Special hardware - “the only way” 1) Special hardware - “the only way” Computer Graphics device
Historical development Historical development
2) Integration - “the cheaper way” 2) Integration - “the cheaper way” Computer RAM Graphics
Historical development Historical development
3) On board memory - “the faster way” 3) On board memory - “the faster way” Computer Graphics Dedicated RAM
Historical development Historical development
3) Acceleration - “the modern way” 3) Acceleration - “the modern way” Computer Graphics Dedicated RAM Graphics pipeline
Historical development Historical development
Bandwidth Bandwidth
- ISA 8/16 bit - 5MB/s
ISA 8/16 bit - 5MB/s
- PCI 32 bit - 132MB/s
PCI 32 bit - 132MB/s
- AGP 64 bit - 512 MB/s
AGP 64 bit - 512 MB/s
Historical development Historical development
Resolution Resolution
- CGA - 320x240x4 colours
CGA - 320x240x4 colours
- VGA - 640x480x256 colours
VGA - 640x480x256 colours
- XGA, … 1600x1200 and more, 32-bit
XGA, … 1600x1200 and more, 32-bit RGB RGB
Current state of the art Current state of the art
Typical configuration Typical configuration
- AGP interface
AGP interface
- 64 MB RAM - 256 bit, >3GB/s
64 MB RAM - 256 bit, >3GB/s
- Graphics pipeline (3D transformation,
Graphics pipeline (3D transformation, clipping, shading, 2D/3D textures, clipping, shading, 2D/3D textures, partially programmable, etc.) partially programmable, etc.)
- >100 000 polygons/s, >1G pixels/s
>100 000 polygons/s, >1G pixels/s
Current state of the art Current state of the art
Pixel oriented architectures (Pixel Planes) Pixel oriented architectures (Pixel Planes)
- Each pixel has its own “CPU”
Each pixel has its own “CPU”
- Extremely high pixel rates
Extremely high pixel rates
- Limited features in texturing etc.
Limited features in texturing etc.
Current state of the art Current state of the art
Volume rendering architecture (VolumePro) Volume rendering architecture (VolumePro)
- Very high voxel rate
Very high voxel rate
- Shadows and perspective projection
Shadows and perspective projection (so far) not included (so far) not included
Possible Possible future development future development
Possible future development Possible future development
Frequency cannot increase much Frequency cannot increase much
- Frequency - light travels only 30cm/1ns
Frequency - light travels only 30cm/1ns and electrical signal propagation is much and electrical signal propagation is much slower inside the chips slower inside the chips
Possible future development Possible future development
Parallelism and configurable devices Parallelism and configurable devices
- Parallelism - semantic problems
Parallelism - semantic problems
- Device configuration - difficult to handle
Device configuration - difficult to handle for “programmers” for “programmers”
Possible future development Possible future development
Questions Questions
- What is the frequency limit?
What is the frequency limit?
- What is the limit in the bus width?
What is the limit in the bus width?
- Why would programmable logic be
Why would programmable logic be implemented only in graphics implemented only in graphics accelerators and not in CPUs? accelerators and not in CPUs?
FPGA capabilities FPGA capabilities and configuration, DSP and configuration, DSP
FPGA features FPGA features
General purpose versus application specific General purpose versus application specific
- Current processors (and DSPs) are
Current processors (and DSPs) are suitable for any algorithm but are not suitable for any algorithm but are not fast enough (e.g. For 2D or 3D data) fast enough (e.g. For 2D or 3D data)
- Speed in processing can be achieved by
Speed in processing can be achieved by hard-wired digital circuits but these are hard-wired digital circuits but these are considered application-specific considered application-specific
- Sequential processing
Sequential processing
– poor performance poor performance – easy to program easy to program
- Fixed architecture
Fixed architecture
- Cheap
Cheap MUL ADD 1 MUL ADD 2
Processor Processor
More Processors More Processors
- Multiprocessing is difficult to handle
Multiprocessing is difficult to handle algorithmically algorithmically
- Memory throughput or communication
Memory throughput or communication speed is the limiting factor speed is the limiting factor
- Price and power efficiency poor
Price and power efficiency poor
- Parallel processing
Parallel processing
- Fixed architecture
Fixed architecture
- Programmable
Programmable MUL ADD 1 MUL ADD 2 MUL ADD N
More Processors More Processors
Hard-wired Hard-wired
- Each circuit useable only for one task or
Each circuit useable only for one task or very limited set of tasks very limited set of tasks
- Maximum size complexity is the limiting
Maximum size complexity is the limiting factor factor
- Expensive to design
Expensive to design
MUL ADD 1
- Parallel processing
Parallel processing
- Fixed architecture
Fixed architecture
- Few functions
Few functions
- Not programmable
Not programmable MUL 2 MUL 3 MUL N
Hard-wired Hard-wired
MUL ADD 1 MUL 2 MUL 3 MUL N
- Parallel processing
Parallel processing
- Flexible architecture
Flexible architecture
- Configurable for different tasks
Configurable for different tasks
FPGAs FPGAs
MUL ADD 1 MUL 2 MUL 3 MUL N
Benchmark Example Benchmark Example
- FPGA is 10 times faster than DSP
FPGA is 10 times faster than DSP (source Xilinx)
(source Xilinx)
FPGA DSP is Lower Cost FPGA DSP is Lower Cost
- Price per Million MACs per Second
Price per Million MACs per Second (source Xilinx)
(source Xilinx)
FPGA structure FPGA structure
FPGA configurable logic block FPGA configurable logic block
CPU vs. FPGA Speedup Example CPU vs. FPGA Speedup Example
- Experimental data
Experimental data
- Average Speedup = 24
Average Speedup = 24
Price versus Performance Price versus Performance
- Software solution = cheap, but slow
Software solution = cheap, but slow
- Hardware Solution = fast, but expensive
Hardware Solution = fast, but expensive Price Performance HW+SW HW SW
MAX possible price MIN required performance
Solution? Solution? Coupled DSPs and FPGAs Coupled DSPs and FPGAs
- Potentially the best solution
Potentially the best solution
– provides both programmability and provides both programmability and performance performance
- Possible trend
Possible trend
– modify the concept of DSPs modify the concept of DSPs – include programmable circuits inside the include programmable circuits inside the DSP DSP – cannot be quite affected by the developers cannot be quite affected by the developers
Dynamical Reconfiguration Dynamical Reconfiguration
- Algorithms example
Algorithms example (3D Graphics) (3D Graphics)
– Texture Texture – Shadow Shadow – Reflections Reflections – Perspective Perspective – Edge Edge
- DSP
DSP
– one function at a time
- ne function at a time
- FPGA
FPGA
– more functions at a time more functions at a time
- Re-configurable FPGA
Re-configurable FPGA
– all functions done in time all functions done in time – some functions run while some functions run while
- thers are loading
- thers are loading
Reconfiguration Advantages Reconfiguration Advantages
- Lower cost by reusing silicon for multiple
Lower cost by reusing silicon for multiple functions over time functions over time
- Significant performance increase in FPGA
Significant performance increase in FPGA hardware versus software DSP hardware versus software DSP implementation implementation
- Possible partial reconfiguration - function
Possible partial reconfiguration - function swapping swapping
DSP development DSP development
DSP example DSP example
TI 320C32 TI 320C32
DSP example II DSP example II
DSP I/O DSP I/O
DSP algorithm example DSP algorithm example
Algorithm example Algorithm example
- Image erosion using 3x3 square mask
Image erosion using 3x3 square mask
Algorithm example ASM Algorithm example ASM
- Can execute in 9 machine cycles
Can execute in 9 machine cycles
General DSP (CPU) strong points General DSP (CPU) strong points
- Good in math operations on single data
Good in math operations on single data (or few data with MMX-like SIMD device) (or few data with MMX-like SIMD device)
- Wide integer and float data (e.g. 32bits)
Wide integer and float data (e.g. 32bits)
- Powerful addressing functions
Powerful addressing functions
General DSP (CPU) weaknesses General DSP (CPU) weaknesses
- Very poor bit manipulation (shift and
Very poor bit manipulation (shift and logical functions) logical functions)
- Poor parallelism
Poor parallelism
- Slow conditional execution
Slow conditional execution
Which features do we like? Which features do we like?
- DSP:
DSP: Fast processing of wide data Fast processing of wide data Addressing logic Addressing logic
- FPGA:
FPGA: Parallelism Parallelism Bit manipulation Bit manipulation
How to combine the features? How to combine the features?
- Separate DSP
Separate DSP and and FPGA circuits FPGA circuits
- DSP and FPGA sharing the memory
DSP and FPGA sharing the memory
- DSP and FPGA in pipeline
DSP and FPGA in pipeline
- DSP and FPGA sharing the data bus
DSP and FPGA sharing the data bus
Separate DSP and FPGA Separate DSP and FPGA
- Data
Data exchange exchange
- Price
Price
DSP and FPGA sharing DSP and FPGA sharing memory memory
- Complex
Complex FPGA FPGA design design
- Too many
Too many pins pins
- Price
Price
DSP and FPGA in pipeline DSP and FPGA in pipeline
- Speed
Speed
- Price
Price
- Variability
Variability
DSP and FPGA sharing data bus DSP and FPGA sharing data bus
- Simple
Simple FPGA FPGA
- Small
Small number number
- f pins
- f pins
- Price
Price
- Speed
Speed
Suitable designs Suitable designs
- Pipeline - for very high performance
Pipeline - for very high performance systems with fixed function systems with fixed function
- Shared data bus - for price/performance
Shared data bus - for price/performance sensitive systems and for systems with sensitive systems and for systems with need of dynamic reconfiguration FPGAs need of dynamic reconfiguration FPGAs
- Combination of both above approaches
Combination of both above approaches
Algorithm example FPGA II Algorithm example FPGA II
- Long registers
Long registers
- Image bands
Image bands
Design examples Design examples
Basic design Basic design
- TMS320C32 DSP 32-bit floating point
TMS320C32 DSP 32-bit floating point DSP (60MFLOP/s, 4KB internal RAM) DSP (60MFLOP/s, 4KB internal RAM)
- XILINX Spartan FPGA with good I/O
XILINX Spartan FPGA with good I/O capabilities capabilities
- FPGA connected through DMA channels
FPGA connected through DMA channels and DSP “host port” and DSP “host port”
Experimental design Experimental design
60 MHz DSP 60 MHz DSP 2MB RAM 15ns 2MB RAM 15ns 54 I/O pins 54 I/O pins Dynamic FPGA Dynamic FPGA configuration configuration Stackable Stackable RS232 remote RS232 remote control control
Design Overview Design Overview
- TMS320C6711 DSP 32-bit VLIW DSP
TMS320C6711 DSP 32-bit VLIW DSP (1GFLOP/s, 256KB internal RAM), (1GFLOP/s, 256KB internal RAM), possibly can be replaced with “6411” possibly can be replaced with “6411”
- XILINX Virtex partially reconfigurable
XILINX Virtex partially reconfigurable FPGA with good I/O capabilities FPGA with good I/O capabilities
- FPGA connected through DMA channels
FPGA connected through DMA channels and DSP “host port” and DSP “host port”
Block diagram Block diagram
FPGA (Xilinx Virtex) SRAM (local) (1 MByte, 8bit) DSP (TI 320C6711) 256kB SRAM SDRAM (64 MByte, 32bit) Host link Peer links (LVDS)
More detailed block Diagram More detailed block Diagram
Algorithm example FPGA Algorithm example FPGA
- Executes in 1 cycle
Executes in 1 cycle
- Data flow may delay
Data flow may delay to 2 or more cycles to 2 or more cycles
- Can use local
Can use local storage for speedup storage for speedup
Possible problems Possible problems
- DSP
DSP
- difficult to use internal DSP memory
- difficult to use internal DSP memory
- out-of-order execution complicated
- out-of-order execution complicated
- FPGA
FPGA
- I/O (data bus) speed critical
- I/O (data bus) speed critical
- reconfiguration difficult to handle
- reconfiguration difficult to handle
Algorithm design Algorithm design
Algorithm design Algorithm design
- VHDL language “codesign”
VHDL language “codesign”
- Automated C/C++ language analysers
Automated C/C++ language analysers
- C/C++ profilers & manual identification
C/C++ profilers & manual identification
- f portions to be implemented in FPGA
- f portions to be implemented in FPGA
is not “nice and sophisticated”, but is still is not “nice and sophisticated”, but is still the only reasonable way to go the only reasonable way to go
Dynamic reconfiguration? Dynamic reconfiguration?
- The algorithms generally are diverse and
The algorithms generally are diverse and single “algorithm core” does not exist single “algorithm core” does not exist
- Several “algorighm cores” must exist
Several “algorighm cores” must exist
- the options are complexity increase or
- the options are complexity increase or
dynamical reconfiguration of the FPGA dynamical reconfiguration of the FPGA
- Reconfiguration of FPGA is memory
Reconfiguration of FPGA is memory intensive intensive
How to reconfigure? How to reconfigure?
- Multithreaded environment with
Multithreaded environment with synchronization events from the FPGA synchronization events from the FPGA
- Data transfer using DSP
Data transfer using DSP’ ’s DMA channels s DMA channels
- Reconfigure only partially if the FPGA
Reconfigure only partially if the FPGA allows this feature (e.g. XILINX Virtex) allows this feature (e.g. XILINX Virtex)
FPGA reconfiguration FPGA reconfiguration
FPGA resource allocation FPGA resource allocation
- FPGA is a shared resource well known
FPGA is a shared resource well known from operating systems theory from operating systems theory
- FPGA sharing may cause deadlocks
FPGA sharing may cause deadlocks
- To prevent deadlocks, use e.g. the
To prevent deadlocks, use e.g. the “Banker “Banker’ ’s algorithm s algorithm” ”
- To increase speed with small number of
To increase speed with small number of threads, “prefetch” can be used threads, “prefetch” can be used
Algorithm example Algorithm example
void Process1(const Image * A,Image * OutA) { Alloc(UnitX); // prefetch X PreprocessInDSP(A); Wait(UnitX); // use X now! Execute(UnitX,A,OutA); Free(UnitX); // free X } void Process2(const Image * B,Image * OutB) { Alloc(UnitY); // prefetch Y Wait(UnitY); // and use immediately Execute(UnitY,B,OutB); Free(UnitY); // free Y }
DSP
A,B
FPGA
- A
Load X,Y A Exec Y
- Exec X,Y
- Free X,Y
A,B
Example Example
Rendering engine Rendering engine
Rendering engine Rendering engine
Rendering engine Rendering engine
Conclusions Conclusions
- DSP co-operation with FPGA is possible
DSP co-operation with FPGA is possible
- Suitable for both cost efficient and very
Suitable for both cost efficient and very high performance designs high performance designs
- Still difficult to generalise where and how