The Sliding Window Algorithm The Sliding Window algorithm sums - - PowerPoint PPT Presentation

the sliding window algorithm
SMART_READER_LITE
LIVE PREVIEW

The Sliding Window Algorithm The Sliding Window algorithm sums - - PowerPoint PPT Presentation

The Sliding Window Algorithm The Sliding Window algorithm sums several small sub-matrices of a matrix of values. The maximum of these sums is then located and it's coordinates passed out of the program. This is used as a


slide-1
SLIDE 1

The Sliding Window Algorithm

  • The “Sliding Window” algorithm sums

several small sub-matrices of a matrix of values.

  • The maximum of these sums is then

located and it's coordinates passed out of the program.

  • This is used as a calorimetry trigger –

used for locating events with high energy jets.

– (The following 6 slides have been copied

from Matthew's presentation)

slide-2
SLIDE 2

Sliding window: serial

slide-3
SLIDE 3

Sliding window: serial

slide-4
SLIDE 4

Sliding window: serial

slide-5
SLIDE 5

Sliding window: serial

slide-6
SLIDE 6

Sliding window: serial etc...

slide-7
SLIDE 7

Sliding window: parallel (5x5 border around each submatrix – use your imagination)

slide-8
SLIDE 8

Two Approaches to the Sliding Window

  • CPU Algorithm (Standard)
  • GPU Algorithm
  • Hybrid Algorithm

– Use the GPU to perform the sliding

window sum

– Transfer the resulting matrix of sums to

CPU

– Use the CPU to locate the maximum

slide-9
SLIDE 9

Motivation : Time Complexity

  • The time complexity of the algorithm to

locate a maximum on the CPU is linear O(N)

  • The time complexity of the best algorithm

to do the same on the GPU is O(N log N)

– Even if the GPU cores and CPU cores

had the same processing speed, more calculations are required by the GPU to perform the same task.

slide-10
SLIDE 10

Small Problem : Find Max

The speed-up is not fully realized for small window sizes because the GPU finishes the calculation nearly as fast as new calculation commands are issued. Note that at ATLAS scale problems (~5,000) this algorithm performs MUCH worse than the CPU version. Matthew wrote a new algorithm that works better at ATLAS scale problems but not as well at extreme values.

slide-11
SLIDE 11

Small Problem : Sliding Window Speed-Up

For those concerned : The sudden drops on this plot are because of my testing procedure with rectangular grids of varying dimensions. Threads are issued 1 warp (32 threads) at a time and I declared each block to be a constant 256 threads. Because of this there are problem sizes for which a large number of threads are inactive.

slide-12
SLIDE 12

Large Problem : Find Max

Even at extremely large sizes the speed up offered by the GPU algorithm pales compared to the speed-up of the sliding window.

slide-13
SLIDE 13

Large Problem : Sliding Window Speed-Up

Here the algorithm has plateaued. The speed-up for any algorithm is limited by the number of GPU cores which can run simultaneously.

slide-14
SLIDE 14

Motivation : Processing Speed vs Copy Speed

  • The GPU cores are individually much

slower than a CPU core.

  • The copy speed from the GPU to CPU is

very fast – and the result that needs to be copied is relatively small.

– It may be worth the time to copy the

memory to the CPU as it can do it much faster.

slide-15
SLIDE 15

Small Problem Fraction Plots

slide-16
SLIDE 16

Large Problem Fraction Plots

slide-17
SLIDE 17

Conclusion

  • At ATLAS scale, the speed-up grows fastest

and is greatest for the Hybrid algorithm (see ratio plot)

  • Beyond the ATLAS scale (a factor of ~10

greater) the purely GPU algorithm becomes better.

slide-18
SLIDE 18

Small Problem : Ratio

At small problem sizes (current ATLAS size is around 5,000) the Hybrid algorithm provides greater speed-up

slide-19
SLIDE 19

Large Problem : Ratio

This shows that at extremely large problem sizes the purely GPU based algorithm provides a greater speed-up than the hybrid.

slide-20
SLIDE 20

Small Problem Speed Up

slide-21
SLIDE 21

Large Problem Speed-Up

slide-22
SLIDE 22

Backup Slide : Cuda Card Specs

  • 8 SM (streaming multiprocessors) with 192

cores each (1,536 cores) @ ~1000 MHz each

  • ~15.75 GB/s bandwidth to host