CS4402-9535: High-Performance Computing with CUDA
Marc Moreno Maza
University of Western Ontario, London, Ontario (Canada)
UWO-CS4402-CS9535
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 1 / 113
CS4402-9535: High-Performance Computing with CUDA Marc Moreno Maza - - PowerPoint PPT Presentation
CS4402-9535: High-Performance Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 1 / 113 Plan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 1 / 113
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 2 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 3 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 4 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 5 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 6 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 7 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 8 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 9 / 113
Optimizing Matrix Transpose with CUDA
1
2
3
4
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 10 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 11 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 12 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 13 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 14 / 113
Optimizing Matrix Transpose with CUDA
1
2
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 15 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 16 / 113
Optimizing Matrix Transpose with CUDA
1 The half warp writes four half rows of the idata matrix tile to the
2 After a
3 the half warp writes four half columns of tile to four half rows of an
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 17 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 18 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 19 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 20 / 113
Optimizing Matrix Transpose with CUDA
1 Shared memory is divided into 16 equally-sized memory modules,
2 These banks can be accessed simultaneously, and to achieve maximum
3 The exception to this rule is when all threads in a half warp read
4 One can use the warp serialize flag when profiling CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 21 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 22 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 23 / 113
Optimizing Matrix Transpose with CUDA
1 The coalesced transpose uses a 32 × 32 shared memory array of
2 For this sized array, all data in columns k and k+16 are mapped to
3 As a result, when writing partial columns from tile in shared
4 A simple way to avoid this conflict is to pad the shared memory array
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 24 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 25 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 26 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 27 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 28 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 29 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 30 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 31 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 32 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 33 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 34 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 35 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 36 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 37 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 38 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 39 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 40 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 41 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 42 / 113
Optimizing Matrix Transpose with CUDA
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 43 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 44 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 45 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 46 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 47 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 48 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 49 / 113
Performance Optimization
1 Partition data into subsets that fit into shared memory 2 Handle each data subset with one thread block 3 Load the subset from global memory to shared memory, using
4 Perform the computation on the subset from shared memory. 5 Copy the result from shared memory back to global memory. (Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 50 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 51 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 52 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 53 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 54 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 55 / 113
Performance Optimization
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 56 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 57 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 58 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 59 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 60 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 61 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 62 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 63 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 64 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 65 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 66 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 67 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 68 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 69 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 70 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 71 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 72 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 73 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 74 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 75 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 76 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 77 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 78 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 79 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 80 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 81 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 82 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 83 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 84 / 113
Parallel Reduction
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 85 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 86 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 87 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 88 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 89 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 90 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 91 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 92 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 93 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 94 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 95 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 96 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 97 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 98 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 99 / 113
Parallel Scan
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 100 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 101 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 102 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 103 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 104 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 105 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 106 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 107 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 108 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 109 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 110 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 111 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 112 / 113
Exercises
(Moreno Maza) CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535 113 / 113