I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication
Tyler M. Smith July 5, 2017
1 / 69
I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - - PowerPoint PPT Presentation
I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69 Introduction Dense matrix-matrix multiplication (MMM) Goal: Reduce I/O cost for machines with hierarchical memory Novel
1 / 69
◮ I/O lower bounds with a tight constant 2mnk
√ S
◮ A family of algorithms for machines with any number of levels
◮ Outperform the state-of-the-art Goto’s Algorithm by 38%
2 / 69
◮ C += AB ◮ C is m × n, A is m × k, and B is k × n
3 / 69
4 / 69
◮ Each element of A is used n times ◮ Each element of B is used m times ◮ Each element of C is used k times
◮ If all matrices fit into fast memory, amortize O(n2) memops
5 / 69
6 / 69
◮ Goto’s Algorithm
7 / 69
5th loop around micro-kernel
nC nC
A Bj Cj
kC kC
Ap Bp Cj
mC mC
Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel
nR
Pack Ai → Ai ~
nR
Ai ~ Bp ~ Ci 2nd loop around micro-kernel
mR
1st loop around micro-kernel 1
1 L3 cache L2 cache L1 cache registers main memory micro-kernel mR 4th loop around micro-kernel
Cj
8 / 69
5th loop around micro-kernel
nC nC
A Bj Cj L2 cache registers main memory L1 cache L3 cache
9 / 69
5th loop around micro-kernel
nC nC
A Bj Cj 4th loop around micro-kernel
kC kC
Ap Bp Cj registers L3 cache main memory L1 cache L2 cache Pack Bp → Bp ~ ~ Bp
10 / 69
5th loop around micro-kernel
nC nC
A Bj Cj 4th loop around micro-kernel
kC kC
Ap Bp Cj
mC mC
Pack Bp → Bp ~ Bp ~ Ci Ai registers L3 cache main memory L1 cache L2 cache Pack Ai → Ai ~ Ai ~ 3rd loop around micro-kernel
11 / 69
5th loop around micro-kernel
nC nC
A Bj Cj 4th loop around micro-kernel
kC kC
Ap Bp Cj
mC mC
Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel
nR
Pack Ai → Ai ~
nR
Ai ~ Bp ~ Ci registers L3 cache main memory L1 cache L2 cache 2nd loop around micro-kernel
12 / 69
1st loop around micro-kernel 5th loop around micro-kernel
nC nC
A Bj Cj 4th loop around micro-kernel
kC kC
Ap Bp Cj
mC mC
Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel
nR
Pack Ai → Ai ~
nR
Ai ~ Bp ~ Ci 2nd loop around micro-kernel
mR mR registers L3 cache main memory L1 cache L2 cache
13 / 69
5th loop around micro-kernel
nC nC
A Bj Cj
kC kC
Ap Bp Cj
mC mC
Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel
nR
Pack Ai → Ai ~
nR
Ai ~ Bp ~ Ci 2nd loop around micro-kernel
mR
1st loop around micro-kernel 1
1 L3 cache L2 cache L1 cache registers main memory micro-kernel mR 4th loop around micro-kernel
Cj
14 / 69
◮ An element of A is reused nc times ◮ An element of B is reused m times ◮ An element of C is reused kc times
◮ A:
mnk nc
◮ B:
mnk m
◮ C:
mnk kc
15 / 69
1 2 4 8 16 32 64 128 256 512 1 10 100
Goto’s Algorithm
flops per byte GFLOPS 4 core Intel i7-7700k
Roofline
16 / 69
1 2 4 8 16 32 64 128 256 512 1 10 100
Goto’s Algorithm
flops per byte GFLOPS Bandwidth to main memory: 51.2 GB/s 1 2 4 8 16 32 64 128 256 512 1 10 100
Goto’s Algorithm
flops per byte Bandwidth to main memory: 6.4 GB/s
Roofline
17 / 69
18 / 69
◮ 2 layers of memory: slow and fast ◮ Slow memory has unlimited capacity ◮ Fast memory has capacity S ◮ Data must be in fast memory before computing with it 19 / 69
◮ I/O lower bound: Ω
√ S
◮ I/O lower bound:
mnk 2 √ 2 √ S
◮ With a little calculus this can be improved to mnk
√ S
◮ I/O lower bound 2mnk
√ S
◮ Under submission at ACM TOMS 20 / 69
◮ Each phase has an I/O cost of exactly S 1
◮ Let F be an upper bound on the multiplications during a phase ◮ There are mnk total multiplications during MMM ◮ There must be at least mnk
F
◮ Determine F based on the number of elements available ◮ Each phase: 2S elements available as inputs and 2S elements
1except the last 21 / 69
◮ Using NA, NB, and NC elements of A, B, and C ◮ Can perform at most √NANBNC multiplications
◮ NA ≤ 2S, NB ≤ 2S, and NC ≤ 2S
22 / 69
◮ In an FMA, elements of A, B, and C are all inputs ◮ We can reason about the input cost of C
◮ Each phase can have S + M inputs and S + M outputs ◮ This adds a degree of freedom to our lower bound 23 / 69
◮ NA + NB + NC ≤ S + M
◮ Maximizing over M, this occurs when M = 2S ◮ The greatest lower bound is 2mnk
√ S
24 / 69
1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound
Goto’s Algorithm
flops per byte GFLOPS Bandwidth to main memory: 51.2 GB/s 1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound
Goto’s Algorithm
flops per byte Bandwidth to main memory: 6.4 GB/s
Roofline
25 / 69
◮ Single level of cache ◮ Multiple levels of cache
26 / 69
27 / 69
28 / 69
29 / 69
30 / 69
31 / 69
32 / 69
33 / 69
◮ Ci,j: mcnc reads and mcnc writes. ◮ Ai: mck reads. ◮ Bj: knc reads.
◮ C: mn reads and mn writes. ◮ A:
mnk nc
◮ B:
mnk mc reads.
34 / 69
◮ C: mn reads and mn writes. ◮ A:
mnk √ S reads.
◮ B:
mnk √ S reads.
◮ I/O cost is 2mnk
√ S
◮ Same as lower bound 35 / 69
36 / 69
37 / 69
◮ Resident A, Resident B, and Resident C ◮ Each is associated with a shape of MMM
◮ We can do it with two loops 38 / 69
39 / 69
40 / 69
41 / 69
42 / 69
43 / 69
44 / 69
45 / 69
46 / 69
◮ e.g. B3A2C1 47 / 69
◮ Depends on the ratio between Sh and Sh−1 48 / 69
◮ Optimize for the Lh and Lh−2 caches ◮ Under the right circumstances, the Lh guest matrix can be
◮ We can think of Goto’s Algorithm as “skipping” the L3 and L1
◮ We can call Goto’s Algorithm “A2CR” 49 / 69
50 / 69
◮ Intel i7-7700K CPU ◮ 4 core ◮ Hyperthreading disabled ◮ Turbomode disabled, CPU set to 4.2 GHz
51 / 69
52 / 69
B3A2 Algorithm Goto’s Algorithm += += += += +=
Partition n with blocksize 768 Partition k with blocksize 768 Partition m with blocksize 120 Partition k with blocksize 192 Inner kernel
+= += += +=
Partition n with blocksize 3000 Partition k with blocksize 192 Partition m with blocksize 120 Inner kernel
Block is reused in L3 cache. Block is reused in L2 cache.
53 / 69
1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound
MOMMS Goto (A2) MOMMS B3A2
flops per byte GFLOPS 51.2 GB/s (2 channels of DDR4 3200 RAM) 1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound
MOMMS Goto (A2) MOMMS B3A2
flops per byte 6.4 GB/s (1 channel of DDR4 800 RAM)
Roofline
54 / 69
1000 2000 3000 4000 100 200 m = n = k
GFLOPS 6.4 GB/s
MOMMS Goto (A2) MOMMS B3A2
1000 2000 3000 4000 100 200 m = n = k
GFLOPS 9.6 GB/s
MOMMS Goto (A2) MOMMS B3A2
1000 2000 3000 4000 100 200 m = n = k
GFLOPS 12.8 GB/s
MOMMS Goto (A2) MOMMS B3A2
1000 2000 3000 4000 100 200 m = n = k
GFLOPS 51.2 GB/s
MOMMS Goto (A2) MOMMS B3A2
55 / 69
A3B2 Algorithm C3A2 Algorithm += += += += +=
Partition m with blocksize 768 Partition k with blocksize 768 Partition n with blocksize 120 Partition k with blocksize 192 Inner kernel
+= += += += +=
Partition n dimension with blocksize 624 Partition m dimension with blocksize 624 Partition k dimension with blocksize 156 Partition m dimension with blocksize 156 inner kernel
Block is reused in L3 cache. Block is reused in L2 cache.
56 / 69
1000 2000 3000 4000 100 200 m GFLOPS n = k = 600 1000 2000 3000 4000 100 200 k = n GFLOPS m = 600 1000 2000 3000 4000 100 200 n GFLOPS m = k = 600 1000 2000 3000 4000 100 200 m = k GFLOPS n = 600 1000 2000 3000 4000 100 200 k GFLOPS m = n = 600 1000 2000 3000 4000 100 200 m = n GFLOPS k = 600 1000 2000 3000 4000 100 200 m = n = k GFLOPS Square
MOMMS Goto (A2) MOMMS A3B2 MOMMS B3A2 MOMMS C3A2
57 / 69
1000 2000 3000 4000 100 200 m = n = k
GFLOPS 6.4 GB/s
1000 2000 3000 4000 100 200 m = n = k
GFLOPS 51.2 GB/s
MOMMS Goto (A2) MOMMS B3A2 MOMMS C3A2 BLIS MKL ATLAS
58 / 69
◮ We can reason about the optimality of algorithms
◮ Better L3 cache utilization ◮ We know how to use further levels of the memory hierarchy
◮ Parallelization ◮ Algorithms for other operations (rest of the level-3 BLAS,
59 / 69
60 / 69
61 / 69
Sh−1 Sh
LRU Lh, optimizing for Lh and Lh−1. Best case optimizing for Lh and Lh−1. Blocking for only Lh−1. 62 / 69
◮ Each element of C is involved in k flops ◮ kc flops accumulated into an element of C each time it is read
◮ Each element of C is read and written to and from main
k kc times.
◮ I/O cost of 2mnk
kc
◮ I/O cost of A is mnk
nc
◮ I/O cost of B can be amortized completely 63 / 69
+= += += += +=
Partition m dimension with blocksize 3600 Partition n dimension with blocksize 3600 Partition k dimension with blocksize 192 Partition m dimension with blocksize 120 Inner kernel
Block is reused in L4 cache. Block is reused in L3 cache. Block is reused in L2 cache.
64 / 69
5000 10000 15000 50 100 150 200 m = n = k
GFLOPS 6.4 GB/s
MOMMS Goto (A2) MOMMS C4A2
5000 10000 15000 50 100 150 200 m = n = k
GFLOPS 8.53 GB/s
MOMMS Goto (A2) MOMMS C4A2
5000 10000 15000 50 100 150 200 m = n = k
GFLOPS 10.66 GB/s
MOMMS Goto (A2) MOMMS C4A2
5000 10000 15000 50 100 150 200 m = n = k
GFLOPS 38.4 GB/s
MOMMS Goto (A2) MOMMS C4A2
65 / 69
5000 10000 15000 50 100 150 200 m = n = k
GFLOPS 6.4 GB/s
5000 10000 15000 50 100 150 200 m = n = k
GFLOPS 38.4 GB/s
MOMMS Goto (A2) MOMMS C4A2 BLIS MKL
66 / 69
◮ NA + NB + NC ≤ 2S ◮ √NANBNC ≤ √xyz for some x, y, z ∈ R ◮ x + y + z = 2S
◮ x = y = z = 2S
3
67 / 69
68 / 69
◮ Fills the L1 cache with a square block of A or B ◮ Streams the other two matrices
69 / 69