I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - - PowerPoint PPT Presentation

i o lower bounds and algorithms for matrix matrix
SMART_READER_LITE
LIVE PREVIEW

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - - PowerPoint PPT Presentation

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69 Introduction Dense matrix-matrix multiplication (MMM) Goal: Reduce I/O cost for machines with hierarchical memory Novel


slide-1
SLIDE 1

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication

Tyler M. Smith July 5, 2017

1 / 69

slide-2
SLIDE 2

Introduction

◮ Dense matrix-matrix multiplication (MMM) ◮ Goal: Reduce I/O cost for machines with hierarchical memory ◮ Novel contributions:

◮ I/O lower bounds with a tight constant 2mnk

√ S

◮ A family of algorithms for machines with any number of levels

  • f memory hierarchy

◮ Outperform the state-of-the-art Goto’s Algorithm by 38%

when there is low bandwidth to main memory

2 / 69

slide-3
SLIDE 3

Problem definition

◮ Classical MMM

◮ C += AB ◮ C is m × n, A is m × k, and B is k × n

◮ Reduce I/O cost for MMM algorithms

3 / 69

slide-4
SLIDE 4

Hierarchical memory

4 / 69

slide-5
SLIDE 5

Blocked algorithms

◮ MMM is an operation with a lot of opportunities for reuse

◮ Each element of A is used n times ◮ Each element of B is used m times ◮ Each element of C is used k times

◮ With O(n2) elements, one can perform O(n3) flops

◮ If all matrices fit into fast memory, amortize O(n2) memops

with O(n3) flops

◮ Work with blocks of matrices at a time, where the blocks can

fit into fast memory

5 / 69

slide-6
SLIDE 6

Building blocks of dense linear algebra

◮ MMM is the bottom of the food chain ◮ Level-3 BLAS ◮ LAPACK/FLAME ◮ ScaLAPACK/Elemental

6 / 69

slide-7
SLIDE 7

Outline

◮ Introduction ◮ State-of-the-art MMM

◮ Goto’s Algorithm

◮ Lower bounds ◮ Algorithms ◮ Experiments

7 / 69

slide-8
SLIDE 8

Goto’s Algorithm

5th loop around micro-kernel

+=

nC nC

A Bj Cj

+=

kC kC

Ap Bp Cj

+=

mC mC

Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel

+=

nR

Pack Ai → Ai ~

nR

Ai ~ Bp ~ Ci 2nd loop around micro-kernel

mR

+=

1st loop around micro-kernel 1

+=

1 L3 cache L2 cache L1 cache registers main memory micro-kernel mR 4th loop around micro-kernel

Cj

8 / 69

slide-9
SLIDE 9

Goto’s Algorithm

5th loop around micro-kernel

+=

nC nC

A Bj Cj L2 cache registers main memory L1 cache L3 cache

9 / 69

slide-10
SLIDE 10

Goto’s Algorithm

5th loop around micro-kernel

+=

nC nC

A Bj Cj 4th loop around micro-kernel

+=

kC kC

Ap Bp Cj registers L3 cache main memory L1 cache L2 cache Pack Bp → Bp ~ ~ Bp

10 / 69

slide-11
SLIDE 11

Goto’s Algorithm

5th loop around micro-kernel

+=

nC nC

A Bj Cj 4th loop around micro-kernel

+=

kC kC

Ap Bp Cj

+=

mC mC

Pack Bp → Bp ~ Bp ~ Ci Ai registers L3 cache main memory L1 cache L2 cache Pack Ai → Ai ~ Ai ~ 3rd loop around micro-kernel

11 / 69

slide-12
SLIDE 12

Goto’s Algorithm

5th loop around micro-kernel

+=

nC nC

A Bj Cj 4th loop around micro-kernel

+=

kC kC

Ap Bp Cj

+=

mC mC

Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel

+=

nR

Pack Ai → Ai ~

nR

Ai ~ Bp ~ Ci registers L3 cache main memory L1 cache L2 cache 2nd loop around micro-kernel

12 / 69

slide-13
SLIDE 13

Goto’s Algorithm

1st loop around micro-kernel 5th loop around micro-kernel

+=

nC nC

A Bj Cj 4th loop around micro-kernel

+=

kC kC

Ap Bp Cj

+=

mC mC

Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel

+=

nR

Pack Ai → Ai ~

nR

Ai ~ Bp ~ Ci 2nd loop around micro-kernel

+=

mR mR registers L3 cache main memory L1 cache L2 cache

13 / 69

slide-14
SLIDE 14

Goto’s Algorithm

5th loop around micro-kernel

+=

nC nC

A Bj Cj

+=

kC kC

Ap Bp Cj

+=

mC mC

Pack Bp → Bp ~ Bp ~ Ci Ai 3rd loop around micro-kernel

+=

nR

Pack Ai → Ai ~

nR

Ai ~ Bp ~ Ci 2nd loop around micro-kernel

mR

+=

1st loop around micro-kernel 1

+=

1 L3 cache L2 cache L1 cache registers main memory micro-kernel mR 4th loop around micro-kernel

Cj

14 / 69

slide-15
SLIDE 15

I/O cost of Goto’s Algorithm

◮ Reuse dictates the I/O cost for Goto’s Algorithm ◮ Each time an element is read from main memory:

◮ An element of A is reused nc times ◮ An element of B is reused m times ◮ An element of C is reused kc times

◮ Overall I/O costs of:

◮ A:

mnk nc

◮ B:

mnk m

◮ C:

mnk kc

15 / 69

slide-16
SLIDE 16

Roofline model

1 2 4 8 16 32 64 128 256 512 1 10 100

Goto’s Algorithm

flops per byte GFLOPS 4 core Intel i7-7700k

Roofline

16 / 69

slide-17
SLIDE 17

Roofline model

1 2 4 8 16 32 64 128 256 512 1 10 100

Goto’s Algorithm

flops per byte GFLOPS Bandwidth to main memory: 51.2 GB/s 1 2 4 8 16 32 64 128 256 512 1 10 100

Goto’s Algorithm

flops per byte Bandwidth to main memory: 6.4 GB/s

Roofline

17 / 69

slide-18
SLIDE 18

Outline

◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Experiments

18 / 69

slide-19
SLIDE 19

I/O lower bounds

◮ Theoretical minimum I/O cost for an operation ◮ We want to find the greatest I/O lower bound ◮ Model of computation

◮ 2 layers of memory: slow and fast ◮ Slow memory has unlimited capacity ◮ Fast memory has capacity S ◮ Data must be in fast memory before computing with it 19 / 69

slide-20
SLIDE 20

Related work

◮ Hong and Kung (1981)

◮ I/O lower bound: Ω

  • mnk

√ S

  • ◮ Irony, Toledo, and Tiskin (2004)

◮ I/O lower bound:

mnk 2 √ 2 √ S

◮ With a little calculus this can be improved to mnk

√ S

◮ Tyler Smith, Robert van de Geijn, Bradley Lowery, and Julien

Langou (2017)

◮ I/O lower bound 2mnk

√ S

◮ Under submission at ACM TOMS 20 / 69

slide-21
SLIDE 21

Lower bound strategy

◮ Consider any algorithm for MMM ◮ Break the algorithm into phases

◮ Each phase has an I/O cost of exactly S 1

◮ If there must be at least h phases, and each phase has an I/O

cost of S, the overall I/O cost must be at least Sh.

◮ Determine minimum number of phases

◮ Let F be an upper bound on the multiplications during a phase ◮ There are mnk total multiplications during MMM ◮ There must be at least mnk

F

phases

◮ Determine F based on the number of elements available ◮ Each phase: 2S elements available as inputs and 2S elements

available as outputs

1except the last 21 / 69

slide-22
SLIDE 22

Upper bound on elementary multiplications in a phase

Irony, Toledo, and Tiskin (2004)

◮ Inequality from Loomis and Whitney (1949)

◮ Using NA, NB, and NC elements of A, B, and C ◮ Can perform at most √NANBNC multiplications

◮ At most 2S elements available as inputs, and 2S elements

available as outputs

◮ NA ≤ 2S, NB ≤ 2S, and NC ≤ 2S

◮ At most

√ 8S3 =

  • 2

√ 2 S √ S

  • multiplications in a phase

◮ Gives an overall lower bound of 1 2 √ 2 mnk √ S

22 / 69

slide-23
SLIDE 23

Improving the lower bound

◮ Assume we perform FMAs instead of elementary

multiplications

◮ In an FMA, elements of A, B, and C are all inputs ◮ We can reason about the input cost of C

◮ What if we generalize the I/O cost of each phase?

◮ Each phase can have S + M inputs and S + M outputs ◮ This adds a degree of freedom to our lower bound 23 / 69

slide-24
SLIDE 24

Upper bound on FMAs during a phase

◮ There are at most S + M inputs

◮ NA + NB + NC ≤ S + M

◮ We again use the Loomis-Whitney inequality ◮ Maximize √NANBNC when NA + NB + NC = S + M ◮ Maximized when NA = NB = NC ◮ Then our lower bound is 3 √ 3Mmnk (S+M) √ S+M ◮ Finding the greatest lower bound

◮ Maximizing over M, this occurs when M = 2S ◮ The greatest lower bound is 2mnk

√ S

24 / 69

slide-25
SLIDE 25

Roofline model

1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound

Goto’s Algorithm

flops per byte GFLOPS Bandwidth to main memory: 51.2 GB/s 1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound

Goto’s Algorithm

flops per byte Bandwidth to main memory: 6.4 GB/s

Roofline

25 / 69

slide-26
SLIDE 26

Outline

◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms

◮ Single level of cache ◮ Multiple levels of cache

◮ Experiments

26 / 69

slide-27
SLIDE 27

Resident C

+= A C B

27 / 69

slide-28
SLIDE 28

Resident C

Partition m dimension

+=

mc

28 / 69

slide-29
SLIDE 29

Resident C

Partition n dimension

+=

mc nc

29 / 69

slide-30
SLIDE 30

Resident C

Move mc × nc block of C into fast memory

+=

mc nc

30 / 69

slide-31
SLIDE 31

Resident C

Stream panels of A and B from slow memory

+=

mc nc

31 / 69

slide-32
SLIDE 32

Resident C

Partition k dimension

+=

mc nc 1

32 / 69

slide-33
SLIDE 33

Resident C

Move vectors into fast memory

+=

mc nc 1

33 / 69

slide-34
SLIDE 34

I/O cost for Resident C

+=

mc nc 1

◮ I/O cost per block dot product:

◮ Ci,j: mcnc reads and mcnc writes. ◮ Ai: mck reads. ◮ Bj: knc reads.

◮ Total I/O cost:

◮ C: mn reads and mn writes. ◮ A:

mnk nc

reads.

◮ B:

mnk mc reads.

34 / 69

slide-35
SLIDE 35

Choosing blocksizes for Resident C

+=

√ S √ S 1

◮ If mc ≈ nc ≈

√ S

◮ Total I/O cost:

◮ C: mn reads and mn writes. ◮ A:

mnk √ S reads.

◮ B:

mnk √ S reads.

◮ If m, n, k are large and we can ignore lower ordered terms

◮ I/O cost is 2mnk

√ S

◮ Same as lower bound 35 / 69

slide-36
SLIDE 36

Three algorithms

Resident C Resident B Resident A += += +=

Data in cache. Data in main memory.

36 / 69

slide-37
SLIDE 37

Resident A, B, and C algorithms in Goto’s Algorithm

37 / 69

slide-38
SLIDE 38

Algorithms for multiple levels of cache

◮ Suppose we have 2 levels of cache: L2 and L1 ◮ We have 3 algorithms

◮ Resident A, Resident B, and Resident C ◮ Each is associated with a shape of MMM

◮ Suppose we have one of those shapes at the L2 level ◮ Then how do we also encounter one at the L1 level?

◮ We can do it with two loops 38 / 69

slide-39
SLIDE 39

Resident C at the L2 cache

+=

Resident block of L2 cache.

39 / 69

slide-40
SLIDE 40

L1 outer loop

Partition k dimension +=

Resident block of L2 cache.

40 / 69

slide-41
SLIDE 41

L1 outer loop

Partition k dimension += +=

Resident block of L2 cache.

41 / 69

slide-42
SLIDE 42

L1 inner loop

Partition either m or n direction += += +=

Resident block of L2 cache.

42 / 69

slide-43
SLIDE 43

L1 inner loop

Partition either m or n direction += += += += +=

Resident block of L2 cache. Resident block of L1 cache.

43 / 69

slide-44
SLIDE 44

L1 inner loop

Partition either m or n direction += += += += +=

Resident block of L2 cache. Guest panel of L2 cache. Resident block of L1 cache.

44 / 69

slide-45
SLIDE 45

Resident A at the L2 cache

+= += += += +=

Resident block of L2 cache. Guest panel of L2 cache. Resident block of L1 cache.

45 / 69

slide-46
SLIDE 46

Resident B at the L2 cache

+= += += += +=

Resident block of L2 cache. Guest panel of L2 cache. Resident block of L1 cache.

46 / 69

slide-47
SLIDE 47

Families of algorithms

◮ We start out with one of the three shapes at the Lh cache ◮ With 2 loops, we have one of the other two shapes at the

Lh−1 cache

◮ Repeat the process for subsequent levels of cache ◮ We name algorithms based on the resident matrix at each

level of cache

◮ e.g. B3A2C1 47 / 69

slide-48
SLIDE 48

Tradeoffs

+=

◮ Blocking for Lh−1 cache means more data must fit into Lh ◮ For LRU caches, all elements used during one iteration of the

Lh−1 outer loop must fit into the Lh cache

◮ For ideal caches, the Lh resident matrix and Lh guest matrix

must fit into the Lh cache

◮ This increases Lh I/O cost

◮ Depends on the ratio between Sh and Sh−1 48 / 69

slide-49
SLIDE 49

What if it’s not worth optimizing for both levels of cache?

◮ One option is to use smaller blocksizes for the Lh−1 cache ◮ Skipping a level of cache

◮ Optimize for the Lh and Lh−2 caches ◮ Under the right circumstances, the Lh guest matrix can be

placed in the Lh−1 cache

◮ We can think of Goto’s Algorithm as “skipping” the L3 and L1

caches.

◮ We can call Goto’s Algorithm “A2CR” 49 / 69

slide-50
SLIDE 50

Outline

◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Experiments

50 / 69

slide-51
SLIDE 51

Experimental setup

◮ Custom-built PC with an unlocked CPU and enthusiast

motherboard

◮ Vary BCLK, CPU multiplier, and the memory multiplier to

change system characteristics

◮ System Details

◮ Intel i7-7700K CPU ◮ 4 core ◮ Hyperthreading disabled ◮ Turbomode disabled, CPU set to 4.2 GHz

◮ Hypothesis: If we reduce bandwidth to main memory,

algorithms that better utilize the last level cache become more efficient than those that do not.

51 / 69

slide-52
SLIDE 52

MOMMS

◮ Multilevel Optimized Matrix-matrix Multiplication Sandbox ◮ Framework written in Rust ◮ Use composition to instantiate different algorithms for MMM

52 / 69

slide-53
SLIDE 53

Algorithms for an Intel i7-7700K

B3A2 Algorithm Goto’s Algorithm += += += += +=

Partition n with blocksize 768 Partition k with blocksize 768 Partition m with blocksize 120 Partition k with blocksize 192 Inner kernel

+= += += +=

Partition n with blocksize 3000 Partition k with blocksize 192 Partition m with blocksize 120 Inner kernel

Block is reused in L3 cache. Block is reused in L2 cache.

53 / 69

slide-54
SLIDE 54

Roofline model

1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound

MOMMS Goto (A2) MOMMS B3A2

flops per byte GFLOPS 51.2 GB/s (2 channels of DDR4 3200 RAM) 1 2 4 8 16 32 64 128 256 512 1 10 100 Lower bound

MOMMS Goto (A2) MOMMS B3A2

flops per byte 6.4 GB/s (1 channel of DDR4 800 RAM)

Roofline

54 / 69

slide-55
SLIDE 55

Varying bandwidth for the i7-7700K

1000 2000 3000 4000 100 200 m = n = k

GFLOPS 6.4 GB/s

MOMMS Goto (A2) MOMMS B3A2

1000 2000 3000 4000 100 200 m = n = k

GFLOPS 9.6 GB/s

MOMMS Goto (A2) MOMMS B3A2

1000 2000 3000 4000 100 200 m = n = k

GFLOPS 12.8 GB/s

MOMMS Goto (A2) MOMMS B3A2

1000 2000 3000 4000 100 200 m = n = k

GFLOPS 51.2 GB/s

MOMMS Goto (A2) MOMMS B3A2

55 / 69

slide-56
SLIDE 56

Algorithms for an Intel i7-7700K

A3B2 Algorithm C3A2 Algorithm += += += += +=

Partition m with blocksize 768 Partition k with blocksize 768 Partition n with blocksize 120 Partition k with blocksize 192 Inner kernel

+= += += += +=

Partition n dimension with blocksize 624 Partition m dimension with blocksize 624 Partition k dimension with blocksize 156 Partition m dimension with blocksize 156 inner kernel

Block is reused in L3 cache. Block is reused in L2 cache.

56 / 69

slide-57
SLIDE 57

Different shapes of MMM

1000 2000 3000 4000 100 200 m GFLOPS n = k = 600 1000 2000 3000 4000 100 200 k = n GFLOPS m = 600 1000 2000 3000 4000 100 200 n GFLOPS m = k = 600 1000 2000 3000 4000 100 200 m = k GFLOPS n = 600 1000 2000 3000 4000 100 200 k GFLOPS m = n = 600 1000 2000 3000 4000 100 200 m = n GFLOPS k = 600 1000 2000 3000 4000 100 200 m = n = k GFLOPS Square

MOMMS Goto (A2) MOMMS A3B2 MOMMS B3A2 MOMMS C3A2

57 / 69

slide-58
SLIDE 58

Comparing with other implementations for the i7-7700K

1000 2000 3000 4000 100 200 m = n = k

GFLOPS 6.4 GB/s

1000 2000 3000 4000 100 200 m = n = k

GFLOPS 51.2 GB/s

MOMMS Goto (A2) MOMMS B3A2 MOMMS C3A2 BLIS MKL ATLAS

58 / 69

slide-59
SLIDE 59

Conclusion

◮ New lower bounds

◮ We can reason about the optimality of algorithms

◮ A new family of algorithms

◮ Better L3 cache utilization ◮ We know how to use further levels of the memory hierarchy

(L4, out of core, etc)

◮ Future work

◮ Parallelization ◮ Algorithms for other operations (rest of the level-3 BLAS,

matrix factorizations, etc)

59 / 69

slide-60
SLIDE 60

Thank you!

Questions?

◮ Tyler M. Smith ◮ tms@cs.utexas.edu ◮ MOMMS can be found at github.com/tlrmchlsmth/momms

60 / 69

slide-61
SLIDE 61

Backup

61 / 69

slide-62
SLIDE 62

Tradeoffs

0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 1.6 2.5 4 S2 : S1 S3 : S2 S3 : (4S2) S3 : S1

Sh−1 Sh

Relative I/O cost I/O cost relative to lower bound for different scenarios

LRU Lh, optimizing for Lh and Lh−1. Best case optimizing for Lh and Lh−1. Blocking for only Lh−1. 62 / 69

slide-63
SLIDE 63

I/O cost of Goto’s Algorithm

◮ Let’s look at the I/O cost of C

◮ Each element of C is involved in k flops ◮ kc flops accumulated into an element of C each time it is read

and written from main memory

◮ Each element of C is read and written to and from main

memory

k kc times.

◮ I/O cost of 2mnk

kc

◮ Can analyze I/O cost of A and B similarly

◮ I/O cost of A is mnk

nc

◮ I/O cost of B can be amortized completely 63 / 69

slide-64
SLIDE 64

An algorithm for an Intel i7-5775C

C4A2 Algorithm

+= += += += +=

Partition m dimension with blocksize 3600 Partition n dimension with blocksize 3600 Partition k dimension with blocksize 192 Partition m dimension with blocksize 120 Inner kernel

Block is reused in L4 cache. Block is reused in L3 cache. Block is reused in L2 cache.

64 / 69

slide-65
SLIDE 65

Varying bandwidth for the i7-5775C

5000 10000 15000 50 100 150 200 m = n = k

GFLOPS 6.4 GB/s

MOMMS Goto (A2) MOMMS C4A2

5000 10000 15000 50 100 150 200 m = n = k

GFLOPS 8.53 GB/s

MOMMS Goto (A2) MOMMS C4A2

5000 10000 15000 50 100 150 200 m = n = k

GFLOPS 10.66 GB/s

MOMMS Goto (A2) MOMMS C4A2

5000 10000 15000 50 100 150 200 m = n = k

GFLOPS 38.4 GB/s

MOMMS Goto (A2) MOMMS C4A2

65 / 69

slide-66
SLIDE 66

Comparing with other implementations for the i7-5775C

5000 10000 15000 50 100 150 200 m = n = k

GFLOPS 6.4 GB/s

5000 10000 15000 50 100 150 200 m = n = k

GFLOPS 38.4 GB/s

MOMMS Goto (A2) MOMMS C4A2 BLIS MKL

66 / 69

slide-67
SLIDE 67

Upper bound on FMAs during a phase with I/O cost S

◮ Again, we will use the Loomis-Whitney Inequality ◮ In MMM, A, B, and C are inputs ◮ There are at most 2S inputs

◮ NA + NB + NC ≤ 2S ◮ √NANBNC ≤ √xyz for some x, y, z ∈ R ◮ x + y + z = 2S

◮ Maximize √xyz under the constraint x + y + z = 2S

◮ x = y = z = 2S

3

◮ F = 2 √ 2 3 √ 3S

√ S

◮ Then our lower bound is S

  • 3

√ 3 2 √ 2 mnk S √ S − 1

  • ◮ Or: 3

√ 3 2 √ 2 mnk √ S − S = 1.837mnk √ S

− S

67 / 69

slide-68
SLIDE 68

Analysis of ATLAS

Whaley, Petitet, and Dongarra (2001) f o r ( j =0; j<N−1; j+= NB ) { f o r ( i =0; i < M−1; i+= NB ) { f o r ( p=0; p<K−1; p+= NB ) { ON CHIP MATMUL( A[ i : i+ NB ] [ p : p+ NB] , B[ p : p+ NB ] [ j : j+ NB] , C[ i : i+ NB ] [ j : j+ NB] ) ; } } }

68 / 69

slide-69
SLIDE 69

Analysis of ATLAS

Whaley, Petitet, and Dongarra (2001)

◮ Inner-kernel is an nb × nb × nb MMM

◮ Fills the L1 cache with a square block of A or B ◮ Streams the other two matrices

◮ The next inner-kernel invocation uses the same block of C,

different A and B.

◮ Each element of A, B, and C reused in cache nb times ◮ I/O cost for each of A, B, and C is mnk √S1 ◮ Overall cost is roughly 3mnk √S1

69 / 69