[PPT] - Column-Based Matrix Partitioning for Parallel Matrix Multiplication PowerPoint Presentation

SLIDE 1

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Column-Based Matrix Partitioning for Parallel Matrix Multiplication

n Heterogeneous Processors

Based on Functional Performance Models

David Clarke Alexey Lastovetsky Vladimir Rychkov

Heterogeneous Computing Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie

HeteroPar’2011

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 2

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Outline

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 3

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Why optimize Matrix Multiplication?

Matrix Multiplication LU Decomposition Cholesky Decomposition Solving a System of Linear Equations Image Processing Molecular Simulation Complexity

f order

O(n3)

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 4

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

A x B C =

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 5

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

A x B C = A x B C =

P

i

P

2

P

1

P

i

P

2

P

1

P

i

P

2

P

1

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 6

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

A x B C = A x B C =

P

i

P

2

P

1

P

i

P

2

P

1

P

i

P

2

P

1

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 7

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

A x B C = A x B C =

P

i

P

2

P

1

P

i

P

2

P

1

P

i

P

2

P

1

P

1

P

i

P

2

P

1

P

i

P

2

P

1

P

i

P

2

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 8

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

A x B C = A x B C =

P

i

P

2

P

1

P

i

P

2

P

1

P

i

P

2

P

1

P

1

P

i

P

2

P

1

P

i

P

2

P

1

P

i

P

2

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 9

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Optimising Parallel Matrix Multiplication on a Heterogeneous Platform

◮ Partition in proportion to processor speed. ◮ Minimise volume of communication. ◮ Partition in proportion to interconnect speed.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 10

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 11

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

allocate and initialise matrices A, B, C; allocate workspace WA, WB; for k = 0 → N − 1 do if (is pivot row) then point WB to local pivot row of B; Broadcast WB to all in column; else Receive WB; end if if (is pivot column) then point WA to local pivot column of A; Send WA horizontally; else receive WA; end if DGEMM(. . . , WA, WB, C, . . . ); end for

A

WA WB

B C

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 12

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

allocate and initialise matrices A, B, C; allocate workspace WA, WB; for k = 0 → N − 1 do if (is pivot row) then point WB to local pivot row of B; Broadcast WB to all in column; else Receive WB; end if if (is pivot column) then point WA to local pivot column of A; Send WA horizontally; else receive WA; end if DGEMM(. . . , WA, WB, C, . . . ); end for

A B

WA WB

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 13

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

allocate and initialise matrices A, B, C; allocate workspace WA, WB; for k = 0 → N − 1 do if (is pivot row) then point WB to local pivot row of B; Broadcast WB to all in column; else Receive WB; end if if (is pivot column) then point WA to local pivot column of A; Send WA horizontally; else receive WA; end if DGEMM(. . . , WA, WB, C, . . . ); end for

A B

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 14

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Benchmarking on each processor must be independent of other processors: serial code. allocate and initialise matrices A, B, C; allocate workspace WA, WB; start timer; MPI Send(A, . . . , MPI COMM SELF); MPI Recv(WA, . . . , MPI COMM SELF); memcpy(WB, B, . . . ); DGEMM(. . . , WA, WB, C, . . . ); stop timer; free memory;

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 15

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Matrix Partitioning Algorithms

◮ Column-Based Partitioning

(Kalinov & Lastovetsky 1999) (KL)

◮ Minimising Total Communication Volume

(Beaumont, Boudet, Rastello, Robert, 2001) (BR)

◮ 1D Functional Performance Model-based Partitioning

(Lastovetsky, Reddy, 2007) (FPM1D)

◮ 2D Functional Performance Model-based Partitioning

(Lastovetsky, Reddy, 2010) (FPM-KL)

◮ New Two-Dimensional Matrix Partitioning Algorithm

(FPM-BR)

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 16

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Column-Based Partitioning (KL)

◮ Processors are arranged into columns. ◮ The width of each column is in proportion to the sum of the

speeds of the processors in that column.

◮ Within each column the heights are calculated in proportion

to speed.

P1, P2, P3 P4, P5, P6 P7, P8, P9 P3 P4 P2 P1 P9 P5 P6 P7 P8 David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 17

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Column-Based Partitioning (KL)

◮ Processors are arranged into columns. ◮ The width of each column is in proportion to the sum of the

speeds of the processors in that column.

◮ Within each column the heights are calculated in proportion

to speed.

◮ However, communication cost is not taken into account. ◮ Uses inaccurate, single-value performance model of processor

speed.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 18

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Minimising Total Communication Volume (BR)

◮ Column-based algorithm. ◮ Computes:

◮ Optimum number of columns ◮ Optimum number of processors in each column

◮ Such that:

◮ Workload is distributed in proportion to speed, ◮ Total volume of communication is minimised. David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 19

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Minimising Total Communication Volume (BR)

◮ Column-based algorithm. ◮ Computes:

◮ Optimum number of columns ◮ Optimum number of processors in each column

◮ Such that:

◮ Workload is distributed in proportion to speed, ◮ Total volume of communication is minimised.

◮ However, uses inaccurate, single-value performance model of

processor speed.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 20

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Minimising Total Communication Volume (BR)

A B m n P

i i i

m n P

i i i

Total volume of communication = p

i (mi + ni)

“the sum of the half perimeters” minimised when mi ≈ ni

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 21

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Realistic Performance Models

◮ Traditionally, processor

performance is defined by a constant number.

◮ In reality, speed is a function

f problem size.

◮ Algorithms based on

constant performance models are only applicable for limited problem sizes.

10 20 30 40 50 60 10000 20000 30000 40000 50000 Speed (GFLOPS) Problem Size N Matrix Multiplication on Grid5000 David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 22

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

1D Functional Performance Model-based Partitioning (FPM1D)

◮ Problem is solved geometrically by noting that the points

di, si(di)
lie on a line passing through the origin when

di si(di) = constant.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 23

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions hcl16 (384MB RAM) hcl13 (1024MB RAM) 100 200 300 400 500 600 700 800 m 100 200 300 400 500 600 700 800 n 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Speed (GFLOPS) David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 24

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

2D Functional Performance Model-based Partitioning (FPM-KL)

◮ Column-based partitioning with 2D performance models. ◮ Processors are arranged in a grid p × q ◮ Column widths are initially distributed nj = N/q ∀j.

Iterating:

1. 1D models are sliced from 2D

at column widths.

2. Optimum partitioning within

each column is solved with FPM1D algorithm.

3. If disbalance < ǫ then

finished, else continue.

4. Single value speeds from this

partitioning used to calculate new column widths.

hcl16 (384MB RAM) hcl13 (1024MB RAM) 2000 4000 6000 8000 10000 12000 m 2000 4000 6000 8000 10000 12000 n 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Speed (GFLOPS)

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 25

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

2D Functional Performance Model-based Partitioning (FPM-KL)

◮ Does not take communication cost into account. ◮ Processor grid is fixed. ◮ Relies on single speed values to calculate new column widths. ◮ Building full 2D models is expensive

0.5 1 1.5 2 2.5 3 3.5 4 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 Speed (GFLOPS) Problem Size - matrix elements Matrix Multiplication Benchmark on HCL cluster hcl13 hcl16

hcl16 (384MB RAM) hcl13 (1024MB RAM) 2000 4000 6000 8000 10000 12000 m 2000 4000 6000 8000 10000 12000 n 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Speed (GFLOPS)

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 26

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

New Two-Dimensional Matrix Partitioning Algorithm (FPM-BR)

◮ Height mi and width ni combined into one parameter, area

di = mi × ni.

◮ Square areas are benchmarked m = n =

√ d.

◮ Partition with FPM1D algorithm, find area rectangles. ◮ BR algorithm computes ordering and shape of these

rectangles.

5 10 15 20 25 30 35 40 1e+07 2e+07 3e+07 4e+07 5e+07 Speed (GFLOPS) Problem Size - matrix elements Matrix Multiplication Benchmark on Grid5000-Lille chirloute-3 chimint-1 chinqchint-1 chicon-1

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 27

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Assumption square area performance is the same as performance of any rectangle

f the same area, s(x, x) = s(x/c, c.x).
not always true.

0.2 0.4 0.6 0.8 1 1.2 1.4 1:40 1:20 1:1 20:1 40:1 Speed (GFLOPS) Ratio m:n Lines connect benchmarks of equal area David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 28

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Assumption square area performance is the same as performance of any rectangle

f the same area, s(x, x) = s(x/c, c.x).
not always true.

0.2 0.4 0.6 0.8 1 1.2 1.4 1:40 1:20 1:1 20:1 40:1 Speed (GFLOPS) Ratio m:n Lines connect benchmarks of equal area

However, goal of BR algorithm is to make rectangles square.

5.0⋅103 1.0⋅104 1.5⋅104 2.0⋅104 1:1.4 1:1.2 1:1 1.2:1 1.4:1 Frequency Ratio m:n 0.2 0.4 0.6 0.8 1 1.2 1.4 1:1.4 1:1.2 1:1 1.2:1 1.4:1 Speed (GFLOPS) Ratio m:n Lines connect benchmarks of equal area David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 29

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Experimental Results

200 400 600 800 1000 1200 6000 8000 10000 12000 14000 16000 18000 20000 Total Execution Time (sec) Total Matrix size Nb Homogeneous Distribution BR FPM-KL FPM-BR

16 heterogeneous nodes, local HCL cluster.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 30

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

2000 4000 6000 8000 10000 12000 14000 20000 40000 60000 80000 100000 120000 Total Execution Time (sec) Total Matrix size Nb Homogeneous Distribution BR FPM-KL FPM-BR

64 nodes from Grid5000 Lille site (4 types of nodes).

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 31

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Matrix partitioning for 14 nodes FPM-KL

01 02 03 04 05 06 07 08 09 10 11 12 13 14

TVC: 9 Time: 192.2sec FPM-BR

01 02 03 04 05 06 07 08 09 10 11 12 13 14

TVC: 7.457 Time: 166.0sec

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 32

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Conclusions

◮ New FPM-BR algorithm can outperform existing algorithms. ◮ Allows use of simpler 1D models. ◮ Total volume of communication is minimised.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning

SLIDE 33

Motivation Parallel Matrix Multiplication Routine Matrix Partitioning Algorithms Experimental Results Conclusions

Questions?

David Clarke, Alexey Lastovetsky, Vladimir Rychkov Heterogeneous Two-Dimensional Matrix Partitioning