An Exploration of Optimization Algorithms for High Performance - - PowerPoint PPT Presentation

an exploration of optimization algorithms for high
SMART_READER_LITE
LIVE PREVIEW

An Exploration of Optimization Algorithms for High Performance - - PowerPoint PPT Presentation

An Exploration of Optimization Algorithms for High Performance Tensor Completion Shaden Smith 1 , Jongsoo Park 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Parallel Computing Lab, Intel


slide-1
SLIDE 1

An Exploration of Optimization Algorithms for High Performance Tensor Completion

Shaden Smith1∗, Jongsoo Park2, and George Karypis1

1Department of Computer Science & Engineering, University of Minnesota 2Parallel Computing Lab, Intel Corporation ∗shaden@cs.umn.edu

1 / 26

slide-2
SLIDE 2

Outline

Introduction & Preliminaries Tensor Completion Evaluation Criteria Optimization Algorithms Alternating Least Squares Coordinate Descent Stochastic Gradient Descent Comparison of Optimization Methods Conclusions

1 / 26

slide-3
SLIDE 3

Table of Contents

Introduction & Preliminaries Tensor Completion Evaluation Criteria Optimization Algorithms Comparison of Optimization Methods Conclusions

1 / 26

slide-4
SLIDE 4

Tensor introduction

◮ Tensors are the generalization of matrices to ≥ 3D. ◮ Tensors have m dimensions (or modes).

◮ We will use dimensions I×J×K in this talk.

users items contexts

2 / 26

slide-5
SLIDE 5

Tensor completion

◮ Many tensors are sparse due to missing or unknown data.

◮ Missing values are not treated as zero.

◮ Assumption: the underlying data is low rank. ◮ Tensor completion estimates a low rank model to recover missing

entries.

◮ Applications: precision healthcare, product recommendation,

cybersecurity, and others.

3 / 26

slide-6
SLIDE 6

Tensor completion

◮ Many tensors are sparse due to missing or unknown data.

◮ Missing values are not treated as zero.

◮ Assumption: the underlying data is low rank. ◮ Tensor completion estimates a low rank model to recover missing

entries.

◮ Applications: precision healthcare, product recommendation,

cybersecurity, and others.

◮ The canonical polyadic decomposition (CPD) models a tensor as

the summation of rank-1 tensors. ≈ + · · · +

3 / 26

slide-7
SLIDE 7

Tensor completion with the CPD

R(i, j, k) is written as the inner product of A(i, :), B(j, :), and C(k, :). A B C

4 / 26

slide-8
SLIDE 8

Tensor completion with the CPD

R(i, j, k) is written as the inner product of A(i, :), B(j, :), and C(k, :). A B C We arrive at a non-convex optimization problem: minimize

A,B,C

L(R, A, B, C)

  • Loss

+ λ

  • ||A||2

F + ||B||2 F + ||C||2 F

  • Regularization

L(R, A, B, C) = 1 2

  • nnz(R)
  • R(i, j, k) −

F

  • f =1

A(i, f )B(j, f )C(k, f ) 2

4 / 26

slide-9
SLIDE 9

Challenges

Optimization algorithms

◮ Algorithms for matrix completion are relatively mature.

◮ How do their tensor adaptations perform on HPC systems?

◮ Several properties to consider when comparing algorithms:

  • 1. Number of operations.
  • 2. Convergence rate.
  • 3. Computational intensity.
  • 4. Parallelism.

5 / 26

slide-10
SLIDE 10

Experimental setup

◮ Source code was implemented as part of SPLATT with

MPI+OpenMP.

◮ Experiments are on the Cori supercomputer at NERSC.

◮ Nodes have two sixteen-core Intel processors (Haswell).

◮ Experiments show a rank-10 factorization of the Yahoo Music

(KDD cup) tensor.

◮ 210 million user-song-month ratings. ◮ More datasets and ranks in the paper.

◮ Root-mean-squared error (RMSE) on a test set measures solution

quality: RMSE =

  • 2 · L(R, A, B, C)

nnz(R)

6 / 26

slide-11
SLIDE 11

Table of Contents

Introduction & Preliminaries Optimization Algorithms Alternating Least Squares Coordinate Descent Stochastic Gradient Descent Comparison of Optimization Methods Conclusions

6 / 26

slide-12
SLIDE 12

Alternating least squares (ALS)

◮ Each row of A is a linear least squares problem. ◮ Hi is an |R(i, :, :)|×F matrix:

◮ R(i, j, k) → B(j, :) ∗ C(k, :) (elementwise multiplication).

◮ A(i, :) ←

  • HT

i Hi + λI

−1

  • normal eq.

HT

i vec(R(i, :, :)).

A = B C

7 / 26

slide-13
SLIDE 13

Alternating least squares (ALS)

◮ Normal equations Ni = HT i Hi are formed one non-zero at a time. ◮ HT i vec(R(i, :, :)) is similarly accumulated into a vector qi.

Algorithm 1 ALS: updating A(i, :)

1: Ni ← 0F×F, qi ← 0F×1 2: for (i, j, k) ∈ R(i, :, :) do 3:

x ← B(j, :) ∗ C(k, :)

4:

Ni ← Ni + xTx

5:

qi ← qi + R(i, j, k)xT

6: end for 7: A(i, :) ← (Ni + λI)−1qi

8 / 26

slide-14
SLIDE 14

BLAS-3 formulation

◮ Element-wise computation is an outer product formulation.

◮ O(F 2) work with O(F 2) data per non-zero.

◮ Instead, append (B(j, :) ∗ C(k, :)) to a matrix Z.

◮ When Z is full, do a rank-k update: Ni ← Ni + ZTZ.

Algorithm 2 ALS: updating A(i, :)

1: Ni ← 0F×F, qi ← 0F×1, Z ← 0 2: for (i, j, k) ∈ R(i, :, :) do 3:

Append (x ← B(j, :) ∗ C(k, :)) to Z

4:

qi ← qi + R(i, j, k)xT

5: end for 6: Ni ← Ni + ZTZ 7: A(i, :) ← (Ni + λI)−1qi

9 / 26

slide-15
SLIDE 15

Parallel ALS

◮ We impose a 1D partition on each of the factors. ◮ Non-zeros are then distributed according to the row partitionings. ◮ Only the updated rows need to be communicated. ◮ If mode is short, cooperatively form rows and aggregate the

normal equations. A B C

10 / 26

slide-16
SLIDE 16

ALS evaluation

295× relative speedup and 153× speedup over base-ALS.

1 2 4 8 16 32 64 128 256 512 1024 Number of cores 0.12 0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 128.00 256.00 512.00 Time per epoch (s) base-ALS ALS

base-ALS is a pure-MPI implementation in C++ [Karlsson ’15]. ALS is our MPI+OpenMP implementation with one MPI rank per node.

11 / 26

slide-17
SLIDE 17

Coordinate descent (CCD++)

◮ Select an variable and update while holding all others constant.

◮ Can be done exactly because problem is quadratic.

◮ Rank-1 factors are updated in sequence.

12 / 26

slide-18
SLIDE 18

CCD++ formulation

◮ O(F) work per non-zero. ◮ Each epoch requires 3F passes over the tensor.

◮ Heavily dependent on memory bandwidth.

δijk ← R(i, j, k) −

F

  • f =1

A(i, f )B(j, f )C(k, f ) αi ←

  • R(i,:,:)

δijk (B(j, f )C(k, f )) βi ←

  • R(i,:,:)

(B(j, f )C(k, f ))2 A(i, f ) ← αi βi + λ

13 / 26

slide-19
SLIDE 19

Compressed sparse fiber (CSF)

◮ CSF is a generalization of the CSR structure for matrices. ◮ Paths from roots to leaves encode non-zeros. ◮ CSF reduces the memory bandwidth of the tensor and also

structures accesses to the factors.

14 / 26

slide-20
SLIDE 20

Parallel CCD++

◮ Shared-memory: each entry of A(:, f ) is computed in parallel. ◮ Distributing non-zeros with a 3D grid limits communication to

the grid layers.

◮ Distributing non-zeros requires αi and βi to be aggregated. ◮ Communication volume is O(IF) per process.

◮ For short modes, use a grid dimension of 1 and fully replicate the

factor. A1 A2 B1 B2 B3 C1 C2

15 / 26

slide-21
SLIDE 21

CCD++ shared-memory evaluation

Replicating the short mode improves speedup from 4× to 16×.

1 2 4 8 16 32 Cores 1 2 4 8 16 32 Speedup

ideal CSF CSF+replication

CSF uses a CSF representation. CSF+replication uses a CSF representation and replicates α and β for load balance.

16 / 26

slide-22
SLIDE 22

CCD++ distributed-memory evaluation

685× relative speedup and 21× speedup over base-CCD++.

1 2 4 8 16 32 64 128 256 512 1024 Number of cores 0.06 0.12 0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 128.00 256.00 Time per epoch (s) base-CCD++ CCD++

base-CCD++ is a pure-MPI implementation in C++ [Karlsson ’15]. CCD++ is

  • ur MPI+OpenMP implementation with two MPI ranks per node.

17 / 26

slide-23
SLIDE 23

Stochastic gradient descent (SGD)

◮ Randomly select entry R(i, j, k) and update A, B, and C.

◮ O(F) work per non-zero.

δijk ← R(i, j, k) −

F

  • f =1

A(i, f )B(j, f )C(k, f ) A(i, :) ← A(i, :) + η [δijk (B(j, :) ∗ C(k, :)) − λA(i, :)] B(j, :) ← B(j, :) + η [δijk (A(i, :) ∗ C(k, :)) − λB(j, :)] C(k, :) ← C(k, :) + η [δijk (A(i, :) ∗ B(j, :)) − λC(k, :)] η is the step size; typically O(10−3).

18 / 26

slide-24
SLIDE 24

Stratified SGD [Beutel ’14]

◮ Strata identify independent blocks of non-zeros. ◮ Each stratum is processed in parallel.

Limitations of stratified SGD:

◮ There is only as much parallelism as the smallest dimension. ◮ Sparsely populated strata are communication bound.

19 / 26

slide-25
SLIDE 25

Asynchronous SGD (ASGD)

◮ Processes overlap updates and exchange to avoid divergence.

◮ Local solutions are combined via a weighted sum.

◮ Go Hogwild! on shared-memory systems.

Limitations of ASGD:

◮ Convergence suffers unless updates are frequently exchanged.

20 / 26

slide-26
SLIDE 26

Hybrid stratified/asynchronous SGD

◮ Limit the number of strata to reduce communication. ◮ Assign multiple processes to the same stratum (called a team). ◮ Each process performs updates on its own local factors. ◮ At the end of a strata, updates are exchanged among the team.

21 / 26

slide-27
SLIDE 27

Effects of stratification on SGD @ 1024 cores

Hybrid stratification combines the speed of ASGD with the stability of stratification.

10 20 30 40 50 60 70 80 Time (seconds) 24 26 28 30 32 RMSE asynchronous hybrid stratified

Hybrid uses sixteen teams of four MPI processes.

22 / 26

slide-28
SLIDE 28

Table of Contents

Introduction & Preliminaries Optimization Algorithms Comparison of Optimization Methods Conclusions

22 / 26

slide-29
SLIDE 29

Strong scaling

◮ SGD exhibits initial slowdown as strata teams are populated. ◮ All methods scale to (past) 1024 cores.

32 64 128 256 512 1024 Number of cores 0.06 0.12 0.25 0.50 1.00 2.00 4.00 Time per epoch (s) ALS SGD CCD++

23 / 26

slide-30
SLIDE 30

Convergence @ 1 core

SGD rapidly converges to a high quality solution.

2000 4000 6000 8000 10000 12000 Time (seconds) 23 24 25 26 27 28 RMSE ALS CCD++ SGD

Convergence is detected if the RMSE does not improve after 20 epochs.

24 / 26

slide-31
SLIDE 31

Convergence @ 1024 cores

◮ ALS now has the lowest time-to-solution. ◮ CCD++ and SGD exhibit similar convergence rates.

10 20 30 40 50 60 70 Time (seconds) 23 24 25 26 27 28 RMSE ALS CCD++ SGD

Convergence is detected if the RMSE does not improve after 20 epochs.

25 / 26

slide-32
SLIDE 32

Table of Contents

Introduction & Preliminaries Optimization Algorithms Comparison of Optimization Methods Conclusions

25 / 26

slide-33
SLIDE 33

Wrapping Up

◮ Some of the principles in sparse matrix computations are useful

here, but tensors bring many new challenges!

◮ Careful attention to sparsity and data structures can give over

10× speedups.

◮ ALS has a high convergence rate and performs well on modern

architectures due to its high compute intensity.

◮ CCD++ may be best for very large scale systems or ranks,

however. http://cs.umn.edu/~splatt/

26 / 26

slide-34
SLIDE 34

Questions?

26 / 26

slide-35
SLIDE 35

Backup Slides

26 / 26

slide-36
SLIDE 36

Communication volume on Yahoo!

1 2 4 8 16 32 Nodes 1 2 3 4 5 6 Average Communication Volume (bytes) 1e8 ALS SGD CCD++

Figure: Average communication volume per node on the Yahoo! dataset. CCD++ and SGD use two MPI ranks per node and ALS uses one.

26 / 26

slide-37
SLIDE 37

Netflix strong scaling

1 2 4 8 16 32 Nodes 0.03 0.06 0.12 0.25 0.50 1.00 2.00 Time per epoch (s) ALS SGD CCD++

26 / 26

slide-38
SLIDE 38

Amazon strong scaling

1 2 4 8 16 32 Nodes 4.00 8.00 16.00 32.00 64.00 128.00 Time per epoch (s) ALS CCD++ SGD

26 / 26

slide-39
SLIDE 39

Scaling factorization rank on 1024 cores

10 20 30 40 50 60 70 80 Rank 0.0 0.5 1.0 1.5 2.0 Time per Epoch (s)

ALS SGD CCD++

Figure: Effects of increasing factorization rank on the Yahoo! dataset.

26 / 26