HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY - - PowerPoint PPT Presentation

highly parallel methods for machine learning and signal
SMART_READER_LITE
LIVE PREVIEW

HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY - - PowerPoint PPT Presentation

HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY Tom Goldstein TOPICS Introduction ADMM / Fast ADMM Application: Distributed computing Automation & Adaptivity FIRST-ORDER METHODS minimize F ( x ) x Generalizes


slide-1
SLIDE 1

HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY

Tom Goldstein

slide-2
SLIDE 2

TOPICS

Introduction ADMM / Fast ADMM Application: Distributed computing Automation & Adaptivity

slide-3
SLIDE 3

FIRST-ORDER METHODS

Generalizes Gradient descent

x0 x1 x2 x3 x4 xk+1 = xk τrF(xk) minimize

x

F(x)

slide-4
SLIDE 4

FIRST-ORDER METHODS

  • Linear complexity
  • Parallelizable
  • Low memory requirements

Pros Generalizes Gradient descent

x0 x1 x2 x3 x4 xk+1 = xk τrF(xk) minimize

x

F(x)

slide-5
SLIDE 5

FIRST-ORDER METHODS

  • Linear complexity
  • Parallelizable
  • Low memory requirements

Pros Generalizes Gradient descent

xk+1 = xk τrF(xk)

  • Poor convergence rates

Con

minimize

x

F(x)

slide-6
SLIDE 6

FIRST-ORDER METHODS

  • Linear complexity
  • Parallelizable
  • Low memory requirements

Pros Generalizes Gradient descent

xk+1 = xk τrF(xk)

  • Poor convergence rates

Con Solution: Adaptivity and Acceleration

minimize

x

F(x)

slide-7
SLIDE 7

CONSTRAINED PROBLEMS

Big idea: Lagrange multipliers

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2

slide-8
SLIDE 8

CONSTRAINED PROBLEMS

Big idea: Lagrange multipliers

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2

slide-9
SLIDE 9

CONSTRAINED PROBLEMS

Big idea: Lagrange multipliers

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2

slide-10
SLIDE 10

CONSTRAINED PROBLEMS

Big idea: Lagrange multipliers

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2

slide-11
SLIDE 11

CONSTRAINED PROBLEMS

Big idea: Lagrange multipliers

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2

  • Optimality for :
  • Reduced energy:
  • Saddle-point = Solution to constrained problem

λ b − Au − Bv = 0 H(u) + G(v)

slide-12
SLIDE 12

ADMM

minimize H(u) + G(v) subject to Au + Bv = b

Big Idea: Lagrange multipliers

max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2 uk+1 = arg min

u

H(u) + hλk, Aui + τ 2kb Au Bvkk2 vk+1 = arg min

v

G(v) + hλk, Bvi + τ 2kb Auk+1 Bvk2 λk+1 = λk + τ(b Auk+1 Bvk+1)

Alternating Direction Method of Multipliers

slide-13
SLIDE 13

EXAMPLE PROBLEMS

  • TV Denoising
  • TV Deblurring
  • General Problem:

min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 noisy image

Goldstein & Osher, “Split Bregman,” 2009

slide-14
SLIDE 14

EXAMPLE PROBLEMS

  • TV Denoising
  • TV Deblurring
  • General Problem:

min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 clean image

Goldstein & Osher, “Split Bregman,” 2009

slide-15
SLIDE 15

EXAMPLE PROBLEMS

  • TV Denoising
  • TV Deblurring
  • General Problem:

min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 total variation

Goldstein & Osher, “Split Bregman,” 2009

slide-16
SLIDE 16

EXAMPLE PROBLEMS

  • TV Denoising
  • TV Deblurring
  • General Problem:

min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 blurred image

Goldstein & Osher, “Split Bregman,” 2009

slide-17
SLIDE 17

EXAMPLE PROBLEMS

  • TV Denoising
  • TV Deblurring
  • General Problem:

min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 Convolution

Goldstein & Osher, “Split Bregman,” 2009

slide-18
SLIDE 18

EXAMPLE PROBLEMS

  • TV Denoising
  • TV Deblurring
  • General Problem:

min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 General Problem

Goldstein & Osher, “Split Bregman,” 2009

slide-19
SLIDE 19

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2

WHY IS SPLITTING GOOD?

Goldstein & Osher, “Split Bregman,” 2009

slide-20
SLIDE 20

Non-Smooth Problems

  • Make change of Variables:

min |⇤u| + µ 2 ⇥Au f⇥2 v ru

WHY IS SPLITTING GOOD?

Goldstein & Osher, “Split Bregman,” 2009

slide-21
SLIDE 21

Non-Smooth Problems

  • Make change of Variables:

min |⇤u| + µ 2 ⇥Au f⇥2 v ru

WHY IS SPLITTING GOOD?

minimize |v| + µ 2 kAu fk2 subject to v ru = 0

  • ‘Split Bregman’ form:

Goldstein & Osher, “Split Bregman,” 2009

slide-22
SLIDE 22

Non-Smooth Problems

  • Make change of Variables:

min |⇤u| + µ 2 ⇥Au f⇥2 v ru

WHY IS SPLITTING GOOD?

minimize |v| + µ 2 kAu fk2 subject to v ru = 0

  • ‘Split Bregman’ form:

|v| + 1 2kAu fk2 + hλ, v rui + τ 2kv ruk2

  • Augmented Lagrangian

Goldstein & Osher, “Split Bregman,” 2009

slide-23
SLIDE 23

Non-Smooth Problems

min |⇤u| + µ 2 ⇥Au f⇥2 |v| + 1 2kAu fk2 + hλ, v rui + τ 2kv ruk2

ADMM for TV

uk+1 = arg min

u

kAu fk2 + τ 2kvk ru λkk2 vk+1 = arg min

v

|v| + τ 2kv ruk+1 λkk2 λk+1 = λk + τ(ruk+1 v)

WHY IS SPLITTING GOOD?

Goldstein, Osher. 2008

slide-24
SLIDE 24

WHY IS SPLITTING BAD?

min |⇤u| + µ 2 ⇥u f⇥2

TV Denoising:

slide-25
SLIDE 25

x0 x1 x2 x3 x4 x0 x1 x2 x3 x4

GRADIENT VS. NESTEROV

Gradient Nesterov

O ✓1 k ◆ O ✓ 1 k2 ◆

slide-26
SLIDE 26

x0 x1 x2 x3 x4 x0 x1 x2 x3 x4

GRADIENT VS. NESTEROV

Gradient Nesterov

O ✓1 k ◆ O ✓ 1 k2 ◆

Nemirovski and Yudin ’83

Optimal

slide-27
SLIDE 27

NESTEROV’S METHOD

Gradient Descent Acceleration Factor Prediction

minimize

x

F(x)

xk+1 = yk τrF(yk) αk+1 = 1 2 ✓ 1 + q 4α2

k + 1

◆ yk+1 = xk+1 + αk 1 αk+1 (xk+1 xk)

Nesterov ’83

slide-28
SLIDE 28

ACCELERATED SPLITTING METHODS

slide-29
SLIDE 29

HOW TO MEASURE CONVERGENCE?

No “Objective” to minimize

−1 −0.5 0.5 1 −1 −0.5 0.5 1 0.5 1 1.5 2

Constrained Unconstrained

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

Convex Saddle

slide-30
SLIDE 30

RESIDUALS

minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi

  • Lagrangian

min

u,v max λ

slide-31
SLIDE 31

RESIDUALS

minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi ∂H(u) − AT λ = 0 λ u b − Au − Bv = 0

  • Derivative for
  • Derivative for
  • Lagrangian

min

u,v max λ

slide-32
SLIDE 32

RESIDUALS

minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi rk = b − Auk − Bvk dk = ∂H(uk) − AT λk ∂H(u) − AT λ = 0 λ u b − Au − Bv = 0

  • Derivative for
  • Derivative for
  • We have convergence when derivatives are ‘small’
  • Lagrangian

min

u,v max λ

slide-33
SLIDE 33

RESIDUALS

minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi rk = b − Auk − Bvk ∂H(u) − AT λ = 0 λ u b − Au − Bv = 0

  • Derivative for
  • Derivative for
  • We have convergence when derivatives are ‘small’

dk = τAT B(vk − vk−1)

  • Lagrangian

min

u,v max λ

slide-34
SLIDE 34

EXPLICIT RESIDUALS

  • Explicit formulas for residuals

rk = b − Auk − Bvk dk = τAT B(vk − vk−1)

slide-35
SLIDE 35

EXPLICIT RESIDUALS

  • Explicit formulas for residuals

rk = b − Auk − Bvk dk = τAT B(vk − vk−1)

Yuan and He. 2012

ck = krkk2 + 1 τ kdkk2

  • Combined residual

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 &

slide-36
SLIDE 36

EXPLICIT RESIDUALS

  • Explicit formulas for residuals

rk = b − Auk − Bvk dk = τAT B(vk − vk−1)

Yuan and He. 2012

ck = krkk2 + 1 τ kdkk2

  • Combined residual

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 &

  • ADMM/AMA converge at rate

ck ≤ O(1/k)

slide-37
SLIDE 37

EXPLICIT RESIDUALS

  • Explicit formulas for residuals

rk = b − Auk − Bvk dk = τAT B(vk − vk−1)

Yuan and He. 2012

Goal:

O ✓ 1 k2 ◆ ck = krkk2 + 1 τ kdkk2

  • Combined residual

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 &

  • ADMM/AMA converge at rate

ck ≤ O(1/k)

slide-38
SLIDE 38

FAST ADMM

Require: v−1 = ˆ v0 2 RNv, λ−1 = ˆ λ0 2 RNb, τ > 0

1: for k = 1, 2, 3 . . . do 2:

uk = argmin H(u) + hˆ λk, Aui + τ

2kb Au Bˆ

vkk2

3:

vk = argmin G(v) + hˆ λk, Bvi + τ

2kb Auk Bvk2

4:

λk = ˆ λk + τ(b Auk Bvk)

5:

αk+1 =

1+p 1+4α2

k

2

6:

ˆ vk+1 = vk + αk−1

αk+1 (vk vk−1)

7:

ˆ λk+1 = λk + αk−1

αk+1 (λk λk−1)

8: end for

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012

slide-39
SLIDE 39

FAST ADMM

Require: v−1 = ˆ v0 2 RNv, λ−1 = ˆ λ0 2 RNb, τ > 0

1: for k = 1, 2, 3 . . . do 2:

uk = argmin H(u) + hˆ λk, Aui + τ

2kb Au Bˆ

vkk2

3:

vk = argmin G(v) + hˆ λk, Bvi + τ

2kb Auk Bvk2

4:

λk = ˆ λk + τ(b Auk Bvk)

5:

αk+1 =

1+p 1+4α2

k

2

6:

ˆ vk+1 = vk + αk−1

αk+1 (vk vk−1)

7:

ˆ λk+1 = λk + αk−1

αk+1 (λk λk−1)

8: end for

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012

slide-40
SLIDE 40

CONVERGENCE RESULTS

To prove formal convergence bounds, need assumptions:

  • Strong convexity of the objective
  • Stepsize restriction

Theorem

Suppose H and G are strongly convex and that τ 3 < σHσ2

G

ρ(AT A)ρ(BT B)2 , then fast ADMM converges with ck  Cτkˆ λ1 λ?k2 (k + 2)2 .

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012

Without strong convexity, convergence is still guaranteed using a “restart” method.

slide-41
SLIDE 41

RESULTS: ROF

min |⇤u| + µ 2 ⇥u f⇥2

slide-42
SLIDE 42

MACHINE LEARNING: ELASTIC NET

min

u λ1|u| + λ2

2 kuk2 + 1 2kAu fk2

Random A, Sparsity = 15/40

100 200 300 400 10

−15

10

−10

10

−5

10 Iteration Relative Error

Restricted Stepsize

ADMM Fast ADMM0 Fast ADMM

Goldstein, O'Donoghue, Setzer, Baraniuk. 2012

Fast ADMM

20X Faster

slide-43
SLIDE 43

DISTRIBUTED MACHINE LEARNING

Consensus and Transpose Reduction Methods

slide-44
SLIDE 44

WHY DISTRIBUTE?

Google compute engine

CPUs GPU

  • Very scalable
  • Cheap (cloud platforms)
  • Communication cheap
  • Not so scalable
  • Expensive
  • Communication expensive
  • data is stored across many servers
  • datasets are big (memory is an issue)
  • communication is expensive
slide-45
SLIDE 45

LINEAR CLASSIFIERS

feature vectors vectors containing descriptions of objects

Rn

labels +1/-1 labels indicating which “class” each vector lies in

+1 −1

training data

a1 a2 a10

slide-46
SLIDE 46

LINEAR CLASSIFIERS

+1 −1

goal learn a line that separates two object classes

a1 a2 a10

slope

aT w = 0

what line is best?

slide-47
SLIDE 47

SUPPORT VECTOR MACHINE

+1 −1

SVM choose line with maximum “margin”

1 kwk

margin width =

1 kwk

aT w = 0

aT w = 1 aT w = −1

slide-48
SLIDE 48

SUPPORT VECTOR MACHINE

+1 −1

1 kwk

margin width =

1 kwk

hinge loss penalize points that lie within the margin

aT w = 1

X

i

h(aT

i w)

slide-49
SLIDE 49

SUPPORT VECTOR MACHINE

+1 −1

1 kwk

margin width =

1 kwk

hinge loss penalize points that lie within the margin

aT w = 1

X

i

h(aT

i w)

h(Aw)

short-hand combined objective

minimize 1 2kwk2 + h(Aw)

slide-50
SLIDE 50

SCALED ADMM

minimize f(x) + g(y) subject to Ax + By + c = 0

Lτ(x, y, λ) = f(x) + g(y) + hλ, Ax + By + ci + τ 2kAx + By + ck2

Lτ(x, y, λ) = f(x) + g(y) + τ 2kAx + By + c + 1 τ λk2 augmented Lagrangian scaled Lagrangian These differ by a constant

slide-51
SLIDE 51

SCALED ADMM

minimize f(x) + g(y) subject to Ax + By + c = 0 Lτ(x, y, λ) = f(x) + g(y) + τ 2kAx + By + c + 1 τ λk2 scaled Lagrangian Lτ(x, y, ˆ λ) = f(x) + g(y) + τ 2kAx + By + c + ˆ λk2 ˆ λ ← λ xk+1 = arg min

x

f(x) + τ 2kAx + Byk + c + ˆ λkk2 yk+1 = arg min

y

g(y) + τ 2kAxk+1 + By + c + ˆ λkk2 ˆ λk+1 = ˆ λk + Axk+1 + Byk+1 + c scaled ADMM

slide-52
SLIDE 52

DISTRIBUTED PROBLEMS

minimize µ|x| + 1 2kAx bk2 example: sparse least squares A =      A1 A2 . . . AN      minimize µ|x| + X

i

1 2kAix bik2 data stored on different servers minimize g(x) + X

i

fi(x)

slide-53
SLIDE 53

CONSENSUS ADMM

minimize g(x) + X

i

fi(x) Central server holds global variables: z Every client gets local copy of unknowns: xi minimize g(z) + X

i

fi(xi) subject to xi = z, ∀i

Boyd et al. ‘10

slide-54
SLIDE 54

CONSENSUS ADMM

minimize g(z) + X

i

fi(xi) subject to xi = z, ∀i L = g(z) + X

i

fi(xi) + X

i

τ 2kxi z + λik2 scaled augmented Lagrangian central server: zk+1 = arg min

z

g(z) + X

i

τ 2kxk

i z + λk i k2

remote client: λk+1

i

= λk

i + xk+1 i

− zk+1 consensus ADMM remote client: xk+1

i

= arg min

xi

fi(xi) + τ 2kxi zk+1 + λk

i k2

slide-55
SLIDE 55

EXAMPLE: LASSO

minimize µ|x| + X

i

1 2kAix bik2 L = µ|z| + X

i

1 2kAixi bik2 + X

i

τ 2kxi z + λik2 scaled augmented Lagrangian central server: zk+1 = arg min

z

µ|z| + Nτ 2 kz + ηk2 MPI reduce: ηk = 1 N X

i

xk

i + λk i

remote client: xk+1

i

= arg min

xi

1 2kAixi bik2 + τ 2kxi zk+1 + λk

i k2

remote client: λk+1

i

= λk

i + xk+1 i

− zk+1 consensus LASSO

slide-56
SLIDE 56

PROBLEMS WITH CONSENSUS

minimize µ|x| + X

i

1 2kAix bik2

Works well for homogeneous data: everyone agrees

minimize µ|x| + 1 2kAx bk2 A =      A1 A2 . . . AN     

What about heterogenous data? What about many cores?

slide-57
SLIDE 57

TRANSPOSE REDUCTION

minimize 1 2kAx bk2 (AT A)−1AT b

normal equations =

A AT AT A ×

slide-58
SLIDE 58

TRANSPOSE REDUCTION

minimize 1 2kAx bk2 (AT A)−1AT b

normal equations

A =      A1 A2 . . . AN      AT b = X Aibi AT A = X AT

i Ai

distributed computation Big idea Solve complex problems with ADMM, solve least-squares sub-problems with TR

slide-59
SLIDE 59

UNWRAPPED ADMM

minimize g(x) + f(Ax) = g(x) + X

i

fi(Aix) Example: SVM minimize 1 2kxk2 + h(Ax) A = data, h = hinge loss

slide-60
SLIDE 60

UNWRAPPED ADMM

minimize g(x) + f(Ax) = g(x) + X

i

fi(Aix) Example: SVM minimize 1 2kxk2 + h(Ax)

“unwrapped” form

minimize 1 2kxk2 + h(z) subject to z = Ax

slide-61
SLIDE 61

TRANSPOSE REDUCTION ADMM

minimize 1 2kxk2 + h(Ax)

“unwrapped” form

minimize 1 2kxk2 + h(z) subject to z = Ax minimize 1 2kxk2 + h(z) + τ 2kz Ax + λk2 scaled augmented Lagrangian

Least squares: use TR here

slide-62
SLIDE 62

DISTRIBUTED VERSION

“unwrapped” form

scaled augmented Lagrangian minimize 1 2kxk2 + X

i

h(Aix) minimize 1 2kxk2 + X

i

h(zi) subject to zi = Aix minimize 1 2kxk2 + X

i

h(zi) + τ 2kzi Aix + λik2

slide-63
SLIDE 63

DISTRIBUTED STEPS

scaled augmented Lagrangian setup phase Form: AT A = P

i AT i Ai

remote servers: z-update central servers: x-update minimize 1 2kxk2 + X

i

h(zi) + τ 2kzi Aix + λik2 xk+1 = minimize

x

1 2kxk2 + τ 2kzk+1

i

Aix + λk

i k2

zk+1 = minimize

z

X

i

h(zi) + τ 2kzi Aixk + λk

i k2

remote servers: lambda-update λk+1

i

= λk

i + zk+1 i

− Aixk+1

slide-64
SLIDE 64

TRANSPOSE REDUCTION

data central server cloud data central server SGD transpose reduction

Transpose reduction ADMM

  • servers work together to find one solution
  • central server makes simple queries
  • fits classifier (SVM) on 7TB dataset in 15

seconds using 7500 cores

slide-65
SLIDE 65

EXPERIMENT: SVM

2K features, 50K data points/core transpose reduction consensus

Goldstein & Taylor ‘15

1000X speedup

slide-66
SLIDE 66

GUIDE STAR CATALOG II

All known objects from Palomar and UK Schmidt surveys About 2 billion objects

H.Jenkner, B.Lasker, ‘90

slide-67
SLIDE 67

USNO GUIDE STAR DATABASE

960M features vectors (1.8TB), 2500 cores consensus does not converge until t=1180 sec

Goldstein & Taylor ‘15

60X speedup!

slide-68
SLIDE 68

MORE COMPLEX PROBLEMS

GLMs & Linear classifiers Neural nets

slide-69
SLIDE 69

ADMM FOR NEURAL NETS

a1

W1a1 = z2 a2

1

a1

1

a3

1

a2

2

a1

2

a1

3

z1

2

z2

2

z1

3

σ(z2) = a2 W2a2 = z3 σ(z3) = a3 minimize `(a3)

slide-70
SLIDE 70

ADMM FOR NEURAL NETS

a1

W1a1 = z2 a2

1

a1

1

a3

1

a2

2

a1

2

a1

3

z1

2

z2

2

z1

3

σ(z2) = a2 W2a2 = z3 σ(z3) = a3 minimize `(a3) subject to z2 = W1a1, a2 = σ(z2) z3 = W2a2, a3 = σ(z3)

slide-71
SLIDE 71

ADMM FOR NEURAL NETS

a1

W1a1 = z2 a2

1

a1

1

a3

1

a2

2

a1

2

a1

3

z1

2

z2

2

z1

3

σ(z2) = a2 W2a2 = z3 σ(z3) = a3 minimize `(a3)

+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2

slide-72
SLIDE 72

MINIMIZATION STEPS

minimize `(a3) Solve for activations: least squares + ridge penalty convex Solve for weights: least squares convex Solve for inputs: coordinate-minimization non-convex but global

+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2

slide-73
SLIDE 73

MINIMIZATION STEPS

minimize `(a3) Solve for activations: least squares + ridge penalty convex Solve for weights: least squares convex Solve for inputs: coordinate-minimization non-convex but global Use TR to solve least squares: scales across nodes

+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2

slide-74
SLIDE 74

LAGRANGE MULTIPLIERS

+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2 +hλ1, z2 W1a1i + hλ2, ha2 σ(z2)i

+hλ3, z3 W2a2i + hλ4, a3 σ(z3)i

Classical ADMM Bregman Iteration …unstable because of non-linear constraints minimize `(a3)

+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2

+hλ, a3i multiplier term minimize `(a3)

slide-75
SLIDE 75

NEURAL NETS

2496 core ADMM vs GPU (K40, 2880 cores)

2 hidden layers ~150K weights “Street View” house number classification: 120K data

“Training Neural Networks Without Gradients: A Scalable ADMM Approach.” Taylor, Burmeister, Xu, Singh, Patel, Goldstein. ICML 2016

TR 2496 cores

CG GPU

SGD GPU

L-BFGS GPU

slide-76
SLIDE 76

NEURAL NETS

7500 core ADMM vs GPU (K40, 2880 cores)

TR 7500 cores BFGS GPU SGD GPU

3 hidden layers ~ 300K weights “Higgs” particle classification: 10.5M points

“Training Neural Networks Without Gradients: A Scalable ADMM Approach.” Taylor, Burmeister, Xu, Singh, Patel, Goldstein. ICML 2016

slide-77
SLIDE 77

NEURAL NETS

7500 core ADMM vs GPU (K40, 2880 cores)

TR 7500 cores BFGS GPU SGD GPU

3 hidden layers ~ 300K weights “Higgs” particle classification: 10.5M points

“Training Neural Networks Without Gradients: A Scalable ADMM Approach.” Taylor, Burmeister, Xu, Singh, Patel, Goldstein. ICML 2016

100X speedup

slide-78
SLIDE 78

ADAPTIVE METHODS

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v H(u) + G(v) + hλ, b Au Bvi + τ

2kb Au Bvk2 Augmented Lagrangian how to choose?

minimize f(x)

xk+1 = xk τrf(xk)

slide-79
SLIDE 79

SPECTRAL STEPSIZE

“Spectral” approximation y = f(x)

xk

minimize f(x) xk+1 = xk τrf(xk)

Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988

slide-80
SLIDE 80

SPECTRAL STEPSIZE

“Spectral” approximation y = α 2 kxk2 y = f(x)

  • ptimal stepsize = 1

α

xk

minimize f(x) xk+1 = xk τrf(xk)

Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988

slide-81
SLIDE 81

SPECTRAL STEPSIZE

“Spectral” approximation y = α 2 kxk2 y = f(x)

  • ptimal stepsize = 1

α

xk

xk+1 minimize f(x) y = α 2 kxk2 y = f(x)

  • ptimal stepsize = 1

α

xk

xk+1 xk+1 = xk τrf(xk)

Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988

slide-82
SLIDE 82

HOW TO GET STEPSIZE

minimize f(x) y = α 2 kxk2 y = f(x)

  • ptimal stepsize = 1

α

xk

xk+1 xk+1 = xk τrf(xk) rf(x) = αx

from quadratic model

rf(xk+1) rf(xk) = α(xk+1 xk)

least squares solution

α = (xk+1 xk)T (rf(xk+1) rf(xk)) kxk+1 xkk2

Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988

slide-83
SLIDE 83

HOW TO GET STEPSIZE

minimize f(x) y = α 2 kxk2 y = f(x)

  • ptimal stepsize = 1

α

xk

xk+1 xk+1 = xk τrf(xk) rf(x) = αx

from quadratic model

rf(xk+1) rf(xk) = α(xk+1 xk)

least squares solution

α = (xk+1 xk)T (rf(xk+1) rf(xk)) kxk+1 xkk2

Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988

slide-84
SLIDE 84

FORWARD-BACKWARD METHOD

minimize f(x) xk+1 = xk τrf(xk) +g(x) ˆ xk+1 = arg min g(x) + 1 2τ kx ˆ xk+1k2 Advantages

  • Fast! Superlinear for some problems
  • Automated
  • Can solve non-differentiable problems
  • Can handle simple set constraints

y = α 2 kxk2 y = f(x)

  • ptimal stepsize = 1

α

xk

xk+1

slide-85
SLIDE 85

FASTA: Fasta Adaptive Shrinkage Thresholding Algorithm

Solves: L1 least squares, sparse classification, matrix completion, democratic representations, total variation, semidefinite programs, etc…

You provide:

rf(x) proxg(x)

FASTA does the rest!

minimize f(x)+g(x)

stepsize choice! acceleration! stopping conditions! preconditioners! line search! converge theory!

slide-86
SLIDE 86

FASTA: Fasta Adaptive Shrinkage Thresholding Algorithm

Paper: A field guide to forward backward splitting with a FASTA implementation Solves: L1 least squares, sparse classification, matrix completion, democratic representations, total variation, semidefinite programs, etc…

T Goldstein, C Studer, R Baraniuk

slide-87
SLIDE 87

ADAPTIVE ADMM

minimize H(u) + G(v) subject to Au + Bv = b max

λ

min

u,v [H(u) + λT Au] + [G(v) + λT Bv] − λT b

H∗(λ) G∗(λ)

Lagrangian

max

λ

min

u,v H(u) + G(v) + λT (Au + Bv − b)

slide-88
SLIDE 88

ADAPTIVE ADMM

minimize H(u) + G(v) subject to Au + Bv = b min

λ H∗(AT λ) hλ, bi + G∗(BT λ)

dual problem: no constraints

Lagrangian

max

λ

min

u,v H(u) + G(v) + λT (Au + Bv − b)

slide-89
SLIDE 89

ADAPTIVE ADMM

minimize H(u) + G(v) subject to Au + Bv = b min

λ H∗(AT λ) hλ, bi + G∗(BT λ)

dual problem: no constraints

{ {

α 2 kλk2 β 2 kλk2

  • ptimal stepsize =

1 √αβ

Lagrangian

max

λ

min

u,v H(u) + G(v) + λT (Au + Bv − b)

slide-90
SLIDE 90

ADAPTIVE ADMM

minimize H(u) + G(v) subject to Au + Bv = b

Lagrangian

max

λ

min

u,v H(u) + G(v) + λT (Au + Bv − b)

min

λ H∗(AT λ) hλ, bi + G∗(BT λ)

dual problem: no constraints

{ {

α 2 kλk2 β 2 kλk2

  • ptimal stepsize =

1 √αβ

β = (λk λ0)T B(vk vk0) kλk λ0k2 α = (ˆ λk ˆ λ0)T A(uk uk0) kˆ λk ˆ λ0k2 curvatures are “free” given ADMM iterates

slide-91
SLIDE 91

DEPENDENCE ON STEPSIZE GUESS

(a) Elastic net regression

10

  • 5 10
  • 4 10
  • 3

10

  • 2 10
  • 1

10 10

1

10

2

10

3

10

4

10

5

10

1

10

2

10

3

Initial penalty parameter Iterations Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

Adaptive ADMM Fast ADMM Vanilla ADMM Residual Balancing

Zheng Xu, Mario Figueiredo, Tom Goldstein. ”Adaptive ADMM with spectral penalty parameter selection.” 2014

slide-92
SLIDE 92

PROBLEM SCALING

(b) Quadratic programming

10

  • 5 10
  • 4 10
  • 3

10

  • 2 10
  • 1

10 10

1

10

2

10

3

10

4

10

5

10 10

1

10

2

10

3

Problem scale Iterations Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM 10

5

Adaptive ADMM

Zheng Xu, Mario Figueiredo, Tom Goldstein. ”Adaptive ADMM with spectral penalty parameter selection.” 2014

slide-93
SLIDE 93

WRAP UP

ADMM Complex problems : simple sub-steps great for model fitting problems Distributed Variants: transpose reduction Fast ADMM for poorly-conditioned problems Adaptive variants for automation Bells and Whistles

slide-94
SLIDE 94

ACKNOWLEDGEMENTS

Brendan O’Donoghue (Google Deepmind) Ernie Esser (Univ. of British Columbia) Simon Setzer (Saarland University) Rich Baraniuk (Rice) Gavin Taylor (US Naval Academy) Ankit Patel (Rice)

Fast Alternating Direction Methods

Thanks to my collaborators

“Transpose Reduction” & “Training Neural Nets without Gradients”

Zheng Xu (Maryland) Mario Figueiredo (University of Lisbon)

Adaptive ADMM with spectral penalty parameter selection

slide-95
SLIDE 95

Questions??