HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY
Tom Goldstein
HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY - - PowerPoint PPT Presentation
HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY Tom Goldstein TOPICS Introduction ADMM / Fast ADMM Application: Distributed computing Automation & Adaptivity FIRST-ORDER METHODS minimize F ( x ) x Generalizes
Tom Goldstein
x0 x1 x2 x3 x4 xk+1 = xk τrF(xk) minimize
x
F(x)
x0 x1 x2 x3 x4 xk+1 = xk τrF(xk) minimize
x
F(x)
xk+1 = xk τrF(xk)
minimize
x
F(x)
xk+1 = xk τrF(xk)
minimize
x
F(x)
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2
λ b − Au − Bv = 0 H(u) + G(v)
minimize H(u) + G(v) subject to Au + Bv = b
max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2 uk+1 = arg min
u
H(u) + hλk, Aui + τ 2kb Au Bvkk2 vk+1 = arg min
v
G(v) + hλk, Bvi + τ 2kb Auk+1 Bvk2 λk+1 = λk + τ(b Auk+1 Bvk+1)
min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2
min |⇤u| + µ 2 ⇥Au f⇥2 noisy image
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2
min |⇤u| + µ 2 ⇥Au f⇥2 clean image
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2
min |⇤u| + µ 2 ⇥Au f⇥2 total variation
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2
min |⇤u| + µ 2 ⇥Au f⇥2 blurred image
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2
min |⇤u| + µ 2 ⇥Au f⇥2 Convolution
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Ku f⇥2 min |⇤u| + µ 2 ⇥u f⇥2
min |⇤u| + µ 2 ⇥Au f⇥2 General Problem
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Au f⇥2
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Au f⇥2 v ru
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Au f⇥2 v ru
minimize |v| + µ 2 kAu fk2 subject to v ru = 0
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Au f⇥2 v ru
minimize |v| + µ 2 kAu fk2 subject to v ru = 0
|v| + 1 2kAu fk2 + hλ, v rui + τ 2kv ruk2
Goldstein & Osher, “Split Bregman,” 2009
min |⇤u| + µ 2 ⇥Au f⇥2 |v| + 1 2kAu fk2 + hλ, v rui + τ 2kv ruk2
uk+1 = arg min
u
kAu fk2 + τ 2kvk ru λkk2 vk+1 = arg min
v
|v| + τ 2kv ruk+1 λkk2 λk+1 = λk + τ(ruk+1 v)
Goldstein, Osher. 2008
min |⇤u| + µ 2 ⇥u f⇥2
x0 x1 x2 x3 x4 x0 x1 x2 x3 x4
O ✓1 k ◆ O ✓ 1 k2 ◆
x0 x1 x2 x3 x4 x0 x1 x2 x3 x4
O ✓1 k ◆ O ✓ 1 k2 ◆
Nemirovski and Yudin ’83
minimize
x
F(x)
xk+1 = yk τrF(yk) αk+1 = 1 2 ✓ 1 + q 4α2
k + 1
◆ yk+1 = xk+1 + αk 1 αk+1 (xk+1 xk)
Nesterov ’83
No “Objective” to minimize
−1 −0.5 0.5 1 −1 −0.5 0.5 1 0.5 1 1.5 2
−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1
minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi
min
u,v max λ
minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi ∂H(u) − AT λ = 0 λ u b − Au − Bv = 0
min
u,v max λ
minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi rk = b − Auk − Bvk dk = ∂H(uk) − AT λk ∂H(u) − AT λ = 0 λ u b − Au − Bv = 0
min
u,v max λ
minimize H(u) + G(v) subject to Au + Bv = b H(u) + G(v) + hλ, b Au Bvi rk = b − Auk − Bvk ∂H(u) − AT λ = 0 λ u b − Au − Bv = 0
dk = τAT B(vk − vk−1)
min
u,v max λ
rk = b − Auk − Bvk dk = τAT B(vk − vk−1)
rk = b − Auk − Bvk dk = τAT B(vk − vk−1)
Yuan and He. 2012
ck = krkk2 + 1 τ kdkk2
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 &
rk = b − Auk − Bvk dk = τAT B(vk − vk−1)
Yuan and He. 2012
ck = krkk2 + 1 τ kdkk2
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 &
ck ≤ O(1/k)
rk = b − Auk − Bvk dk = τAT B(vk − vk−1)
Yuan and He. 2012
O ✓ 1 k2 ◆ ck = krkk2 + 1 τ kdkk2
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 &
ck ≤ O(1/k)
Require: v−1 = ˆ v0 2 RNv, λ−1 = ˆ λ0 2 RNb, τ > 0
1: for k = 1, 2, 3 . . . do 2:
uk = argmin H(u) + hˆ λk, Aui + τ
2kb Au Bˆ
vkk2
3:
vk = argmin G(v) + hˆ λk, Bvi + τ
2kb Auk Bvk2
4:
λk = ˆ λk + τ(b Auk Bvk)
5:
αk+1 =
1+p 1+4α2
k
2
6:
ˆ vk+1 = vk + αk−1
αk+1 (vk vk−1)
7:
ˆ λk+1 = λk + αk−1
αk+1 (λk λk−1)
8: end for
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012
Require: v−1 = ˆ v0 2 RNv, λ−1 = ˆ λ0 2 RNb, τ > 0
1: for k = 1, 2, 3 . . . do 2:
uk = argmin H(u) + hˆ λk, Aui + τ
2kb Au Bˆ
vkk2
3:
vk = argmin G(v) + hˆ λk, Bvi + τ
2kb Auk Bvk2
4:
λk = ˆ λk + τ(b Auk Bvk)
5:
αk+1 =
1+p 1+4α2
k
2
6:
ˆ vk+1 = vk + αk−1
αk+1 (vk vk−1)
7:
ˆ λk+1 = λk + αk−1
αk+1 (λk λk−1)
8: end for
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012
To prove formal convergence bounds, need assumptions:
Suppose H and G are strongly convex and that τ 3 < σHσ2
G
ρ(AT A)ρ(BT B)2 , then fast ADMM converges with ck Cτkˆ λ1 λ?k2 (k + 2)2 .
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012
Without strong convexity, convergence is still guaranteed using a “restart” method.
min |⇤u| + µ 2 ⇥u f⇥2
min
u λ1|u| + λ2
2 kuk2 + 1 2kAu fk2
100 200 300 400 10
−15
10
−10
10
−5
10 Iteration Relative Error
Restricted Stepsize
ADMM Fast ADMM0 Fast ADMM
Goldstein, O'Donoghue, Setzer, Baraniuk. 2012
CPUs GPU
+1 −1
a1 a2 a10
+1 −1
a1 a2 a10
+1 −1
1 kwk
1 kwk
aT w = 1 aT w = −1
+1 −1
1 kwk
1 kwk
aT w = 1
X
i
h(aT
i w)
+1 −1
1 kwk
1 kwk
aT w = 1
X
i
h(aT
i w)
h(Aw)
minimize 1 2kwk2 + h(Aw)
minimize f(x) + g(y) subject to Ax + By + c = 0
Lτ(x, y, λ) = f(x) + g(y) + hλ, Ax + By + ci + τ 2kAx + By + ck2
Lτ(x, y, λ) = f(x) + g(y) + τ 2kAx + By + c + 1 τ λk2 augmented Lagrangian scaled Lagrangian These differ by a constant
minimize f(x) + g(y) subject to Ax + By + c = 0 Lτ(x, y, λ) = f(x) + g(y) + τ 2kAx + By + c + 1 τ λk2 scaled Lagrangian Lτ(x, y, ˆ λ) = f(x) + g(y) + τ 2kAx + By + c + ˆ λk2 ˆ λ ← λ xk+1 = arg min
x
f(x) + τ 2kAx + Byk + c + ˆ λkk2 yk+1 = arg min
y
g(y) + τ 2kAxk+1 + By + c + ˆ λkk2 ˆ λk+1 = ˆ λk + Axk+1 + Byk+1 + c scaled ADMM
minimize µ|x| + 1 2kAx bk2 example: sparse least squares A = A1 A2 . . . AN minimize µ|x| + X
i
1 2kAix bik2 data stored on different servers minimize g(x) + X
i
fi(x)
minimize g(x) + X
i
fi(x) Central server holds global variables: z Every client gets local copy of unknowns: xi minimize g(z) + X
i
fi(xi) subject to xi = z, ∀i
Boyd et al. ‘10
minimize g(z) + X
i
fi(xi) subject to xi = z, ∀i L = g(z) + X
i
fi(xi) + X
i
τ 2kxi z + λik2 scaled augmented Lagrangian central server: zk+1 = arg min
z
g(z) + X
i
τ 2kxk
i z + λk i k2
remote client: λk+1
i
= λk
i + xk+1 i
− zk+1 consensus ADMM remote client: xk+1
i
= arg min
xi
fi(xi) + τ 2kxi zk+1 + λk
i k2
minimize µ|x| + X
i
1 2kAix bik2 L = µ|z| + X
i
1 2kAixi bik2 + X
i
τ 2kxi z + λik2 scaled augmented Lagrangian central server: zk+1 = arg min
z
µ|z| + Nτ 2 kz + ηk2 MPI reduce: ηk = 1 N X
i
xk
i + λk i
remote client: xk+1
i
= arg min
xi
1 2kAixi bik2 + τ 2kxi zk+1 + λk
i k2
remote client: λk+1
i
= λk
i + xk+1 i
− zk+1 consensus LASSO
minimize µ|x| + X
i
1 2kAix bik2
minimize µ|x| + 1 2kAx bk2 A = A1 A2 . . . AN
minimize 1 2kAx bk2 (AT A)−1AT b
A AT AT A ×
minimize 1 2kAx bk2 (AT A)−1AT b
A = A1 A2 . . . AN AT b = X Aibi AT A = X AT
i Ai
minimize g(x) + f(Ax) = g(x) + X
i
fi(Aix) Example: SVM minimize 1 2kxk2 + h(Ax) A = data, h = hinge loss
minimize g(x) + f(Ax) = g(x) + X
i
fi(Aix) Example: SVM minimize 1 2kxk2 + h(Ax)
minimize 1 2kxk2 + h(z) subject to z = Ax
minimize 1 2kxk2 + h(Ax)
minimize 1 2kxk2 + h(z) subject to z = Ax minimize 1 2kxk2 + h(z) + τ 2kz Ax + λk2 scaled augmented Lagrangian
scaled augmented Lagrangian minimize 1 2kxk2 + X
i
h(Aix) minimize 1 2kxk2 + X
i
h(zi) subject to zi = Aix minimize 1 2kxk2 + X
i
h(zi) + τ 2kzi Aix + λik2
scaled augmented Lagrangian setup phase Form: AT A = P
i AT i Ai
remote servers: z-update central servers: x-update minimize 1 2kxk2 + X
i
h(zi) + τ 2kzi Aix + λik2 xk+1 = minimize
x
1 2kxk2 + τ 2kzk+1
i
Aix + λk
i k2
zk+1 = minimize
z
X
i
h(zi) + τ 2kzi Aixk + λk
i k2
remote servers: lambda-update λk+1
i
= λk
i + zk+1 i
− Aixk+1
data central server cloud data central server SGD transpose reduction
seconds using 7500 cores
2K features, 50K data points/core transpose reduction consensus
Goldstein & Taylor ‘15
H.Jenkner, B.Lasker, ‘90
960M features vectors (1.8TB), 2500 cores consensus does not converge until t=1180 sec
Goldstein & Taylor ‘15
GLMs & Linear classifiers Neural nets
W1a1 = z2 a2
1
a1
1
a3
1
a2
2
a1
2
a1
3
z1
2
z2
2
z1
3
σ(z2) = a2 W2a2 = z3 σ(z3) = a3 minimize `(a3)
W1a1 = z2 a2
1
a1
1
a3
1
a2
2
a1
2
a1
3
z1
2
z2
2
z1
3
σ(z2) = a2 W2a2 = z3 σ(z3) = a3 minimize `(a3) subject to z2 = W1a1, a2 = σ(z2) z3 = W2a2, a3 = σ(z3)
W1a1 = z2 a2
1
a1
1
a3
1
a2
2
a1
2
a1
3
z1
2
z2
2
z1
3
σ(z2) = a2 W2a2 = z3 σ(z3) = a3 minimize `(a3)
+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2
minimize `(a3) Solve for activations: least squares + ridge penalty convex Solve for weights: least squares convex Solve for inputs: coordinate-minimization non-convex but global
+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2
minimize `(a3) Solve for activations: least squares + ridge penalty convex Solve for weights: least squares convex Solve for inputs: coordinate-minimization non-convex but global Use TR to solve least squares: scales across nodes
+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2
+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2 +hλ1, z2 W1a1i + hλ2, ha2 σ(z2)i
+hλ3, z3 W2a2i + hλ4, a3 σ(z3)i
Classical ADMM Bregman Iteration …unstable because of non-linear constraints minimize `(a3)
+1 2kz2 W1a1k2 + 1 2ka2 σ(z2)k2 +1 2kz3 W2a2k2 + 1 2ka3 σ(z3)k2
+hλ, a3i multiplier term minimize `(a3)
2496 core ADMM vs GPU (K40, 2880 cores)
“Training Neural Networks Without Gradients: A Scalable ADMM Approach.” Taylor, Burmeister, Xu, Singh, Patel, Goldstein. ICML 2016
TR 2496 cores
CG GPU
SGD GPU
L-BFGS GPU
7500 core ADMM vs GPU (K40, 2880 cores)
TR 7500 cores BFGS GPU SGD GPU
“Training Neural Networks Without Gradients: A Scalable ADMM Approach.” Taylor, Burmeister, Xu, Singh, Patel, Goldstein. ICML 2016
7500 core ADMM vs GPU (K40, 2880 cores)
TR 7500 cores BFGS GPU SGD GPU
“Training Neural Networks Without Gradients: A Scalable ADMM Approach.” Taylor, Burmeister, Xu, Singh, Patel, Goldstein. ICML 2016
100X speedup
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v H(u) + G(v) + hλ, b Au Bvi + τ
2kb Au Bvk2 Augmented Lagrangian how to choose?
“Spectral” approximation y = f(x)
minimize f(x) xk+1 = xk τrf(xk)
Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988
“Spectral” approximation y = α 2 kxk2 y = f(x)
α
minimize f(x) xk+1 = xk τrf(xk)
Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988
“Spectral” approximation y = α 2 kxk2 y = f(x)
α
xk+1 minimize f(x) y = α 2 kxk2 y = f(x)
α
xk+1 xk+1 = xk τrf(xk)
Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988
minimize f(x) y = α 2 kxk2 y = f(x)
α
xk+1 xk+1 = xk τrf(xk) rf(x) = αx
rf(xk+1) rf(xk) = α(xk+1 xk)
α = (xk+1 xk)T (rf(xk+1) rf(xk)) kxk+1 xkk2
Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988
minimize f(x) y = α 2 kxk2 y = f(x)
α
xk+1 xk+1 = xk τrf(xk) rf(x) = αx
rf(xk+1) rf(xk) = α(xk+1 xk)
α = (xk+1 xk)T (rf(xk+1) rf(xk)) kxk+1 xkk2
Barzilai & Borwein, “Two-point stepsize gradient methods,” 1988
minimize f(x) xk+1 = xk τrf(xk) +g(x) ˆ xk+1 = arg min g(x) + 1 2τ kx ˆ xk+1k2 Advantages
y = α 2 kxk2 y = f(x)
α
xk+1
Solves: L1 least squares, sparse classification, matrix completion, democratic representations, total variation, semidefinite programs, etc…
rf(x) proxg(x)
minimize f(x)+g(x)
Paper: A field guide to forward backward splitting with a FASTA implementation Solves: L1 least squares, sparse classification, matrix completion, democratic representations, total variation, semidefinite programs, etc…
T Goldstein, C Studer, R Baraniuk
minimize H(u) + G(v) subject to Au + Bv = b max
λ
min
u,v [H(u) + λT Au] + [G(v) + λT Bv] − λT b
H∗(λ) G∗(λ)
max
λ
min
u,v H(u) + G(v) + λT (Au + Bv − b)
minimize H(u) + G(v) subject to Au + Bv = b min
λ H∗(AT λ) hλ, bi + G∗(BT λ)
dual problem: no constraints
max
λ
min
u,v H(u) + G(v) + λT (Au + Bv − b)
minimize H(u) + G(v) subject to Au + Bv = b min
λ H∗(AT λ) hλ, bi + G∗(BT λ)
dual problem: no constraints
α 2 kλk2 β 2 kλk2
1 √αβ
max
λ
min
u,v H(u) + G(v) + λT (Au + Bv − b)
minimize H(u) + G(v) subject to Au + Bv = b
max
λ
min
u,v H(u) + G(v) + λT (Au + Bv − b)
min
λ H∗(AT λ) hλ, bi + G∗(BT λ)
dual problem: no constraints
α 2 kλk2 β 2 kλk2
1 √αβ
β = (λk λ0)T B(vk vk0) kλk λ0k2 α = (ˆ λk ˆ λ0)T A(uk uk0) kˆ λk ˆ λ0k2 curvatures are “free” given ADMM iterates
(a) Elastic net regression
10
10
10 10
1
10
2
10
3
10
4
10
5
10
1
10
2
10
3
Initial penalty parameter Iterations Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM
Adaptive ADMM Fast ADMM Vanilla ADMM Residual Balancing
Zheng Xu, Mario Figueiredo, Tom Goldstein. ”Adaptive ADMM with spectral penalty parameter selection.” 2014
(b) Quadratic programming
10
10
10 10
1
10
2
10
3
10
4
10
5
10 10
1
10
2
10
3
Problem scale Iterations Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM 10
5
Adaptive ADMM
Zheng Xu, Mario Figueiredo, Tom Goldstein. ”Adaptive ADMM with spectral penalty parameter selection.” 2014
Fast Alternating Direction Methods
“Transpose Reduction” & “Training Neural Nets without Gradients”
Adaptive ADMM with spectral penalty parameter selection