Composite Objective Mirror Descent John C. Duchi 1 , 3 Shai - - PowerPoint PPT Presentation

composite objective mirror descent
SMART_READER_LITE
LIVE PREVIEW

Composite Objective Mirror Descent John C. Duchi 1 , 3 Shai - - PowerPoint PPT Presentation

Composite Objective Mirror Descent John C. Duchi 1 , 3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research 4 Toyota Technological Institute, Chicago


slide-1
SLIDE 1

Composite Objective Mirror Descent

John C. Duchi1,3 Shai Shalev-Shwartz2 Yoram Singer3 Ambuj Tewari4

1University of California, Berkeley 2Hebrew University of Jerusalem, Israel 3Google Research 4Toyota Technological Institute, Chicago

June 29, 2010

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 1 / 22

slide-2
SLIDE 2

Large scale logistic regression

Problem: n huge, min

x

1 n

n

  • i=1

log(1 + exp(ai, x))

  • =f (x)

+λ x1 “Usual” approach: online gradient descent (Zinkevich ’03). Let gt = ∇ log(1 + exp(at, xt)) xt+1 = xt − ηtgt − ηtλ sign(xt) Then perform online to batch conversion

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 2 / 22

slide-3
SLIDE 3

Problems with usual approach

◮ Regret bound/convergence rate: set

G = maxt gt + λ sign(xt)2 f (xT) + λ xT1 = f (x∗) + λ x∗1 + O x∗2 G √ T

  • But G = Θ(

√ d)—additional penalty of sign(xt)

◮ No sparsity in xT

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 3 / 22

slide-4
SLIDE 4

Problems with usual approach

◮ Regret bound/convergence rate: set

G = maxt gt + λ sign(xt)2 f (xT) + λ xT1 = f (x∗) + λ x∗1 + O x∗2 G √ T

  • But G = Θ(

√ d)—additional penalty of sign(xt)

◮ No sparsity in xT ◮ Why should we suffer from ·1 term?

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 3 / 22

slide-5
SLIDE 5

Online Gradient Descent

Let gt = ∇ log(1 + exp(at, xt)) + λ sign(xt). OGD step (Zinkevich ’03): xt+1 = xt − ηgt = argmin

x

  • η gt, x + 1

2 x − xt2

2

  • f(x) + λx1

f(xt) + gt, x - xt

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 4 / 22

slide-6
SLIDE 6

Online Gradient Descent

Let gt = ∇ log(1 + exp(at, xt)) + λ sign(xt). OGD step (Zinkevich ’03): xt+1 = xt − ηgt = argmin

x

  • η gt, x + 1

2 x − xt2

2

  • gt, x + Bψ(x, xt)

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 4 / 22

slide-7
SLIDE 7

Problems with Subgradient Methods

◮ Subgradients are

non-informative at singularities

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 5 / 22

slide-8
SLIDE 8

Problems with Subgradient Methods

◮ Subgradients are

non-informative at singularities

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 5 / 22

slide-9
SLIDE 9

Problems with Subgradient Methods

◮ Subgradients are

non-informative at singularities

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 5 / 22

slide-10
SLIDE 10

Composite Objective Approach

Let gt = ∇ log(1 + exp(at, xt)). Truncated gradient (Langford et

  • al. ’08, Duchi & Singer ’09):

xt+1 = argmin

x

1 2 x − xt2 + η gt, x + ηλ x1

  • = sign(xt − ηgt) ⊙ [|xt − ηgt| − ηλ]+

[|xt - ηgt| - ηλ]+ xt - ηgt

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 6 / 22

slide-11
SLIDE 11

Composite Objective Approach

Update is xt+1 = sign(xt − ηgt) ⊙ [|xt − ηgt| − ηλ]+ Two nice things:

◮ Sparsity from [·]+ ◮ Convergence rate: let G = maxt gt2

f (xT) + λ xT1 = f (x∗) + λ x∗1 + O x∗2 G √ T

  • No extra penalty from λ x1!

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 7 / 22

slide-12
SLIDE 12

Abstraction to Regularized Online Convex Optimization

Repeat:

◮ Learner plays point xt ◮ Receive ft + ϕ (ϕ known) ◮ Suffer loss ft(xt) + ϕ(xt)

Goal: attain small regret R(T) :=

T

  • t=1

ft(xt) + ϕ(xt) − inf

x∈X T

  • t=1

ft(x) + ϕ(x)

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 8 / 22

slide-13
SLIDE 13

Composite Objective MIrror Descent

Let gt = ∇ft(xt). Comid step: xt+1 = argmin

x∈X

{Bψ(x, xt) + η gt, x + ηϕ(x)}

ϕ(x) f(x) f(x) + ϕ(x)

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 9 / 22

slide-14
SLIDE 14

Composite Objective MIrror Descent

Let gt = ∇ft(xt). Comid step: xt+1 = argmin

x∈X

{Bψ(x, xt) + η gt, x + ηϕ(x)}

f(x) + ϕ(x)

Bψ(x, xt) + g, x + ϕ(x)

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 9 / 22

slide-15
SLIDE 15

Convergence Results

Old (online gradient/mirror descent): Theorem: For any x∗ ∈ X,

T

  • t=1

ft(xt) + ϕ(xt) − ft(x∗) − ϕ(x∗) ≤ 1 ηBψ(x∗, x1) + η 2

T

  • t=1

∇ft(xt) + ∇ϕ(xt)2

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 10 / 22

slide-16
SLIDE 16

Convergence Results

Old (online gradient/mirror descent): Theorem: For any x∗ ∈ X,

T

  • t=1

ft(xt) + ϕ(xt) − ft(x∗) − ϕ(x∗) ≤ 1 ηBψ(x∗, x1) + η 2

T

  • t=1

∇ft(xt) + ∇ϕ(xt)2

New (Comid): Theorem: For any x∗ ∈ X,

T

  • t=1

ft(xt) + ϕ(xt) − ft(x∗) − ϕ(x∗) ≤ 1 ηBψ(x∗, x1) + η 2

T

  • t=1

∇ft(xt)2

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 10 / 22

slide-17
SLIDE 17

Derived Algorithms

◮ FOBOS (Duchi & Singer, 2009) ◮ p-norm divergences ◮ Mixed-norm regularization ◮ Matrix Comid

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 11 / 22

slide-18
SLIDE 18

p-norms

Better ℓ1 algorithms: ϕ(x) = λ x1

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 12 / 22

slide-19
SLIDE 19

p-norms

Better ℓ1 algorithms: ϕ(x) = λ x1

◮ Idea: non-Euclidean geometry (e.g. dense gradients, sparse x∗) ◮ Recall 1 2(p−1) x2 p is strongly convex over Rd w.r.t. ℓp,

1 < p ≤ 2

◮ Take ψ(x) = 1 2 x2 p

Corollary: When f ′

t (xt)∞ ≤ G∞, take p = 1 + 1/ log d to get

R(T) = O

  • x∗1 G∞
  • T log d
  • Duchi (UC Berkeley & Google)

Composite Objective Mirror Descent June 29, 2010 12 / 22

slide-20
SLIDE 20

Derived p-norm algorithms

SMIDAS (Shalev-Shwartz & Tewari 2009): take ϕ(x) = λ x1. Assume sign([∇ψ(x)]j) = sign(xj), define Sλ(z) = sign(z) · [|z| − λ]+ Then xt+1 = (∇ψ)−1Sηλ(∇ψ(xt) − ηf ′

t (xt))

ληt xt+1/2 xt+1

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 13 / 22

slide-21
SLIDE 21

Comid with mixed norms

ϕ(X) = Xℓ1/ℓq =

d

  • j=1

xjq X =      x1 x2 . . . xd      ⇒ x1q x2q . . . xdq

◮ Separable and solvable using previous methods ◮ Multitask and multiclass learning

◮ xj associated with feature j ◮ Penalize xj once Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 14 / 22

slide-22
SLIDE 22

Mixed-norm p-norm algorithms

Specialize problem to min

x v, x + 1

2 x2

p + λ x∞ ◮ Closed form? No.

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 15 / 22

slide-23
SLIDE 23

Mixed-norm p-norm algorithms

Specialize problem to min

x v, x + 1

2 x2

p + λ x∞ ◮ Closed form? No. ◮ Dual problem (x∗ = v − β):

min

β v − βq

subject to β1 ≤ λ

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 15 / 22

slide-24
SLIDE 24

Mixed-norm p-norm algorithms

Problem: min

β v − βq

subject to β1 ≤ λ Observation: Monotonicity of β, so vi ≥ vj implies βi ≥ βj

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 16 / 22

slide-25
SLIDE 25

Mixed-norm p-norm algorithms

Problem: min

β v − βq

subject to β1 ≤ λ Observation: Monotonicity of β, so vi ≥ vj implies βi ≥ βj Root-finding problem: λ =

d

  • i=1

βi(θ) =

d

  • i=1
  • vi − θ1/(q−1)

+

β6(θ) v6 v4 v8

Solve with median-like search

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 16 / 22

slide-26
SLIDE 26

Matrix Comid

Idea: get sparsity in spectrum of X ∈ Rd1×d2. Take ϕ(X) = | | |X| | |1 =

min{d1,d2}

  • i=1

σi(X)

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 17 / 22

slide-27
SLIDE 27

Matrix Comid

Idea: get sparsity in spectrum of X ∈ Rd1×d2. Take ϕ(X) = | | |X| | |1 =

min{d1,d2}

  • i=1

σi(X) Schatten p-norms: apply p-norms to columns of X ∈ Rd1×d2 | | |X| | |p = σ(X)p = min{d1,d2}

  • i=1

σi(X)p 1/p Important fact: for 1 < p ≤ 2, ψ(X) = 1 2(p − 1) | | |X| | |2

p

is strongly convex w.r.t. | | |·| | |p (Ball et al., 1994)

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 17 / 22

slide-28
SLIDE 28

Matrix Comid

Schatten p-norms: apply p-norms to columns of X ∈ Rd1×d2 | | |X| | |p = σ(X)p = min{d1,d2}

  • i=1

σi(X)p 1/p Important fact: for 1 < p ≤ 2, ψ(X) = 1 2(p − 1) | | |X| | |2

p

is strongly convex w.r.t. | | |·| | |p (Ball et al., 1994) Consequence: Take p = 1 + 1/ log d, G∞ ≥ | | |f ′

t (Xt)|

| |∞. Comid with above ψ has R(T) = O

  • G∞ |

| |X ∗| | |1

  • T log d
  • Duchi (UC Berkeley & Google)

Composite Objective Mirror Descent June 29, 2010 18 / 22

slide-29
SLIDE 29

Trace-norm Regularization

Idea: get sparsity in spectrum, take ϕ(X) = | | |X| | |1 =

i σi(X)

Xt+1 = argmin

X∈X

η f ′

t (Xt), X + Bψ(X, Xt) + ηλ |

| |X| | |1 For 1 < p ≤ 2, update is Compute SVD Xt = Uσ(Xt)V ⊤ Gradient step Xt+ 1

2

= U diag(∇ψ(σ(Xt)))V ⊤ − ηf ′

t (Xt)

Compute SVD Xt+ 1

2

= Uσ(Xt+ 1

2)

V ⊤ Shrinkage Xt+1 = U diag

  • (∇ψ)−1Sηλ(σ(Xt+ 1

2))

  • V ⊤

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 19 / 22

slide-30
SLIDE 30

Trace-norm Regularization Example

Proximal function: ψ(X) = 1 2 | | |X| | |2

2 = 1

2 | | |X| | |2

Fr

Update: Xt+ 1

2 = Xt−ηf ′

t (Xt)

(= UΣt+ 1

2V ⊤)

Shrinkage: Xt+1 = U

  • Σt+ 1

2 − ηλ

  • + V ⊤

u1σ1 u2σ2 u3σ3 u4σ4

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 20 / 22

slide-31
SLIDE 31

Trace-norm Regularization Example

Proximal function: ψ(X) = 1 2 | | |X| | |2

2 = 1

2 | | |X| | |2

Fr

Update: Xt+ 1

2 = Xt−ηf ′

t (Xt)

(= UΣt+ 1

2V ⊤)

Shrinkage: Xt+1 = U

  • Σt+ 1

2 − ηλ

  • + V ⊤

u1[σ1 - λ]+ u2[σ2 - λ]+

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 20 / 22

slide-32
SLIDE 32

Proof ideas for trace-norm

Idea: Unitary invariance to reduce to vector case (Lewis 1995) ∇ψ(X) = U diag [∇ψ(σ(X))] V ⊤ ∂ | | |X| | |1 = U diag(∂ σ(X)1)V ⊤ Simply reduce to vector case with ℓ1-regularization

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 21 / 22

slide-33
SLIDE 33

Conclusions and Related Work

◮ All derivations apply to Regularized Dual Averaging (Xiao 2009)

xt+1 = argmin

x∈X

  • η

t

  • τ=1

gτ, x + ηtϕ(x) + ψ(x)

  • ◮ Analysis of online convex programming for regularized objectives

◮ Unify several previous algorithms (projected gradient, mirror

descent, forward-backward splitting)

◮ Derived algorithms for several regularization functions

Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, 2010 22 / 22