[PPT] - Machine Learning: Chenhao Tan University of Colorado Boulder PowerPoint Presentation

SLIDE 1

Machine Learning: Chenhao Tan

University of Colorado Boulder

LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph

Machine Learning: Chenhao Tan | Boulder | 1 of 27

SLIDE 2

Quiz question

For a test instance (x, y) and a naïve Bayes classifier ˆ P, which of the following statements is true?

(A)

c ˆ

P(y = c|x) = 1

(B)

c ˆ

P(x|c) = 1

(C)

c ˆ

P(x|c)ˆ P(c) = 1

Machine Learning: Chenhao Tan | Boulder | 2 of 27

SLIDE 3

Overview

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 3 of 27

SLIDE 4

Reminder: Logistic Regression

P(Y = 0 | X) = 1 1 + exp [β0 +

i βiXi]

(1) P(Y = 1 | X) = exp [β0 +

i βiXi]

1 + exp [β0 +

i βiXi]

(2)

Discriminative prediction: P(y | x)
Classification uses: sentiment analysis, spam detection
What we didn’t talk about is how to learn β from data

Machine Learning: Chenhao Tan | Boulder | 4 of 27

SLIDE 5

Objective function

Outline

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 5 of 27

SLIDE 6

Objective function

Logistic Regression: Objective Function

Maximize likelihood Obj ≡ log P(Y | X, β) =

j

log P(y(j) | x(j), β)) =

j

y(j)

β0 +
i

βix(j)

i

− log
1 + exp
β0 +
i

βix(j)

i

Machine Learning: Chenhao Tan

| Boulder | 6 of 27

SLIDE 7

Objective function

Logistic Regression: Objective Function

Minimize negative log likelihood (loss) L ≡ − log P(Y | X, β) = −

j

log P(y(j) | x(j), β)) =

j

−y(j)

β0 +
i

βix(j)

i

+ log
1 + exp
β0 +
i

βix(j)

i

Machine Learning: Chenhao Tan

| Boulder | 7 of 27

SLIDE 8

Objective function

Logistic Regression: Objective Function

Minimize negative log likelihood (loss) L ≡ − log P(Y | X, β) = −

j

log P(y(j) | x(j), β)) =

j

−y(j)

β0 +
i

βix(j)

i

+ log
1 + exp
β0 +
i

βix(j)

i

Training data {(x, y)} are fixed. Objective function is a function of β . . . what values
f β give a good value?

Machine Learning: Chenhao Tan | Boulder | 7 of 27

SLIDE 9

Objective function

Logistic Regression: Objective Function

Minimize negative log likelihood (loss) L ≡ − log P(Y | X, β) = −

j

log P(y(j) | x(j), β)) =

j

−y(j)

β0 +
i

βix(j)

i

+ log
1 + exp
β0 +
i

βix(j)

i

Training data {(x, y)} are fixed. Objective function is a function of β . . . what values
f β give a good value?

β∗ = arg min

β

L (β)

Machine Learning: Chenhao Tan | Boulder | 7 of 27

SLIDE 10

Objective function

Convexity

L (β) is convex for logistic regression.

Proof.

Logistic loss −yv + log(1 + exp(v)) is convex.
Composition with linear function maintains convexity.
Sum of convex functions is convex.

Machine Learning: Chenhao Tan | Boulder | 8 of 27

SLIDE 11

Gradient Descent

Outline

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 9 of 27

SLIDE 12

Gradient Descent

Convexity

Convex function
Doesn’t matter where you start, if you

go down along the gradient

Machine Learning: Chenhao Tan | Boulder | 10 of 27

SLIDE 13

Gradient Descent

Convexity

Convex function
Doesn’t matter where you start, if you

go down along the gradient

Gradient!

Machine Learning: Chenhao Tan | Boulder | 10 of 27

SLIDE 14

Gradient Descent

Convexity

It would have been much harder if this

is not convex.

Machine Learning: Chenhao Tan | Boulder | 11 of 27

SLIDE 15

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 16

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 17

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 18

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 19

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 20

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 21

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1 2

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 22

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1 2

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 23

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1 2 3

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 24

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β βl+1

j

= βl

j − η∂L

∂βj

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 25

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β βl+1

j

= βl

j − η∂L

∂βj Luckily, (vanilla) logistic regression is convex

Machine Learning: Chenhao Tan | Boulder | 12 of 27

SLIDE 26

Gradient Descent

Gradient for Logistic Regression

To ease notation, let’s define πi = exp βTxi 1 + exp βTxi (3) Our objective function is L = −

i

log p(yi | xi) =

i

Li =

i
− log πi

if yi = 1 − log(1 − πi) if yi = 0 (4)

Machine Learning: Chenhao Tan | Boulder | 13 of 27

SLIDE 27

Gradient Descent

Taking the Derivative

Apply chain rule: ∂L ∂βj =

i

∂Li( β) ∂βj =

i

   − 1

πi ∂πi ∂βj

if yi = 1 −

1 1−πi

− ∂πi

∂βj

if yi = 0

(5) If we plug in the derivative, ∂πi ∂βj = πi(1 − πi)xj, (6) we can merge these two cases ∂Li ∂βj = −(yi − πi)xj. (7)

Machine Learning: Chenhao Tan | Boulder | 14 of 27

SLIDE 28

Gradient Descent

Gradient for Logistic Regression

Gradient

∇βL ( β) =

∂L (

β) ∂β0 , . . . , ∂L ( β) ∂βn

(8)

Update

∆β ≡η∇βL ( β) (9) β′

i ←βi − η∂L (

β) ∂βi (10)

Machine Learning: Chenhao Tan | Boulder | 15 of 27

SLIDE 29

Gradient Descent

Gradient for Logistic Regression

Gradient

∇βL ( β) =

∂L (

β) ∂β0 , . . . , ∂L ( β) ∂βn

(8)

Update

∆β ≡η∇βL ( β) (9) β′

i ←βi − η∂L (

β) ∂βi (10) Why are we subtracting? What would we do if we wanted to do ascent?

Machine Learning: Chenhao Tan | Boulder | 15 of 27

SLIDE 30

Gradient Descent

Gradient for Logistic Regression

Gradient

∇βL ( β) =

∂L (

β) ∂β0 , . . . , ∂L ( β) ∂βn

(8)

Update

∆β ≡η∇βL ( β) (9) β′

i ←βi − η∂L (

β) ∂βi (10) η: step size, must be greater than zero

Machine Learning: Chenhao Tan | Boulder | 15 of 27

SLIDE 31

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

SLIDE 32

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

SLIDE 33

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

SLIDE 34

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

SLIDE 35

Gradient Descent

Remaining issues

When to stop?
What if β keeps getting bigger?

Machine Learning: Chenhao Tan | Boulder | 17 of 27

SLIDE 36

Gradient Descent

Regularized Conditional Log Likelihood

Unregularized

β∗ = arg min

β

− ln

p(y(j) | x(j), β)
(11)

Regularized

β∗ = arg min

β

− ln

p(y(j) | x(j), β)
+ 1

2µ

i

β2

i

(12)

Machine Learning: Chenhao Tan | Boulder | 18 of 27

SLIDE 37

Gradient Descent

Regularized Conditional Log Likelihood

Unregularized

β∗ = arg min

β

− ln

p(y(j) | x(j), β)
(11)

Regularized

β∗ = arg min

β

− ln

p(y(j) | x(j), β)
+ 1

2µ

i

β2

i

(12) µ is “regularization” parameter that trades off between likelihood and having small parameters

Machine Learning: Chenhao Tan | Boulder | 18 of 27

SLIDE 38

Stochastic Gradient Descent

Outline

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 19 of 27

SLIDE 39

Stochastic Gradient Descent

Approximating the Gradient

Our datasets are big (to fit into memory)
. . . or data are changing / streaming

Machine Learning: Chenhao Tan | Boulder | 20 of 27

SLIDE 40

Stochastic Gradient Descent

Approximating the Gradient

Our datasets are big (to fit into memory)
. . . or data are changing / streaming
Hard to compute true gradient

L (β) ≡ Ex [∇L (β, x)] (13)

Average over all observations

Machine Learning: Chenhao Tan | Boulder | 20 of 27

SLIDE 41

Stochastic Gradient Descent

Approximating the Gradient

Our datasets are big (to fit into memory)
. . . or data are changing / streaming
Hard to compute true gradient

L (β) ≡ Ex [∇L (β, x)] (13)

Average over all observations
What if we compute an update just from one observation?

Machine Learning: Chenhao Tan | Boulder | 20 of 27

SLIDE 42

Stochastic Gradient Descent

Getting to Union Station

Pretend it’s a pre-smartphone world and you want to get to Union Station

Machine Learning: Chenhao Tan | Boulder | 21 of 27

SLIDE 43

Stochastic Gradient Descent

Stochastic Gradient for Regularized Regression

L = − log p(y | x; β) + 1 2µ

j

β2

j

(14)

Machine Learning: Chenhao Tan | Boulder | 22 of 27

SLIDE 44

Stochastic Gradient Descent

Stochastic Gradient for Regularized Regression

L = − log p(y | x; β) + 1 2µ

j

β2

j

(14) Taking the derivative (with respect to example xi) ∂L ∂βj = − (yi − πi)xj + µβj (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 27

SLIDE 45

Stochastic Gradient Descent

Stochastic Gradient for Logistic Regression

Given a single observation xi chosen at random from the dataset, βj ← β′

j − η

µβ′

j − xij [yi − πi]

(16)

Machine Learning: Chenhao Tan | Boulder | 23 of 27

SLIDE 46

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =βbias = 0, βA = 0, βB = 0, βC = 0, βD = 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) You first see the positive example. First, compute π1

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 47

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) You first see the positive example. First, compute π1 π1 = Pr(y1 = 1 | x1) =

exp βTxi 1+exp βTxi =

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 48

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) You first see the positive example. First, compute π1 π1 = Pr(y1 = 1 | x1) =

exp βTxi 1+exp βTxi = exp 0 exp 0+1 = 0.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 49

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) π1 = 0.5 What’s the update for βbias?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 50

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y1 − π1) · x1,bias = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 51

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y1 − π1) · x1,bias = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

=0.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 52

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 53

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y1 − π1) · x1,A = 0.0 + 1.0 · (1.0 − 0.5) · 4.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 54

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y1 − π1) · x1,A = 0.0 + 1.0 · (1.0 − 0.5) · 4.0

=2.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 55

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 56

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y1 − π1) · x1,B = 0.0 + 1.0 · (1.0 − 0.5) · 3.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 57

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y1 − π1) · x1,B = 0.0 + 1.0 · (1.0 − 0.5) · 3.0

=1.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 58

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 59

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y1 − π1) · x1,C = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 60

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y1 − π1) · x1,C = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

=0.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 61

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 62

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y1 − π1) · x1,D = 0.0 + 1.0 · (1.0 − 0.5) · 0.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 63

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y1 − π1) · x1,D = 0.0 + 1.0 · (1.0 − 0.5) · 0.0

=0.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 64

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 65

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2? π2 = Pr(y2 = 1 | x2) =

exp βTxi 1+exp βTxi =

exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 =

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 66

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2? π2 = Pr(y2 = 1 | x2) =

exp βTxi 1+exp βTxi =

exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 = 0.97

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 67

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2? π2 = 0.97 What’s the update for βbias?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 68

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y2 − π2) · x2,bias = 0.5 + 1.0 · (0.0 − 0.97) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 69

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y2 − π2) · x2,bias = 0.5 + 1.0 · (0.0 − 0.97) · 1.0

=-0.47

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 70

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 71

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y2 − π2) · x2,A = 2.0 + 1.0 · (0.0 − 0.97) · 0.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 72

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y2 − π2) · x2,A = 2.0 + 1.0 · (0.0 − 0.97) · 0.0

=2.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 73

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 74

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y2 − π2) · x2,B = 1.5 + 1.0 · (0.0 − 0.97) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 75

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y2 − π2) · x2,B = 1.5 + 1.0 · (0.0 − 0.97) · 1.0

=0.53

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 76

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 77

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y2 − π2) · x2,C = 0.5 + 1.0 · (0.0 − 0.97) · 3.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 78

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y2 − π2) · x2,C = 0.5 + 1.0 · (0.0 − 0.97) · 3.0

=-2.41

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 79

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 80

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y2 − π2) · x2,D = 0.0 + 1.0 · (0.0 − 0.97) · 4.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 81

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y2 − π2) · x2,D = 0.0 + 1.0 · (0.0 − 0.97) · 4.0

=-3.88

Machine Learning: Chenhao Tan | Boulder | 24 of 27

SLIDE 82

Stochastic Gradient Descent

Algorithm

1. Initialize a vector β to be all zeros
2. For t = 1, . . . , T
For each example

xi, yi and feature j:

Compute πi ≡ Pr(yi = 1 |

xi)

Set βj = β′

j − η(µβ′ j − (yi − πi)xi)

3. Output the parameters β1, . . . , βd.

Machine Learning: Chenhao Tan | Boulder | 25 of 27

SLIDE 83

Stochastic Gradient Descent

Algorithm

1. Initialize a vector β to be all zeros
2. For t = 1, . . . , T
For each example

xi, yi and feature j:

Compute πi ≡ Pr(yi = 1 |

xi)

Set βj = β′

j − η(µβ′ j − (yi − πi)xi)

3. Output the parameters β1, . . . , βd.

Any issues?

Machine Learning: Chenhao Tan | Boulder | 25 of 27

SLIDE 84

Stochastic Gradient Descent

Algorithm

1. Initialize a vector β to be all zeros
2. For t = 1, . . . , T
For each example

xi, yi and feature j:

Compute πi ≡ Pr(yi = 1 |

xi)

Set βj = β′

j − η(µβ′ j − (yi − πi)xi)

3. Output the parameters β1, . . . , βd.

Any issues? Inefficiency due to updates on every j even if xij = 0. Lazy sparse updates in the readings.

Machine Learning: Chenhao Tan | Boulder | 25 of 27

SLIDE 85

Stochastic Gradient Descent

Proofs about Stochastic Gradient

Depends on convexity of objective and how close ǫ you want to get to actual

answer

Best bounds depend on changing η over time and per dimension (not all

features created equal)

Machine Learning: Chenhao Tan | Boulder | 26 of 27

SLIDE 86

Stochastic Gradient Descent

Running time

Number of iterations to get to accuracy:

L (β) − L (β∗) ≤ ǫ

Gradient descent:
If L is strongly convex: O(log(1/ǫ)) iterations
Stochastic gradient descent:
If L is strongly convex: O(1/ǫ) iterations

Machine Learning: Chenhao Tan | Boulder | 27 of 27