Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

machine learning chenhao tan
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan | Boulder | 1 of 27 Quiz question For a test instance ( x , y ) and a


slide-1
SLIDE 1

Machine Learning: Chenhao Tan

University of Colorado Boulder

LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph

Machine Learning: Chenhao Tan | Boulder | 1 of 27

slide-2
SLIDE 2

Quiz question

For a test instance (x, y) and a naïve Bayes classifier ˆ P, which of the following statements is true?

  • (A)

c ˆ

P(y = c|x) = 1

  • (B)

c ˆ

P(x|c) = 1

  • (C)

c ˆ

P(x|c)ˆ P(c) = 1

Machine Learning: Chenhao Tan | Boulder | 2 of 27

slide-3
SLIDE 3

Overview

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 3 of 27

slide-4
SLIDE 4

Reminder: Logistic Regression

P(Y = 0 | X) = 1 1 + exp [β0 +

i βiXi]

(1) P(Y = 1 | X) = exp [β0 +

i βiXi]

1 + exp [β0 +

i βiXi]

(2)

  • Discriminative prediction: P(y | x)
  • Classification uses: sentiment analysis, spam detection
  • What we didn’t talk about is how to learn β from data

Machine Learning: Chenhao Tan | Boulder | 4 of 27

slide-5
SLIDE 5

Objective function

Outline

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 5 of 27

slide-6
SLIDE 6

Objective function

Logistic Regression: Objective Function

Maximize likelihood Obj ≡ log P(Y | X, β) =

  • j

log P(y(j) | x(j), β)) =

  • j

y(j)

  • β0 +
  • i

βix(j)

i

  • − log
  • 1 + exp
  • β0 +
  • i

βix(j)

i

  • Machine Learning: Chenhao Tan

| Boulder | 6 of 27

slide-7
SLIDE 7

Objective function

Logistic Regression: Objective Function

Minimize negative log likelihood (loss) L ≡ − log P(Y | X, β) = −

  • j

log P(y(j) | x(j), β)) =

  • j

−y(j)

  • β0 +
  • i

βix(j)

i

  • + log
  • 1 + exp
  • β0 +
  • i

βix(j)

i

  • Machine Learning: Chenhao Tan

| Boulder | 7 of 27

slide-8
SLIDE 8

Objective function

Logistic Regression: Objective Function

Minimize negative log likelihood (loss) L ≡ − log P(Y | X, β) = −

  • j

log P(y(j) | x(j), β)) =

  • j

−y(j)

  • β0 +
  • i

βix(j)

i

  • + log
  • 1 + exp
  • β0 +
  • i

βix(j)

i

  • Training data {(x, y)} are fixed. Objective function is a function of β . . . what values
  • f β give a good value?

Machine Learning: Chenhao Tan | Boulder | 7 of 27

slide-9
SLIDE 9

Objective function

Logistic Regression: Objective Function

Minimize negative log likelihood (loss) L ≡ − log P(Y | X, β) = −

  • j

log P(y(j) | x(j), β)) =

  • j

−y(j)

  • β0 +
  • i

βix(j)

i

  • + log
  • 1 + exp
  • β0 +
  • i

βix(j)

i

  • Training data {(x, y)} are fixed. Objective function is a function of β . . . what values
  • f β give a good value?

β∗ = arg min

β

L (β)

Machine Learning: Chenhao Tan | Boulder | 7 of 27

slide-10
SLIDE 10

Objective function

Convexity

L (β) is convex for logistic regression.

Proof.

  • Logistic loss −yv + log(1 + exp(v)) is convex.
  • Composition with linear function maintains convexity.
  • Sum of convex functions is convex.

Machine Learning: Chenhao Tan | Boulder | 8 of 27

slide-11
SLIDE 11

Gradient Descent

Outline

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 9 of 27

slide-12
SLIDE 12

Gradient Descent

Convexity

  • Convex function
  • Doesn’t matter where you start, if you

go down along the gradient

Machine Learning: Chenhao Tan | Boulder | 10 of 27

slide-13
SLIDE 13

Gradient Descent

Convexity

  • Convex function
  • Doesn’t matter where you start, if you

go down along the gradient

  • Gradient!

Machine Learning: Chenhao Tan | Boulder | 10 of 27

slide-14
SLIDE 14

Gradient Descent

Convexity

  • It would have been much harder if this

is not convex.

Machine Learning: Chenhao Tan | Boulder | 11 of 27

slide-15
SLIDE 15

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-16
SLIDE 16

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-17
SLIDE 17

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-18
SLIDE 18

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-19
SLIDE 19

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-20
SLIDE 20

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-21
SLIDE 21

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1 2

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-22
SLIDE 22

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1 2

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-23
SLIDE 23

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β

Parameter Objective Objective

Undiscovered Country

1 2 3

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-24
SLIDE 24

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β βl+1

j

= βl

j − η∂L

∂βj

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-25
SLIDE 25

Gradient Descent

Gradient Descent (non-convex)

Goal

Optimize loss function with respect to variables β βl+1

j

= βl

j − η∂L

∂βj Luckily, (vanilla) logistic regression is convex

Machine Learning: Chenhao Tan | Boulder | 12 of 27

slide-26
SLIDE 26

Gradient Descent

Gradient for Logistic Regression

To ease notation, let’s define πi = exp βTxi 1 + exp βTxi (3) Our objective function is L = −

  • i

log p(yi | xi) =

  • i

Li =

  • i
  • − log πi

if yi = 1 − log(1 − πi) if yi = 0 (4)

Machine Learning: Chenhao Tan | Boulder | 13 of 27

slide-27
SLIDE 27

Gradient Descent

Taking the Derivative

Apply chain rule: ∂L ∂βj =

  • i

∂Li( β) ∂βj =

  • i

   − 1

πi ∂πi ∂βj

if yi = 1 −

1 1−πi

  • − ∂πi

∂βj

  • if yi = 0

(5) If we plug in the derivative, ∂πi ∂βj = πi(1 − πi)xj, (6) we can merge these two cases ∂Li ∂βj = −(yi − πi)xj. (7)

Machine Learning: Chenhao Tan | Boulder | 14 of 27

slide-28
SLIDE 28

Gradient Descent

Gradient for Logistic Regression

Gradient

∇βL ( β) =

  • ∂L (

β) ∂β0 , . . . , ∂L ( β) ∂βn

  • (8)

Update

∆β ≡η∇βL ( β) (9) β′

i ←βi − η∂L (

β) ∂βi (10)

Machine Learning: Chenhao Tan | Boulder | 15 of 27

slide-29
SLIDE 29

Gradient Descent

Gradient for Logistic Regression

Gradient

∇βL ( β) =

  • ∂L (

β) ∂β0 , . . . , ∂L ( β) ∂βn

  • (8)

Update

∆β ≡η∇βL ( β) (9) β′

i ←βi − η∂L (

β) ∂βi (10) Why are we subtracting? What would we do if we wanted to do ascent?

Machine Learning: Chenhao Tan | Boulder | 15 of 27

slide-30
SLIDE 30

Gradient Descent

Gradient for Logistic Regression

Gradient

∇βL ( β) =

  • ∂L (

β) ∂β0 , . . . , ∂L ( β) ∂βn

  • (8)

Update

∆β ≡η∇βL ( β) (9) β′

i ←βi − η∂L (

β) ∂βi (10) η: step size, must be greater than zero

Machine Learning: Chenhao Tan | Boulder | 15 of 27

slide-31
SLIDE 31

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

slide-32
SLIDE 32

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

slide-33
SLIDE 33

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

slide-34
SLIDE 34

Gradient Descent

Choosing Step Size

Parameter Objective

Machine Learning: Chenhao Tan | Boulder | 16 of 27

slide-35
SLIDE 35

Gradient Descent

Remaining issues

  • When to stop?
  • What if β keeps getting bigger?

Machine Learning: Chenhao Tan | Boulder | 17 of 27

slide-36
SLIDE 36

Gradient Descent

Regularized Conditional Log Likelihood

Unregularized

β∗ = arg min

β

− ln

  • p(y(j) | x(j), β)
  • (11)

Regularized

β∗ = arg min

β

− ln

  • p(y(j) | x(j), β)
  • + 1

  • i

β2

i

(12)

Machine Learning: Chenhao Tan | Boulder | 18 of 27

slide-37
SLIDE 37

Gradient Descent

Regularized Conditional Log Likelihood

Unregularized

β∗ = arg min

β

− ln

  • p(y(j) | x(j), β)
  • (11)

Regularized

β∗ = arg min

β

− ln

  • p(y(j) | x(j), β)
  • + 1

  • i

β2

i

(12) µ is “regularization” parameter that trades off between likelihood and having small parameters

Machine Learning: Chenhao Tan | Boulder | 18 of 27

slide-38
SLIDE 38

Stochastic Gradient Descent

Outline

Objective function Gradient Descent Stochastic Gradient Descent

Machine Learning: Chenhao Tan | Boulder | 19 of 27

slide-39
SLIDE 39

Stochastic Gradient Descent

Approximating the Gradient

  • Our datasets are big (to fit into memory)
  • . . . or data are changing / streaming

Machine Learning: Chenhao Tan | Boulder | 20 of 27

slide-40
SLIDE 40

Stochastic Gradient Descent

Approximating the Gradient

  • Our datasets are big (to fit into memory)
  • . . . or data are changing / streaming
  • Hard to compute true gradient

L (β) ≡ Ex [∇L (β, x)] (13)

  • Average over all observations

Machine Learning: Chenhao Tan | Boulder | 20 of 27

slide-41
SLIDE 41

Stochastic Gradient Descent

Approximating the Gradient

  • Our datasets are big (to fit into memory)
  • . . . or data are changing / streaming
  • Hard to compute true gradient

L (β) ≡ Ex [∇L (β, x)] (13)

  • Average over all observations
  • What if we compute an update just from one observation?

Machine Learning: Chenhao Tan | Boulder | 20 of 27

slide-42
SLIDE 42

Stochastic Gradient Descent

Getting to Union Station

Pretend it’s a pre-smartphone world and you want to get to Union Station

Machine Learning: Chenhao Tan | Boulder | 21 of 27

slide-43
SLIDE 43

Stochastic Gradient Descent

Stochastic Gradient for Regularized Regression

L = − log p(y | x; β) + 1 2µ

  • j

β2

j

(14)

Machine Learning: Chenhao Tan | Boulder | 22 of 27

slide-44
SLIDE 44

Stochastic Gradient Descent

Stochastic Gradient for Regularized Regression

L = − log p(y | x; β) + 1 2µ

  • j

β2

j

(14) Taking the derivative (with respect to example xi) ∂L ∂βj = − (yi − πi)xj + µβj (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 27

slide-45
SLIDE 45

Stochastic Gradient Descent

Stochastic Gradient for Logistic Regression

Given a single observation xi chosen at random from the dataset, βj ← β′

j − η

  • µβ′

j − xij [yi − πi]

  • (16)

Machine Learning: Chenhao Tan | Boulder | 23 of 27

slide-46
SLIDE 46

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =βbias = 0, βA = 0, βB = 0, βC = 0, βD = 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) You first see the positive example. First, compute π1

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-47
SLIDE 47

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) You first see the positive example. First, compute π1 π1 = Pr(y1 = 1 | x1) =

exp βTxi 1+exp βTxi =

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-48
SLIDE 48

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) You first see the positive example. First, compute π1 π1 = Pr(y1 = 1 | x1) =

exp βTxi 1+exp βTxi = exp 0 exp 0+1 = 0.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-49
SLIDE 49

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) π1 = 0.5 What’s the update for βbias?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-50
SLIDE 50

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y1 − π1) · x1,bias = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-51
SLIDE 51

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y1 − π1) · x1,bias = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

=0.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-52
SLIDE 52

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-53
SLIDE 53

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y1 − π1) · x1,A = 0.0 + 1.0 · (1.0 − 0.5) · 4.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-54
SLIDE 54

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y1 − π1) · x1,A = 0.0 + 1.0 · (1.0 − 0.5) · 4.0

=2.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-55
SLIDE 55

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-56
SLIDE 56

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y1 − π1) · x1,B = 0.0 + 1.0 · (1.0 − 0.5) · 3.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-57
SLIDE 57

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y1 − π1) · x1,B = 0.0 + 1.0 · (1.0 − 0.5) · 3.0

=1.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-58
SLIDE 58

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-59
SLIDE 59

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y1 − π1) · x1,C = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-60
SLIDE 60

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y1 − π1) · x1,C = 0.0 + 1.0 · (1.0 − 0.5) · 1.0

=0.5

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-61
SLIDE 61

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-62
SLIDE 62

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y1 − π1) · x1,D = 0.0 + 1.0 · (1.0 − 0.5) · 0.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-63
SLIDE 63

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =0, 0, 0, 0, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y1 − π1) · x1,D = 0.0 + 1.0 · (1.0 − 0.5) · 0.0

=0.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-64
SLIDE 64

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-65
SLIDE 65

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2? π2 = Pr(y2 = 1 | x2) =

exp βTxi 1+exp βTxi =

exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 =

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-66
SLIDE 66

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2? π2 = Pr(y2 = 1 | x2) =

exp βTxi 1+exp βTxi =

exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 = 0.97

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-67
SLIDE 67

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) Now you see the negative example. What’s π2? π2 = 0.97 What’s the update for βbias?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-68
SLIDE 68

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y2 − π2) · x2,bias = 0.5 + 1.0 · (0.0 − 0.97) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-69
SLIDE 69

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βbias? βbias = β′

bias + η · (y2 − π2) · x2,bias = 0.5 + 1.0 · (0.0 − 0.97) · 1.0

=-0.47

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-70
SLIDE 70

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-71
SLIDE 71

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y2 − π2) · x2,A = 2.0 + 1.0 · (0.0 − 0.97) · 0.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-72
SLIDE 72

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βA? βA = β′

A + η · (y2 − π2) · x2,A = 2.0 + 1.0 · (0.0 − 0.97) · 0.0

=2.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-73
SLIDE 73

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-74
SLIDE 74

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y2 − π2) · x2,B = 1.5 + 1.0 · (0.0 − 0.97) · 1.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-75
SLIDE 75

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βB? βB = β′

B + η · (y2 − π2) · x2,B = 1.5 + 1.0 · (0.0 − 0.97) · 1.0

=0.53

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-76
SLIDE 76

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-77
SLIDE 77

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y2 − π2) · x2,C = 0.5 + 1.0 · (0.0 − 0.97) · 3.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-78
SLIDE 78

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βC? βC = β′

C + η · (y2 − π2) · x2,C = 0.5 + 1.0 · (0.0 − 0.97) · 3.0

=-2.41

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-79
SLIDE 79

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD?

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-80
SLIDE 80

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y2 − π2) · x2,D = 0.0 + 1.0 · (0.0 − 0.97) · 4.0

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-81
SLIDE 81

Stochastic Gradient Descent

Example Documents

β[j] =β[j] + η(yi − πi)xi

  • β =.5, 2, 1.5, 0.5, 0

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D (Assume step size η = 1.0.) What’s the update for βD? βD = β′

D + η · (y2 − π2) · x2,D = 0.0 + 1.0 · (0.0 − 0.97) · 4.0

=-3.88

Machine Learning: Chenhao Tan | Boulder | 24 of 27

slide-82
SLIDE 82

Stochastic Gradient Descent

Algorithm

  • 1. Initialize a vector β to be all zeros
  • 2. For t = 1, . . . , T
  • For each example

xi, yi and feature j:

  • Compute πi ≡ Pr(yi = 1 |

xi)

  • Set βj = β′

j − η(µβ′ j − (yi − πi)xi)

  • 3. Output the parameters β1, . . . , βd.

Machine Learning: Chenhao Tan | Boulder | 25 of 27

slide-83
SLIDE 83

Stochastic Gradient Descent

Algorithm

  • 1. Initialize a vector β to be all zeros
  • 2. For t = 1, . . . , T
  • For each example

xi, yi and feature j:

  • Compute πi ≡ Pr(yi = 1 |

xi)

  • Set βj = β′

j − η(µβ′ j − (yi − πi)xi)

  • 3. Output the parameters β1, . . . , βd.

Any issues?

Machine Learning: Chenhao Tan | Boulder | 25 of 27

slide-84
SLIDE 84

Stochastic Gradient Descent

Algorithm

  • 1. Initialize a vector β to be all zeros
  • 2. For t = 1, . . . , T
  • For each example

xi, yi and feature j:

  • Compute πi ≡ Pr(yi = 1 |

xi)

  • Set βj = β′

j − η(µβ′ j − (yi − πi)xi)

  • 3. Output the parameters β1, . . . , βd.

Any issues? Inefficiency due to updates on every j even if xij = 0. Lazy sparse updates in the readings.

Machine Learning: Chenhao Tan | Boulder | 25 of 27

slide-85
SLIDE 85

Stochastic Gradient Descent

Proofs about Stochastic Gradient

  • Depends on convexity of objective and how close ǫ you want to get to actual

answer

  • Best bounds depend on changing η over time and per dimension (not all

features created equal)

Machine Learning: Chenhao Tan | Boulder | 26 of 27

slide-86
SLIDE 86

Stochastic Gradient Descent

Running time

  • Number of iterations to get to accuracy:

L (β) − L (β∗) ≤ ǫ

  • Gradient descent:
  • If L is strongly convex: O(log(1/ǫ)) iterations
  • Stochastic gradient descent:
  • If L is strongly convex: O(1/ǫ) iterations

Machine Learning: Chenhao Tan | Boulder | 27 of 27