[PPT] - Summary Linearly separable classification problems. Logistic loss PowerPoint Presentation

SLIDE 1

Summary

◮ Linearly separable classification problems. ◮ Logistic loss ℓlog and (empirical) risk Rlog. ◮ Gradient descent.

20 / 68

SLIDE 2

(Slide from last time) Classification

For now, let’s consider binary classification: Y = {−1, +1}. A linear predictor w ∈ Rd classifies according to sign(wTx) ∈ {−1, +1}. Given ((xi, yi))n

i=1, a predictor w ∈ Rd,

we want sign

wTxi
and yi to agree.

21 / 68

SLIDE 3

(Slide from last time) Logistic loss 1

Let’s state our classification goal with a generic margin loss ℓ:

R(w) = 1

n

i=1

ℓ(yiw

Txi);

the key properties we want: ◮ ℓ is continuous; ◮ ℓ(z) ≥ c1[z ≤ 0] = cℓzo(z) for some c > 0 and any z ∈ R, which implies Rℓ(w) ≥ c Rzo(w). ◮ ℓ′(0) < 0 (pushes stuff from wrong to right).

22 / 68

SLIDE 4

(Slide from last time) Logistic loss 1

Let’s state our classification goal with a generic margin loss ℓ:

R(w) = 1

n

i=1

ℓ(yiw

Txi);

the key properties we want: ◮ ℓ is continuous; ◮ ℓ(z) ≥ c1[z ≤ 0] = cℓzo(z) for some c > 0 and any z ∈ R, which implies Rℓ(w) ≥ c Rzo(w). ◮ ℓ′(0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓls(z) := (1 − z)2; note ℓls(yˆ y) = (1 − yˆ y)2 = y2(1 − yˆ y)2 = (y − ˆ y)2. ◮ Logistic loss: ℓlog(z) = ln(1 + exp(−z)).

22 / 68

SLIDE 5

(Slide from last time) Logistic loss 2

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

1

2 .

8

.

4

. . 4 . 8 . 1 2 . 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

1.200
0.800
0.400

0.000 0.400 0.800 1.200

Logistic loss. Squared loss.

23 / 68

SLIDE 6

(Slide from last time) Logistic loss 3

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

1

2 .

8

.

4

. . 4 . 8 . 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

1

. 2

.

8

.

4 . . 4 . 8 1 . 2

Logistic loss. Squared loss.

24 / 68

SLIDE 7

(Slide from last time) Gradient descent 1

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

25 / 68

SLIDE 8

(Slide from last time) Gradient descent 1

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

Does this work for least squares?

25 / 68

SLIDE 9

(Slide from last time) Gradient descent 1

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity.

25 / 68

SLIDE 10

(Slide from last time) Gradient descent 2

Gradient descent is the iteration: wi+1 := wi − ηi∇w Rlog(wi). ◮ Note ℓ′

log(z) = −1 1+exp(z), and use the chain rule (hw1!).

◮ Or use pytorch:

def GD(X, y, loss, step = 0.1, n iters = 10000): w = torch.zeros(X.shape[1], requires grad = True) for i in range(n iters): l = loss(X, y, w).mean() l.backward() with torch.no grad(): w −= step ∗ w.grad w.grad.zero () return w

26 / 68

SLIDE 11

Part 2 of logistic regression. . .

SLIDE 12

5. A maximum likelihood derivation

SLIDE 13

MLE and ERM

We’ve studied an ERM perspective on logistic regression: ◮ Form empirical logistic risk Rlog(w) = 1

n

i=1 ln(1 + exp(−yiwTxi)).

◮ Approximately solve arg minw∈Rd Rlog(w) via gradient descent (or other convex optimization technique). We only justified it with “popularity”! Today we’ll derive Rlog via Maximum Likelihood Estimation (MLE).

1. We form a model for Pr[Y = 1|X = x], parameterized by w.
2. We form a full data log-likelihood (equivalent to

Rlog). Let’s first describe the distributions underlying the data.

27 / 68

SLIDE 14

Learning prediction functions

IID model for supervised learning: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid random pairs (i.e., labeled examples). ◮ X takes values in X. E.g., X = Rd. ◮ Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1, . . . , K} or Y = {0, 1} or Y = {−1, +1}.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function

(i.e., predictor) ˆ f : X → Y, This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction ˆ

f(X).

3. Outcome is Y , and

◮ squared loss is ( ˆ f(X) − Y )2 (regression problems). ◮ zero-one loss is 1{ ˆ f(X) = Y } (classification problems). Note: expected zero-one loss is E[1{ ˆ f(X) = Y }] = P( ˆ f(X) = Y ), which we also call error rate.

28 / 68

SLIDE 15

Distributions over labeled examples

X: space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X × Y can be thought

f in two parts:
1. Marginal distribution PX of X:

PX is a probability distribution on X.

2. Conditional distribution PY |X=x of Y given X = x, for each x ∈ X:

PY |X=x is a probability distribution on Y.

29 / 68

SLIDE 16

Optimal classifier

For binary classification, what function f : X → {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) = Y )? ◮ Conditional on X = x, the minimizer of conditional risk ˆ y → P(ˆ y = Y | X = x) is ˆ y :=

1

if P(Y = 1 | X = x) > 1/2, if P(Y = 1 | X = x) ≤ 1/2. ◮ Therefore, the function f ⋆ : X → {0, 1} where f ⋆(x) =

1

if P(Y = 1 | X = x) > 1/2, if P(Y = 1 | X = x) ≤ 1/2, x ∈ X, has the smallest risk. ◮ f ⋆ is called the Bayes (optimal) classifier. For Y = {1, . . . , K}, f ⋆(x) = arg max

y∈Y

P(Y = y | X = x), x ∈ X.

30 / 68

SLIDE 17

Logistic regression

Suppose X = Rd and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y | X = x ∼ Bern(ηw(x)), x ∈ Rd, with ηw(x) := logistic(x

Tw),

x ∈ Rd (with parameters w ∈ Rd), and logistic(z) := 1 1 + e−z = ez 1 + ez , z ∈ R.

6
4
2

2 4 6 0.2 0.4 0.6 0.8 1

◮ Conditional distribution of Y given X is Bernoulli; marginal distribution

f X not specified.

◮ With least squares, Y | X = x was N(wTx, σ2).

31 / 68

SLIDE 18

MLE for logistic regression

Log-likelihood of w in iid logistic regression model, given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n: ln

n

i=1

ηw(xi)yi 1 − ηw(xi) 1−yi =

n

i=1
yi ln ηw(xi) + (1 − yi) ln(1 − ηw(xi))
= −

n

i=1
yi ln(1 + exp(−w

Txi)) + (1 − yi) ln(1 + exp(w Txi))

= −

n

i=1

ln(1 + exp(−(2yi − 1)w

Txi)),

and old form is recovered with labels ˜ yi := 2yi − 1 ∈ {−1, +1}.

32 / 68

SLIDE 19

Log-odds function and classifier

Equivalent way to characterize logistic regression model: The log-odds function, given by log-oddsβ(x) = ln ηβ(x) 1 − ηβ(x) = ln       exTβ 1 + exTβ 1 1 + exTβ       = x

Tβ,

is a linear function1, parameterized by β ∈ Rd.

1Some authors allow affine function; we can get this using affine expansion.

33 / 68

SLIDE 20

Log-odds function and classifier

Equivalent way to characterize logistic regression model: The log-odds function, given by log-oddsβ(x) = ln ηβ(x) 1 − ηβ(x) = ln       exTβ 1 + exTβ 1 1 + exTβ       = x

Tβ,

is a linear function1, parameterized by β ∈ Rd. Bayes optimal classifier fβ : Rd → {0, 1} in logistic regression model: fβ(x) =

if xTβ ≤ 0,

1 if xTβ > 0.

1Some authors allow affine function; we can get this using affine expansion.

33 / 68

SLIDE 21

Log-odds function and classifier

Equivalent way to characterize logistic regression model: The log-odds function, given by log-oddsβ(x) = ln ηβ(x) 1 − ηβ(x) = ln       exTβ 1 + exTβ 1 1 + exTβ       = x

Tβ,

is a linear function1, parameterized by β ∈ Rd. Bayes optimal classifier fβ : Rd → {0, 1} in logistic regression model: fβ(x) =

if xTβ ≤ 0,

1 if xTβ > 0. Such classifiers are called linear classifiers.

1Some authors allow affine function; we can get this using affine expansion.

33 / 68

SLIDE 22

Where does the logistic regression model come from?

The following is one way the logistic regression model comes about (but not the only way).

34 / 68

SLIDE 23

Where does the logistic regression model come from?

The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y ∼ Bern(π), X | Y = y ∼ N(µy, Σ).

34 / 68

SLIDE 24

Where does the logistic regression model come from?

The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y ∼ Bern(π), X | Y = y ∼ N(µy, Σ). ◮ Parameters: π ∈ [0, 1], µ0, µ1 ∈ Rd, Σ ∈ Rd×d sym. & pos. def.

34 / 68

SLIDE 25

Where does the logistic regression model come from?

The following is one way the logistic regression model comes about (but not the only way). Consider the following generative model for (X, Y ) where Y ∼ Bern(π), X | Y = y ∼ N(µy, Σ). ◮ Parameters: π ∈ [0, 1], µ0, µ1 ∈ Rd, Σ ∈ Rd×d sym. & pos. def.

x

2

2 4

2

2 4 0.05 0.1 0.15

2

2 4

2

2 4

R1 R2

P(ω2)=.5 P(ω1)=.5

Figure shows (unconditional) probability density function for X.

34 / 68

SLIDE 26

Statistical model for conditional distribution

Suppose we are given the following. ◮ pY : probability mass function for Y . ◮ pX|Y =y: conditional probability density function for X given Y = y.

35 / 68

SLIDE 27

Statistical model for conditional distribution

Suppose we are given the following. ◮ pY : probability mass function for Y . ◮ pX|Y =y: conditional probability density function for X given Y = y. What is the conditional distribution of Y given X?

35 / 68

SLIDE 28

Statistical model for conditional distribution

Suppose we are given the following. ◮ pY : probability mass function for Y . ◮ pX|Y =y: conditional probability density function for X given Y = y. What is the conditional distribution of Y given X? By Bayes’ rule: for any x ∈ Rd, P(Y = y | X = x) = pY (y) · pX|Y =y(x) pX(x) (where pX is unconditional density for X).

35 / 68

SLIDE 29

Statistical model for conditional distribution

Suppose we are given the following. ◮ pY : probability mass function for Y . ◮ pX|Y =y: conditional probability density function for X given Y = y. What is the conditional distribution of Y given X? By Bayes’ rule: for any x ∈ Rd, P(Y = y | X = x) = pY (y) · pX|Y =y(x) pX(x) (where pX is unconditional density for X). Therefore, log-odds function is log-odds(x) = ln

pY (1)

pY (0) · pX|Y =1(x) pX|Y =0(x)

.

35 / 68

SLIDE 30

Log-odds function for our toy model

Log-odds function: log-odds(x) = ln pY (1) pY (0)

+ ln
pX|Y =1(x)

pX|Y =0(x)

.

36 / 68

SLIDE 31

Log-odds function for our toy model

Log-odds function: log-odds(x) = ln pY (1) pY (0)

+ ln
pX|Y =1(x)

pX|Y =0(x)

.

In our toy model, we have Y ∼ Bern(π) and X | Y = y ∼ N(µy, AAT), so: log-odds(x) = ln π 1 − π + ln e− 1

2 A−1(x−µ1)2 2

e− 1

2 A−1(x−µ0)2 2

= ln π 1 − π − 1 2A−1(x − µ1)2

2 + 1

2A−1(x − µ0)2

2

= ln π 1 − π − 1 2(A−1µ12

2 − A−1µ02 2)

constant—doesn’t depend on x

+ (µ1 − µ0)

T(AA T)−1x

linear function of x

.

36 / 68

SLIDE 32

Log-odds function for our toy model

Log-odds function: log-odds(x) = ln pY (1) pY (0)

+ ln
pX|Y =1(x)

pX|Y =0(x)

.

In our toy model, we have Y ∼ Bern(π) and X | Y = y ∼ N(µy, AAT), so: log-odds(x) = ln π 1 − π + ln e− 1

2 A−1(x−µ1)2 2

e− 1

2 A−1(x−µ0)2 2

= ln π 1 − π − 1 2A−1(x − µ1)2

2 + 1

2A−1(x − µ0)2

2

= ln π 1 − π − 1 2(A−1µ12

2 − A−1µ02 2)

constant—doesn’t depend on x

+ (µ1 − µ0)

T(AA T)−1x

linear function of x

. ◮ This is an affine function of x.

36 / 68

SLIDE 33

Log-odds function for our toy model

Log-odds function: log-odds(x) = ln pY (1) pY (0)

+ ln
pX|Y =1(x)

pX|Y =0(x)

.

In our toy model, we have Y ∼ Bern(π) and X | Y = y ∼ N(µy, AAT), so: log-odds(x) = ln π 1 − π + ln e− 1

2 A−1(x−µ1)2 2

e− 1

2 A−1(x−µ0)2 2

= ln π 1 − π − 1 2A−1(x − µ1)2

2 + 1

2A−1(x − µ0)2

2

= ln π 1 − π − 1 2(A−1µ12

2 − A−1µ02 2)

constant—doesn’t depend on x

+ (µ1 − µ0)

T(AA T)−1x

linear function of x

. ◮ This is an affine function of x. ◮ Hence, the statistical model for Y | X is a logistic regression model (with affine feature expansion).

36 / 68

SLIDE 34

Log-odds function for our toy model

Log-odds function: log-odds(x) = ln pY (1) pY (0)

+ ln
pX|Y =1(x)

pX|Y =0(x)

.

In our toy model, we have Y ∼ Bern(π) and X | Y = y ∼ N(µy, AAT), so: log-odds(x) = ln π 1 − π + ln e− 1

2 A−1(x−µ1)2 2

e− 1

2 A−1(x−µ0)2 2

= ln π 1 − π − 1 2A−1(x − µ1)2

2 + 1

2A−1(x − µ0)2

2

= ln π 1 − π − 1 2(A−1µ12

2 − A−1µ02 2)

constant—doesn’t depend on x

+ (µ1 − µ0)

T(AA T)−1x

linear function of x

. ◮ This is an affine function of x. ◮ Hence, the statistical model for Y | X is a logistic regression model (with affine feature expansion). ◮ Important: Logistic regression model forgets about pX|Y =y!

36 / 68

SLIDE 35

6. Multiclass classification and cross-entropy

SLIDE 36

Multiclass?

All our methods so far handle multiclass: ◮ k-nn and decision tree: plurality label. ◮ Least squares: arg min

W ∈Rd×kAW − B2

F with B ∈ Rn×k;

W ∈ Rd×k is k separate linear regressors in Rd. How about linear classifiers? ◮ At prediction time, x → arg maxy ˆ f(x)y. ◮ As in binary case: interpretation f(x)y = Pr[Y = y|X = x]. What is a good loss function?

37 / 68

SLIDE 37

Cross-entropy

Given two probability vectors p, q ∈ ∆k = {p ∈ Rk

≥0 : i pi = 1},

H(p, q) = −

k

i=1

pi ln qi (cross-entropy). ◮ If p = q, then H(p, q) = H(p) (entropy); indeed H(p, q) = −

k

i=1

pi ln

pi qi

pi

= H(p)

entropy

+ KL(p, q)

KL divergence

. Since KL ≥ 0 and moreover 0 iff p = q, this is the cost/entropy of p plus a penalty for differing. ◮ Choose encoding ˜ yi = ey for y ∈ {1, . . . , k}, and ˆ y ∝ exp(f(x)) with f : Rd → Rk; ℓce(˜ y, f(x)) = H(˜ y, ˆ y) = −

k

i=1

˜ yi ln   exp(f(x)i) k

j=1 exp(f(x)j)

  = − ln   exp(f(x)y) k

j=1 exp(f(x)j)

  = −f(x)y + ln

k

j=1

exp(f(x)j). (In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y).)

38 / 68

SLIDE 38

Cross-entropy, classification, and margins

The zero-one loss for classification is ℓzo(yi, f(x)) = 1

yi = arg max

j

f(x)j

.

In the multiclass case, can define margin as f(x)y − max

j=y f(x)j,

interpreted as “the distance by which f is correct”. (Can be negative!) Since ln

j zj ≈ maxj zj, cross-entropy satisfies

ℓce(˜ yi, f(x)) = −f(x)y + ln

j

exp

f(x)j
≈ −f(x)y + max

j

f(x)j, thus minimizing cross-entropy maximizes margins.

39 / 68

SLIDE 39

Cross-entropy and logistic loss

With a linear model f(x) = W Tx for W ∈ Rd×k, with two labels {1, 2}, ℓce(e1, f(x)) = − ln

exp(f(x)1)

exp(f(x)1) + exp(f(x)2)

= ln
1 + exp(f(x)2 − f(x)1)
ℓce(e2, f(x))

= ln

1 + exp(f(x)1 − f(x)2)
.

Thus if we write ˜ y := 2y − 3 and v := W :2 − W :1, ln(1 + exp(−˜ yv

Tx)) = ℓce(ey, W Tx). 40 / 68

SLIDE 40

7. Summary

SLIDE 41

Summary

Part 1. ◮ Linearly separable classification problems. ◮ Logistic loss ℓlog and (empirical) risk Rlog. ◮ Gradient descent. Part 2. ◮ MLE perspective on logistic regression. ◮ Cross-entropy.

41 / 68