[PPT] - Support Vector Machines Part 1 CS 760@UW-Madison Goals for the PowerPoint Presentation

SLIDE 1

Support Vector Machines Part 1

CS 760@UW-Madison

SLIDE 2

Goals for the lecture

you should understand the following concepts

the margin
the linear support vector machine
the primal and dual formulations of SVM learning
support vectors
Optional: variants of SVM
Optional: Lagrange Multiplier

2

SLIDE 3

Motivation

SLIDE 4

Linear classification

(𝑥∗)𝑈𝑦 = 0 Class +1 Class -1 𝑥∗ (𝑥∗)𝑈𝑦 > 0 (𝑥∗)𝑈𝑦 < 0 Assume perfect separation between the two classes

SLIDE 5

Attempt

Given training data

𝑦𝑗, 𝑧𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸

Hypothesis 𝑧 = sign(𝑔

𝑥 𝑦 ) = sign(𝑥𝑈𝑦)

𝑧 = +1 if 𝑥𝑈𝑦 > 0
𝑧 = −1 if 𝑥𝑈𝑦 < 0
Let’s assume that we can optimize to find 𝑥

SLIDE 6

Multiple optimal solutions?

Class +1 Class -1 𝑥2 𝑥3 𝑥1 Same on empirical loss; Different on test/expected loss

SLIDE 7

What about 𝑥1?

Class +1 Class -1 𝑥1

New test data

SLIDE 8

What about 𝑥3?

Class +1 Class -1 𝑥3

New test data

SLIDE 9

Most confident: 𝑥2

Class +1 Class -1 𝑥2

New test data

SLIDE 10

Intuition: margin

Class +1 Class -1 𝑥2

large margin

SLIDE 11

Margin

SLIDE 12

Margin

We are going to prove the following math expression for margin using a geometric argument

Lemma 1: 𝑦 has distance

|𝑔

𝑥 𝑦 |

| 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 =

𝑥𝑈𝑦 = 0

Lemma 2: 𝑦 has distance

|𝑔𝑥,𝑐 𝑦 | | 𝑥 |

to the hyperplane 𝑔

𝑥,𝑐 𝑦 =

𝑥𝑈𝑦 + 𝑐 = 0 Need two geometric facts:

𝑥 is orthogonal to the hyperplane 𝑔

𝑥,𝑐 𝑦 = 𝑥𝑈𝑦 + 𝑐 = 0

Let 𝑤 be a direction (i.e., unit vector). Then the length of the

projection of 𝑦 on 𝑤 is 𝑤𝑈𝑦

SLIDE 13

Margin

Lemma 1: 𝑦 has distance

|𝑔

𝑥 𝑦 |

| 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 =

𝑥𝑈𝑦 = 0 Proof:

𝑥 is orthogonal to the hyperplane
The unit direction is

𝑥 | 𝑥 |

The projection of 𝑦 is

𝑥 𝑥 𝑈

𝑦 =

𝑔

𝑥(𝑦)

| 𝑥 | 𝑥 | 𝑥 |

𝑦

𝑥 𝑥

𝑈

𝑦

SLIDE 14

Margin: with bias

Lemma 2: 𝑦 has distance

|𝑔𝑥,𝑐 𝑦 | | 𝑥 |

to the hyperplane 𝑔

𝑥,𝑐 𝑦 =

𝑥𝑈𝑦 + 𝑐 = 0 Proof:

Let 𝑦 = 𝑦⊥ + 𝑠

𝑥 | 𝑥 |, then |𝑠| is the distance

Multiply both sides by 𝑥𝑈 and add 𝑐
Left hand side: 𝑥𝑈𝑦 + 𝑐 = 𝑔

𝑥,𝑐 𝑦

Right hand side: 𝑥𝑈𝑦⊥ + 𝑠

𝑥𝑈𝑥 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

SLIDE 15

𝑧 𝑦 = 𝑥𝑈𝑦 + 𝑥0 The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

Margin: with bias

SLIDE 16

Support Vector Machine (SVM)

SLIDE 17

SVM: objective

Absolute margin over all training data points:

𝛿 = min

𝑗

|𝑔

𝑥,𝑐 𝑦𝑗 |

| 𝑥 |

Since only want correct 𝑔

𝑥,𝑐, and recall 𝑧𝑗 ∈ {+1, −1}, we define

the margin to be 𝛿 = min

𝑗

𝑧𝑗𝑔

𝑥,𝑐 𝑦𝑗

| 𝑥 |

If 𝑔

𝑥,𝑐 incorrect on some 𝑦𝑗, the margin is negative

SLIDE 18

SVM: objective

Maximize margin over all training data points:

max

𝑥,𝑐 𝛿 = max 𝑥,𝑐 min 𝑗

𝑧𝑗𝑔

𝑥,𝑐 𝑦𝑗

| 𝑥 | = max

𝑥,𝑐 min 𝑗

𝑧𝑗(𝑥𝑈𝑦𝑗 + 𝑐) | 𝑥 |

A bit complicated …

SLIDE 19

SVM: simplified objective

Observation: when (𝑥, 𝑐) scaled by a factor 𝑑, the margin

unchanged 𝑧𝑗(𝑑𝑥𝑈𝑦𝑗 + 𝑑𝑐) | 𝑑𝑥 | = 𝑧𝑗(𝑥𝑈𝑦𝑗 + 𝑐) | 𝑥 |

Let’s consider a fixed scale such that

𝑧𝑗∗ 𝑥𝑈𝑦𝑗∗ + 𝑐 = 1 where 𝑦𝑗∗ is the point closest to the hyperplane

SLIDE 20

SVM: simplified objective

Let’s consider a fixed scale such that

𝑧𝑗∗ 𝑥𝑈𝑦𝑗∗ + 𝑐 = 1 where 𝑦𝑗∗ is the point closet to the hyperplane

Now we have for all data

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds

Then the margin over all training points is

1 | 𝑥 |

SLIDE 21

SVM: simplified objective

Optimization simplified to

min

𝑥,𝑐

1 2 𝑥

2

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

How to find the optimum ෝ

𝑥∗?

Solved by Lagrange multiplier method

SLIDE 22

SVM: optimization

SLIDE 23

SVM: optimization

Optimization (Quadratic Programming):

min

𝑥,𝑐

1 2 𝑥

2

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

Generalized Lagrangian:

ℒ 𝑥, 𝑐, 𝜷 = 1 2 𝑥

2

− ෍

𝑗

𝛽𝑗[𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 − 1] where 𝜷 is the Lagrange multiplier

SLIDE 24

SVM: optimization

KKT conditions:

𝜖ℒ 𝜖𝑥 = 0, → 𝑥 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0, → 0 = σ𝑗 𝛽𝑗𝑧𝑗

(2)

Plug into ℒ:

ℒ 𝑥, 𝑐, 𝜷 = σ𝑗 𝛽𝑗 −

1 2 σ𝑗𝑘 𝛽𝑗𝛽𝑘𝑧𝑗𝑧𝑘𝑦𝑗 𝑈𝑦𝑘 (3)

combined with 0 = σ𝑗 𝛽𝑗𝑧𝑗 , 𝛽𝑗 ≥ 0

SLIDE 25

SVM: optimization

Reduces to dual problem:

ℒ 𝑥, 𝑐, 𝜷 = ෍

𝑗

𝛽𝑗 − 1 2 ෍

𝑗𝑘

𝛽𝑗𝛽𝑘𝑧𝑗𝑧𝑘𝑦𝑗

𝑈𝑦𝑘

෍

𝑗

𝛽𝑗𝑧𝑗 = 0, 𝛽𝑗 ≥ 0

Since 𝑥 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗, we have 𝑥𝑈𝑦 + 𝑐 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗

𝑈𝑦 + 𝑐

Only depend on inner products

SLIDE 26

Support Vectors

those instances with αi > 0

are called support vectors

they lie on the margin

boundary

solution NOT changed if

delete the instances with αi =

support vectors

final solution is a sparse linear combination of the training

instances

SLIDE 27

Optional: Lagrange Multiplier

SLIDE 28

Lagrangian

Consider optimization problem:

min

𝑥

𝑔(𝑥) ℎ𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚

Lagrangian:

ℒ 𝑥, 𝜸 = 𝑔 𝑥 + ෍

𝑗

𝛾𝑗ℎ𝑗(𝑥) where 𝛾𝑗’s are called Lagrange multipliers

SLIDE 29

Lagrangian

Consider optimization problem:

min

𝑥

𝑔(𝑥) ℎ𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚

Solved by setting derivatives of Lagrangian to 0

𝜖ℒ 𝜖𝑥𝑗 = 0; 𝜖ℒ 𝜖𝛾𝑗 = 0

SLIDE 30

Generalized Lagrangian

Consider optimization problem:

min

𝑥

𝑔(𝑥) 𝑕𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚

Generalized Lagrangian:

ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + ෍

𝑗

𝛽𝑗𝑕𝑗(𝑥) + ෍

𝑘

𝛾𝑘ℎ𝑘(𝑥) where 𝛽𝑗, 𝛾𝑘’s are called Lagrange multipliers

SLIDE 31

Generalized Lagrangian

Consider the quantity:

𝜄𝑄 𝑥 ≔ max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

Why?

𝜄𝑄 𝑥 = ቊ𝑔 𝑥 , if 𝑥 satisfies all the constraints +∞, if 𝑥 does not satisfy the constraints

So minimizing 𝑔 𝑥 is the same as minimizing 𝜄𝑄 𝑥

min

𝑥 𝑔 𝑥 = min 𝑥 𝜄𝑄 𝑥 = min 𝑥

max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

SLIDE 32

Lagrange duality

The primal problem

𝑞∗ ≔ min

𝑥 𝑔 𝑥 = min 𝑥

max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

The dual problem

𝑒∗ ≔ max

𝜷,𝜸:𝛽𝑗≥0min 𝑥 ℒ 𝑥, 𝜷, 𝜸

Always true:

𝑒∗ ≤ 𝑞∗

SLIDE 33

Lagrange duality

The primal problem

𝑞∗ ≔ min

𝑥 𝑔 𝑥 = min 𝑥

max

𝜷,𝜸:𝛽𝑗≥0 ℒ 𝑥, 𝜷, 𝜸

The dual problem

𝑒∗ ≔ max

𝜷,𝜸:𝛽𝑗≥0min 𝑥 ℒ 𝑥, 𝜷, 𝜸

Interesting case: when do we have

𝑒∗ = 𝑞∗?

SLIDE 34

Lagrange duality

Theorem: under proper conditions, there exists 𝑥∗, 𝜷∗, 𝜸∗

such that 𝑒∗ = ℒ 𝑥∗, 𝜷∗, 𝜸∗ = 𝑞∗ Moreover, 𝑥∗, 𝜷∗, 𝜸∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ 𝜖𝑥𝑗 = 0, 𝛽𝑗𝑕𝑗 𝑥 = 0 𝑕𝑗 𝑥 ≤ 0, ℎ𝑘 𝑥 = 0, 𝛽𝑗 ≥ 0

SLIDE 35

Lagrange duality

Theorem: under proper conditions, there exists 𝑥∗, 𝜷∗, 𝜸∗

such that 𝑒∗ = ℒ 𝑥∗, 𝜷∗, 𝜸∗ = 𝑞∗ Moreover, 𝑥∗, 𝜷∗, 𝜸∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ 𝜖𝑥𝑗 = 0, 𝛽𝑗𝑕𝑗 𝑥 = 0 𝑕𝑗 𝑥 ≤ 0, ℎ𝑘 𝑥 = 0, 𝛽𝑗 ≥ 0 dual complementarity

SLIDE 36

Lagrange duality

Theorem: under proper conditions, there exists 𝑥∗, 𝜷∗, 𝜸∗

such that 𝑒∗ = ℒ 𝑥∗, 𝜷∗, 𝜸∗ = 𝑞∗

Moreover, 𝑥∗, 𝜷∗, 𝜸∗ satisfy Karush-Kuhn-Tucker (KKT)

conditions: 𝜖ℒ 𝜖𝑥𝑗 = 0, 𝛽𝑗𝑕𝑗 𝑥 = 0 𝑕𝑗 𝑥 ≤ 0, ℎ𝑘 𝑥 = 0, 𝛽𝑗 ≥ 0 dual constraints primal constraints

SLIDE 37

Lagrange duality

What are the proper conditions?
A set of conditions (Slater conditions):
𝑔, 𝑕𝑗 convex, ℎ𝑘 affine, and exists 𝑥 satisfying all 𝑕𝑗 𝑥 < 0
There exist other sets of conditions
Check textbooks, e.g., Convex Optimization by Boyd and

Vandenberghe

SLIDE 38

Optional: Variants of SVM

SLIDE 39

Hard-margin SVM

Optimization (Quadratic Programming):

min

𝑥,𝑐

1 2 𝑥

2

𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1, ∀𝑗

SLIDE 40

Soft-margin SVM [Cortes & Vapnik, Machine Learning 1995]

if the training instances are not linearly separable, the previous

formulation will fail

we can adjust our approach by using slack variables (denoted

by 𝜂𝑗) to tolerate errors min

𝑥,𝑐,𝜂𝑗

1 2 𝑥

2

+ 𝐷 ෍

𝑗

𝜂𝑗 𝑧𝑗 𝑥𝑈𝑦𝑗 + 𝑐 ≥ 1 − 𝜂𝑗, 𝜂𝑗 ≥ 0, ∀𝑗

𝐷 determines the relative importance of maximizing margin vs.

minimizing slack

SLIDE 41

The effect of 𝐷 in soft-margin SVM

Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010

SLIDE 42

Hinge loss

when we covered neural nets, we talked about minimizing

squared loss and cross-entropy loss

SVMs minimize hinge loss

loss (error) when 𝑧 = 1 model output ℎ 𝒚 squared loss 0/1 loss hinge loss

SLIDE 43

Support Vector Regression

the SVM idea can also be

applied in regression tasks

an 𝜗-insensitive error

function specifies that a training instance is well explained if the model’s prediction is within 𝜗 of 𝑧𝑗

(𝑥⊤𝑦 + 𝑐) − 𝑧 = 𝜗 𝑧 − (𝑥⊤𝑦 + 𝑐) = 𝜗

SLIDE 44

Support Vector Regression

Regression using slack variables (denoted by 𝜂𝑗, 𝜊𝑗) to tolerate

errors min

𝑥,𝑐,𝜂𝑗,𝜊𝑗

1 2 𝑥

2

+ 𝐷 ෍

𝑗

𝜂𝑗 + 𝜊𝑗 𝑥𝑈𝑦𝑗 + 𝑐 − 𝑧𝑗 ≤ 𝜗 + 𝜂𝑗, 𝑧𝑗 − 𝑥𝑈𝑦𝑗 + 𝑐 ≤ 𝜗 + 𝜊𝑗, 𝜂𝑗, 𝜊𝑗 ≥ 0.

slack variables allow predictions for some training instances to be

ff by more than 𝜗

SLIDE 45

THANK YOU

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.