Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - - PowerPoint PPT Presentation

β–Ά
support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - - PowerPoint PPT Presentation

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the margin the linear support vector machine the primal and dual formulations of SVM learning support vectors


slide-1
SLIDE 1

Support Vector Machines Part 1

CS 760@UW-Madison

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • the margin
  • the linear support vector machine
  • the primal and dual formulations of SVM learning
  • support vectors
  • Optional: variants of SVM
  • Optional: Lagrange Multiplier

2

slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

Linear classification

(π‘₯βˆ—)π‘ˆπ‘¦ = 0 Class +1 Class -1 π‘₯βˆ— (π‘₯βˆ—)π‘ˆπ‘¦ > 0 (π‘₯βˆ—)π‘ˆπ‘¦ < 0 Assume perfect separation between the two classes

slide-5
SLIDE 5

Attempt

  • Given training data

𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸

  • Hypothesis 𝑧 = sign(𝑔

π‘₯ 𝑦 ) = sign(π‘₯π‘ˆπ‘¦)

  • 𝑧 = +1 if π‘₯π‘ˆπ‘¦ > 0
  • 𝑧 = βˆ’1 if π‘₯π‘ˆπ‘¦ < 0
  • Let’s assume that we can optimize to find π‘₯
slide-6
SLIDE 6

Multiple optimal solutions?

Class +1 Class -1 π‘₯2 π‘₯3 π‘₯1 Same on empirical loss; Different on test/expected loss

slide-7
SLIDE 7

What about π‘₯1?

Class +1 Class -1 π‘₯1

New test data

slide-8
SLIDE 8

What about π‘₯3?

Class +1 Class -1 π‘₯3

New test data

slide-9
SLIDE 9

Most confident: π‘₯2

Class +1 Class -1 π‘₯2

New test data

slide-10
SLIDE 10

Intuition: margin

Class +1 Class -1 π‘₯2

large margin

slide-11
SLIDE 11

Margin

slide-12
SLIDE 12

Margin

We are going to prove the following math expression for margin using a geometric argument

  • Lemma 1: 𝑦 has distance

|𝑔

π‘₯ 𝑦 |

| π‘₯ | to the hyperplane 𝑔 π‘₯ 𝑦 =

π‘₯π‘ˆπ‘¦ = 0

  • Lemma 2: 𝑦 has distance

|𝑔π‘₯,𝑐 𝑦 | | π‘₯ |

to the hyperplane 𝑔

π‘₯,𝑐 𝑦 =

π‘₯π‘ˆπ‘¦ + 𝑐 = 0 Need two geometric facts:

  • π‘₯ is orthogonal to the hyperplane 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 = 0

  • Let 𝑀 be a direction (i.e., unit vector). Then the length of the

projection of 𝑦 on 𝑀 is π‘€π‘ˆπ‘¦

slide-13
SLIDE 13

Margin

  • Lemma 1: 𝑦 has distance

|𝑔

π‘₯ 𝑦 |

| π‘₯ | to the hyperplane 𝑔 π‘₯ 𝑦 =

π‘₯π‘ˆπ‘¦ = 0 Proof:

  • π‘₯ is orthogonal to the hyperplane
  • The unit direction is

π‘₯ | π‘₯ |

  • The projection of 𝑦 is

π‘₯ π‘₯ π‘ˆ

𝑦 =

𝑔

π‘₯(𝑦)

| π‘₯ | π‘₯ | π‘₯ |

𝑦

π‘₯ π‘₯

π‘ˆ

𝑦

slide-14
SLIDE 14

Margin: with bias

  • Lemma 2: 𝑦 has distance

|𝑔π‘₯,𝑐 𝑦 | | π‘₯ |

to the hyperplane 𝑔

π‘₯,𝑐 𝑦 =

π‘₯π‘ˆπ‘¦ + 𝑐 = 0 Proof:

  • Let 𝑦 = 𝑦βŠ₯ + 𝑠

π‘₯ | π‘₯ |, then |𝑠| is the distance

  • Multiply both sides by π‘₯π‘ˆ and add 𝑐
  • Left hand side: π‘₯π‘ˆπ‘¦ + 𝑐 = 𝑔

π‘₯,𝑐 𝑦

  • Right hand side: π‘₯π‘ˆπ‘¦βŠ₯ + 𝑠

π‘₯π‘ˆπ‘₯ | π‘₯ | + 𝑐 = 0 + 𝑠| π‘₯ |

slide-15
SLIDE 15

𝑧 𝑦 = π‘₯π‘ˆπ‘¦ + π‘₯0 The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

Margin: with bias

slide-16
SLIDE 16

Support Vector Machine (SVM)

slide-17
SLIDE 17

SVM: objective

  • Absolute margin over all training data points:

𝛿 = min

𝑗

|𝑔

π‘₯,𝑐 𝑦𝑗 |

| π‘₯ |

  • Since only want correct 𝑔

π‘₯,𝑐, and recall 𝑧𝑗 ∈ {+1, βˆ’1}, we define

the margin to be 𝛿 = min

𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ |

  • If 𝑔

π‘₯,𝑐 incorrect on some 𝑦𝑗, the margin is negative

slide-18
SLIDE 18

SVM: objective

  • Maximize margin over all training data points:

max

π‘₯,𝑐 𝛿 = max π‘₯,𝑐 min 𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ | = max

π‘₯,𝑐 min 𝑗

𝑧𝑗(π‘₯π‘ˆπ‘¦π‘— + 𝑐) | π‘₯ |

  • A bit complicated …
slide-19
SLIDE 19

SVM: simplified objective

  • Observation: when (π‘₯, 𝑐) scaled by a factor 𝑑, the margin

unchanged 𝑧𝑗(𝑑π‘₯π‘ˆπ‘¦π‘— + 𝑑𝑐) | 𝑑π‘₯ | = 𝑧𝑗(π‘₯π‘ˆπ‘¦π‘— + 𝑐) | π‘₯ |

  • Let’s consider a fixed scale such that

π‘§π‘—βˆ— π‘₯π‘ˆπ‘¦π‘—βˆ— + 𝑐 = 1 where π‘¦π‘—βˆ— is the point closest to the hyperplane

slide-20
SLIDE 20

SVM: simplified objective

  • Let’s consider a fixed scale such that

π‘§π‘—βˆ— π‘₯π‘ˆπ‘¦π‘—βˆ— + 𝑐 = 1 where π‘¦π‘—βˆ— is the point closet to the hyperplane

  • Now we have for all data

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1 and at least for one 𝑗 the equality holds

  • Then the margin over all training points is

1 | π‘₯ |

slide-21
SLIDE 21

SVM: simplified objective

  • Optimization simplified to

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • How to find the optimum ෝ

π‘₯βˆ—?

  • Solved by Lagrange multiplier method
slide-22
SLIDE 22

SVM: optimization

slide-23
SLIDE 23

SVM: optimization

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • Generalized Lagrangian:

β„’ π‘₯, 𝑐, 𝜷 = 1 2 π‘₯

2

βˆ’ ෍

𝑗

𝛽𝑗[𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 1] where 𝜷 is the Lagrange multiplier

slide-24
SLIDE 24

SVM: optimization

  • KKT conditions:

πœ–β„’ πœ–π‘₯ = 0, β†’ π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗 (1) πœ–β„’ πœ–π‘ = 0, β†’ 0 = σ𝑗 𝛽𝑗𝑧𝑗

(2)

  • Plug into β„’:

β„’ π‘₯, 𝑐, 𝜷 = σ𝑗 𝛽𝑗 βˆ’

1 2 Οƒπ‘—π‘˜ π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘— π‘ˆπ‘¦π‘˜ (3)

combined with 0 = σ𝑗 𝛽𝑗𝑧𝑗 , 𝛽𝑗 β‰₯ 0

slide-25
SLIDE 25

SVM: optimization

  • Reduces to dual problem:

β„’ π‘₯, 𝑐, 𝜷 = ෍

𝑗

𝛽𝑗 βˆ’ 1 2 ෍

π‘—π‘˜

π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘—

π‘ˆπ‘¦π‘˜

෍

𝑗

𝛽𝑗𝑧𝑗 = 0, 𝛽𝑗 β‰₯ 0

  • Since π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗, we have π‘₯π‘ˆπ‘¦ + 𝑐 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗

π‘ˆπ‘¦ + 𝑐

Only depend on inner products

slide-26
SLIDE 26

Support Vectors

  • those instances with Ξ±i > 0

are called support vectors

  • they lie on the margin

boundary

  • solution NOT changed if

delete the instances with Ξ±i =

support vectors

  • final solution is a sparse linear combination of the training

instances

slide-27
SLIDE 27

Optional: Lagrange Multiplier

slide-28
SLIDE 28

Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) β„Žπ‘— π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š

  • Lagrangian:

β„’ π‘₯, 𝜸 = 𝑔 π‘₯ + ෍

𝑗

π›Ύπ‘—β„Žπ‘—(π‘₯) where 𝛾𝑗’s are called Lagrange multipliers

slide-29
SLIDE 29

Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) β„Žπ‘— π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š

  • Solved by setting derivatives of Lagrangian to 0

πœ–β„’ πœ–π‘₯𝑗 = 0; πœ–β„’ πœ–π›Ύπ‘— = 0

slide-30
SLIDE 30

Generalized Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) 𝑕𝑗 π‘₯ ≀ 0, βˆ€1 ≀ 𝑗 ≀ 𝑙 β„Žπ‘˜ π‘₯ = 0, βˆ€1 ≀ π‘˜ ≀ π‘š

  • Generalized Lagrangian:

β„’ π‘₯, 𝜷, 𝜸 = 𝑔 π‘₯ + ෍

𝑗

𝛽𝑗𝑕𝑗(π‘₯) + ෍

π‘˜

π›Ύπ‘˜β„Žπ‘˜(π‘₯) where 𝛽𝑗, π›Ύπ‘˜β€™s are called Lagrange multipliers

slide-31
SLIDE 31

Generalized Lagrangian

  • Consider the quantity:

πœ„π‘„ π‘₯ ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • Why?

πœ„π‘„ π‘₯ = α‰Šπ‘” π‘₯ , if π‘₯ satisfies all the constraints +∞, if π‘₯ does not satisfy the constraints

  • So minimizing 𝑔 π‘₯ is the same as minimizing πœ„π‘„ π‘₯

min

π‘₯ 𝑔 π‘₯ = min π‘₯ πœ„π‘„ π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

slide-32
SLIDE 32

Lagrange duality

  • The primal problem

π‘žβˆ— ≔ min

π‘₯ 𝑔 π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • The dual problem

π‘’βˆ— ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0min π‘₯ β„’ π‘₯, 𝜷, 𝜸

  • Always true:

π‘’βˆ— ≀ π‘žβˆ—

slide-33
SLIDE 33

Lagrange duality

  • The primal problem

π‘žβˆ— ≔ min

π‘₯ 𝑔 π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • The dual problem

π‘’βˆ— ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0min π‘₯ β„’ π‘₯, 𝜷, 𝜸

  • Interesting case: when do we have

π‘’βˆ— = π‘žβˆ—?

slide-34
SLIDE 34

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ—

such that π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ— Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0

slide-35
SLIDE 35

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ—

such that π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ— Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0 dual complementarity

slide-36
SLIDE 36

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ—

such that π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ—

  • Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT)

conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0 dual constraints primal constraints

slide-37
SLIDE 37

Lagrange duality

  • What are the proper conditions?
  • A set of conditions (Slater conditions):
  • 𝑔, 𝑕𝑗 convex, β„Žπ‘˜ affine, and exists π‘₯ satisfying all 𝑕𝑗 π‘₯ < 0
  • There exist other sets of conditions
  • Check textbooks, e.g., Convex Optimization by Boyd and

Vandenberghe

slide-38
SLIDE 38

Optional: Variants of SVM

slide-39
SLIDE 39

Hard-margin SVM

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

slide-40
SLIDE 40

Soft-margin SVM [Cortes & Vapnik, Machine Learning 1995]

  • if the training instances are not linearly separable, the previous

formulation will fail

  • we can adjust our approach by using slack variables (denoted

by πœ‚π‘—) to tolerate errors min

π‘₯,𝑐,πœ‚π‘—

1 2 π‘₯

2

+ 𝐷 ෍

𝑗

πœ‚π‘— 𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1 βˆ’ πœ‚π‘—, πœ‚π‘— β‰₯ 0, βˆ€π‘—

  • 𝐷 determines the relative importance of maximizing margin vs.

minimizing slack

slide-41
SLIDE 41

The effect of 𝐷 in soft-margin SVM

Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010

slide-42
SLIDE 42

Hinge loss

  • when we covered neural nets, we talked about minimizing

squared loss and cross-entropy loss

  • SVMs minimize hinge loss

loss (error) when 𝑧 = 1 model output β„Ž π’š squared loss 0/1 loss hinge loss

slide-43
SLIDE 43

Support Vector Regression

  • the SVM idea can also be

applied in regression tasks

  • an πœ—-insensitive error

function specifies that a training instance is well explained if the model’s prediction is within πœ— of 𝑧𝑗

(π‘₯βŠ€π‘¦ + 𝑐) βˆ’ 𝑧 = πœ— 𝑧 βˆ’ (π‘₯βŠ€π‘¦ + 𝑐) = πœ—

slide-44
SLIDE 44

Support Vector Regression

  • Regression using slack variables (denoted by πœ‚π‘—, πœŠπ‘—) to tolerate

errors min

π‘₯,𝑐,πœ‚π‘—,πœŠπ‘—

1 2 π‘₯

2

+ 𝐷 ෍

𝑗

πœ‚π‘— + πœŠπ‘— π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 𝑧𝑗 ≀ πœ— + πœ‚π‘—, 𝑧𝑗 βˆ’ π‘₯π‘ˆπ‘¦π‘— + 𝑐 ≀ πœ— + πœŠπ‘—, πœ‚π‘—, πœŠπ‘— β‰₯ 0.

slack variables allow predictions for some training instances to be

  • ff by more than πœ—
slide-45
SLIDE 45

THANK YOU

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.