Support Vector Machines & Kernels Lecture 5 David Sontag New - - PowerPoint PPT Presentation

▶

Sep 29, 2023 297 likes •495 views

Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin Support Vector Machines QP form: More natural form: Equivalent if Regularization Empirical loss

SLIDE 1

Support Vector Machines & Kernels Lecture 5

David Sontag New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin

SLIDE 2

Support Vector Machines

QP form: More “natural” form:

Empirical loss Regularization term

Equivalent if

SLIDE 3

Subgradient method

SLIDE 4

Subgradient method

Step size:

SLIDE 5

Stochastic subgradient

Subgradient

SLIDE 6

PEGASOS

Subgradient Projection

A_t = S Subgradient method |A_t| = 1 Stochastic gradient

SLIDE 7

Run-Time of Pegasos

Choosing |At|=1

 Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1-δ

Run-time does not depend on #examples
Depends on “difficulty” of problem (λ and ε)

n = # of features

SLIDE 8

Experiments

3 datasets (provided by Joachims)

– Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters

2 77 20,075

Covertype

6 85 25,514

Astro-Physics

2 5 80

SLIDE 9

What’s Next!

Learn one of the most interesting and

exciting recent advancements in machine learning

– The “kernel trick” – High dimensional feature spaces at no extra cost

But first, a detour

– Constrained optimization!

SLIDE 10

Constrained optimization

x=0 No Constraint x ≥ -1 x=0 x*=1 x ≥ 1

How do we solve with constraints?  Lagrange Multipliers!!!

SLIDE 11

Lagrange multipliers – Dual variables

Introduce Lagrangian (objective): We will solve: Add Lagrange multiplier Add new constraint

Why is this equivalent?

min is fighting max!

x<b  (x-b)<0  maxα-α(x-b) = ∞

min won’t let this happen!

x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0

min is cool with 0, and L(x, α)=x2 (original objective)

x=b  α can be anything, and L(x, α)=x2 (original objective) Rewrite Constraint

The min on the outside forces max to behave, so constraints will be satisfied.

SLIDE 12

Dual SVM derivation (1) – the linearly separable case (hard margin SVM)

Original optimization problem: Lagrangian:

Rewrite constraints One Lagrange multiplier per example

Our goal now is to solve:

SLIDE 13

Dual SVM derivation (2) – the linearly separable case (hard margin SVM)

Swap min and max Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

(Primal) (Dual)

SLIDE 14

Dual SVM derivation (3) – the linearly separable case (hard margin SVM)

Can solve for optimal w, b as function of α:



⇤ ⌅ ∂L ∂w = w − ⌥

αjyjxj

(Dual)



Substituting these values back in (and simplifying), we obtain:

(Dual) Sums over all training examples dot product scalars

SLIDE 15

Reminder: What if the data is not linearly separable?

Use features of features

f features of features….

Feature space can get really large really quickly!

φ(x) =              x(1) . . . x(n) x(1)x(2) x(1)x(3) . . . ex(1) . . .             

SLIDE 16

Higher order polynomials

number of input dimensions number of monomial terms d=2 d=4 d=3

m – input features d – degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms

SLIDE 17

Dual formulation only depends on dot-products of the features!

First, we introduce a feature mapping: Next, replace the dot product with an equivalent kernel function:



SLIDE 18

Polynomial kernel

Polynomials of degree exactly d

d=1

φ(u).φ(v) = u1 u2 ⇥ . v1 v2 ⇥ = u1v1 + u2v2 = u.v

d=2 For any d (we will skip proof):

φ(u).φ(v) = (u.v)d

⇥ φ(u).φ(v) = ⇤ ⌥ ⌥ ⇧ u2

u1u2 u2u1 u2

⌅

⌃ .

⇤ ⌥ ⌥ ⇧ v2

v1v2 v2v1 v2

⌅

⌃ = u2

1v2 1 + 2u1v1u2v2 + u2 2v2 2

⌃ ⇧ ⌃ = (u1v1 + u2v2)2

= (u.v)2

SLIDE 19

Common kernels

Polynomials of degree exactly d
Polynomials of degree up to d
Gaussian kernels
Sigmoid
And many others: very active area of research!

Support Vector Machines & Kernels Lecture 5

David Sontag New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Support Vector Machines

QP form: More “natural” form:

Empirical loss Regularization term

Subgradient method

Subgradient method

Stochastic subgradient

Subgradient

PEGASOS

Subgradient Projection

Run-Time of Pegasos

 Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1-δ

Experiments

– Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Training Time (in seconds):

What’s Next!

exciting recent advancements in machine learning

– The “kernel trick” – High dimensional feature spaces at no extra cost

– Constrained optimization!

Constrained optimization

x*=0 No Constraint x ≥ -1 x*=0 x*=1 x ≥ 1

How do we solve with constraints?  Lagrange Multipliers!!!

Lagrange multipliers – Dual variables

Why is this equivalent?

Dual SVM derivation (1) – the linearly separable case (hard margin SVM)

Original optimization problem: Lagrangian:

Our goal now is to solve:

Dual SVM derivation (2) – the linearly separable case (hard margin SVM)

Dual SVM derivation (3) – the linearly separable case (hard margin SVM)

Can solve for optimal w, b as function of α:





Substituting these values back in (and simplifying), we obtain:

Reminder: What if the data is not linearly separable?

Use features of features

Feature space can get really large really quickly!

φ(x) =              x(1) . . . x(n) x(1)x(2) x(1)x(3) . . . ex(1) . . .             

Higher order polynomials

m – input features d – degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms

Dual formulation only depends on dot-products of the features!



Polynomial kernel

d=1

φ(u).φ(v) = u1 u2 ⇥ . v1 v2 ⇥ = u1v1 + u2v2 = u.v

d=2 For any d (we will skip proof):

φ(u).φ(v) = (u.v)d

⇥ φ(u).φ(v) = ⇤ ⌥ ⌥ ⇧ u2

u1u2 u2u1 u2

⌅

⇤ ⌥ ⌥ ⇧ v2

v1v2 v2v1 v2

⌅

⌃ ⇧ ⌃ = (u1v1 + u2v2)2

= (u.v)2

Common kernels

x=0 No Constraint x ≥ -1 x=0 x*=1 x ≥ 1