Support Vector Machines & Kernels Lecture 5 David Sontag New - - PowerPoint PPT Presentation

support vector machines kernels lecture 5
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines & Kernels Lecture 5 David Sontag New - - PowerPoint PPT Presentation

Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin Support Vector Machines QP form: More natural form: Equivalent if Regularization Empirical loss


slide-1
SLIDE 1

Support Vector Machines & Kernels Lecture 5

David Sontag New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin

slide-2
SLIDE 2

Support Vector Machines

QP form: More “natural” form:

Empirical loss Regularization term

Equivalent if

slide-3
SLIDE 3

Subgradient method

slide-4
SLIDE 4

Subgradient method

Step size:

slide-5
SLIDE 5

Stochastic subgradient

Subgradient

1

slide-6
SLIDE 6

PEGASOS

Subgradient Projection

A_t = S Subgradient method |A_t| = 1 Stochastic gradient

1

slide-7
SLIDE 7

Run-Time of Pegasos

  • Choosing |At|=1

 Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1-δ

  • Run-time does not depend on #examples
  • Depends on “difficulty” of problem (λ and ε)

n = # of features

slide-8
SLIDE 8

Experiments

  • 3 datasets (provided by Joachims)

– Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters

2 77 20,075

Covertype

6 85 25,514

Astro-Physics

2 5 80

slide-9
SLIDE 9

What’s Next!

  • Learn one of the most interesting and

exciting recent advancements in machine learning

– The “kernel trick” – High dimensional feature spaces at no extra cost

  • But first, a detour

– Constrained optimization!

slide-10
SLIDE 10

Constrained optimization

x*=0 No Constraint x ≥ -1 x*=0 x*=1 x ≥ 1

How do we solve with constraints?  Lagrange Multipliers!!!

slide-11
SLIDE 11

Lagrange multipliers – Dual variables

Introduce Lagrangian (objective): We will solve: Add Lagrange multiplier Add new constraint

Why is this equivalent?

  • min is fighting max!

x<b  (x-b)<0  maxα-α(x-b) = ∞

  • min won’t let this happen!

x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0

  • min is cool with 0, and L(x, α)=x2 (original objective)

x=b  α can be anything, and L(x, α)=x2 (original objective) Rewrite Constraint

The min on the outside forces max to behave, so constraints will be satisfied.

slide-12
SLIDE 12

Dual SVM derivation (1) – the linearly separable case (hard margin SVM)

Original optimization problem: Lagrangian:

Rewrite constraints One Lagrange multiplier per example

Our goal now is to solve:

slide-13
SLIDE 13

Dual SVM derivation (2) – the linearly separable case (hard margin SVM)

Swap min and max Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

(Primal) (Dual)

slide-14
SLIDE 14

Dual SVM derivation (3) – the linearly separable case (hard margin SVM)

Can solve for optimal w, b as function of α:

⇤ ⌅ ∂L ∂w = w − ⌥

j

αjyjxj

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual) Sums over all training examples dot product scalars

slide-15
SLIDE 15

Reminder: What if the data is not linearly separable?

Use features of features

  • f features of features….

Feature space can get really large really quickly!

φ(x) =              x(1) . . . x(n) x(1)x(2) x(1)x(3) . . . ex(1) . . .             

slide-16
SLIDE 16

Higher order polynomials

number of input dimensions number of monomial terms d=2 d=4 d=3

m – input features d – degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms

slide-17
SLIDE 17

Dual formulation only depends on dot-products of the features!

First, we introduce a feature mapping: Next, replace the dot product with an equivalent kernel function:

slide-18
SLIDE 18

Polynomial kernel

Polynomials of degree exactly d

d=1

φ(u).φ(v) = u1 u2 ⇥ . v1 v2 ⇥ = u1v1 + u2v2 = u.v

d=2 For any d (we will skip proof):

φ(u).φ(v) = (u.v)d

⇥ φ(u).φ(v) = ⇤ ⌥ ⌥ ⇧ u2

1

u1u2 u2u1 u2

2

  • ⌃ .

⇤ ⌥ ⌥ ⇧ v2

1

v1v2 v2v1 v2

2

  • ⌃ = u2

1v2 1 + 2u1v1u2v2 + u2 2v2 2

⌃ ⇧ ⌃ = (u1v1 + u2v2)2

= (u.v)2

slide-19
SLIDE 19

Common kernels

  • Polynomials of degree exactly d
  • Polynomials of degree up to d
  • Gaussian kernels
  • Sigmoid
  • And many others: very active area of research!