Support Vector Machines Part 1
CS 760@UW-Madison
Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - - PowerPoint PPT Presentation
Support Vector Machines Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the margin the linear support vector machine the primal and dual formulations of SVM learning support vectors
CS 760@UW-Madison
you should understand the following concepts
2
(π₯β)ππ¦ = 0 Class +1 Class -1 π₯β (π₯β)ππ¦ > 0 (π₯β)ππ¦ < 0 Assume perfect separation between the two classes
π¦π, π§π : 1 β€ π β€ π i.i.d. from distribution πΈ
π₯ π¦ ) = sign(π₯ππ¦)
Class +1 Class -1 π₯2 π₯3 π₯1 Same on empirical loss; Different on test/expected loss
Class +1 Class -1 π₯1
New test data
Class +1 Class -1 π₯3
New test data
Class +1 Class -1 π₯2
New test data
Class +1 Class -1 π₯2
large margin
We are going to prove the following math expression for margin using a geometric argument
|π
π₯ π¦ |
| π₯ | to the hyperplane π π₯ π¦ =
π₯ππ¦ = 0
|ππ₯,π π¦ | | π₯ |
to the hyperplane π
π₯,π π¦ =
π₯ππ¦ + π = 0 Need two geometric facts:
π₯,π π¦ = π₯ππ¦ + π = 0
projection of π¦ on π€ is π€ππ¦
|π
π₯ π¦ |
| π₯ | to the hyperplane π π₯ π¦ =
π₯ππ¦ = 0 Proof:
π₯ | π₯ |
π₯ π₯ π
π¦ =
π
π₯(π¦)
| π₯ | π₯ | π₯ |
π¦
π₯ π₯
π
π¦
|ππ₯,π π¦ | | π₯ |
to the hyperplane π
π₯,π π¦ =
π₯ππ¦ + π = 0 Proof:
π₯ | π₯ |, then |π | is the distance
π₯,π π¦
π₯ππ₯ | π₯ | + π = 0 + π | π₯ |
π§ π¦ = π₯ππ¦ + π₯0 The notation here is:
Figure from Pattern Recognition and Machine Learning, Bishop
πΏ = min
π
|π
π₯,π π¦π |
| π₯ |
π₯,π, and recall π§π β {+1, β1}, we define
the margin to be πΏ = min
π
π§ππ
π₯,π π¦π
| π₯ |
π₯,π incorrect on some π¦π, the margin is negative
max
π₯,π πΏ = max π₯,π min π
π§ππ
π₯,π π¦π
| π₯ | = max
π₯,π min π
π§π(π₯ππ¦π + π) | π₯ |
unchanged π§π(ππ₯ππ¦π + ππ) | ππ₯ | = π§π(π₯ππ¦π + π) | π₯ |
π§πβ π₯ππ¦πβ + π = 1 where π¦πβ is the point closest to the hyperplane
π§πβ π₯ππ¦πβ + π = 1 where π¦πβ is the point closet to the hyperplane
π§π π₯ππ¦π + π β₯ 1 and at least for one π the equality holds
1 | π₯ |
min
π₯,π
1 2 π₯
2
π§π π₯ππ¦π + π β₯ 1, βπ
π₯β?
min
π₯,π
1 2 π₯
2
π§π π₯ππ¦π + π β₯ 1, βπ
β π₯, π, π· = 1 2 π₯
2
β ΰ·
π
π½π[π§π π₯ππ¦π + π β 1] where π· is the Lagrange multiplier
πβ ππ₯ = 0, β π₯ = Οπ π½ππ§ππ¦π (1) πβ ππ = 0, β 0 = Οπ π½ππ§π
(2)
β π₯, π, π· = Οπ π½π β
1 2 Οππ π½ππ½ππ§ππ§ππ¦π ππ¦π (3)
combined with 0 = Οπ π½ππ§π , π½π β₯ 0
β π₯, π, π· = ΰ·
π
π½π β 1 2 ΰ·
ππ
π½ππ½ππ§ππ§ππ¦π
ππ¦π
ΰ·
π
π½ππ§π = 0, π½π β₯ 0
ππ¦ + π
Only depend on inner products
are called support vectors
boundary
delete the instances with Ξ±i =
support vectors
instances
min
π₯
π(π₯) βπ π₯ = 0, β1 β€ π β€ π
β π₯, πΈ = π π₯ + ΰ·
π
πΎπβπ(π₯) where πΎπβs are called Lagrange multipliers
min
π₯
π(π₯) βπ π₯ = 0, β1 β€ π β€ π
πβ ππ₯π = 0; πβ ππΎπ = 0
min
π₯
π(π₯) ππ π₯ β€ 0, β1 β€ π β€ π βπ π₯ = 0, β1 β€ π β€ π
β π₯, π·, πΈ = π π₯ + ΰ·
π
π½πππ(π₯) + ΰ·
π
πΎπβπ(π₯) where π½π, πΎπβs are called Lagrange multipliers
ππ π₯ β max
π·,πΈ:π½πβ₯0 β π₯, π·, πΈ
ππ π₯ = απ π₯ , if π₯ satisfies all the constraints +β, if π₯ does not satisfy the constraints
min
π₯ π π₯ = min π₯ ππ π₯ = min π₯
max
π·,πΈ:π½πβ₯0 β π₯, π·, πΈ
πβ β min
π₯ π π₯ = min π₯
max
π·,πΈ:π½πβ₯0 β π₯, π·, πΈ
πβ β max
π·,πΈ:π½πβ₯0min π₯ β π₯, π·, πΈ
πβ β€ πβ
πβ β min
π₯ π π₯ = min π₯
max
π·,πΈ:π½πβ₯0 β π₯, π·, πΈ
πβ β max
π·,πΈ:π½πβ₯0min π₯ β π₯, π·, πΈ
πβ = πβ?
such that πβ = β π₯β, π·β, πΈβ = πβ Moreover, π₯β, π·β, πΈβ satisfy Karush-Kuhn-Tucker (KKT) conditions: πβ ππ₯π = 0, π½πππ π₯ = 0 ππ π₯ β€ 0, βπ π₯ = 0, π½π β₯ 0
such that πβ = β π₯β, π·β, πΈβ = πβ Moreover, π₯β, π·β, πΈβ satisfy Karush-Kuhn-Tucker (KKT) conditions: πβ ππ₯π = 0, π½πππ π₯ = 0 ππ π₯ β€ 0, βπ π₯ = 0, π½π β₯ 0 dual complementarity
such that πβ = β π₯β, π·β, πΈβ = πβ
conditions: πβ ππ₯π = 0, π½πππ π₯ = 0 ππ π₯ β€ 0, βπ π₯ = 0, π½π β₯ 0 dual constraints primal constraints
Vandenberghe
min
π₯,π
1 2 π₯
2
π§π π₯ππ¦π + π β₯ 1, βπ
formulation will fail
by ππ) to tolerate errors min
π₯,π,ππ
1 2 π₯
2
+ π· ΰ·
π
ππ π§π π₯ππ¦π + π β₯ 1 β ππ, ππ β₯ 0, βπ
minimizing slack
Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010
squared loss and cross-entropy loss
loss (error) when π§ = 1 model output β π squared loss 0/1 loss hinge loss
applied in regression tasks
function specifies that a training instance is well explained if the modelβs prediction is within π of π§π
(π₯β€π¦ + π) β π§ = π π§ β (π₯β€π¦ + π) = π
errors min
π₯,π,ππ,ππ
1 2 π₯
2
+ π· ΰ·
π
ππ + ππ π₯ππ¦π + π β π§π β€ π + ππ, π§π β π₯ππ¦π + π β€ π + ππ, ππ, ππ β₯ 0.
slack variables allow predictions for some training instances to be
Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.