Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - PowerPoint PPT Presentation
Support Vector Machines Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the margin the linear support vector machine the primal and dual formulations of SVM learning support vectors
Support Vector Machines Part 1 CS 760@UW-Madison
Goals for the lecture you should understand the following concepts • the margin • the linear support vector machine • the primal and dual formulations of SVM learning • support vectors • Optional: variants of SVM • Optional: Lagrange Multiplier 2
Motivation
Linear classification (𝑥 ∗ ) 𝑈 𝑦 = 0 (𝑥 ∗ ) 𝑈 𝑦 > 0 (𝑥 ∗ ) 𝑈 𝑦 < 0 𝑥 ∗ Class +1 Class -1 Assume perfect separation between the two classes
Attempt • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥 𝑦 ) = sign(𝑥 𝑈 𝑦) • Hypothesis 𝑧 = sign(𝑔 • 𝑧 = +1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = −1 if 𝑥 𝑈 𝑦 < 0 • Let’s assume that we can optimize to find 𝑥
Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Same on empirical loss; Different on test/expected loss
What about 𝑥 1 ? New test data Class +1 𝑥 1 Class -1
What about 𝑥 3 ? New test data Class +1 𝑥 3 Class -1
Most confident: 𝑥 2 New test data Class +1 𝑥 2 Class -1
Intuition: margin large margin Class +1 𝑥 2 Class -1
Margin
Margin We are going to prove the following math expression for margin using a geometric argument |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Need two geometric facts: • 𝑥 is orthogonal to the hyperplane 𝑔 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 0 • Let 𝑤 be a direction (i.e., unit vector). Then the length of the projection of 𝑦 on 𝑤 is 𝑤 𝑈 𝑦
Margin |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 Proof: • 𝑥 is orthogonal to the hyperplane 𝑥 • The unit direction is | 𝑥 | 𝑈 𝑥 𝑔 𝑥 (𝑦) • The projection of 𝑦 is 𝑦 = 𝑥 | 𝑥 | 𝑦 0 𝑥 | 𝑥 | 𝑈 𝑥 𝑦 𝑥
Margin: with bias |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Proof: 𝑥 • Let 𝑦 = 𝑦 ⊥ + 𝑠 | 𝑥 | , then |𝑠| is the distance • Multiply both sides by 𝑥 𝑈 and add 𝑐 • Left hand side: 𝑥 𝑈 𝑦 + 𝑐 = 𝑔 𝑥,𝑐 𝑦 𝑥 𝑈 𝑥 • Right hand side: 𝑥 𝑈 𝑦 ⊥ + 𝑠 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |
Margin: with bias The notation here is: 𝑧 𝑦 = 𝑥 𝑈 𝑦 + 𝑥 0 Figure from Pattern Recognition and Machine Learning , Bishop
Support Vector Machine (SVM)
SVM: objective • Absolute margin over all training data points: |𝑔 𝑥,𝑐 𝑦 𝑗 | 𝛿 = min | 𝑥 | 𝑗 • Since only want correct 𝑔 𝑥,𝑐 , and recall 𝑧 𝑗 ∈ {+1, −1} , we define the margin to be 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • If 𝑔 𝑥,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative
SVM: objective • Maximize margin over all training data points: 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min = max 𝑥,𝑐 min | 𝑥 | | 𝑥 | 𝑗 𝑗 • A bit complicated …
SVM: simplified objective • Observation: when (𝑥, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑𝑥 𝑈 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) | 𝑑𝑥 | | 𝑥 | • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closest to the hyperplane
SVM: simplified objective • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closet to the hyperplane • Now we have for all data 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds 1 • Then the margin over all training points is | 𝑥 |
SVM: simplified objective • Optimization simplified to 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 𝑥 ∗ ? • How to find the optimum ෝ • Solved by Lagrange multiplier method
SVM: optimization
SVM: optimization • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Generalized Lagrangian: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − 𝑗 where 𝜷 is the Lagrange multiplier
SVM: optimization • KKT conditions: 𝜖ℒ 𝜖𝑥 = 0, → 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0, → 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) • Plug into ℒ : 1 𝑈 𝑦 𝑘 (3) ℒ 𝑥, 𝑐, 𝜷 = σ 𝑗 𝛽 𝑗 − 2 σ 𝑗𝑘 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 combined with 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 ≥ 0
SVM: optimization Only depend on inner products • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = 2 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 𝑈 𝑦 + 𝑐
Support Vectors • final solution is a sparse linear combination of the training instances • those instances with α i > 0 are called support vectors • they lie on the margin boundary • solution NOT changed if delete the instances with α i = 0 support vectors
Optional: Lagrange Multiplier
Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Lagrangian: ℒ 𝑥, 𝜸 = 𝑔 𝑥 + 𝛾 𝑗 ℎ 𝑗 (𝑥) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers
Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Solved by setting derivatives of Lagrangian to 0 𝜖ℒ 𝜖ℒ = 0; = 0 𝜖𝑥 𝑗 𝜖𝛾 𝑗
Generalized Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ 𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚 • Generalized Lagrangian: ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + 𝛽 𝑗 𝑗 (𝑥) + 𝛾 𝑘 ℎ 𝑘 (𝑥) 𝑗 𝑘 where 𝛽 𝑗 , 𝛾 𝑘 ’s are called Lagrange multipliers
Generalized Lagrangian • Consider the quantity: 𝜄 𝑄 𝑥 ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max • Why? if 𝑥 satisfies all the constraints 𝜄 𝑄 𝑥 = ቊ 𝑔 𝑥 , +∞, if 𝑥 does not satisfy the constraints • So minimizing 𝑔 𝑥 is the same as minimizing 𝜄 𝑄 𝑥 min 𝑥 𝑔 𝑥 = min 𝑥 𝜄 𝑄 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥
Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Always true: 𝑒 ∗ ≤ 𝑞 ∗
Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Interesting case: when do we have 𝑒 ∗ = 𝑞 ∗ ?
Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0
Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that dual 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ complementarity Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0
Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ primal constraints dual constraints • Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0
Lagrange duality • What are the proper conditions? • A set of conditions (Slater conditions): • 𝑔, 𝑗 convex, ℎ 𝑘 affine, and exists 𝑥 satisfying all 𝑗 𝑥 < 0 • There exist other sets of conditions • Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe
Optional: Variants of SVM
Hard-margin SVM • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.