SLIDE 1 Maria-Florina Balcan
03/25/2015
- Support Vector Machines (SVMs).
- Semi-Supervised SVMs.
- Semi-Supervised Learning.
SLIDE 2
One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.
Support Vector Machines (SVMs).
Directly motivated by Margins and Kernels!
SLIDE 3 Geometric Margin
Definition: The margin of example π¦ w.r.t. a linear sep. π₯ is the distance from π¦ to the plane π₯ β
π¦ = 0.
π¦1 w
Margin of example π¦1
π¦2
Margin of example π¦2
If π₯ = 1, margin of x w.r.t. w is |π¦ β
π₯|.
WLOG homogeneous linear separators [w0 = 0].
SLIDE 4 + + + +
πΏ
+
Definition: The margin πΏ of a set of examples π is the maximum πΏπ₯ over all linear separators π₯.
Geometric Margin
Definition: The margin πΏπ₯ of a set of examples π wrt a linear separator π₯ is the smallest margin over points π¦ β π. Definition: The margin of example π¦ w.r.t. a linear sep. π₯ is the distance from π¦ to the plane π₯ β
π¦ = 0.
SLIDE 5 Both sample complexity and algorithmic implications.
Margin Important Theme in ML
Sample/Mistake Bound complexity:
- If large margin, # mistakes Peceptron makes
is small (independent on the dim of the space)!
- If large margin πΏ and if alg. produces a large
margin classifier, then amount of data needed depends only on R/πΏ [Bartlett & Shawe-Taylor β99].
+ + + +
πΏ
+
Suggests searching for a large margin classifier⦠SVMs
Algorithmic Implications
SLIDE 6 Input: πΏ, S={(x1, π§1), β¦,(xm, π§m)}; Output: w, a separator of margin πΏ over S
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
Find: some w where:
2 = 1
- For all i, π§ππ₯ β
π¦π β₯ πΏ
First, assume we know a lower bound on the margin πΏ
+ + + +
πΏ
+
Realizable case, where the data is linearly separable by margin πΏ
SLIDE 7 Input: S={(x1, π§1), β¦,(xm, π§m)}; Output: maximum margin separator over S
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
Find: some w and maximum πΏ where:
2 = 1
- For all i, π§ππ₯ β
π¦π β₯ πΏ
E.g., search for the best possible πΏ
+ + + +
πΏ
+
SLIDE 8 Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
Input: S={(x1, π§1), β¦,(xm, π§m)}; Maximize πΏ under the constraint:
2 = 1
- For all i, π§ππ₯ β
π¦π β₯ πΏ
+ + + +
πΏ
+
SLIDE 9 Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
Input: S={(x1, π§1), β¦,(xm, π§m)}; Maximize πΏ under the constraint:
2 = 1
- For all i, π§ππ₯ β
π¦π β₯ πΏ
This is a constrained
problem.
function constraints
- Famous example of constrained optimization: linear programming,
where objective fn is linear, constraints are linear (in)equalities
SLIDE 10 This constraint is non-linear.
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
+ + + +
πΏ
+
Input: S={(x1, π§1), β¦,(xm, π§m)}; Maximize πΏ under the constraint:
2 = 1
- For all i, π§ππ₯ β
π¦π β₯ πΏ
In fact, itβs even non-convex
π₯1 π₯2
π₯1 + π₯2 2
SLIDE 11 Input: S={(x1, π§1), β¦,(xm, π§m)};
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
Maximize πΏ under the constraint:
2 = 1
- For all i, π§ππ₯ β
π¦π β₯ πΏ
Input: S={(x1, π§1), β¦,(xm, π§m)}; Minimize π₯β²
2 under the constraint:
- For all i, π§ππ₯β² β
π¦π β₯ 1
π₯β = π₯/πΏ, then max πΏ is equiv. to minimizing ||π₯β||2 (since ||π₯β||2 = 1/πΏ2). So, dividing both sides by πΏ and writing in terms of wβ we get:
+ + + +
πΏ
+
+ + + +
π₯β β
π¦ = β1 π₯β β
π¦ = 1
SLIDE 12 Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
Input: S={(x1, π§1), β¦,(xm, π§m)}; argminw π₯
2 s.t.:
- For all i, π§ππ₯ β
π¦π β₯ 1
This is a constrained
problem.
- The objective is convex (quadratic)
- All constraints are linear
- Can solve efficiently (in poly time) using standard quadratic
programing (QP) software
SLIDE 13 Support Vector Machines (SVMs)
Question: what if data isnβt perfectly linearly separable?
π₯
2 + π·(# misclassifications)
Issue 1: now have two objectives
- maximize margin
- minimize # of misclassifications.
Ans 1: Letβs optimize their sum: minimize
NP-hard [Guruswami-Raghavendraβ06]
where π· is some tradeoff constant. Issue 2: This is computationally hard (NP-hard).
[even if didnβt care about margin and minimized # mistakes]
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
SLIDE 14 Support Vector Machines (SVMs)
Question: what if data isnβt perfectly linearly separable?
π₯
2 + π·(# misclassifications)
Issue 1: now have two objectives
- maximize margin
- minimize # of misclassifications.
Ans 1: Letβs optimize their sum: minimize
NP-hard [Guruswami-Raghavendraβ06]
where π· is some tradeoff constant. Issue 2: This is computationally hard (NP-hard).
[even if didnβt care about margin and minimized # mistakes]
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
SLIDE 15 Support Vector Machines (SVMs)
Question: what if data isnβt perfectly linearly separable?
Input: S={(x1, π§1), β¦,(xm, π§m)}; argminw,π1,β¦,ππ π₯
2 + π· ππ π
s.t.:
- For all i, π§ππ₯ β
π¦π β₯ 1 β ππ
Find ππ β₯ 0 ππ are βslack variablesβ
Replace β# mistakesβ with upper bound called βhinge lossβ
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
Input: S={(x1, π§1), β¦,(xm, π§m)}; Minimize π₯β²
2 under the constraint:
- For all i, π§ππ₯β² β
π¦π β₯ 1
+ + + +
π₯β β
π¦ = β1 π₯β β
π¦ = 1
SLIDE 16 Support Vector Machines (SVMs)
Question: what if data isnβt perfectly linearly separable?
Input: S={(x1, π§1), β¦,(xm, π§m)}; argminw,π1,β¦,ππ π₯
2 + π· ππ π
s.t.:
- For all i, π§ππ₯ β
π¦π β₯ 1 β ππ
Find ππ β₯ 0 ππ are βslack variablesβ
Replace β# mistakesβ with upper bound called βhinge lossβ
π π₯, π¦, π§ = max (0,1 β π§ π₯ β
π¦)
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
C controls the relative weighting between the twin goals of making the π₯
2 small (margin is
large) and ensuring that most examples have functional margin β₯ 1.
SLIDE 17 Support Vector Machines (SVMs)
Question: what if data isnβt perfectly linearly separable?
Input: S={(x1, π§1), β¦,(xm, π§m)}; argminw,π1,β¦,ππ π₯
2 + π· ππ π
s.t.:
- For all i, π§ππ₯ β
π¦π β₯ 1 β ππ
Find ππ β₯ 0
Replace β# mistakesβ with upper bound called βhinge lossβ
π π₯, π¦, π§ = max (0,1 β π§ π₯ β
π¦)
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
π₯
2 + π·(# misclassifications)
Replace the number of mistakes with the hinge loss
SLIDE 18 Support Vector Machines (SVMs)
Question: what if data isnβt perfectly linearly separable?
Input: S={(x1, π§1), β¦,(xm, π§m)}; argminw,π1,β¦,ππ π₯
2 + π· ππ π
s.t.:
- For all i, π§ππ₯ β
π¦π β₯ 1 β ππ
Find ππ β₯ 0
Replace β# mistakesβ with upper bound called βhinge lossβ
π π₯, π¦, π§ = max (0,1 β π§ π₯ β
π¦)
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
Total amount have to move the points to get them
- n the correct side of the lines π₯ β
π¦ = +1/β1,
where the distance between the lines π₯ β
π¦ = 0 and π₯ β
π¦ = 1 counts as β1 unitβ.
SLIDE 19
What if the data is far from being linearly separable?
Example: vs
No good linear separator in pixel representation.
SVM philosophy: βUse e a Ke Kernelβ
SLIDE 20 Support Vector Machines (SVMs)
Input: S={(x1, π§1), β¦,(xm, π§m)}; argminw,π1,β¦,ππ π₯
2 + π· ππ π
s.t.:
- For all i, π§ππ₯ β
π¦π β₯ 1 β ππ
Which is equivalent to:
Find ππ β₯ 0
Primal form
Input: S={(x1, y1), β¦,(xm, ym)}; argminΞ±
1 2 yiyj Ξ±iΞ±jxi β
xj β Ξ±i i j i
s.t.:
Find 0 β€ Ξ±i β€ Ci
Lagrangian Dual
yiΞ±i = 0
i
SLIDE 21 SVMs (Lagrangian Dual)
- Final classifier is: w = Ξ±iyixi
i
- The points xi for which Ξ±i β 0
are called the βsupport vectorsβ Input: S={(x1, y1), β¦,(xm, ym)}; argminΞ±
1 2 yiyj Ξ±iΞ±jxi β
xj β Ξ±i i j i
s.t.:
Find 0 β€ Ξ±i β€ Ci
yiΞ±i = 0
i
+ + + +
π₯ β
π¦ = β1 π₯ β
π¦ = 1
SLIDE 22 Kernelizing the Dual SVMs
- Final classifier is: w = Ξ±iyixi
i
- The points xi for which Ξ±i β 0 are called the βsupport vectorsβ
- With a kernel, classify x using Ξ±iyiK(x, xi)
i
Input: S={(x1, y1), β¦,(xm, ym)}; argminΞ±
1 2 yiyj Ξ±iΞ±jxi β
xj β Ξ±i i j i
s.t.:
Find 0 β€ Ξ±i β€ Ci
yiΞ±i = 0
i
Replace xi β
xj with K xi, xj .
SLIDE 23
One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.
Support Vector Machines (SVMs).
Directly motivated by Margins and Kernels!
SLIDE 24
- The importance of margins in machine learning.
- The primal form of the SVM optimization problem
What you should know
- The dual form of the SVM optimization problem.
- Kernelizing SVM.
- Think about how itβs related to Regularized Logistic
Regression.
SLIDE 25
Interaction for Learning
Modern (Partially) Supervised Machine Learning
SLIDE 26
Classic Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.
Billions of webpages Images Protein sequences
SLIDE 27 Expert Labeler
Semi-Supervised Learning
raw data
face not face
Labeled data
Classifier
SLIDE 28 Active Learning
face
O O O
Expert Labeler
raw data
Classifier
not face
SLIDE 29 Semi-Supervised Learning
Prominent paradigm in past 15 years in Machine Learning.
- Most applications have lots of unlabeled data, but
labeled data is rare or expensive:
- Web page, document classification
- Computer Vision
- Computational Biology,
- β¦.
SLIDE 30 Labeled Examples
Semi-Supervised Learning
Learning Algorithm Expert / Oracle Data Source Unlabeled examples Algorithm outputs a classifier Unlabeled examples
Sl={(x1, y1), β¦,(xml, yml)} Su={x1, β¦,xmu} drawn i.i.d from D xi drawn i.i.d from D, yi = cβ(xi) Goal: h has small error over D. errD h = Pr
x~ D(h x β cβ(x))
SLIDE 31
Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Key Insight
Semi-supervised learning: no querying. Just have lots of additional unlabeled data. A bit puzzling since unclear what unlabeled data can do for youβ¦.
SLIDE 32 Combining Labeled and Unlabeled Data
- Several methods have been developed to try to use
unlabeled data to improve performance, e.g.:
β Transductive SVM [Joachims β99] β Co-training [Blum & Mitchell β98] β Graph-based methods [B&C01], [ZGL03]
Test of time awards at ICML!
Workshops [ICML β03, ICMLβ 05, β¦]
- Semi-Supervised Learning, MIT 2006
- O. Chapelle, B. Scholkopf and A. Zien (eds)
Books:
- Introduction to Semi-Supervised Learning,
Morgan & Claypool, 2009 Zhu & Goldberg
SLIDE 33 Example of βtypicalβ assumption: Margins
- The separator goes through low density regions of
the space/large margin.
β assume we are looking for linear separator β belief: should exist one with large separation
+ + _ _ Labeled data only + + _ _ + + _ _ Transductive SVM SVM
SLIDE 34 Transductive Support Vector Machines
Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims β99]
argminw w
2 s.t.:
- yi w β
xi β₯ 1, for all i β {1, β¦ , ml}
Su={x1, β¦,xmu}
w β
xu β₯ 1, for all u β {1, β¦ , mu}
β {β1, 1} for all u β {1, β¦ , mu}
+ + + +
π₯β β
π¦ = β1 π₯β β
π¦ = 1
Input: Sl={(x1, y1), β¦,(xml, yml)} Find a labeling of the unlabeled sample and π₯ s.t. π₯ separates both labeled and unlabeled data with maximum margin.
SLIDE 35 Transductive Support Vector Machines
argminw w
2 + π· ππ π
+ π· ππ£
π£
- yi w β
xi β₯ 1-ππ, for all i β {1, β¦ , ml}
Su={x1, β¦,xmu}
w β
xu β₯ 1 β ππ£ , for all u β {1, β¦ , mu}
β {β1, 1} for all u β {1, β¦ , mu}
+ + + +
π₯β β
π¦ = β1 π₯β β
π¦ = 1
Input: Sl={(x1, y1), β¦,(xml, yml)} Find a labeling of the unlabeled sample and π₯ s.t. π₯ separates both labeled and unlabeled data with maximum margin.
Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims β99]
SLIDE 36 Transductive Support Vector Machines
Optimize for the separator with large margin wrt labeled and unlabeled data.
argminw w
2 + π· ππ π
+ π· ππ£
π£
- yi w β
xi β₯ 1-ππ, for all i β {1, β¦ , ml}
Su={x1, β¦,xmu}
w β
xu β₯ 1 β ππ£ , for all u β {1, β¦ , mu}
β {β1, 1} for all u β {1, β¦ , mu} Input: Sl={(x1, y1), β¦,(xml, yml)} NP-hardβ¦.. Convex only after you guessed the labelsβ¦ too many possible guessesβ¦
SLIDE 37 Transductive Support Vector Machines
Optimize for the separator with large margin wrt labeled and unlabeled data.
Heuristic (Joachims) high level idea: Keep going until no more improvements. Finds a locally-optimal solution.
- First maximize margin over the labeled points
- Use this to give initial labels to unlabeled points
based on this separator.
- Try flipping labels of unlabeled points to see if doing
so can increase margin
SLIDE 38
Experiments [Joachims99]
SLIDE 39 Highly compatible
+ +
_ _
Helpful distribution Non-helpful distributions
Transductive Support Vector Machines
1/Β°2 clusters, all partitions separable by large margin
SLIDE 40
Semi-Supervised Learning
Prominent paradigm in past 15 years in Machine Learning.
Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Key Insight β Transductive SVM [Joachims β99] β Co-training [Blum & Mitchell β98] β Graph-based methods [B&C01], [ZGL03]
Prominent techniques