Support Vector Machines (SVMs). Semi-Supervised Learning. - - PowerPoint PPT Presentation

β–Ά
support vector machines svms semi supervised learning
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines (SVMs). Semi-Supervised Learning. - - PowerPoint PPT Presentation

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs. Maria-Florina Balcan 03/25/2015 Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most e ff ective


slide-1
SLIDE 1

Maria-Florina Balcan

03/25/2015

  • Support Vector Machines (SVMs).
  • Semi-Supervised SVMs.
  • Semi-Supervised Learning.
slide-2
SLIDE 2

One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.

Support Vector Machines (SVMs).

Directly motivated by Margins and Kernels!

slide-3
SLIDE 3

Geometric Margin

Definition: The margin of example 𝑦 w.r.t. a linear sep. π‘₯ is the distance from 𝑦 to the plane π‘₯ β‹… 𝑦 = 0.

𝑦1 w

Margin of example 𝑦1

𝑦2

Margin of example 𝑦2

If π‘₯ = 1, margin of x w.r.t. w is |𝑦 β‹… π‘₯|.

WLOG homogeneous linear separators [w0 = 0].

slide-4
SLIDE 4

+ + + +

  • -
  • 𝛿

𝛿

+

  • w

Definition: The margin 𝛿 of a set of examples 𝑇 is the maximum 𝛿π‘₯ over all linear separators π‘₯.

Geometric Margin

Definition: The margin 𝛿π‘₯ of a set of examples 𝑇 wrt a linear separator π‘₯ is the smallest margin over points 𝑦 ∈ 𝑇. Definition: The margin of example 𝑦 w.r.t. a linear sep. π‘₯ is the distance from 𝑦 to the plane π‘₯ β‹… 𝑦 = 0.

slide-5
SLIDE 5

Both sample complexity and algorithmic implications.

Margin Important Theme in ML

Sample/Mistake Bound complexity:

  • If large margin, # mistakes Peceptron makes

is small (independent on the dim of the space)!

  • If large margin 𝛿 and if alg. produces a large

margin classifier, then amount of data needed depends only on R/𝛿 [Bartlett & Shawe-Taylor ’99].

+ + + +

  • -
  • 𝛿

𝛿

+

  • w

Suggests searching for a large margin classifier… SVMs

Algorithmic Implications

slide-6
SLIDE 6

Input: 𝛿, S={(x1, 𝑧1), …,(xm, 𝑧m)}; Output: w, a separator of margin 𝛿 over S

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Find: some w where:

  • w

2 = 1

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 𝛿

First, assume we know a lower bound on the margin 𝛿

+ + + +

  • -
  • 𝛿

𝛿

+

  • w

Realizable case, where the data is linearly separable by margin 𝛿

slide-7
SLIDE 7

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; Output: maximum margin separator over S

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Find: some w and maximum 𝛿 where:

  • w

2 = 1

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 𝛿

E.g., search for the best possible 𝛿

+ + + +

  • -
  • 𝛿

𝛿

+

  • w
slide-8
SLIDE 8

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; Maximize 𝛿 under the constraint:

  • w

2 = 1

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 𝛿

+ + + +

  • -
  • 𝛿

𝛿

+

  • w
slide-9
SLIDE 9

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; Maximize 𝛿 under the constraint:

  • w

2 = 1

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 𝛿

This is a constrained

  • ptimization

problem.

  • bjective

function constraints

  • Famous example of constrained optimization: linear programming,

where objective fn is linear, constraints are linear (in)equalities

slide-10
SLIDE 10

This constraint is non-linear.

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

+ + + +

  • -
  • 𝛿

𝛿

+

  • w

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; Maximize 𝛿 under the constraint:

  • w

2 = 1

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 𝛿

In fact, it’s even non-convex

π‘₯1 π‘₯2

π‘₯1 + π‘₯2 2

slide-11
SLIDE 11

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)};

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Maximize 𝛿 under the constraint:

  • w

2 = 1

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 𝛿

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; Minimize π‘₯β€²

2 under the constraint:

  • For all i, 𝑧𝑗π‘₯β€² β‹… 𝑦𝑗 β‰₯ 1

π‘₯’ = π‘₯/𝛿, then max 𝛿 is equiv. to minimizing ||π‘₯’||2 (since ||π‘₯’||2 = 1/𝛿2). So, dividing both sides by 𝛿 and writing in terms of w’ we get:

+ + + +

  • -
  • 𝛿

𝛿

+

  • w

+ + + +

  • -
  • +
  • w’

π‘₯’ β‹… 𝑦 = βˆ’1 π‘₯’ β‹… 𝑦 = 1

slide-12
SLIDE 12

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; argminw π‘₯

2 s.t.:

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 1

This is a constrained

  • ptimization

problem.

  • The objective is convex (quadratic)
  • All constraints are linear
  • Can solve efficiently (in poly time) using standard quadratic

programing (QP) software

slide-13
SLIDE 13

Support Vector Machines (SVMs)

Question: what if data isn’t perfectly linearly separable?

π‘₯

2 + 𝐷(# misclassifications)

Issue 1: now have two objectives

  • maximize margin
  • minimize # of misclassifications.

Ans 1: Let’s optimize their sum: minimize

NP-hard [Guruswami-Raghavendra’06]

where 𝐷 is some tradeoff constant. Issue 2: This is computationally hard (NP-hard).

[even if didn’t care about margin and minimized # mistakes]

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

slide-14
SLIDE 14

Support Vector Machines (SVMs)

Question: what if data isn’t perfectly linearly separable?

π‘₯

2 + 𝐷(# misclassifications)

Issue 1: now have two objectives

  • maximize margin
  • minimize # of misclassifications.

Ans 1: Let’s optimize their sum: minimize

NP-hard [Guruswami-Raghavendra’06]

where 𝐷 is some tradeoff constant. Issue 2: This is computationally hard (NP-hard).

[even if didn’t care about margin and minimized # mistakes]

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

slide-15
SLIDE 15

Support Vector Machines (SVMs)

Question: what if data isn’t perfectly linearly separable?

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; argminw,𝜊1,…,πœŠπ‘› π‘₯

2 + 𝐷 πœŠπ‘— 𝑗

s.t.:

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 1 βˆ’ πœŠπ‘—

Find πœŠπ‘— β‰₯ 0 πœŠπ‘— are β€œslack variables”

Replace β€œ# mistakes” with upper bound called β€œhinge loss”

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; Minimize π‘₯β€²

2 under the constraint:

  • For all i, 𝑧𝑗π‘₯β€² β‹… 𝑦𝑗 β‰₯ 1

+ + + +

  • -
  • +
  • w’

π‘₯’ β‹… 𝑦 = βˆ’1 π‘₯’ β‹… 𝑦 = 1

slide-16
SLIDE 16

Support Vector Machines (SVMs)

Question: what if data isn’t perfectly linearly separable?

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; argminw,𝜊1,…,πœŠπ‘› π‘₯

2 + 𝐷 πœŠπ‘— 𝑗

s.t.:

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 1 βˆ’ πœŠπ‘—

Find πœŠπ‘— β‰₯ 0 πœŠπ‘— are β€œslack variables”

Replace β€œ# mistakes” with upper bound called β€œhinge loss”

π‘š π‘₯, 𝑦, 𝑧 = max (0,1 βˆ’ 𝑧 π‘₯ β‹… 𝑦)

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

C controls the relative weighting between the twin goals of making the π‘₯

2 small (margin is

large) and ensuring that most examples have functional margin β‰₯ 1.

slide-17
SLIDE 17

Support Vector Machines (SVMs)

Question: what if data isn’t perfectly linearly separable?

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; argminw,𝜊1,…,πœŠπ‘› π‘₯

2 + 𝐷 πœŠπ‘— 𝑗

s.t.:

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 1 βˆ’ πœŠπ‘—

Find πœŠπ‘— β‰₯ 0

Replace β€œ# mistakes” with upper bound called β€œhinge loss”

π‘š π‘₯, 𝑦, 𝑧 = max (0,1 βˆ’ 𝑧 π‘₯ β‹… 𝑦)

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

π‘₯

2 + 𝐷(# misclassifications)

Replace the number of mistakes with the hinge loss

slide-18
SLIDE 18

Support Vector Machines (SVMs)

Question: what if data isn’t perfectly linearly separable?

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; argminw,𝜊1,…,πœŠπ‘› π‘₯

2 + 𝐷 πœŠπ‘— 𝑗

s.t.:

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 1 βˆ’ πœŠπ‘—

Find πœŠπ‘— β‰₯ 0

Replace β€œ# mistakes” with upper bound called β€œhinge loss”

π‘š π‘₯, 𝑦, 𝑧 = max (0,1 βˆ’ 𝑧 π‘₯ β‹… 𝑦)

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

Total amount have to move the points to get them

  • n the correct side of the lines π‘₯ β‹… 𝑦 = +1/βˆ’1,

where the distance between the lines π‘₯ β‹… 𝑦 = 0 and π‘₯ β‹… 𝑦 = 1 counts as β€œ1 unit”.

slide-19
SLIDE 19

What if the data is far from being linearly separable?

Example: vs

No good linear separator in pixel representation.

SVM philosophy: β€œUse e a Ke Kernel”

slide-20
SLIDE 20

Support Vector Machines (SVMs)

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)}; argminw,𝜊1,…,πœŠπ‘› π‘₯

2 + 𝐷 πœŠπ‘— 𝑗

s.t.:

  • For all i, 𝑧𝑗π‘₯ β‹… 𝑦𝑗 β‰₯ 1 βˆ’ πœŠπ‘—

Which is equivalent to:

Find πœŠπ‘— β‰₯ 0

Primal form

Input: S={(x1, y1), …,(xm, ym)}; argminΞ±

1 2 yiyj Ξ±iΞ±jxi β‹… xj βˆ’ Ξ±i i j i

s.t.:

  • For all i,

Find 0 ≀ Ξ±i ≀ Ci

Lagrangian Dual

yiΞ±i = 0

i

slide-21
SLIDE 21

SVMs (Lagrangian Dual)

  • Final classifier is: w = Ξ±iyixi

i

  • The points xi for which Ξ±i β‰  0

are called the β€œsupport vectors” Input: S={(x1, y1), …,(xm, ym)}; argminΞ±

1 2 yiyj Ξ±iΞ±jxi β‹… xj βˆ’ Ξ±i i j i

s.t.:

  • For all i,

Find 0 ≀ Ξ±i ≀ Ci

yiΞ±i = 0

i

+ + + +

  • -
  • +
  • w

π‘₯ β‹… 𝑦 = βˆ’1 π‘₯ β‹… 𝑦 = 1

slide-22
SLIDE 22

Kernelizing the Dual SVMs

  • Final classifier is: w = Ξ±iyixi

i

  • The points xi for which Ξ±i β‰  0 are called the β€œsupport vectors”
  • With a kernel, classify x using Ξ±iyiK(x, xi)

i

Input: S={(x1, y1), …,(xm, ym)}; argminΞ±

1 2 yiyj Ξ±iΞ±jxi β‹… xj βˆ’ Ξ±i i j i

s.t.:

  • For all i,

Find 0 ≀ Ξ±i ≀ Ci

yiΞ±i = 0

i

Replace xi β‹… xj with K xi, xj .

slide-23
SLIDE 23

One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.

Support Vector Machines (SVMs).

Directly motivated by Margins and Kernels!

slide-24
SLIDE 24
  • The importance of margins in machine learning.
  • The primal form of the SVM optimization problem

What you should know

  • The dual form of the SVM optimization problem.
  • Kernelizing SVM.
  • Think about how it’s related to Regularized Logistic

Regression.

slide-25
SLIDE 25
  • Using Unlabeled Data and

Interaction for Learning

Modern (Partially) Supervised Machine Learning

slide-26
SLIDE 26

Classic Paradigm Insufficient Nowadays

Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.

Billions of webpages Images Protein sequences

slide-27
SLIDE 27

Expert Labeler

Semi-Supervised Learning

raw data

face not face

Labeled data

Classifier

slide-28
SLIDE 28

Active Learning

face

O O O

Expert Labeler

raw data

Classifier

not face

slide-29
SLIDE 29

Semi-Supervised Learning

Prominent paradigm in past 15 years in Machine Learning.

  • Most applications have lots of unlabeled data, but

labeled data is rare or expensive:

  • Web page, document classification
  • Computer Vision
  • Computational Biology,
  • ….
slide-30
SLIDE 30

Labeled Examples

Semi-Supervised Learning

Learning Algorithm Expert / Oracle Data Source Unlabeled examples Algorithm outputs a classifier Unlabeled examples

Sl={(x1, y1), …,(xml, yml)} Su={x1, …,xmu} drawn i.i.d from D xi drawn i.i.d from D, yi = cβˆ—(xi) Goal: h has small error over D. errD h = Pr

x~ D(h x β‰  cβˆ—(x))

slide-31
SLIDE 31

Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.

Key Insight

Semi-supervised learning: no querying. Just have lots of additional unlabeled data. A bit puzzling since unclear what unlabeled data can do for you….

slide-32
SLIDE 32

Combining Labeled and Unlabeled Data

  • Several methods have been developed to try to use

unlabeled data to improve performance, e.g.:

– Transductive SVM [Joachims ’99] – Co-training [Blum & Mitchell ’98] – Graph-based methods [B&C01], [ZGL03]

Test of time awards at ICML!

Workshops [ICML ’03, ICML’ 05, …]

  • Semi-Supervised Learning, MIT 2006
  • O. Chapelle, B. Scholkopf and A. Zien (eds)

Books:

  • Introduction to Semi-Supervised Learning,

Morgan & Claypool, 2009 Zhu & Goldberg

slide-33
SLIDE 33

Example of β€œtypical” assumption: Margins

  • The separator goes through low density regions of

the space/large margin.

– assume we are looking for linear separator – belief: should exist one with large separation

+ + _ _ Labeled data only + + _ _ + + _ _ Transductive SVM SVM

slide-34
SLIDE 34

Transductive Support Vector Machines

Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims ’99]

argminw w

2 s.t.:

  • yi w β‹… xi β‰₯ 1, for all i ∈ {1, … , ml}

Su={x1, …,xmu}

  • yu

w β‹… xu β‰₯ 1, for all u ∈ {1, … , mu}

  • yu

∈ {βˆ’1, 1} for all u ∈ {1, … , mu}

+ + + +

  • +
  • w’

π‘₯’ β‹… 𝑦 = βˆ’1 π‘₯’ β‹… 𝑦 = 1

Input: Sl={(x1, y1), …,(xml, yml)} Find a labeling of the unlabeled sample and π‘₯ s.t. π‘₯ separates both labeled and unlabeled data with maximum margin.

slide-35
SLIDE 35

Transductive Support Vector Machines

argminw w

2 + 𝐷 πœŠπ‘— 𝑗

+ 𝐷 πœŠπ‘£

𝑣

  • yi w β‹… xi β‰₯ 1-πœŠπ‘—, for all i ∈ {1, … , ml}

Su={x1, …,xmu}

  • yi

w β‹… xu β‰₯ 1 βˆ’ πœŠπ‘£ , for all u ∈ {1, … , mu}

  • yu

∈ {βˆ’1, 1} for all u ∈ {1, … , mu}

+ + + +

  • +
  • w’

π‘₯’ β‹… 𝑦 = βˆ’1 π‘₯’ β‹… 𝑦 = 1

Input: Sl={(x1, y1), …,(xml, yml)} Find a labeling of the unlabeled sample and π‘₯ s.t. π‘₯ separates both labeled and unlabeled data with maximum margin.

Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims ’99]

slide-36
SLIDE 36

Transductive Support Vector Machines

Optimize for the separator with large margin wrt labeled and unlabeled data.

argminw w

2 + 𝐷 πœŠπ‘— 𝑗

+ 𝐷 πœŠπ‘£

𝑣

  • yi w β‹… xi β‰₯ 1-πœŠπ‘—, for all i ∈ {1, … , ml}

Su={x1, …,xmu}

  • yi

w β‹… xu β‰₯ 1 βˆ’ πœŠπ‘£ , for all u ∈ {1, … , mu}

  • yu

∈ {βˆ’1, 1} for all u ∈ {1, … , mu} Input: Sl={(x1, y1), …,(xml, yml)} NP-hard….. Convex only after you guessed the labels… too many possible guesses…

slide-37
SLIDE 37

Transductive Support Vector Machines

Optimize for the separator with large margin wrt labeled and unlabeled data.

Heuristic (Joachims) high level idea: Keep going until no more improvements. Finds a locally-optimal solution.

  • First maximize margin over the labeled points
  • Use this to give initial labels to unlabeled points

based on this separator.

  • Try flipping labels of unlabeled points to see if doing

so can increase margin

slide-38
SLIDE 38

Experiments [Joachims99]

slide-39
SLIDE 39

Highly compatible

+ +

_ _

Helpful distribution Non-helpful distributions

Transductive Support Vector Machines

1/Β°2 clusters, all partitions separable by large margin

slide-40
SLIDE 40

Semi-Supervised Learning

Prominent paradigm in past 15 years in Machine Learning.

Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.

Key Insight – Transductive SVM [Joachims ’99] – Co-training [Blum & Mitchell ’98] – Graph-based methods [B&C01], [ZGL03]

Prominent techniques