Kernel Methods CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

โ–ถ
kernel methods
SMART_READER_LITE
LIVE PREVIEW

Kernel Methods CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Not linearly separable data } Noisy data or overlapping classes 2 } (we discussed about it: soft margin) } Near linearly separable 1 }


slide-1
SLIDE 1

CE-717: Machine Learning

Sharif University of Technology Fall 2019

Soleymani

Kernel Methods

slide-2
SLIDE 2

Not linearly separable data

2

} Noisy data or overlapping classes } (we discussed about it: soft margin)

} Near linearly separable

} Non-linear decision surface

} Transform to a new feature space

๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1

slide-3
SLIDE 3

Nonlinear SVM

3

} Assume a transformation ๐œš: โ„' โ†’ โ„) on the feature

space

} ๐’š โ†’ ๐” ๐’š

} Find a hyper-plane in the transformed feature space:

๐‘ฆ2 ๐‘ฆ1 ๐œš,(๐’š) ๐œš/(๐’š) ๐œš: ๐’š โ†’ ๐” ๐’š ๐’™2๐” ๐’š + ๐‘ฅ5 = 0 {๐œš,(๐’š),...,๐œš)(๐’š)}: set of basis functions (or features) ๐œš< ๐’š : โ„' โ†’ โ„ ๐” ๐’š = [๐œš,(๐’š), . . . , ๐œš)(๐’š)]

slide-4
SLIDE 4

Soft-margin SVM in a transformed space: Primal problem

4

} Primal problem:

min

๐’™,BC

1 2 ๐’™ / + ๐ท E ๐œŠG

H GI,

  • s. t. ๐‘ง G

๐’™2๐”(๐’š G ) + ๐‘ฅ5 โ‰ฅ 1 โˆ’ ๐œŠG ๐‘œ = 1, โ€ฆ , ๐‘‚ ๐œŠG โ‰ฅ 0

} ๐’™ โˆˆ โ„): the weights that must be found } If ๐‘› โ‰ซ ๐‘’ (very high dimensional feature space) then there are

many more parameters to learn

slide-5
SLIDE 5

Soft-margin SVM in a transformed space: Dual problem

5

} Optimization problem:

max

๐œท

E ๐›ฝG

H GI,

โˆ’ 1 2 E E ๐›ฝG๐›ฝ)๐‘ง(G)๐‘ง())๐” ๐’š(G) 2๐” ๐’š())

H )I, H GI,

}

Subject to โˆ‘ ๐›ฝG๐‘ง(G) = 0

H GI,

}

0 โ‰ค ๐›ฝG โ‰ค ๐ท ๐‘œ = 1, โ€ฆ , ๐‘‚

} If we have inner products ๐” ๐’š(<) 2๐” ๐’š(\) , only ๐œท =

[๐›ฝ,, โ€ฆ , ๐›ฝH] needs to be learnt.

} not necessary to learn ๐‘› parameters as opposed to the primal problem

slide-6
SLIDE 6

Classifying a new data

6

๐‘ง ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ5 + ๐’™2๐”(๐’š) where ๐’™ = โˆ‘ ๐›ฝG ๐‘ง(G)๐”(๐’š(G))

  • bcd5

and ๐‘ฅ5 = ๐‘ง(e) โˆ’ ๐’™2๐”(๐’š(e))

slide-7
SLIDE 7

Kernel SVM

7

} Learns linear decision boundary in a high dimension space

without explicitly working on the mapped data

} Let ๐” ๐’š 2๐” ๐’šf = ๐ฟ(๐’š, ๐’šf) (kernel)

} Example: ๐’š = ๐‘ฆ,, ๐‘ฆ/ and second-order ๐”:

๐” ๐’š = 1, ๐‘ฆ,, ๐‘ฆ/, ๐‘ฆ,

/, ๐‘ฆ/ /, ๐‘ฆ,๐‘ฆ/

๐ฟ ๐’š, ๐’šf = 1 + ๐‘ฆ,๐‘ฆ,

f + ๐‘ฆ/๐‘ฆ/ f + ๐‘ฆ, /๐‘ฆ, f/ + ๐‘ฆ/ /๐‘ฆ/ f/ + ๐‘ฆ,๐‘ฆ, f๐‘ฆ/๐‘ฆ/ f

slide-8
SLIDE 8

Kernel trick

8

} Compute ๐ฟ ๐’š, ๐’šf without transforming ๐’š and ๐’šโ€ฒ } Example: Consider ๐ฟ ๐’š, ๐’šf = 1 + ๐’š2๐’šf / = 1 + ๐‘ฆ,๐‘ฆ,

f + ๐‘ฆ/๐‘ฆ/ f /

= 1 + 2๐‘ฆ,๐‘ฆ,

f + 2๐‘ฆ/๐‘ฆ/ f + ๐‘ฆ, /๐‘ฆ, f/ + ๐‘ฆ/ /๐‘ฆ/ f/ + 2๐‘ฆ,๐‘ฆ, f๐‘ฆ/๐‘ฆ/ f

This is an inner product in:

๐” ๐’š = 1, 2

  • ๐‘ฆ,,

2

  • ๐‘ฆ/, ๐‘ฆ,

/, ๐‘ฆ/ /,

2

  • ๐‘ฆ,๐‘ฆ/

๐” ๐’šโ€ฒ = 1, 2

  • ๐‘ฆ,

f,

2

  • ๐‘ฆ/

f, ๐‘ฆโ€ฒ, /, ๐‘ฆโ€ฒ/ /,

2

  • ๐‘ฆ,

f๐‘ฆ/ f

slide-9
SLIDE 9

Polynomial kernel: Degree two

9

} We instead use ๐ฟ(๐’š, ๐’šโ€ฒ) = ๐’š2๐’šโ€ฒ + 1 / that corresponds to:

๐” ๐’š = 1, 2

  • ๐‘ฆ,, โ€ฆ ,

2

  • ๐‘ฆ', ๐‘ฆ,

/, . . , ๐‘ฆ' /,

2

  • ๐‘ฆ,๐‘ฆ/, โ€ฆ ,

2

  • ๐‘ฆ,๐‘ฆ',

2

  • ๐‘ฆ/๐‘ฆi, โ€ฆ ,

2

  • ๐‘ฆ'j,๐‘ฆ'

2

๐‘’-dimensional feature space ๐’š = ๐‘ฆ,, โ€ฆ ,๐‘ฆ' 2

slide-10
SLIDE 10

Polynomial kernel

10

} This can similarly be generalized to d-dimensioan ๐’š and ๐œšs are

polynomials of order ๐‘: ๐ฟ ๐’š, ๐’šf = 1 + ๐’š2๐’šf l

= 1 + ๐‘ฆ,๐‘ฆ,

f + ๐‘ฆ/๐‘ฆ/ f + โ‹ฏ + ๐‘ฆ'๐‘ฆ' f l

} Example: SVM boundary for a polynomial kernel

} ๐‘ฅ5 + ๐’™2๐” ๐’š = 0 } โ‡’ ๐‘ฅ5 + โˆ‘

๐›ฝ<๐‘ง(<)๐” ๐’š <

2๐” ๐’š = 0

  • bod5

} โ‡’ ๐‘ฅ5 + โˆ‘

๐›ฝ<๐‘ง(<)๐‘™(๐’š < , ๐’š) = 0

  • bod5

} โ‡’ ๐‘ฅ5 + โˆ‘

๐›ฝ<๐‘ง(<) 1 + ๐’š(<)q๐’š

l

= 0

  • bod5

Boundary is a polynomial of order ๐‘

slide-11
SLIDE 11

Why kernel?

11

} kernel functions ๐ฟ can indeed be efficiently computed, with a

cost proportional to ๐‘’ (the dimensionality of the input) instead of ๐‘›.

} Example: consider the second-order polynomial transform:

๐” ๐’š = 1, ๐‘ฆ,, โ€ฆ , ๐‘ฆ', ๐‘ฆ,

/, ๐‘ฆ,๐‘ฆ/, โ€ฆ , ๐‘ฆ'๐‘ฆ' 2

๐” ๐’š 2๐” ๐’šโ€ฒ = 1 + E ๐‘ฆ<๐‘ฆ<

f ' <I,

+ E E ๐‘ฆ<๐‘ฆ\๐‘ฆ<

f๐‘ฆ\ f ' \I, ' <I,

๐” ๐’š 2๐” ๐’šโ€ฒ = 1 + ๐‘ฆ2๐‘ฆf + ๐‘ฆ2๐‘ฆf /

E ๐‘ฆ<๐‘ฆ<

f ' <I,

ร— E ๐‘ฆ\๐‘ฆ\

f ' \I,

๐‘Ÿ = 1 + ๐‘’ + ๐‘’/ ๐‘ƒ(๐‘Ÿ) ๐‘ƒ(๐‘’)

slide-12
SLIDE 12

Gaussian or RBF kernel

12

} If ๐ฟ ๐’š, ๐’šf is an inner product in some transformed space of x,

it is good

} ๐ฟ ๐’š, ๐’šf = exp

(โˆ’

๐’šj๐’šw x y

)

} Take one dimensional case with ๐›ฟ = 1:

๐ฟ ๐‘ฆ, ๐‘ฆf = exp โˆ’ ๐‘ฆ โˆ’ ๐‘ฆf / = exp โˆ’๐‘ฆ/ exp โˆ’๐‘ฆโ€ฒ/ exp 2๐‘ฆ๐‘ฆโ€ฒ = exp โˆ’๐‘ฆ/ exp โˆ’๐‘ฆโ€ฒ/ E 2{๐‘ฆ{๐‘ฆโ€ฒ{ ๐‘™!

} {I5

slide-13
SLIDE 13

Some common kernel functions

13

} Linear: ๐‘™(๐’š, ๐’šf) = ๐’š2๐’šโ€ฒ } Polynomial: ๐‘™ ๐’š, ๐’šf = (๐’š2๐’šf + 1)l } Gaussian: ๐‘™ ๐’š, ๐’šf = exp

(โˆ’

๐’šj๐’šw x y

)

} Sigmoid: ๐‘™ ๐’š, ๐’šf = tanh

(๐‘๐’š2๐’šf + ๐‘)

slide-14
SLIDE 14

Kernel formulation of SVM

14

} Optimization problem:

max

๐œท

E ๐›ฝG

H GI,

โˆ’ 1 2 E E ๐›ฝG๐›ฝ)๐‘ง(G)๐‘ง())๐” ๐’š(G) 2๐” ๐’š())

H )I, H GI,

}

Subject to โˆ‘ ๐›ฝG๐‘ง(G) = 0

H GI,

}

0 โ‰ค ๐›ฝG โ‰ค ๐ท ๐‘œ = 1, โ€ฆ , ๐‘‚

๐‘™(๐’š G , ๐’š()))

๐‘น = ๐‘ง , ๐‘ง , ๐ฟ ๐’š , , ๐’š , โ‹ฏ ๐‘ง , ๐‘ง H ๐ฟ ๐’š H , ๐’š , โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ง H ๐‘ง , ๐ฟ ๐’š H , ๐’š , โ‹ฏ ๐‘ง H ๐‘ง H ๐ฟ ๐’š H , ๐’š H

slide-15
SLIDE 15

Classifying a new data

15

๐‘ง ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ5 + ๐’™2๐”(๐’š) where ๐’™ = โˆ‘ ๐›ฝG ๐‘ง(G)๐”(๐’š(G))

  • bcd5

and ๐‘ฅ5 = ๐‘ง(e) โˆ’ ๐’™2๐”(๐’š(e)) ๐‘ง ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ5 + E ๐›ฝG ๐‘ง G ๐” ๐’š G

2

  • bcd5

๐”(๐’š) ๐‘ฅ5 = ๐‘ง(e) โˆ’ E ๐›ฝG ๐‘ง G ๐” ๐’š G

2

  • bcd5

๐” ๐’š e

๐‘™(๐’š G , ๐’š) ๐‘™(๐’š G , ๐’š(e))

slide-16
SLIDE 16

Gaussian kernel

16

} Example: SVM boundary for a Gaussian kernel

} Considers a Gaussian function around each data point.

} ๐‘ฅ5 + โˆ‘

๐›ฝ<๐‘ง(<)exp (โˆ’

๐’šj๐’š(o) x โ€ž

) = 0

  • bod5

} SVM + Gaussian Kernel can classify any arbitrary training set

} Training error is zero when ๐œ โ†’ 0

ยจ All samples become support vectors (likely overfiting)

slide-17
SLIDE 17

Hard margin Example

17

} For narrow Gaussian (large ๐œ), even the protection of a large

margin cannot suppress overfitting.

  • Y. Abu-Mostafa et. Al, 2012
slide-18
SLIDE 18

SVM Gaussian kernel: Example

18

This example has been adopted from Zissermanโ€™s slides ๐‘” ๐’š = ๐‘ฅ5 + E ๐›ฝ<๐‘ง(<)exp (โˆ’ ๐’š โˆ’ ๐’š(<) / 2๐œ/ )

  • bod5
slide-19
SLIDE 19

SVM Gaussian kernel: Example

19

This example has been adopted from Zissermanโ€™s slides

slide-20
SLIDE 20

SVM Gaussian kernel: Example

20

This example has been adopted from Zissermanโ€™s slides

slide-21
SLIDE 21

SVM Gaussian kernel: Example

21

This example has been adopted from Zissermanโ€™s slides

slide-22
SLIDE 22

SVM Gaussian kernel: Example

22

This example has been adopted from Zissermanโ€™s slides

slide-23
SLIDE 23

SVM Gaussian kernel: Example

23

This example has been adopted from Zissermanโ€™s slides

slide-24
SLIDE 24

SVM Gaussian kernel: Example

24

This example has been adopted from Zissermanโ€™s slides

slide-25
SLIDE 25

Kernel trick: Idea

25

} Kernel trick โ†’ Extension of many well-known algorithms to

kernel-based ones

} By substituting the dot product with the kernel function

} ๐‘™ ๐’š, ๐’šf = ๐” ๐’š 2๐”(๐’šโ€ฒ) } ๐‘™ ๐’š, ๐’šf shows the dot product of ๐’š and ๐’šf in the transformed space.

} Idea: when the input vectors appears only in the form of dot

products, we can use kernel trick

} Solving the problem without explicitly mapping the data

} Explicit mapping is expensive if ๐” ๐’š is very high dimensional

slide-26
SLIDE 26

Kernel trick: Idea (Contโ€™d)

26

} Instead of using a mapping ๐”: ๐’ด โ† โ„ฑ to represent ๐’š โˆˆ ๐’ด by

๐”(๐’š) โˆˆ โ„ฑ, a kernel function ๐‘™: ๐’ดร—๐’ด โ†’ โ„ is used.

} We specify only an inner product function between points in the

transformed space (not their coordinates)

} In many cases, the inner product in the embedding space can be

computed efficiently.

slide-27
SLIDE 27

Constructing kernels

27

} Construct kernel functions directly

} Ensure that it is a valid kernel

} Corresponds to an inner product in some feature space.

} Example: ๐‘™(๐’š, ๐’šf) = ๐’š2๐’šf /

} Corresponding mapping: ๐” ๐’š = ๐‘ฆ,

/,

2

  • ๐‘ฆ,๐‘ฆ/, ๐‘ฆ/

/ 2 for ๐’š =

๐‘ฆ,, ๐‘ฆ/ 2

} We need a way to test whether a kernel is valid without

having to construct ๐” ๐’š

slide-28
SLIDE 28

Construct Valid Kernels

28

  • ๐‘‘ > 0 , ๐‘™1: valid kernel
  • ๐‘”(. ): any function
  • ๐‘Ÿ(. ): a polynomial with coefficientsโ‰ฅ 0
  • ๐‘™1, ๐‘™2: valid kernels
  • ๐”(๐’š): a function from ๐’š to โ„l

๐‘™3(. , . ): a valid kernel in โ„l

  • ๐‘ฉ : a

symmetric positive semi-definite matrix

  • ๐’šลฝ and ๐’šโ€ข are variables (not necessarily

disjoint) with ๐’š = (๐’šลฝ, ๐’šโ€ข), and ๐‘™ลฝ and ๐‘™โ€ข are valid kernel functions

  • ver

their respective spaces. [Bishop]

slide-29
SLIDE 29

Valid kernel: Necessary & sufficient conditions

29

} Gram matrix ๐‘ณHร—H: ๐ฟ<\ = ๐‘™(๐’š(<), ๐’š(\))

} Restricting the kernel function to a set of points {๐’š , , ๐’š / , โ€ฆ , ๐’š(H)}

๐ฟ = ๐‘™(๐’š(,), ๐’š(,)) โ‹ฏ ๐‘™(๐’š(,), ๐’š(H)) โ‹ฎ โ‹ฑ โ‹ฎ ๐‘™(๐’š(H), ๐’š(,)) โ‹ฏ ๐‘™(๐’š(H), ๐’š(H))

} Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

} Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

slide-30
SLIDE 30

Extending linear methods to kernelized ones

30

} Kernelized version of linear methods

} Linear methods are famous

}

Unique optimal solutions, faster learning algorithms, and better analysis

} However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

} Replacing inner products with kernels in linear algorithms โ‡’

very flexible methods

} We can operate in the mapped space without ever computing the

coordinates of the data in that space

slide-31
SLIDE 31

Example: kernelized minimum distance classifier

31

} If

๐’š โˆ’ ๐‚, < ๐’š โˆ’ ๐‚/ then assign ๐’š to ๐’Ÿ,

๐’š โˆ’ ๐‚, 2 ๐’š โˆ’ ๐‚, < ๐’š โˆ’ ๐‚/ 2 ๐’š โˆ’ ๐‚/ โˆ’2๐’š2๐‚, + ๐‚,

2๐‚, < โˆ’2๐’š2๐‚/ + ๐‚/ 2๐‚/

โˆ’2 โˆ‘ ๐’š2๐’š G

  • โ€ c I,

๐‘‚, + โˆ‘ โˆ‘ ๐’š G 2๐’š )

  • โ€ โ€ข I,
  • โ€ c I,

๐‘‚,ร—๐‘‚, < โˆ’2 โˆ‘ ๐’š2๐’š G

  • โ€ c I/

๐‘‚/ + โˆ‘ โˆ‘ ๐’š G 2๐’š )

  • โ€ โ€ข I/
  • โ€ c I/

๐‘‚/ร—๐‘‚/

โˆ’2 โˆ‘ ๐ฟ ๐’š, ๐’š G

  • โ€ c I,

๐‘‚

,

+ โˆ‘ โˆ‘ ๐ฟ ๐’š G , ๐’š )

  • โ€ โ€ข I,
  • โ€ c I,

๐‘‚

,ร—๐‘‚ ,

< โˆ’2 โˆ‘ ๐ฟ ๐’š, ๐’š G

  • โ€ c I/

๐‘‚/ + โˆ‘ โˆ‘ ๐ฟ ๐’š G , ๐’š )

  • โ€ โ€ข I/
  • โ€ c I/

๐‘‚/ร—๐‘‚/

slide-32
SLIDE 32

Which information can be obtained from kernel?

32

} Example: we know all pairwise distances

} ๐‘’ ๐” ๐’š , ๐” ๐’œ

/ =

๐” ๐’š โˆ’ ๐” ๐’œ

/ = ๐‘™ ๐’š, ๐’š + ๐‘™ ๐’œ, ๐’œ โˆ’ 2๐‘™(๐’š, ๐’œ)

}

Therefore, we also know distance of points from center of mass of a set

} Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

} This allow us to introduce kernelized versions of them

slide-33
SLIDE 33

Example: Kernel ridge regression

33

min

๐’™ E ๐’™2๐’š G โˆ’ ๐‘ง G / H GI,

+ ๐œ‡๐’™2๐’™ E 2๐’š G ๐’™2๐’š G โˆ’ ๐‘ง G

H GI,

+ 2๐œ‡๐’™ = ๐Ÿ โ‡’ ๐’™ = E ๐›ฝG๐’š G

H GI,

๐›ฝG = โˆ’ 1 ๐œ‡ ๐’™2๐’š G โˆ’ ๐‘ง G

slide-34
SLIDE 34

Example: Kernel ridge regression (Contโ€™d)

34

min

๐’™ E ๐’™2๐œš ๐’š G

โˆ’ ๐‘ง G

/ H GI,

+ ๐œ‡๐’™2๐’™

} Dual representation:

๐พ ๐œท = ๐œท2๐šพ๐šพ2๐šพ๐šพ2๐œท โˆ’ 2๐œท2๐šพ๐šพ2๐’› + ๐’›2๐’› + ๐œ‡๐œท2๐šพ๐šพ2๐œท ๐พ ๐œท = ๐œท2๐‘ณ๐‘ณ๐œท โˆ’ 2๐œท2๐‘ณ๐’› + ๐’›2๐’› + ๐œ‡๐œท2๐‘ณ๐œท ๐›ผ

๐œท๐พ ๐œท = ๐Ÿ โ‡’ ๐œท = ๐‘ณ + ๐œ‡๐‘ฑH j,๐’›

๐’™ = E ๐›ฝG๐œš ๐’š G

H GI,

slide-35
SLIDE 35

Example: Kernel ridge regression (Contโ€™d)

35

} Prediction for new ๐’š:

๐‘” ๐’š = ๐’™2๐œš ๐’š = ๐œท2๐šพ๐œš ๐’š = ๐ฟ(๐’š(,), ๐’š) โ‹ฎ ๐ฟ(๐’š(H), ๐’š)

2

๐‘ณ + ๐œ‡๐‘ฑH j,๐’›

๐’™ = ๐šพ2๐œท

slide-36
SLIDE 36

Kernels for structured data

36

} Kernels also can be defined on general types of data

} Kernel functions do not need to be defined over vectors

} just we need a symmetric positive definite matrix

} Thus, many algorithms can work with general (non-vectorial)

data

} Kernels exist to embed strings, trees, graphs, โ€ฆ

} This may be more important than nonlinearity

} kernel-based version of classical learning algorithms for recognition

  • f structured data
slide-37
SLIDE 37

Kernel function for objects

37

} Sets: Example of kernel function for sets:

๐‘™ ๐ต, ๐ถ = 2 โˆฉยข

} Strings: The inner product of the feature vectors for two

strings can be defined as

} e.g., sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G E G T E A G A E G A T G

slide-38
SLIDE 38

Kernel trick advantages: summary

38

} Operating in the mapped space without ever computing the

coordinates of the data in that space

} Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

} Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

} In many cases, inner product in the embedding space can be

computed efficiently.

slide-39
SLIDE 39

Resources

39

} C. Bishop, โ€œPattern Recognition and Machine Learningโ€,

Chapter 6.1-6.2, 7.1.

} Yaser S. Abu-Mostafa, et al., โ€œLearning from Dataโ€, Chapter

8.