[PPT] - Kernel Methods CE-717: Machine Learning Sharif University of PowerPoint Presentation

SLIDE 1

CE-717: Machine Learning

Sharif University of Technology Fall 2019

Soleymani

Kernel Methods

SLIDE 2

Not linearly separable data

2

} Noisy data or overlapping classes } (we discussed about it: soft margin)

} Near linearly separable

} Non-linear decision surface

} Transform to a new feature space

𝑦2 𝑦1 𝑦2 𝑦1

SLIDE 3

Nonlinear SVM

3

} Assume a transformation 𝜚: ℝ' → ℝ) on the feature

space

} 𝒚 → 𝝔 𝒚

} Find a hyper-plane in the transformed feature space:

𝑦2 𝑦1 𝜚,(𝒚) 𝜚/(𝒚) 𝜚: 𝒚 → 𝝔 𝒚 𝒙2𝝔 𝒚 + 𝑥5 = 0 {𝜚,(𝒚),...,𝜚)(𝒚)}: set of basis functions (or features) 𝜚< 𝒚 : ℝ' → ℝ 𝝔 𝒚 = [𝜚,(𝒚), . . . , 𝜚)(𝒚)]

SLIDE 4

Soft-margin SVM in a transformed space: Primal problem

4

} Primal problem:

min

𝒙,BC

1 2 𝒙 / + 𝐷 E 𝜊G

H GI,

s. t. 𝑧 G

𝒙2𝝔(𝒚 G ) + 𝑥5 ≥ 1 − 𝜊G 𝑜 = 1, … , 𝑂 𝜊G ≥ 0

} 𝒙 ∈ ℝ): the weights that must be found } If 𝑛 ≫ 𝑒 (very high dimensional feature space) then there are

many more parameters to learn

SLIDE 5

Soft-margin SVM in a transformed space: Dual problem

5

} Optimization problem:

max

𝜷

E 𝛽G

H GI,

− 1 2 E E 𝛽G𝛽)𝑧(G)𝑧())𝝔 𝒚(G) 2𝝔 𝒚())

H )I, H GI,

}

Subject to ∑ 𝛽G𝑧(G) = 0

H GI,

}

0 ≤ 𝛽G ≤ 𝐷 𝑜 = 1, … , 𝑂

} If we have inner products 𝝔 𝒚(<) 2𝝔 𝒚(\) , only 𝜷 =

[𝛽,, … , 𝛽H] needs to be learnt.

} not necessary to learn 𝑛 parameters as opposed to the primal problem

SLIDE 6

Classifying a new data

6

𝑧 ] = 𝑡𝑗𝑕𝑜 𝑥5 + 𝒙2𝝔(𝒚) where 𝒙 = ∑ 𝛽G 𝑧(G)𝝔(𝒚(G))

bcd5

and 𝑥5 = 𝑧(e) − 𝒙2𝝔(𝒚(e))

SLIDE 7

Kernel SVM

7

} Learns linear decision boundary in a high dimension space

without explicitly working on the mapped data

} Let 𝝔 𝒚 2𝝔 𝒚f = 𝐿(𝒚, 𝒚f) (kernel)

} Example: 𝒚 = 𝑦,, 𝑦/ and second-order 𝝔:

𝝔 𝒚 = 1, 𝑦,, 𝑦/, 𝑦,

/, 𝑦/ /, 𝑦,𝑦/

𝐿 𝒚, 𝒚f = 1 + 𝑦,𝑦,

f + 𝑦/𝑦/ f + 𝑦, /𝑦, f/ + 𝑦/ /𝑦/ f/ + 𝑦,𝑦, f𝑦/𝑦/ f

SLIDE 8

Kernel trick

8

} Compute 𝐿 𝒚, 𝒚f without transforming 𝒚 and 𝒚′ } Example: Consider 𝐿 𝒚, 𝒚f = 1 + 𝒚2𝒚f / = 1 + 𝑦,𝑦,

f + 𝑦/𝑦/ f /

= 1 + 2𝑦,𝑦,

f + 2𝑦/𝑦/ f + 𝑦, /𝑦, f/ + 𝑦/ /𝑦/ f/ + 2𝑦,𝑦, f𝑦/𝑦/ f

This is an inner product in:

𝝔 𝒚 = 1, 2

𝑦,,

2

𝑦/, 𝑦,

/, 𝑦/ /,

2

𝑦,𝑦/

𝝔 𝒚′ = 1, 2

𝑦,

f,

2

𝑦/

f, 𝑦′, /, 𝑦′/ /,

2

𝑦,

f𝑦/ f

SLIDE 9

Polynomial kernel: Degree two

9

} We instead use 𝐿(𝒚, 𝒚′) = 𝒚2𝒚′ + 1 / that corresponds to:

𝝔 𝒚 = 1, 2

𝑦,, … ,

2

𝑦', 𝑦,

/, . . , 𝑦' /,

2

𝑦,𝑦/, … ,

2

𝑦,𝑦',

2

𝑦/𝑦i, … ,

2

𝑦'j,𝑦'

2

𝑒-dimensional feature space 𝒚 = 𝑦,, … ,𝑦' 2

SLIDE 10

Polynomial kernel

10

} This can similarly be generalized to d-dimensioan 𝒚 and 𝜚s are

polynomials of order 𝑁: 𝐿 𝒚, 𝒚f = 1 + 𝒚2𝒚f l

= 1 + 𝑦,𝑦,

f + 𝑦/𝑦/ f + ⋯ + 𝑦'𝑦' f l

} Example: SVM boundary for a polynomial kernel

} 𝑥5 + 𝒙2𝝔 𝒚 = 0 } ⇒ 𝑥5 + ∑

𝛽<𝑧(<)𝝔 𝒚 <

2𝝔 𝒚 = 0

bod5

} ⇒ 𝑥5 + ∑

𝛽<𝑧(<)𝑙(𝒚 < , 𝒚) = 0

bod5

} ⇒ 𝑥5 + ∑

𝛽<𝑧(<) 1 + 𝒚(<)q𝒚

l

= 0

bod5

Boundary is a polynomial of order 𝑁

SLIDE 11

Why kernel?

11

} kernel functions 𝐿 can indeed be efficiently computed, with a

cost proportional to 𝑒 (the dimensionality of the input) instead of 𝑛.

} Example: consider the second-order polynomial transform:

𝝔 𝒚 = 1, 𝑦,, … , 𝑦', 𝑦,

/, 𝑦,𝑦/, … , 𝑦'𝑦' 2

𝝔 𝒚 2𝝔 𝒚′ = 1 + E 𝑦<𝑦<

f ' <I,

+ E E 𝑦<𝑦\𝑦<

f𝑦\ f ' \I, ' <I,

𝝔 𝒚 2𝝔 𝒚′ = 1 + 𝑦2𝑦f + 𝑦2𝑦f /

E 𝑦<𝑦<

f ' <I,

× E 𝑦\𝑦\

f ' \I,

𝑟 = 1 + 𝑒 + 𝑒/ 𝑃(𝑟) 𝑃(𝑒)

SLIDE 12

Gaussian or RBF kernel

12

} If 𝐿 𝒚, 𝒚f is an inner product in some transformed space of x,

it is good

} 𝐿 𝒚, 𝒚f = exp

(−

𝒚j𝒚w x y

)

} Take one dimensional case with 𝛿 = 1:

𝐿 𝑦, 𝑦f = exp − 𝑦 − 𝑦f / = exp −𝑦/ exp −𝑦′/ exp 2𝑦𝑦′ = exp −𝑦/ exp −𝑦′/ E 2{𝑦{𝑦′{ 𝑙!

} {I5

SLIDE 13

Some common kernel functions

13

} Linear: 𝑙(𝒚, 𝒚f) = 𝒚2𝒚′ } Polynomial: 𝑙 𝒚, 𝒚f = (𝒚2𝒚f + 1)l } Gaussian: 𝑙 𝒚, 𝒚f = exp

(−

𝒚j𝒚w x y

)

} Sigmoid: 𝑙 𝒚, 𝒚f = tanh

(𝑏𝒚2𝒚f + 𝑐)

SLIDE 14

Kernel formulation of SVM

14

} Optimization problem:

max

𝜷

E 𝛽G

H GI,

− 1 2 E E 𝛽G𝛽)𝑧(G)𝑧())𝝔 𝒚(G) 2𝝔 𝒚())

H )I, H GI,

}

Subject to ∑ 𝛽G𝑧(G) = 0

H GI,

}

0 ≤ 𝛽G ≤ 𝐷 𝑜 = 1, … , 𝑂

𝑙(𝒚 G , 𝒚()))

𝑹 = 𝑧 , 𝑧 , 𝐿 𝒚 , , 𝒚 , ⋯ 𝑧 , 𝑧 H 𝐿 𝒚 H , 𝒚 , ⋮ ⋱ ⋮ 𝑧 H 𝑧 , 𝐿 𝒚 H , 𝒚 , ⋯ 𝑧 H 𝑧 H 𝐿 𝒚 H , 𝒚 H

SLIDE 15

Classifying a new data

15

𝑧 ] = 𝑡𝑗𝑕𝑜 𝑥5 + 𝒙2𝝔(𝒚) where 𝒙 = ∑ 𝛽G 𝑧(G)𝝔(𝒚(G))

bcd5

and 𝑥5 = 𝑧(e) − 𝒙2𝝔(𝒚(e)) 𝑧 ] = 𝑡𝑗𝑕𝑜 𝑥5 + E 𝛽G 𝑧 G 𝝔 𝒚 G

2

bcd5

𝝔(𝒚) 𝑥5 = 𝑧(e) − E 𝛽G 𝑧 G 𝝔 𝒚 G

2

bcd5

𝝔 𝒚 e

𝑙(𝒚 G , 𝒚) 𝑙(𝒚 G , 𝒚(e))

SLIDE 16

Gaussian kernel

16

} Example: SVM boundary for a Gaussian kernel

} Considers a Gaussian function around each data point.

} 𝑥5 + ∑

𝛽<𝑧(<)exp (−

𝒚j𝒚(o) x „

) = 0

bod5

} SVM + Gaussian Kernel can classify any arbitrary training set

} Training error is zero when 𝜏 → 0

¨ All samples become support vectors (likely overfiting)

SLIDE 17

Hard margin Example

17

} For narrow Gaussian (large 𝜏), even the protection of a large

margin cannot suppress overfitting.

Y. Abu-Mostafa et. Al, 2012

SLIDE 18

SVM Gaussian kernel: Example

18

This example has been adopted from Zisserman’s slides 𝑔 𝒚 = 𝑥5 + E 𝛽<𝑧(<)exp (− 𝒚 − 𝒚(<) / 2𝜏/ )

bod5

SLIDE 19

SVM Gaussian kernel: Example

19

This example has been adopted from Zisserman’s slides

SLIDE 20

SVM Gaussian kernel: Example

20

This example has been adopted from Zisserman’s slides

SLIDE 21

SVM Gaussian kernel: Example

21

This example has been adopted from Zisserman’s slides

SLIDE 22

SVM Gaussian kernel: Example

22

This example has been adopted from Zisserman’s slides

SLIDE 23

SVM Gaussian kernel: Example

23

This example has been adopted from Zisserman’s slides

SLIDE 24

SVM Gaussian kernel: Example

24

This example has been adopted from Zisserman’s slides

SLIDE 25

Kernel trick: Idea

25

} Kernel trick → Extension of many well-known algorithms to

kernel-based ones

} By substituting the dot product with the kernel function

} 𝑙 𝒚, 𝒚f = 𝝔 𝒚 2𝝔(𝒚′) } 𝑙 𝒚, 𝒚f shows the dot product of 𝒚 and 𝒚f in the transformed space.

} Idea: when the input vectors appears only in the form of dot

products, we can use kernel trick

} Solving the problem without explicitly mapping the data

} Explicit mapping is expensive if 𝝔 𝒚 is very high dimensional

SLIDE 26

Kernel trick: Idea (Cont’d)

26

} Instead of using a mapping 𝝔: 𝒴 ← ℱ to represent 𝒚 ∈ 𝒴 by

𝝔(𝒚) ∈ ℱ, a kernel function 𝑙: 𝒴×𝒴 → ℝ is used.

} We specify only an inner product function between points in the

transformed space (not their coordinates)

} In many cases, the inner product in the embedding space can be

computed efficiently.

SLIDE 27

Constructing kernels

27

} Construct kernel functions directly

} Ensure that it is a valid kernel

} Corresponds to an inner product in some feature space.

} Example: 𝑙(𝒚, 𝒚f) = 𝒚2𝒚f /

} Corresponding mapping: 𝝔 𝒚 = 𝑦,

/,

2

𝑦,𝑦/, 𝑦/

/ 2 for 𝒚 =

𝑦,, 𝑦/ 2

} We need a way to test whether a kernel is valid without

having to construct 𝝔 𝒚

SLIDE 28

Construct Valid Kernels

28

𝑑 > 0 , 𝑙1: valid kernel
𝑔(. ): any function
𝑟(. ): a polynomial with coefficients≥ 0
𝑙1, 𝑙2: valid kernels
𝝔(𝒚): a function from 𝒚 to ℝl

𝑙3(. , . ): a valid kernel in ℝl

𝑩 : a

symmetric positive semi-definite matrix

𝒚Ž and 𝒚• are variables (not necessarily

disjoint) with 𝒚 = (𝒚Ž, 𝒚•), and 𝑙Ž and 𝑙• are valid kernel functions

ver

their respective spaces. [Bishop]

SLIDE 29

Valid kernel: Necessary & sufficient conditions

29

} Gram matrix 𝑳H×H: 𝐿<\ = 𝑙(𝒚(<), 𝒚(\))

} Restricting the kernel function to a set of points {𝒚 , , 𝒚 / , … , 𝒚(H)}

𝐿 = 𝑙(𝒚(,), 𝒚(,)) ⋯ 𝑙(𝒚(,), 𝒚(H)) ⋮ ⋱ ⋮ 𝑙(𝒚(H), 𝒚(,)) ⋯ 𝑙(𝒚(H), 𝒚(H))

} Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

} Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

SLIDE 30

Extending linear methods to kernelized ones

30

} Kernelized version of linear methods

} Linear methods are famous

}

Unique optimal solutions, faster learning algorithms, and better analysis

} However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

} Replacing inner products with kernels in linear algorithms ⇒

very flexible methods

} We can operate in the mapped space without ever computing the

coordinates of the data in that space

SLIDE 31

Example: kernelized minimum distance classifier

31

} If

𝒚 − 𝝂, < 𝒚 − 𝝂/ then assign 𝒚 to 𝒟,

𝒚 − 𝝂, 2 𝒚 − 𝝂, < 𝒚 − 𝝂/ 2 𝒚 − 𝝂/ −2𝒚2𝝂, + 𝝂,

2𝝂, < −2𝒚2𝝂/ + 𝝂/ 2𝝂/

−2 ∑ 𝒚2𝒚 G

” c I,

𝑂, + ∑ ∑ 𝒚 G 2𝒚 )

” • I,
” c I,

𝑂,×𝑂, < −2 ∑ 𝒚2𝒚 G

” c I/

𝑂/ + ∑ ∑ 𝒚 G 2𝒚 )

” • I/
” c I/

𝑂/×𝑂/

−2 ∑ 𝐿 𝒚, 𝒚 G

” c I,

𝑂

,

+ ∑ ∑ 𝐿 𝒚 G , 𝒚 )

” • I,
” c I,

𝑂

,×𝑂 ,

< −2 ∑ 𝐿 𝒚, 𝒚 G

” c I/

𝑂/ + ∑ ∑ 𝐿 𝒚 G , 𝒚 )

” • I/
” c I/

𝑂/×𝑂/

SLIDE 32

Which information can be obtained from kernel?

32

} Example: we know all pairwise distances

} 𝑒 𝝔 𝒚 , 𝝔 𝒜

/ =

𝝔 𝒚 − 𝝔 𝒜

/ = 𝑙 𝒚, 𝒚 + 𝑙 𝒜, 𝒜 − 2𝑙(𝒚, 𝒜)

}

Therefore, we also know distance of points from center of mass of a set

} Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

} This allow us to introduce kernelized versions of them

SLIDE 33

Example: Kernel ridge regression

33

min

𝒙 E 𝒙2𝒚 G − 𝑧 G / H GI,

+ 𝜇𝒙2𝒙 E 2𝒚 G 𝒙2𝒚 G − 𝑧 G

H GI,

+ 2𝜇𝒙 = 𝟏 ⇒ 𝒙 = E 𝛽G𝒚 G

H GI,

𝛽G = − 1 𝜇 𝒙2𝒚 G − 𝑧 G

SLIDE 34

Example: Kernel ridge regression (Cont’d)

34

min

𝒙 E 𝒙2𝜚 𝒚 G

− 𝑧 G

/ H GI,

+ 𝜇𝒙2𝒙

} Dual representation:

𝐾 𝜷 = 𝜷2𝚾𝚾2𝚾𝚾2𝜷 − 2𝜷2𝚾𝚾2𝒛 + 𝒛2𝒛 + 𝜇𝜷2𝚾𝚾2𝜷 𝐾 𝜷 = 𝜷2𝑳𝑳𝜷 − 2𝜷2𝑳𝒛 + 𝒛2𝒛 + 𝜇𝜷2𝑳𝜷 𝛼

𝜷𝐾 𝜷 = 𝟏 ⇒ 𝜷 = 𝑳 + 𝜇𝑱H j,𝒛

𝒙 = E 𝛽G𝜚 𝒚 G

H GI,

SLIDE 35

Example: Kernel ridge regression (Cont’d)

35

} Prediction for new 𝒚:

𝑔 𝒚 = 𝒙2𝜚 𝒚 = 𝜷2𝚾𝜚 𝒚 = 𝐿(𝒚(,), 𝒚) ⋮ 𝐿(𝒚(H), 𝒚)

2

𝑳 + 𝜇𝑱H j,𝒛

𝒙 = 𝚾2𝜷

SLIDE 36

Kernels for structured data

36

} Kernels also can be defined on general types of data

} Kernel functions do not need to be defined over vectors

} just we need a symmetric positive definite matrix

} Thus, many algorithms can work with general (non-vectorial)

data

} Kernels exist to embed strings, trees, graphs, …

} This may be more important than nonlinearity

} kernel-based version of classical learning algorithms for recognition

f structured data

SLIDE 37

Kernel function for objects

37

} Sets: Example of kernel function for sets:

𝑙 𝐵, 𝐶 = 2 ∩¢

} Strings: The inner product of the feature vectors for two

strings can be defined as

} e.g., sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G E G T E A G A E G A T G

SLIDE 38

Kernel trick advantages: summary

38

} Operating in the mapped space without ever computing the

coordinates of the data in that space

} Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

} Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

} In many cases, inner product in the embedding space can be

computed efficiently.