[PPT] - ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING PowerPoint Presentation

SLIDE 1

11

ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING Kernel PCA

SLIDE 2

22

ADVANCED MACHINE LEARNING

Overview Today’s Lecture

Brief Recap of Classical Principal Component Analysis (PCA)
Derivation of kernel PCA
Exercises to develop a geometrical intuition of kernel PCA

SLIDE 3

33

ADVANCED MACHINE LEARNING

Take samples of two classes (yellow and pink classes)

320 240 3 230400

Each image is a high- dimensional vector x





Principal Component Analysis: Overview

SLIDE 4

44

ADVANCED MACHINE LEARNING

2 2 230400

Project the images onto a lower dimensional space through matrix : y A y Ax



  

Separating Line

Principal Component Analysis: Overview

SLIDE 5

55

ADVANCED MACHINE LEARNING

What is A? PCA discovers the matrix A

2 2 230400

Project the images onto a lower dimensional space through matrix : y A y Ax



  

Principal Component Analysis: Overview

SLIDE 6

66

ADVANCED MACHINE LEARNING

1

x

2

x

What is the 2D to 1D projection that minimizes the reconstruction error? Infinite number of choices for projection matrix A  need criteria to reduce the choice 1: minimum information loss (minimal reconstruction error)

Principal Component Analysis: Overview

SLIDE 7

77

ADVANCED MACHINE LEARNING

Infinite number of choices for projection matrix A  need criteria to reduce the choice

1

x

2

x

What is the 2D to 1D projection that minimizes the reconstruction error? 1: minimum information loss(minimal reconstruction error) 2: equivalent to finding the direction with maximum variance

1

x

2

x

Smallest breadth of data lost Largest breadth of data conserved

Reconstruction after projection

Principal Component Analysis: Overview

SLIDE 8

88

ADVANCED MACHINE LEARNING

 

1 Compute covariance matrix of dataset :

T

X C E XX M 

 

1 2

Dataset ..... (data is centered, i.e. 0)

M

X x x x E X      

1 1 1 2

Find eigenvalue decomposition: .... : matrix of eigenvectors : Diagonal matrix of eigenvalues Order .... , s.t. ...

T N N N

C V V V e e e e             

1

The eigenvectors form a basis of the space. is aligned with the axis of maximum variance. e

2

e

1

e

Project data onto eigenvectors. Remove projections with low (noise). 

Principal Component Analysis: Overview

SLIDE 9

99

ADVANCED MACHINE LEARNING

Original Image Image compressed 90%

0.1

Compressed image is , Rows of contains 1st eigenvectors

p N p p

y y A x A p



 

Original image is encoded in .

N

x

PCA for Data Compression

SLIDE 10

10 10

ADVANCED MACHINE LEARNING

Results of decomposition with Principal Component Analysis: eigenvectors

PCA for Feature Extraction

Encapsulate main differences across groups of images

(in the first eigenvectors)

Detailed features (glasses) get encapsulated next

(in the following eigenvectors)

SLIDE 11

11 11

ADVANCED MACHINE LEARNING

Principal Component Analysis: Pros & Cons

Limitations: a) PCA assumes a linear transformation:

 With centering of data, one can only do a rotation in space. b) It fails at finding directions that require a non-linear transformation.

Advantages:

a) The projection through PCA ensures minimal reconstruction error. b) The projection does not distort the space (rotation in space).  Ease of visualization/interpretation: The features that appear in the projections are often interpretable visually.

SLIDE 12

12 12

ADVANCED MACHINE LEARNING

Revisiting the hypotheses of PCA

PCA assumed a linear transformation  Non-linear PCA (Kernel PCA): find a non-linear embedding of the data and then perform linear PCA.

SLIDE 13

13 13

ADVANCED MACHINE LEARNING

Find a non-linear transformation that send the data in a space where linear computation is again feasible.

Recall: Principle of kernel Methods Going back to linearity

SLIDE 14

14 14

ADVANCED MACHINE LEARNING

Determine a transformation which brings out features of the data so as to make subsequent computation easier.

Original Space

x1 x2

After Lifting Data in Feature Space

Example above: Data becomes linearly separable when using a rbf kernel and projecting onto first 2 PC-s of kernel PCA.

Kernel PCA: Principle

v2 v1

SLIDE 15

15 15

ADVANCED MACHINE LEARNING

Kernel PCA: Principle

Scholkopf et al, Neural Computation, 1998

Idea: Send the data X into a feature space H through the nonlinear map f. Perform linear pca in feature space and project into set

f

eigenvectors in feature space

 

 

   

 

1... 1 ,....., i M i N M

X x X x x f f f



  

Original Space

H

 

2

x f

 

1

x f

2

x

1

x

 

P x f

SLIDE 16

16 16

ADVANCED MACHINE LEARNING

Original Space

x1 x2 v1 v2

Data projected onto the two first principal components in feature space

Kernel PCA: Principle

Idea: Send the data X into a feature space H through the nonlinear map f. Perform linear pca in feature space and project into set

f

eigenvectors in feature space

 

 

   

 

1... 1 ,....., i M i N M

X x X x x f f f



  

Determining f is difficult  Kernel Trick

SLIDE 17

MACHINE LEARNING – 2013

17

Assume that, in feature space H, the data are centered:

 

1 M i i

x f





The covariance matrix in the feature space is:

 

1 The columns 1...

f are composed of

.

T i

C FF M i M F x

f

f  

Linear PCA in Feature Space

 

: X H x x f f 

Sending the data in feature space through f:

SLIDE 18

18 18 18

ADVANCED MACHINE LEARNING

i i i

C v v

f

 

As in the original space, in feature space, the covariance matrix can be diagonalized and we have now to find the eigenvalues i > 0, satisfying:

Linear PCA in Feature Space

=> Formulate everything as a dot product and use kernel trick! Primal eigenvalue problem: Finding the eigenvectors of Not possible in feature space! v Cf

SLIDE 19

19 19 19

ADVANCED MACHINE LEARNING

   

1 1

Each eigenvector ,..., can be expressed as a linear combination of the images of the datapoints: Rewriting PCA in terms of dot products: 1 Using = with we

M M T i j j i i i i j

v v C v x x v C v v M

f f

f f 





   

1

btain,

M T i j j i j i

ij

v x x v M



f f 





 

1

1 .

j

M i j j i

x M  f 





From Linear PCA to Kernel PCA

Scalar

SLIDE 20

20 20 20

ADVANCED MACHINE LEARNING

Linear PCA in Feature Space

 

by , on both sides, we have:

T j

x f

 

1

M i i l l l i

v x M  f 





i i i

C v v

f

 

   

1

M T i k k i k

C v x x v M

f

f f





Multiplying the equation:

   

, , , , 1,...

j i j i i

x C v x v i j M

f

f  f   

SLIDE 21

MACHINE LEARNING – 2013

21

           

2 1 1 1

1 1 , , ,

M M M j k i k l i j l l l k l l i

x x x x x x M M f f  f f  f f 

  

 

  

, : Gram Matrix

i

i i

K M K   

     

Use the kernel trick: , : ,

i j i j ij

k x x K x x f f  

Linear PCA in Feature Space

kl

K

jl

K

jk k

K



 Eigenvalue problem of the form:

Dual eigenvalue problem: Finding the dual eigenvectors .

i



SLIDE 22

MACHINE LEARNING – 2013

22

, : Gram Matrix

i

i i

K M K   

 Eigenvalue problem of the form:

Linear PCA in Feature Space

1 1

The solutions to the dual eigenvalue problem: are given by all the eigenvectors ,..., with non-zero eigenvalues ,..., .

M M

   

Kernel PCA finds at most M eigenvectors M: number of datapoints M>>N dimension of each datapoint

SLIDE 23

MACHINE LEARNING – 2013

23

Linear PCA in Feature Space

1

Request that the eigenvectors v of be normalized, i.e. , 1 1,..., is equivalent to asking that the dual eigenvectors ,..., are such that: 1/ .

i i M i i

C v v i M

f

       

SLIDE 24

24 24 24

ADVANCED MACHINE LEARNING

Constructing the kPCA projections

 

1 1

Projection of query point x onto eigenvector : 1 1 , , ,

i M M i i j i j j j j j i i

v v x x x k x x M M f  f f   

 

 

We cannot see the projection in feature space! We can only compute the projections of each point onto each eigenvector

 

Isolines group points with equal projection: All points , . : , .

i

x s t v x cst f 

Sum over all training points

SLIDE 25

25 25 25

ADVANCED MACHINE LEARNING

kPCA projections: Exercise

 

1

Recall: projection of query point x onto eigenvector : 1 , , where the are the dual eigenvectors, solutions of the eigenvalue decomposition of .

i M i i j j j i i

v v x k x x M K f   





 

2 2

'

Consider a 2 dimensional data space, with two datapoints, and the RBF kernel: , ' a) How many dual eigenvectors do you have and what is their dimension? b) Compute the eigenvectors and draw t

x x

k x x e

  

  

 

2

he isolines for the projections

n each eigenvector.

c) Repeat (b) for a homogeneous polynomial kernel with p=2: , ' , ' k x x x x 

SLIDE 26

26 26 26

ADVANCED MACHINE LEARNING

kPCA projections: Exercise

 

1

Recall: projection of query point x onto eigenvector : 1 , , where the are the dual eigenvectors, solutions of the eigenvalue decomposition of .

i M i i j j j i i

v v x k x x M K f   





 

2 2

'

Consider a 2 dimensional data space, with three equidistant datapoints, and the RBF kernel: , ' a) How many dual eigenvectors do you have and what is their dimension? b) Compute the eigenvect

x x

k x x e

  

  

 

2

rs and draw the isolines for the projections
n each eigenvector.

c) Repeat (b) for a homogeneous polynomial kernel with p=2: , ' , ' d) What happens if you take 3 non-equidistant datapoints? k x x x x 

SLIDE 27

27 27

ADVANCED MACHINE LEARNING

Curse of Dimensionality

Kernel PCA is very intensive computationally. Computation of the eigenvectors requires eigenvalue decomposition of the Gram matrix (Kernel Matrix is M x M) which grows quadratically with the number of data points M. Computation of each projection in original space grows linearly with M too.  A variety of sparse methods have been proposed in the literature

SLIDE 28

28 28

ADVANCED MACHINE LEARNING

Summary

Kernel PCA offers an alternative to standard PCA to determine

projections of the data while not assuming a linear transformation.

It exploits the principle of the kernel trick, by replacing the inner

product between pair of datapoints in the standard PCA (to compute the Covariance matrix) by the kernel function.

The computation of kernel PCA is simple, yet it is expensive as it

requires an eigenvalue decomposition of a very large matrix.

Current research develops sparse versions of the algorithm.