[PPT] - PCA & ICA CE-717: Machine Learning Sharif University of PowerPoint Presentation

SLIDE 1

CE-717: Machine Learning

Sharif University of Technology Spring 2018

Soleymani

PCA & ICA

SLIDE 2

Dimensionality Reduction: Feature Selection vs. Feature Extraction

} Feature selection

} Select a subset of a given feature set

} Feature extraction

} A linear or non-linear transform on the original feature space

2

𝑦" ⋮ 𝑦$ → 𝑦&' ⋮ 𝑦&() Feature Selection (𝑒+ < 𝑒) 𝑦" ⋮ 𝑦$ → 𝑧" ⋮ 𝑧$) = 𝑔 𝑦" ⋮ 𝑦$ Feature Extraction

SLIDE 3

Feature Extraction

3

} Mapping of the original data to another space

} Criterion for feature extraction can be different based on problem

settings

} Unsupervised task: minimize the information loss (reconstruction error) } Supervised task: maximize the class discrimination on the projected space

} Feature extraction algorithms

} Linear Methods

} Unsupervised: e.g., Principal Component Analysis (PCA) } Supervised: e.g., Linear Discriminant Analysis (LDA)

¨ Also known as Fisher’s Discriminant Analysis (FDA)

} Non-linear methods:

} Supervised: MLP neural networks } Unsupervised: e.g., autoencoders

SLIDE 4

Feature Extraction

4

} Unsupervised feature extraction: } Supervised feature extraction:

Feature Extraction 𝒀 = 𝑦"

(")

⋯ 𝑦$

(")

⋮ ⋱ ⋮ 𝑦"

(5)

⋯ 𝑦$

(5)

A mapping 𝑔: ℝ$ → ℝ$) Or

nly the transformed data

𝒀+ = 𝑦′"

(")

⋯ 𝑦′$)

(")

⋮ ⋱ ⋮ 𝑦′"

(5)

⋯ 𝑦′$)

(5)

Feature Extraction 𝒀 = 𝑦"

(")

⋯ 𝑦$

(")

⋮ ⋱ ⋮ 𝑦"

(5)

⋯ 𝑦$

(5)

𝑍 = 𝑧(") ⋮ 𝑧(5) A mapping 𝑔: ℝ$ → ℝ$) Or

nly the transformed data

𝒀+ = 𝑦′"

(")

⋯ 𝑦′$)

(")

⋮ ⋱ ⋮ 𝑦′"

(5)

⋯ 𝑦′$)

(5)

SLIDE 5

Unsupervised Feature Reduction

5

} Visualization

and interpretation: projection

f

high- dimensional data onto 2D or 3D.

} Data compression: efficient storage, communication, or

and retrieval.

} Pre-process: to improve accuracy by reducing features

} As a preprocessing step to reduce dimensions for supervised

learning tasks

} Helps avoiding overfitting

} Noise removal

} E.g, “noise” in the images introduced by minor lighting

variations, slightly different imaging conditions,

SLIDE 6

Linear Transformation

6

} For linear transformation, we find an explicit mapping

𝑔 𝒚 = 𝑩<𝒚 that can transform also new data vectors.

Original data reduced data

=

Type equation here. 𝑩< ∈ ℝ$)×$ 𝒚 ∈ ℝ 𝒚′ ∈ ℝ$) 𝒚+ = 𝑩<𝒚 𝑒+ < 𝑒

SLIDE 7

Linear Transformation

7

} Linear transformation are simple mappings

( )

T T j j

¢ ¢ = = x A x x a x

1

a

d ¢

a

𝑘 = 1, … , 𝑒

𝑩 = 𝑏"" ⋯ 𝑏"$ ⋮ ⋱ ⋮ 𝑏$" ⋯ 𝑏$$)

SLIDE 8

Linear Dimensionality Reduction

8

} Unsupervised

} Principal Component Analysis (PCA) } Independent Component Analysis (ICA) } SingularValue Decomposition (SVD) } Multi Dimensional Scaling (MDS) } Canonical Correlation Analysis (CCA) } …

SLIDE 9

Principal Component Analysis (PCA)

9

} Also known as Karhonen-Loeve (KL) transform } Principal Components (PCs): orthogonal vectors that are

rdered by the fraction of the total information (variation) in

the corresponding directions

} Find the directions at which data approximately lie

} When the data is projected onto first PC, the variance of the projected data

is maximized

SLIDE 10

Principal Component Analysis (PCA)

10

} The “best” linear subspace (i.e. providing least reconstruction

error of data):

} Find mean reduced data } The axes have been rotated to new (principal) axes such that:

} Principal axis 1 has the highest variance

....

} Principal axis i has the i-th highest variance.

} The principal axes are uncorrelated

} Covariance among each pair of the principal axes is zero.

} Goal: reducing the dimensionality of the data while preserving

the variation present in the dataset as much as possible.

} PCs can be found as the “best” eigenvectors of the covariance

matrix of the data points.

SLIDE 11

Principal components

11

} If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the

largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻

𝒘" 𝒘W

SLIDE 12

Example: random direction

12

SLIDE 13

Example: principal component

13

SLIDE 14

Covariance Matrix

14

𝝂𝒚 = 𝜈" ⋮ 𝜈$ = 𝐹(𝑦") ⋮ 𝐹(𝑦$) 𝜯 = 𝐹 𝒚 − 𝝂𝒚 𝒚 − 𝝂𝒚 <

} ML estimate of covariance matrix from data points 𝒚(&)

&\" 5 :

𝜯 ] = 1 𝑂 ^ 𝒚(&) − 𝝂 _ 𝒚(&) − 𝝂 _

< 5 &\"

= 1 𝑂 𝒀 `<𝒀 `

𝒀 ` = 𝒚 a(") ⋮ 𝒚 a(5) = 𝒚(") − 𝝂 _ ⋮ 𝒚(5) − 𝝂 _ 𝝂 _ = 1 𝑂 ^ 𝒚(&)

5 &\" Mean-centered data

SLIDE 15

PCA: Steps

15

} Input: 𝑂×𝑒

data matrix 𝒀 (each row contain a 𝑒 dimensional data point)

} 𝝂 =

" 5 ∑

𝒚(&)

5 &\"

} 𝒀

` ← Mean value of data points is subtracted from rows of 𝒀

} 𝚻 =

" 5 𝒀

`<𝒀 ` (Covariance matrix)

} Calculate eigenvalue and eigenvectors of 𝚻 } Pick 𝑒+ eigenvectors corresponding to the largest eigenvalues

and put them in the columns of 𝑩 = [𝒘", … , 𝒘$)]

} 𝒀′ = 𝒀𝑩

First PC d’-th PC

SLIDE 16

Find principal components

16

} Assume that data is centered. } Find vector 𝒘 that maximizes sample variance of the projected

data: argmax

h

1 𝑂 ^ 𝑤<𝑦 j

W 5 j\"

= 1 𝑂 𝑤<𝑌<𝑌𝑤

s. t. 𝑤<𝑤 = 1

𝑀 𝑤, 𝜇 = 𝑤<𝑌<𝑌𝑤 − 𝜇𝑤<𝑤 𝜖𝑀 𝜖𝑤 = 0 ⇒ 2𝑌<𝑌𝑤 − 2𝜇𝑤 = 0 ⇒ 𝑌<𝑌𝑤 = 𝜇𝑤

SLIDE 17

Find principal components

17

} For symmetric matrices, there exist eigen-vectors that

are orthogonal.

} Let 𝑤", … 𝑤$ denote the eigen-vectors of 𝑌<𝑌 such that:

𝑤&

<𝑤s = 0, ∀𝑗 ≠ 𝑘

𝑤&

<𝑤& = 1,

∀𝑗

SLIDE 18

Find principal components

18

𝑌<𝑌𝒘 = 𝜇𝒘 ⇒ 𝒘<𝑌<𝑌𝒘 = 𝜇𝒘<𝒘 = 𝜇

} 𝜇 denotes the amount of variance along the found

dimension 𝒘 (called energy along that dimension).

} } Eigenvalues: 𝜇" ≥ 𝜇W ≥ 𝜇x ≥ ⋯

} The first PC 𝒘" is the the eigenvector of the sample covariance

matrix 𝑌<𝑌 associated with the largest eigenvalue.

} The 2nd PC 𝒘W is the the eigenvector of the sample covariance

matrix 𝑌<𝑌 associated with the second largest eigenvalue

} And so on ...

SLIDE 19

Another Interpretation: Least Squares Error

19

} PCs are linear least squares fits to samples, each orthogonal to

the previous PCs:

} First PC is a minimum distance fit to a vector in the original feature

space

} Second PC is a minimum distance fit to a vector in the plane

perpendicular to the first PC

} …

SLIDE 20

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)

} When data are mean-removed:

} Minimizing sum of square distances to the line is equivalent to

maximizing the sum of squares of the projections on that line (Pythagoras).

20

rigin

SLIDE 21

Two interpretations

21

} Maximum variance subspace

argmax

h

^ 𝑤<𝑦 j

W 5 j\"

= 𝑤<𝑌<𝑌𝑤

} Minimum reconstruction error

argmin

h

^ 𝑦 j − 𝑤<𝑦 j 𝑤

W 5 j\"

rigin

𝑤<𝑦 𝑦 𝑤 blue2 + red2 = geen2 geen2 is fixed (shows data) So, maximizing red2 is equivalent to minimizing blue2

SLIDE 22

PCA: Uncorrelated Features

22

𝒚+ = 𝑩<𝒚 𝑺𝒚) = 𝐹 𝒚+𝒚+< = 𝐹 𝑩<𝒚𝒚<𝑩 = 𝑩<𝐹 𝒚𝒚< 𝑩 = 𝑩<𝑺𝒚𝑩

} If 𝑩 = [𝒃", … , 𝒃$]

where 𝒃", … , 𝒃$ are orthonormal eighenvectors of 𝑺𝒚: 𝑺𝒚) = 𝑩<𝑺𝒚𝑩 = 𝑩< 𝑩𝚳𝑩< 𝑩 = 𝚳 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚&

+𝒚s + = 0

} then mutually uncorrelated features are obtained

} Completely

uncorrelated features avoid information redundancies

SLIDE 23

Reconstruction

23

𝒚+ = 𝒘"

<𝒚

⋮ 𝒘$)

< 𝒚

= 𝒘"

<

⋮ 𝒘$)

<

𝒚

𝒚+ = 𝑩<𝒚 ⇒ 𝑩𝒚+ = 𝑩𝑩<𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚+

} Incorporating all eigenvectors in 𝑩 = [𝒘", … , 𝒘$]:

⟹ If 𝑒+ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚+

𝑩 = [𝒘", … , 𝒘$)]

SLIDE 24

PCA Derivation: Relation between Eigenvalues and Variances

24

} The 𝑘-th largest eigenvalue of 𝑺𝒚 is the variance on the 𝑘-th

PC:

𝑤𝑏𝑠 𝑦s

+ = 𝒘s <𝑺𝒚𝒘s = 𝜇s

SLIDE 25

PCA Derivation: Mean Square Error Approximation

25

} Incorporating only 𝑒′ eigenvectors corresponding to the

largest eigenvalues 𝑩 = [𝒃", … , 𝒃$)] (𝑒+ < 𝑒)

} It minimizes MSE between 𝒚 and 𝒚

_ = 𝑩𝒚+:

𝐾 𝑩 = 𝐹 𝒚 − 𝒚 _ W = 𝐹 𝒚 − 𝑩𝒚+ W = 𝐹 ^ 𝑦s

+𝒃s $ s\$)•" W

= 𝐹 ^ ^ 𝑦s

+𝒃s <𝒃€ $ €\$)•"

𝑦€

+ $ s\$)•"

= 𝐹 ^ 𝑦s

+W $ s\$)•"

= ^ 𝐹 𝑦s

+W $ s\$)•"

= ^ 𝒃s

<𝐹 𝒚𝒚< 𝒃s $ s\$)•"

= ^ 𝒃s

<𝑺𝒚𝒃s $ s\$)•"

= ^ 𝜇s

$ s\$)•"

Sum of the 𝑒 − 𝑒+ smallest eigenvalues

SLIDE 26

PCA Derivation: Mean Square Error Approximation

26

} In general, it can also be shown MSE is minimized compared to

any

ther

approximation

f

𝒚 by any 𝑒+ -dimensional

rthonormal basis

} without first assuming that the axes are eigenvectors of the correlation

matrix, this result can also be obtained.

} If the data is mean-centered in advance, 𝑺𝒚 and 𝑫𝒚 (covariance

matrix) will be the same.

} However, in the correlation version when 𝑫𝒚 ≠ 𝑺𝒚 the approximation is

not, in general, a good one (although it is a minimum MSE solution)

SLIDE 27

PCA on Faces: “Eigenfaces”

27

} ORL Database

Some Images

SLIDE 28

PCA on Faces: “Eigenfaces”

28

For eigen faces “gray” = 0, “white” > 0, “black” < 0 Average face

PC

st

1 PC

th

6

SLIDE 29

PCA on Faces:

Feature vector=[𝑦"

+,𝑦W +, … ,𝑦$) + ] +𝑦"

+×

+𝑦W

+×

+𝑦Wƒ„

+

× + ⋯

29

= Average Face 𝑦&

+ = 𝒘& <𝒚

The projection of 𝒚 on the i-th PC 𝒚 is a 112×92 = 10304 dimensional vector containing intensity of the pixels of this image 𝒚 _ = 𝑦 + ^ 𝑦&

+×𝒘& $) &\"

𝒚 _ = 𝒚 + ^ 𝒘&

<𝒚 ×𝒘& $) &\"

SLIDE 30

PCA on Faces: Reconstructed Face

d'=1 d'=2 d'=4 d'=8 d'=16 d'=32 d'=64 d'=128 Original Image d'=256

30

SLIDE 31

Dimensionality reduction by PCA

31

} Data may lie near a linear subspace of high-dimensional input

space

} Only keep data projections onto principal components with

large eigenvalues

} Plot of the eigenvalues (or variances of principal components)

against their indices.

2 4 6 8 10 12 PC 1 PC 2 PC 3 PC 4

variance

SLIDE 32

PCA: Summary

32

} Global optimum is found by eigenvector method } No parameter tuning } However, it is limited to:

} second order statistics } Limited to linear projections

SLIDE 33

PCA and LDA: Drawbacks

33

} PCA drawback: An excellent information packing transform

does not necessarily lead to a good class separability.

} The directions of the maximum variance may be useless for classification

purpose } LDA drawback

} Singularity or under-sampled problem (when 𝑂 < 𝑒)

} Example: gene expression data, images, text documents

} Can reduces dimension only to 𝑒′ ≤ 𝐷 − 1 (unlike PCA)

PCA LDA

SLIDE 34

PCA vs. LDA

} Although LDA often provide more suitable features for

classification tasks, PCA might outperform LDA in some situations:

} When there are many unlabeled data while no or small amount of

labeled data

} when the number of samples per class is small (overfitting problem of LDA)

} when the number of the desired features is more than 𝐷 − 1 } when the training data non-uniformly sample the underlying

distribution

} Semi-supervised feature extraction

} E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA)

34

SLIDE 35

Kernel PCA

35

} Hilbert space: 𝒚 → 𝝔(𝒚) (Nonlinear extension of PCA)

𝑫 = 1 𝑂 ^ 𝝔 𝒚(&) 𝝔 𝒚(&) <

5 &\"

} All eigenvectors of 𝑫 lie in the span of the mapped data points,

𝑫𝒘 = 𝜇𝒘 𝒘 = ^ 𝛽&𝝔 𝒚(&)

5 &\"

} After some algebra, we have:

𝑳𝜷 = 𝑂𝜇𝜷

SLIDE 36

Kernel PCA

36

} Kernel extension of PCA

data (approximately) lies on a lower dimensional non-linear space

SLIDE 37

Laplacian eigenmap, LLE, etc

37

} Need not to predefine a kernel. They construct a graph

n data points.

[M. Belkin et. al, Laplacian Eigenmap]

SLIDE 38

Autoencoder

38

} Cost function: ∑

𝒚(j) − 𝒚 _(j)

5 j\" W

𝒚 𝒚 _

SLIDE 39

Uncorrelated and Independent

39

} Gaussian

} Independent ⟺ Uncorrelated

} Non-Gaussian

} Independent ⇒ Uncorrelated } Uncorrelated ⇏ Independent

Uncorrelated: 𝑑𝑝𝑤 𝑌",𝑌W = 0 Independent: 𝑄 𝑌",𝑌W = 𝑄(𝑌")𝑄(𝑌W)

SLIDE 40

ICA: Cocktail party problem

40

} Cocktail party problem

} 𝑒 speakers are speaking simultaneously and any microphone

records only an overlapping combination of these voices.

¨ Each microphone records a different combination of the speakers’ voices.

} Using these 𝑛 ≥ 𝑒 microphone recordings, can we separate

ut the original 𝑛 speakers’ speech signals?

} Mixing matrix 𝑩:

𝒚 = 𝑩𝒕

} Unmixing matrix 𝑩•":

𝒕 = 𝑩•"𝒚

𝑡€

(j): sound that speaker 𝑙 was uttering at time 𝑜.

𝑦&

(j): acoustic reading recorded by microphone 𝑗 at time 𝑜.

𝑦& = ^ 𝑏&€ 𝑡€

$ €\"

SLIDE 41

ICA Assumptions

41

} The sources are independent of each other. } Sources have non-Gaussian distributions.

} density is completely symmetric

SLIDE 42

Example

42

𝑞 𝑡& = š 1 2 3

if 𝑡& ≤

3

𝑝. 𝑥.

𝐵 = 2 3 2 1 𝑦" 𝑦W [Hyvärinen and Oja, Independent Component Analysis: Algorithms and Applications, 2000.]

SLIDE 43

ICA: Ambiguities

43

} We cannot determine the order of the independent

components.

} There will be no way to distinguish between 𝑩𝒕 and 𝑩𝑸•"𝑸𝒕

(𝑸 is a permutation matrix)

} There is no way to recover the correct scaling of sources

} There will be no way to distinguish between 𝑩𝒕 and 𝑩/𝛽×𝛽𝒕

} Gaussian distribution 𝑂(𝟏, 𝑱) of sources

SLIDE 44

Gaussian distribution 𝑂(𝟏, 𝑱) of sources

44

𝒚 = 𝑩𝒕 𝒚′ = 𝑩′𝒕 𝑩+ = 𝑩𝑺 𝑺𝑺< = 𝑺<𝑺 = 𝑱 ⇒ 𝐹 𝒚′𝒚′< = 𝐹 𝑩𝑺𝒕𝒕<𝑺<𝑩< = 𝑩𝑩< = 𝐹 𝒚𝒚<

} There is no way to tell if sources were mixed using 𝑩 or 𝑩+

} an arbitrary rotational component that cannot be determined from the

data, and thus we cannot recover the original sources.

SLIDE 45

Gaussian distribution 𝑂(𝟏,𝑱) of sources

45

} The distribution of any orthonormal transformation of 𝑂(𝟏, 𝑱)

has exactly the same distribution (A is not identifiable).

} Since multivariate standard normal distribution is rotationally

symmetric.

} So long as the source is not standard Gaussian, it is possible,

given enough data, to recover the 𝑒 independent sources.

SLIDE 46

ICA: Assumptions

46

𝑞£ 𝒚 = 𝑞¤(𝑿𝒚) 𝑿

} Consider

the assumption that the sources are independent: 𝑞 𝒕 = ¦ 𝑞(𝑡&)

$ &\"

𝑿 = 𝑩•" 𝑡& = 𝒙&

<𝒚

𝑿 = 𝑩•" 𝒙"

<

𝒙$

<

SLIDE 47

ICA: Assumptions

47

} To define a density for each source, we can specify cdf

} 𝑞¤ 𝑡 = 𝐺+(𝑡)

} A cdf is a monotonic function that increases from zero to

ne.

} Sigmoid function 𝜏(𝑡) = 1/(1 + 𝑓•¤) is a reasonable

function for the cdf of sources

} Hence, it is assumed 𝑞(𝑡) = 𝜏′(𝑡).

SLIDE 48

ICA: Log likelihood

48

ℓ 𝑿 = ^ log 𝑞£(𝒚(j))

j j\"

= ^ log 𝑞¤ 𝑿𝒚(j) + log 𝑿

j j\"

= ^ ^ log 𝜏+ 𝒙s

<𝒚(j) $ s\"

+ log 𝑿

j j\"

} Stochastic gradient descent:

𝑿 = 𝑿 + 𝛽 1 − 2𝜏 𝒙"

<𝒚(j)

⋮ 1 − 2𝜏 𝒙$

<𝒚(j)

𝒚 j < + 𝑿•" <

𝛼𝑿 𝑿 = 𝑿 𝑿•" <

SLIDE 49

Independent Component Analysis (ICA)

49

} PCA:

} The transformed dimensions will be uncorrelated from each

ther

} Orthogonal linear transform } Only uses second order statistics (i.e., covariance matrix)

} ICA:

} The transformed dimensions will be as independent as

possible.

} Non-orthogonal linear transform } High-order statistics can also used

SLIDE 50

Summary

50

} PCA is a linear dimensionality reduction method that

finds an orthonormal basis (minimizing MSE)

} ICA finds a linear transformation to make components as

independent as possible