PCA & ICA CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

โ–ถ
pca ica
SMART_READER_LITE
LIVE PREVIEW

PCA & ICA CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction }


slide-1
SLIDE 1

CE-717: Machine Learning

Sharif University of Technology Spring 2018

Soleymani

PCA & ICA

slide-2
SLIDE 2

Dimensionality Reduction: Feature Selection vs. Feature Extraction

} Feature selection

} Select a subset of a given feature set

} Feature extraction

} A linear or non-linear transform on the original feature space

2

๐‘ฆ" โ‹ฎ ๐‘ฆ$ โ†’ ๐‘ฆ&' โ‹ฎ ๐‘ฆ&() Feature Selection (๐‘’+ < ๐‘’) ๐‘ฆ" โ‹ฎ ๐‘ฆ$ โ†’ ๐‘ง" โ‹ฎ ๐‘ง$) = ๐‘” ๐‘ฆ" โ‹ฎ ๐‘ฆ$ Feature Extraction

slide-3
SLIDE 3

Feature Extraction

3

} Mapping of the original data to another space

} Criterion for feature extraction can be different based on problem

settings

} Unsupervised task: minimize the information loss (reconstruction error) } Supervised task: maximize the class discrimination on the projected space

} Feature extraction algorithms

} Linear Methods

} Unsupervised: e.g., Principal Component Analysis (PCA) } Supervised: e.g., Linear Discriminant Analysis (LDA)

ยจ Also known as Fisherโ€™s Discriminant Analysis (FDA)

} Non-linear methods:

} Supervised: MLP neural networks } Unsupervised: e.g., autoencoders

slide-4
SLIDE 4

Feature Extraction

4

} Unsupervised feature extraction: } Supervised feature extraction:

Feature Extraction ๐’€ = ๐‘ฆ"

(")

โ‹ฏ ๐‘ฆ$

(")

โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฆ"

(5)

โ‹ฏ ๐‘ฆ$

(5)

A mapping ๐‘”: โ„$ โ†’ โ„$) Or

  • nly the transformed data

๐’€+ = ๐‘ฆโ€ฒ"

(")

โ‹ฏ ๐‘ฆโ€ฒ$)

(")

โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฆโ€ฒ"

(5)

โ‹ฏ ๐‘ฆโ€ฒ$)

(5)

Feature Extraction ๐’€ = ๐‘ฆ"

(")

โ‹ฏ ๐‘ฆ$

(")

โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฆ"

(5)

โ‹ฏ ๐‘ฆ$

(5)

๐‘ = ๐‘ง(") โ‹ฎ ๐‘ง(5) A mapping ๐‘”: โ„$ โ†’ โ„$) Or

  • nly the transformed data

๐’€+ = ๐‘ฆโ€ฒ"

(")

โ‹ฏ ๐‘ฆโ€ฒ$)

(")

โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฆโ€ฒ"

(5)

โ‹ฏ ๐‘ฆโ€ฒ$)

(5)

slide-5
SLIDE 5

Unsupervised Feature Reduction

5

} Visualization

and interpretation: projection

  • f

high- dimensional data onto 2D or 3D.

} Data compression: efficient storage, communication, or

and retrieval.

} Pre-process: to improve accuracy by reducing features

} As a preprocessing step to reduce dimensions for supervised

learning tasks

} Helps avoiding overfitting

} Noise removal

} E.g, โ€œnoiseโ€ in the images introduced by minor lighting

variations, slightly different imaging conditions,

slide-6
SLIDE 6

Linear Transformation

6

} For linear transformation, we find an explicit mapping

๐‘” ๐’š = ๐‘ฉ<๐’š that can transform also new data vectors.

Original data reduced data

=

Type equation here. ๐‘ฉ< โˆˆ โ„$)ร—$ ๐’š โˆˆ โ„ ๐’šโ€ฒ โˆˆ โ„$) ๐’š+ = ๐‘ฉ<๐’š ๐‘’+ < ๐‘’

slide-7
SLIDE 7

Linear Transformation

7

} Linear transformation are simple mappings

( )

T T j j

ยข ยข = = x A x x a x

1

a

d ยข

a

๐‘˜ = 1, โ€ฆ , ๐‘’

๐‘ฉ = ๐‘"" โ‹ฏ ๐‘"$ โ‹ฎ โ‹ฑ โ‹ฎ ๐‘$" โ‹ฏ ๐‘$$)

slide-8
SLIDE 8

Linear Dimensionality Reduction

8

} Unsupervised

} Principal Component Analysis (PCA) } Independent Component Analysis (ICA) } SingularValue Decomposition (SVD) } Multi Dimensional Scaling (MDS) } Canonical Correlation Analysis (CCA) } โ€ฆ

slide-9
SLIDE 9

Principal Component Analysis (PCA)

9

} Also known as Karhonen-Loeve (KL) transform } Principal Components (PCs): orthogonal vectors that are

  • rdered by the fraction of the total information (variation) in

the corresponding directions

} Find the directions at which data approximately lie

} When the data is projected onto first PC, the variance of the projected data

is maximized

slide-10
SLIDE 10

Principal Component Analysis (PCA)

10

} The โ€œbestโ€ linear subspace (i.e. providing least reconstruction

error of data):

} Find mean reduced data } The axes have been rotated to new (principal) axes such that:

} Principal axis 1 has the highest variance

....

} Principal axis i has the i-th highest variance.

} The principal axes are uncorrelated

} Covariance among each pair of the principal axes is zero.

} Goal: reducing the dimensionality of the data while preserving

the variation present in the dataset as much as possible.

} PCs can be found as the โ€œbestโ€ eigenvectors of the covariance

matrix of the data points.

slide-11
SLIDE 11

Principal components

11

} If data has a Gaussian distribution ๐‘‚(๐‚, ๐šป), the direction of the

largest variance can be found by the eigenvector of ๐šป that corresponds to the largest eigenvalue of ๐šป

๐’˜" ๐’˜W

slide-12
SLIDE 12

Example: random direction

12

slide-13
SLIDE 13

Example: principal component

13

slide-14
SLIDE 14

Covariance Matrix

14

๐‚๐’š = ๐œˆ" โ‹ฎ ๐œˆ$ = ๐น(๐‘ฆ") โ‹ฎ ๐น(๐‘ฆ$) ๐œฏ = ๐น ๐’š โˆ’ ๐‚๐’š ๐’š โˆ’ ๐‚๐’š <

} ML estimate of covariance matrix from data points ๐’š(&)

&\" 5 :

๐œฏ ] = 1 ๐‘‚ ^ ๐’š(&) โˆ’ ๐‚ _ ๐’š(&) โˆ’ ๐‚ _

< 5 &\"

= 1 ๐‘‚ ๐’€ `<๐’€ `

๐’€ ` = ๐’š a(") โ‹ฎ ๐’š a(5) = ๐’š(") โˆ’ ๐‚ _ โ‹ฎ ๐’š(5) โˆ’ ๐‚ _ ๐‚ _ = 1 ๐‘‚ ^ ๐’š(&)

5 &\" Mean-centered data

slide-15
SLIDE 15

PCA: Steps

15

} Input: ๐‘‚ร—๐‘’

data matrix ๐’€ (each row contain a ๐‘’ dimensional data point)

} ๐‚ =

" 5 โˆ‘

๐’š(&)

5 &\"

} ๐’€

` โ† Mean value of data points is subtracted from rows of ๐’€

} ๐šป =

" 5 ๐’€

`<๐’€ ` (Covariance matrix)

} Calculate eigenvalue and eigenvectors of ๐šป } Pick ๐‘’+ eigenvectors corresponding to the largest eigenvalues

and put them in the columns of ๐‘ฉ = [๐’˜", โ€ฆ , ๐’˜$)]

} ๐’€โ€ฒ = ๐’€๐‘ฉ

First PC dโ€™-th PC

slide-16
SLIDE 16

Find principal components

16

} Assume that data is centered. } Find vector ๐’˜ that maximizes sample variance of the projected

data: argmax

h

1 ๐‘‚ ^ ๐‘ค<๐‘ฆ j

W 5 j\"

= 1 ๐‘‚ ๐‘ค<๐‘Œ<๐‘Œ๐‘ค

  • s. t. ๐‘ค<๐‘ค = 1

๐‘€ ๐‘ค, ๐œ‡ = ๐‘ค<๐‘Œ<๐‘Œ๐‘ค โˆ’ ๐œ‡๐‘ค<๐‘ค ๐œ–๐‘€ ๐œ–๐‘ค = 0 โ‡’ 2๐‘Œ<๐‘Œ๐‘ค โˆ’ 2๐œ‡๐‘ค = 0 โ‡’ ๐‘Œ<๐‘Œ๐‘ค = ๐œ‡๐‘ค

slide-17
SLIDE 17

Find principal components

17

} For symmetric matrices, there exist eigen-vectors that

are orthogonal.

} Let ๐‘ค", โ€ฆ ๐‘ค$ denote the eigen-vectors of ๐‘Œ<๐‘Œ such that:

๐‘ค&

<๐‘คs = 0, โˆ€๐‘— โ‰  ๐‘˜

๐‘ค&

<๐‘ค& = 1,

โˆ€๐‘—

slide-18
SLIDE 18

Find principal components

18

๐‘Œ<๐‘Œ๐’˜ = ๐œ‡๐’˜ โ‡’ ๐’˜<๐‘Œ<๐‘Œ๐’˜ = ๐œ‡๐’˜<๐’˜ = ๐œ‡

} ๐œ‡ denotes the amount of variance along the found

dimension ๐’˜ (called energy along that dimension).

} } Eigenvalues: ๐œ‡" โ‰ฅ ๐œ‡W โ‰ฅ ๐œ‡x โ‰ฅ โ‹ฏ

} The first PC ๐’˜" is the the eigenvector of the sample covariance

matrix ๐‘Œ<๐‘Œ associated with the largest eigenvalue.

} The 2nd PC ๐’˜W is the the eigenvector of the sample covariance

matrix ๐‘Œ<๐‘Œ associated with the second largest eigenvalue

} And so on ...

slide-19
SLIDE 19

Another Interpretation: Least Squares Error

19

} PCs are linear least squares fits to samples, each orthogonal to

the previous PCs:

} First PC is a minimum distance fit to a vector in the original feature

space

} Second PC is a minimum distance fit to a vector in the plane

perpendicular to the first PC

} โ€ฆ

slide-20
SLIDE 20

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)

} When data are mean-removed:

} Minimizing sum of square distances to the line is equivalent to

maximizing the sum of squares of the projections on that line (Pythagoras).

20

  • rigin
slide-21
SLIDE 21

Two interpretations

21

} Maximum variance subspace

argmax

h

^ ๐‘ค<๐‘ฆ j

W 5 j\"

= ๐‘ค<๐‘Œ<๐‘Œ๐‘ค

} Minimum reconstruction error

argmin

h

^ ๐‘ฆ j โˆ’ ๐‘ค<๐‘ฆ j ๐‘ค

W 5 j\"

  • rigin

๐‘ค<๐‘ฆ ๐‘ฆ ๐‘ค blue2 + red2 = geen2 geen2 is fixed (shows data) So, maximizing red2 is equivalent to minimizing blue2

slide-22
SLIDE 22

PCA: Uncorrelated Features

22

๐’š+ = ๐‘ฉ<๐’š ๐‘บ๐’š) = ๐น ๐’š+๐’š+< = ๐น ๐‘ฉ<๐’š๐’š<๐‘ฉ = ๐‘ฉ<๐น ๐’š๐’š< ๐‘ฉ = ๐‘ฉ<๐‘บ๐’š๐‘ฉ

} If ๐‘ฉ = [๐’ƒ", โ€ฆ , ๐’ƒ$]

where ๐’ƒ", โ€ฆ , ๐’ƒ$ are orthonormal eighenvectors of ๐‘บ๐’š: ๐‘บ๐’š) = ๐‘ฉ<๐‘บ๐’š๐‘ฉ = ๐‘ฉ< ๐‘ฉ๐šณ๐‘ฉ< ๐‘ฉ = ๐šณ โ‡’ โˆ€๐‘— โ‰  ๐‘˜ ๐‘—, ๐‘˜ = 1, โ€ฆ , ๐‘’ ๐น ๐’š&

+๐’šs + = 0

} then mutually uncorrelated features are obtained

} Completely

uncorrelated features avoid information redundancies

slide-23
SLIDE 23

Reconstruction

23

๐’š+ = ๐’˜"

<๐’š

โ‹ฎ ๐’˜$)

< ๐’š

= ๐’˜"

<

โ‹ฎ ๐’˜$)

<

๐’š

๐’š+ = ๐‘ฉ<๐’š โ‡’ ๐‘ฉ๐’š+ = ๐‘ฉ๐‘ฉ<๐’š = ๐’š โ‡’ ๐’š = ๐‘ฉ๐’š+

} Incorporating all eigenvectors in ๐‘ฉ = [๐’˜", โ€ฆ , ๐’˜$]:

โŸน If ๐‘’+ = ๐‘’ then ๐’š can be reconstructed exactly from ๐’š+

๐‘ฉ = [๐’˜", โ€ฆ , ๐’˜$)]

slide-24
SLIDE 24

PCA Derivation: Relation between Eigenvalues and Variances

24

} The ๐‘˜-th largest eigenvalue of ๐‘บ๐’š is the variance on the ๐‘˜-th

PC:

๐‘ค๐‘๐‘  ๐‘ฆs

+ = ๐’˜s <๐‘บ๐’š๐’˜s = ๐œ‡s

slide-25
SLIDE 25

PCA Derivation: Mean Square Error Approximation

25

} Incorporating only ๐‘’โ€ฒ eigenvectors corresponding to the

largest eigenvalues ๐‘ฉ = [๐’ƒ", โ€ฆ , ๐’ƒ$)] (๐‘’+ < ๐‘’)

} It minimizes MSE between ๐’š and ๐’š

_ = ๐‘ฉ๐’š+:

๐พ ๐‘ฉ = ๐น ๐’š โˆ’ ๐’š _ W = ๐น ๐’š โˆ’ ๐‘ฉ๐’š+ W = ๐น ^ ๐‘ฆs

+๐’ƒs $ s\$)โ€ข" W

= ๐น ^ ^ ๐‘ฆs

+๐’ƒs <๐’ƒโ‚ฌ $ โ‚ฌ\$)โ€ข"

๐‘ฆโ‚ฌ

+ $ s\$)โ€ข"

= ๐น ^ ๐‘ฆs

+W $ s\$)โ€ข"

= ^ ๐น ๐‘ฆs

+W $ s\$)โ€ข"

= ^ ๐’ƒs

<๐น ๐’š๐’š< ๐’ƒs $ s\$)โ€ข"

= ^ ๐’ƒs

<๐‘บ๐’š๐’ƒs $ s\$)โ€ข"

= ^ ๐œ‡s

$ s\$)โ€ข"

Sum of the ๐‘’ โˆ’ ๐‘’+ smallest eigenvalues

slide-26
SLIDE 26

PCA Derivation: Mean Square Error Approximation

26

} In general, it can also be shown MSE is minimized compared to

any

  • ther

approximation

  • f

๐’š by any ๐‘’+ -dimensional

  • rthonormal basis

} without first assuming that the axes are eigenvectors of the correlation

matrix, this result can also be obtained.

} If the data is mean-centered in advance, ๐‘บ๐’š and ๐‘ซ๐’š (covariance

matrix) will be the same.

} However, in the correlation version when ๐‘ซ๐’š โ‰  ๐‘บ๐’š the approximation is

not, in general, a good one (although it is a minimum MSE solution)

slide-27
SLIDE 27

PCA on Faces: โ€œEigenfacesโ€

27

} ORL Database

Some Images

slide-28
SLIDE 28

PCA on Faces: โ€œEigenfacesโ€

28

For eigen faces โ€œgrayโ€ = 0, โ€œwhiteโ€ > 0, โ€œblackโ€ < 0 Average face

PC

st

1 PC

th

6

slide-29
SLIDE 29

PCA on Faces:

Feature vector=[๐‘ฆ"

+,๐‘ฆW +, โ€ฆ ,๐‘ฆ$) + ] +๐‘ฆ"

+ร—

+๐‘ฆW

+ร—

+๐‘ฆWฦ’โ€ž

+

ร— + โ‹ฏ

29

= Average Face ๐‘ฆ&

+ = ๐’˜& <๐’š

The projection of ๐’š on the i-th PC ๐’š is a 112ร—92 = 10304 dimensional vector containing intensity of the pixels of this image ๐’š _ = ๐‘ฆ + ^ ๐‘ฆ&

+ร—๐’˜& $) &\"

๐’š _ = ๐’š + ^ ๐’˜&

<๐’š ร—๐’˜& $) &\"

slide-30
SLIDE 30

PCA on Faces: Reconstructed Face

d'=1 d'=2 d'=4 d'=8 d'=16 d'=32 d'=64 d'=128 Original Image d'=256

30

slide-31
SLIDE 31

Dimensionality reduction by PCA

31

} Data may lie near a linear subspace of high-dimensional input

space

} Only keep data projections onto principal components with

large eigenvalues

} Plot of the eigenvalues (or variances of principal components)

against their indices.

2 4 6 8 10 12 PC 1 PC 2 PC 3 PC 4

variance

slide-32
SLIDE 32

PCA: Summary

32

} Global optimum is found by eigenvector method } No parameter tuning } However, it is limited to:

} second order statistics } Limited to linear projections

slide-33
SLIDE 33

PCA and LDA: Drawbacks

33

} PCA drawback: An excellent information packing transform

does not necessarily lead to a good class separability.

} The directions of the maximum variance may be useless for classification

purpose } LDA drawback

} Singularity or under-sampled problem (when ๐‘‚ < ๐‘’)

} Example: gene expression data, images, text documents

} Can reduces dimension only to ๐‘’โ€ฒ โ‰ค ๐ท โˆ’ 1 (unlike PCA)

PCA LDA

slide-34
SLIDE 34

PCA vs. LDA

} Although LDA often provide more suitable features for

classification tasks, PCA might outperform LDA in some situations:

} When there are many unlabeled data while no or small amount of

labeled data

} when the number of samples per class is small (overfitting problem of LDA)

} when the number of the desired features is more than ๐ท โˆ’ 1 } when the training data non-uniformly sample the underlying

distribution

} Semi-supervised feature extraction

} E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA)

34

slide-35
SLIDE 35

Kernel PCA

35

} Hilbert space: ๐’š โ†’ ๐”(๐’š) (Nonlinear extension of PCA)

๐‘ซ = 1 ๐‘‚ ^ ๐” ๐’š(&) ๐” ๐’š(&) <

5 &\"

} All eigenvectors of ๐‘ซ lie in the span of the mapped data points,

๐‘ซ๐’˜ = ๐œ‡๐’˜ ๐’˜ = ^ ๐›ฝ&๐” ๐’š(&)

5 &\"

} After some algebra, we have:

๐‘ณ๐œท = ๐‘‚๐œ‡๐œท

slide-36
SLIDE 36

Kernel PCA

36

} Kernel extension of PCA

data (approximately) lies on a lower dimensional non-linear space

slide-37
SLIDE 37

Laplacian eigenmap, LLE, etc

37

} Need not to predefine a kernel. They construct a graph

  • n data points.

[M. Belkin et. al, Laplacian Eigenmap]

slide-38
SLIDE 38

Autoencoder

38

} Cost function: โˆ‘

๐’š(j) โˆ’ ๐’š _(j)

5 j\" W

๐’š ๐’š _

slide-39
SLIDE 39

Uncorrelated and Independent

39

} Gaussian

} Independent โŸบ Uncorrelated

} Non-Gaussian

} Independent โ‡’ Uncorrelated } Uncorrelated โ‡ Independent

Uncorrelated: ๐‘‘๐‘๐‘ค ๐‘Œ",๐‘ŒW = 0 Independent: ๐‘„ ๐‘Œ",๐‘ŒW = ๐‘„(๐‘Œ")๐‘„(๐‘ŒW)

slide-40
SLIDE 40

ICA: Cocktail party problem

40

} Cocktail party problem

} ๐‘’ speakers are speaking simultaneously and any microphone

records only an overlapping combination of these voices.

ยจ Each microphone records a different combination of the speakersโ€™ voices.

} Using these ๐‘› โ‰ฅ ๐‘’ microphone recordings, can we separate

  • ut the original ๐‘› speakersโ€™ speech signals?

} Mixing matrix ๐‘ฉ:

๐’š = ๐‘ฉ๐’•

} Unmixing matrix ๐‘ฉโ€ข":

๐’• = ๐‘ฉโ€ข"๐’š

๐‘กโ‚ฌ

(j): sound that speaker ๐‘™ was uttering at time ๐‘œ.

๐‘ฆ&

(j): acoustic reading recorded by microphone ๐‘— at time ๐‘œ.

๐‘ฆ& = ^ ๐‘&โ‚ฌ ๐‘กโ‚ฌ

$ โ‚ฌ\"

slide-41
SLIDE 41

ICA Assumptions

41

} The sources are independent of each other. } Sources have non-Gaussian distributions.

} density is completely symmetric

slide-42
SLIDE 42

Example

42

๐‘ž ๐‘ก& = ลก 1 2 3

  • if ๐‘ก& โ‰ค

3

  • ๐‘. ๐‘ฅ.

๐ต = 2 3 2 1 ๐‘ฆ" ๐‘ฆW [Hyvaฬˆrinen and Oja, Independent Component Analysis: Algorithms and Applications, 2000.]

slide-43
SLIDE 43

ICA: Ambiguities

43

} We cannot determine the order of the independent

components.

} There will be no way to distinguish between ๐‘ฉ๐’• and ๐‘ฉ๐‘ธโ€ข"๐‘ธ๐’•

(๐‘ธ is a permutation matrix)

} There is no way to recover the correct scaling of sources

} There will be no way to distinguish between ๐‘ฉ๐’• and ๐‘ฉ/๐›ฝร—๐›ฝ๐’•

} Gaussian distribution ๐‘‚(๐Ÿ, ๐‘ฑ) of sources

slide-44
SLIDE 44

Gaussian distribution ๐‘‚(๐Ÿ, ๐‘ฑ) of sources

44

๐’š = ๐‘ฉ๐’• ๐’šโ€ฒ = ๐‘ฉโ€ฒ๐’• ๐‘ฉ+ = ๐‘ฉ๐‘บ ๐‘บ๐‘บ< = ๐‘บ<๐‘บ = ๐‘ฑ โ‡’ ๐น ๐’šโ€ฒ๐’šโ€ฒ< = ๐น ๐‘ฉ๐‘บ๐’•๐’•<๐‘บ<๐‘ฉ< = ๐‘ฉ๐‘ฉ< = ๐น ๐’š๐’š<

} There is no way to tell if sources were mixed using ๐‘ฉ or ๐‘ฉ+

} an arbitrary rotational component that cannot be determined from the

data, and thus we cannot recover the original sources.

slide-45
SLIDE 45

Gaussian distribution ๐‘‚(๐Ÿ,๐‘ฑ) of sources

45

} The distribution of any orthonormal transformation of ๐‘‚(๐Ÿ, ๐‘ฑ)

has exactly the same distribution (A is not identifiable).

} Since multivariate standard normal distribution is rotationally

symmetric.

} So long as the source is not standard Gaussian, it is possible,

given enough data, to recover the ๐‘’ independent sources.

slide-46
SLIDE 46

ICA: Assumptions

46

๐‘žยฃ ๐’š = ๐‘žยค(๐‘ฟ๐’š) ๐‘ฟ

} Consider

the assumption that the sources are independent: ๐‘ž ๐’• = ยฆ ๐‘ž(๐‘ก&)

$ &\"

๐‘ฟ = ๐‘ฉโ€ข" ๐‘ก& = ๐’™&

<๐’š

๐‘ฟ = ๐‘ฉโ€ข" ๐’™"

<

๐’™$

<

slide-47
SLIDE 47

ICA: Assumptions

47

} To define a density for each source, we can specify cdf

} ๐‘žยค ๐‘ก = ๐บ+(๐‘ก)

} A cdf is a monotonic function that increases from zero to

  • ne.

} Sigmoid function ๐œ(๐‘ก) = 1/(1 + ๐‘“โ€ขยค) is a reasonable

function for the cdf of sources

} Hence, it is assumed ๐‘ž(๐‘ก) = ๐œโ€ฒ(๐‘ก).

slide-48
SLIDE 48

ICA: Log likelihood

48

โ„“ ๐‘ฟ = ^ log ๐‘žยฃ(๐’š(j))

j j\"

= ^ log ๐‘žยค ๐‘ฟ๐’š(j) + log ๐‘ฟ

j j\"

= ^ ^ log ๐œ+ ๐’™s

<๐’š(j) $ s\"

+ log ๐‘ฟ

j j\"

} Stochastic gradient descent:

๐‘ฟ = ๐‘ฟ + ๐›ฝ 1 โˆ’ 2๐œ ๐’™"

<๐’š(j)

โ‹ฎ 1 โˆ’ 2๐œ ๐’™$

<๐’š(j)

๐’š j < + ๐‘ฟโ€ข" <

๐›ผ๐‘ฟ ๐‘ฟ = ๐‘ฟ ๐‘ฟโ€ข" <

slide-49
SLIDE 49

Independent Component Analysis (ICA)

49

} PCA:

} The transformed dimensions will be uncorrelated from each

  • ther

} Orthogonal linear transform } Only uses second order statistics (i.e., covariance matrix)

} ICA:

} The transformed dimensions will be as independent as

possible.

} Non-orthogonal linear transform } High-order statistics can also used

slide-50
SLIDE 50

Summary

50

} PCA is a linear dimensionality reduction method that

finds an orthonormal basis (minimizing MSE)

} ICA finds a linear transformation to make components as

independent as possible

slide-51
SLIDE 51

Resources

51

} C. Bishop, โ€œPattern Recognition and Machine Learningโ€,

Chapter 12.

} Hyvaฬˆrinen and Oja, Independent Component Analysis:

Algorithms and Applications, Neural Networks, 2000.