CE-717: Machine Learning
Sharif University of Technology Spring 2018
PCA & ICA CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction }
Sharif University of Technology Spring 2018
} Select a subset of a given feature set
} A linear or non-linear transform on the original feature space
2
๐ฆ" โฎ ๐ฆ$ โ ๐ฆ&' โฎ ๐ฆ&() Feature Selection (๐+ < ๐) ๐ฆ" โฎ ๐ฆ$ โ ๐ง" โฎ ๐ง$) = ๐ ๐ฆ" โฎ ๐ฆ$ Feature Extraction
3
} Criterion for feature extraction can be different based on problem
} Unsupervised task: minimize the information loss (reconstruction error) } Supervised task: maximize the class discrimination on the projected space
} Linear Methods
} Unsupervised: e.g., Principal Component Analysis (PCA) } Supervised: e.g., Linear Discriminant Analysis (LDA)
ยจ Also known as Fisherโs Discriminant Analysis (FDA)
} Non-linear methods:
} Supervised: MLP neural networks } Unsupervised: e.g., autoencoders
4
Feature Extraction ๐ = ๐ฆ"
(")
โฏ ๐ฆ$
(")
โฎ โฑ โฎ ๐ฆ"
(5)
โฏ ๐ฆ$
(5)
A mapping ๐: โ$ โ โ$) Or
๐+ = ๐ฆโฒ"
(")
โฏ ๐ฆโฒ$)
(")
โฎ โฑ โฎ ๐ฆโฒ"
(5)
โฏ ๐ฆโฒ$)
(5)
Feature Extraction ๐ = ๐ฆ"
(")
โฏ ๐ฆ$
(")
โฎ โฑ โฎ ๐ฆ"
(5)
โฏ ๐ฆ$
(5)
๐ = ๐ง(") โฎ ๐ง(5) A mapping ๐: โ$ โ โ$) Or
๐+ = ๐ฆโฒ"
(")
โฏ ๐ฆโฒ$)
(")
โฎ โฑ โฎ ๐ฆโฒ"
(5)
โฏ ๐ฆโฒ$)
(5)
5
} As a preprocessing step to reduce dimensions for supervised
} Helps avoiding overfitting
6
Type equation here. ๐ฉ< โ โ$)ร$ ๐ โ โ ๐โฒ โ โ$) ๐+ = ๐ฉ<๐ ๐+ < ๐
7
1
d ยข
๐ = 1, โฆ , ๐
8
} Principal Component Analysis (PCA) } Independent Component Analysis (ICA) } SingularValue Decomposition (SVD) } Multi Dimensional Scaling (MDS) } Canonical Correlation Analysis (CCA) } โฆ
9
} Find the directions at which data approximately lie
} When the data is projected onto first PC, the variance of the projected data
is maximized
10
} Find mean reduced data } The axes have been rotated to new (principal) axes such that:
} Principal axis 1 has the highest variance
....
} Principal axis i has the i-th highest variance.
} The principal axes are uncorrelated
} Covariance among each pair of the principal axes is zero.
11
๐" ๐W
12
13
14
&\" 5 :
< 5 &\"
5 &\" Mean-centered data
15
} ๐ =
" 5 โ
5 &\"
} ๐
} ๐ป =
" 5 ๐
} Calculate eigenvalue and eigenvectors of ๐ป } Pick ๐+ eigenvectors corresponding to the largest eigenvalues
} ๐โฒ = ๐๐ฉ
First PC dโ-th PC
16
h
W 5 j\"
17
18
} The first PC ๐" is the the eigenvector of the sample covariance
} The 2nd PC ๐W is the the eigenvector of the sample covariance
} And so on ...
19
} First PC is a minimum distance fit to a vector in the original feature
} Second PC is a minimum distance fit to a vector in the plane
} โฆ
} Minimizing sum of square distances to the line is equivalent to
20
21
๐ค<๐ฆ ๐ฆ ๐ค blue2 + red2 = geen2 geen2 is fixed (shows data) So, maximizing red2 is equivalent to minimizing blue2
22
} Completely
23
<๐
< ๐
<
<
24
25
+๐s $ s\$)โข" W
+๐s <๐โฌ $ โฌ\$)โข"
+ $ s\$)โข"
+W $ s\$)โข"
+W $ s\$)โข"
<๐น ๐๐< ๐s $ s\$)โข"
<๐บ๐๐s $ s\$)โข"
$ s\$)โข"
Sum of the ๐ โ ๐+ smallest eigenvalues
26
} without first assuming that the axes are eigenvectors of the correlation
} However, in the correlation version when ๐ซ๐ โ ๐บ๐ the approximation is
27
Some Images
28
PC
st
1 PC
th
6
+,๐ฆW +, โฆ ,๐ฆ$) + ] +๐ฆ"
+ร
+๐ฆW
+ร
+๐ฆWฦโ
+
ร + โฏ
29
= Average Face ๐ฆ&
+ = ๐& <๐
The projection of ๐ on the i-th PC ๐ is a 112ร92 = 10304 dimensional vector containing intensity of the pixels of this image ๐ _ = ๐ฆ + ^ ๐ฆ&
+ร๐& $) &\"
๐ _ = ๐ + ^ ๐&
<๐ ร๐& $) &\"
30
31
2 4 6 8 10 12 PC 1 PC 2 PC 3 PC 4
variance
32
} second order statistics } Limited to linear projections
33
} The directions of the maximum variance may be useless for classification
} Singularity or under-sampled problem (when ๐ < ๐)
} Example: gene expression data, images, text documents
} Can reduces dimension only to ๐โฒ โค ๐ท โ 1 (unlike PCA)
PCA LDA
} When there are many unlabeled data while no or small amount of
} when the number of samples per class is small (overfitting problem of LDA)
} when the number of the desired features is more than ๐ท โ 1 } when the training data non-uniformly sample the underlying
} E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA)
34
35
5 &\"
5 &\"
36
data (approximately) lies on a lower dimensional non-linear space
37
[M. Belkin et. al, Laplacian Eigenmap]
38
๐ ๐ _
39
} Independent โบ Uncorrelated
} Independent โ Uncorrelated } Uncorrelated โ Independent
Uncorrelated: ๐๐๐ค ๐",๐W = 0 Independent: ๐ ๐",๐W = ๐(๐")๐(๐W)
40
} ๐ speakers are speaking simultaneously and any microphone
ยจ Each microphone records a different combination of the speakersโ voices.
} Using these ๐ โฅ ๐ microphone recordings, can we separate
๐กโฌ
(j): sound that speaker ๐ was uttering at time ๐.
๐ฆ&
(j): acoustic reading recorded by microphone ๐ at time ๐.
$ โฌ\"
41
} density is completely symmetric
42
๐ ๐ก& = ลก 1 2 3
3
๐ต = 2 3 2 1 ๐ฆ" ๐ฆW [Hyvaฬrinen and Oja, Independent Component Analysis: Algorithms and Applications, 2000.]
43
} There will be no way to distinguish between ๐ฉ๐ and ๐ฉ๐ธโข"๐ธ๐
} There will be no way to distinguish between ๐ฉ๐ and ๐ฉ/๐ฝร๐ฝ๐
44
} an arbitrary rotational component that cannot be determined from the
45
46
<๐
๐ฟ = ๐ฉโข" ๐"
<
๐$
<
47
} ๐ยค ๐ก = ๐บ+(๐ก)
} Hence, it is assumed ๐(๐ก) = ๐โฒ(๐ก).
48
49
} The transformed dimensions will be uncorrelated from each
} Orthogonal linear transform } Only uses second order statistics (i.e., covariance matrix)
} The transformed dimensions will be as independent as
} Non-orthogonal linear transform } High-order statistics can also used
50
51