Aykut Erdem // Hacettepe University // Fall 2019
Lecture 23:
Dimensionality Reduction
BBM406
Fundamentals of Machine Learning
Image credit: Matthew Turk and Alex Pentland
BBM406 Fundamentals of Machine Learning Lecture 23: - - PowerPoint PPT Presentation
Image credit: Matthew Turk and Alex Pentland BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem // Hacettepe University // Fall 2019 Administrative Project Presentations January 8,10, 2020 Each
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 23:
Dimensionality Reduction
Image credit: Matthew Turk and Alex Pentland
Project Presentations
January 8,10, 2020
The suggested outline for the presentations are as follows:
is interesting and important)
prepare an engaging video presentation of their work using online tools such as PowToon, moovly or GoAnimate (due January 12, 2020).
2
Final Reports (Due January 15, 2019)
providing a general motivation and briefly discusses the approach(es) that you explored.
topic.
you employed or proposed as detailed and specific as possible.
analyze the performance of the approach(es) you proposed or explored. You should provide a qualitative and/or quantitative analysis, and comment on your findings. You may also demonstrate the limitations of the approach(es).
key results you obtained. You may also suggest possible directions for future work.
Last time… Graph-Theoretic Clustering
Goal: Given data points X1, ..., Xn and similarities W(Xi ,Xj), partition the data into groups so that points in a group are similar and points in different groups are dissimilar.
4
Similarity Graph: G(V,E,W) V – Vertices (Data points) E – Edge if similarity > 0 W - Edge weights (similarities) Partition the graph so that edges within a group have large weights and edges across groups have small weights.
Similarity graph
slide by Aarti SinghLast time… K-Means vs. Spectral Clustering
to Laplacian eigenvectors allows us to find cluster with non- convex boundaries.
5 k-means output Spectral clustering output
slide by Aarti Singh6
Bottom-Up (agglomerative):
Start with each item in its own cluster, find the best pair to merge into a new
fused together.
slide by Andrew Moore7
8
9
Instances Features
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000
10
20 30 40 50 60 100 200 300 400 500 600 700 800 900 1000 measurement Value
Measurement
slide by Alex Smola11
0 10 20 30 40 50 60 70 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Person H-Bands
12
0 50 150 250 350 450 50 100 150 200 250 300 350 400 450 500 550
C-Triglycerides C-LDH
100200300400500 200 400 600 1 2 3 4
C-Triglycerides C-LDH M-EPI
Bi-variate Tri-variate
¡difficl ¡o ¡ee ¡in ¡4 ¡o ¡highe ¡dimenional ¡pace...
slide by Alex SmolaEven 3 dimensions are already difficult. How to extend this?
coordinate axes?
dimensions?
the features?
53-D space that keeps the most information about the original data?
13
slide by Barnabás Póczos and Aarti SinghReduce data from 2D to 1D
Motivation II: Data Compression
slide by Andrew Ng(inches) (cm)
(inches) (cm)
Motivation II: Data Compression
slide by Andrew NgReduce data from 2D to 1D
Motivation II: Data Compression
slide by Andrew NgReduce data from 3D to 2D
point with a single categorical variable
data
vector
17
slide by Fereshteh Sadeghi18
PCA: Orthogonal projection of the data onto a lower- dimension linear space that...
19
mass.
direction of the largest variance.
variance of the residual subspace
20
slide by Barnabás Póczos and Aarti Singh21
slide by Barnabás Póczos and Aarti Singh22
slide by Barnabás Póczos and Aarti Singh23
slide by Barnabás Póczos and Aarti Singh24
slide by Barnabás Póczos and Aarti Singhresidual subspace
x ¡
,, ¡
w1(w1
Tx)
w2(w2
Tx)
x w1 w2 x=w1(w1
Tx)+w2(w2 Tx)
w
25
i T i
m
1
) )( ( 1 x x x x
i i
m
1
1 x x
where
PCA algorithm II (sample covariance matrix)
slide by Barnabás Póczos and Aarti SinghReminder: Eigenvector and Eigenvalue
26
Ax = λx
A: Square matrix λ: Eigenvector or characteristic vector x: Eigenvalue or characteristic value
Reminder: Eigenvector and Eigenvalue
27
Ax - λx = 0 (A – λI)x = 0 B = A – λI Bx = 0 x = B-10 = 0
If we define a new matrix B: If B has an inverse: BUT! an eigenvector cannot be zero!! x will be an eigenvector of A if and only if B does not have an inverse, or equivalently det(B)=0 :
det(A – λI) = 0 Ax = λx
Reminder: Eigenvector and Eigenvalue
28
Example 1: Find the eigenvalues of two eigenvalues: -1, - 2 Note: The roots of the characteristic equation can be repeated. That is, λ1 = λ2 =…= λk. If that happens, the eigenvalue is said to be of multiplicity k. Example 2: Find the eigenvalues of λ = 2 is an eigenvector of multiplicity 3.
ú û ù ê ë é
5 1 12 2 A ) 2 )( 1 ( 2 3 12 ) 5 )( 2 ( 5 1 12 2
2
+ + = + + = + +
+
l l l l l l l l A I ú ú ú û ù ê ê ê ë é = 2 2 1 2 A ) 2 ( 2 2 1 2
3 =
l l l l A I
PCA algorithm II (sample covariance matrix)
29
PCA algorithm III (SVD of the data matrix)
30
23
Singular Value Decomposition of the centered data matrix X.
Xfeatures samples = USVT
X VT S U =
samples
significant noise noise noise significant sig.
(SVD of the data matrix)
slide by Barnabás Póczos and Aarti Singh31
32
33
34
Can ¡j ¡e ¡he ¡gien ¡256 ¡ ¡256 ¡iel
slide by Barnabás Póczos and Aarti Singh35
Example data set: Images of faces Famous Eigenface approach
[Turk & Pentland], [Sirovich & Kirby]
Each face x ¡ 256 256 values (luminance at location) x in 256256 (view as 64K dim vector) Form X = [ x1 , ¡, ¡xm ] centered data mtx Compute = XXT Problem: is 64K 64 ¡ ¡HGE!!!
256 x 256 real values m faces
X =
x1, ¡, ¡xm
Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best Method B: Build one PCA database for the whole dataset and then classify based on the weights.
27
slide by Barnabás Póczos and Aarti Singh36
then Xv is eigenvector of Proof: L v = v
XTX v = v X (XTX v) = X( v) = Xv (XXT)X v = (Xv) Xv) = (Xv)
256 x 256 real values m faces
X =
x1, ¡, ¡xm
slide by Barnabás Póczos and Aarti Singh37
slide by Derek HomeRepresentation and Reconstruction
38
slide by Derek HomePrinciple Components (Method B)
39
slide by Barnabás Póczos and Aarti SinghPrinciple Components (Method B)
40
¡fae ¡if ¡ain ¡ih
41
slide by Alex SmolaHappiness subspace (method A)
42
slide by Barnabás Póczos and Aarti Singh43
slide by Barnabás Póczos and Aarti SinghFacial Expression Recognition Movies
44
slide by Barnabás Póczos and Aarti SinghFacial Expression Recognition Movies
45
slide by Barnabás Póczos and Aarti SinghFacial Expression Recognition Movies
46
slide by Barnabás Póczos and Aarti Singh3D objects (heads)
47
slide by Barnabás Póczos and Aarti Singh48
49
50
slide by Barnabás Póczos and Aarti SinghPCA compression: 144D => 60D
51
slide by Barnabás Póczos and Aarti SinghPCA compression: 144D => 16D
52
slide by Barnabás Póczos and Aarti Singh16 most important eigenvectors
53
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
slide by Barnabás Póczos and Aarti Singh54
slide by Barnabás Póczos and Aarti Singh55
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 slide by Barnabás Póczos and Aarti Singh56
slide by Barnabás Póczos and Aarti Singh57
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 slide by Barnabás Póczos and Aarti Singh58
slide by Barnabás Póczos and Aarti Singh60 most important eigenvectors
59
slide by Barnabás Póczos and Aarti Singh60
http://en.wikipedia.org/wiki/Discrete_cosine_transform
slide by Barnabás Póczos and Aarti Singh61
62
x x’ U x
slide by Barnabás Póczos and Aarti Singh63
slide by Barnabás Póczos and Aarti SinghDenoised image using 15 PCA components
64
slide by Barnabás Póczos and Aarti Singh65
66
PCA ¡doesnt ¡know ¡labels!
slide by Barnabás Póczos and Aarti SinghPCA vs. Fisher Linear Discriminant
67
Fisher Linear Discriminant
68
slide by Barnabás Póczos and Aarti Singh69
slide by Barnabás Póczos and Aarti Singh70
network
are its own inputs
71
slide by Sanja Fidlerz = f (W x); ˆ x = g(V z)
72
slide by Sanja Fidlerz = f (W x); ˆ x = g(V z) min
W,V
1 2N
N
X
n=1
||x(n) − ˆ x(n)||2
73
slide by Sanja Fidlerz = f (W x); ˆ x = g(V z) min
W,V
1 2N
N
X
n=1
||x(n) − ˆ x(n)||2 min
W,V
1 2N
N
X
n=1
||x(n) − VW x(n)||2
74
slide by Sanja Fidlerz = f (W x); ˆ x = g(V z) min
W,V
1 2N
N
X
n=1
||x(n) − ˆ x(n)||2 min
W,V
1 2N
N
X
n=1
||x(n) − VW x(n)||2
75
slide by Sanja Fidleraccurate description
76
slide by Sanja Fidler77
Real data 30-d deep autoencoder 30-d logistic PCA 30-d PCA
slide by Sanja Fidler78
covariance matrix only. What if the data is not well described by the covariance matrix?
covariance (with the subtracted mean) is the Gaussian
Gaussian are poorly described by their covariances.
79
slide by Kornel Laskowski and Dave TouretzkyFaithful vs Meaningful Representations
to the most faithful representation in a reconstruction error sense (recall that we trained our autoencoder network using a mean-square error in an input reconstruction layer).
Gaussianity, since it penalizes datapoints close to the mean less that those that are far away.
80
slide by Kornel Laskowski and Dave TouretzkyA Criterion Stronger than Decorrelation
components which are statistically independent, rather than just uncorrelated.
for any functions g1 and g2.
81
slide by Kornel Laskowski and Dave Touretzkyp(ξ1, ξ2, · · · , ξN) =
N
p(ξi)
ξiξj − ξi ξj = 0 , i = j
g1(ξi)g2(ξj) − g1(ξi) g2(ξj) = 0 , i = j
Independent Component Analysis (ICA)
stronger requirement of independence, rather than uncorrelatedness.
PCA) exists, so ICA is implemented using neural network models.
descend/climb in.
in N-dimensional space; they need not be orthogonal.
(principal) components? When the generative distribution is uniquely determined by its first and second moments. This is true of only the Gaussian distribution.
82
slide by Kornel Laskowski and Dave Touretzky
83
slide by Kornel Laskowski and Dave Touretzky¯ y = 1 1 + eWT ¯
ξ
(we’re trying to maximize the enclosed area representing information quantities).
84
slide by Kornel Laskowski and Dave TouretzkyH(p) = H(p|q) = I(p; q) = = =
entropy of distribution p of first neuron’s output
conditional entropy
H(p) − H(q|p) H(q) − H(p|q) mutual information
and {xk[t]} are time series (t is a discrete time index).
where nk[t] is the noise contribution in the kth signal xk[t], and A is a mixture matrix.
85
slide by Kornel Laskowski and Dave Touretzkyxk[t] = Ask[t] + nk[t]
86
slide by Barnabás Póczos and Aarti Singh6
ICA Estimation Sources Observation
x(t) = As(t) s(t)
Mixing
y(t)=Wx(t)
87
Paris Smaragdis
Input mix: Extracted speech:
http://paris.cs.illinois.edu/demos/index.html