[PPT] - Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 PowerPoint Presentation

SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Dimensionality reduction 2

Byron C Wallace

SLIDE 2

Today

A bit of wrap up on PCA
Then: Non-linear dimensionality reduction! (SNE/t-SNE)

SLIDE 3

Λ =    λ1 λ2 ... λd   

Data

X =( x1 · · · · · · xn) ∈ Rd⇥n

Eigenvectors of Covariance Idea: Take top-k eigenvectors to maximize variance

In Sum: Principal Component Analysis

SLIDE 4

Why?

Idea: Take top-k eigenvectors to maximize variance

Last time, we saw that we can derive this by maximizing the variance in the compressed space

SLIDE 5

Why?

Idea: Take top-k eigenvectors to maximize variance Last time, we saw that we can derive this by maximizing the variance in the compressed space Can also motivate by explicitly minimizing reconstruction error

SLIDE 6

Minimizing reconstruction error

SLIDE 7

Getting the eigenvalues, two ways

S = 1 N

N

X

n=1

xnx>

n = 1

N XX>

Direct eigenvalue decomposition of the covariance matrix

SLIDE 8

Getting the eigenvalues, two ways

Direct eigenvalue decomposition of the covariance matrix

S = 1 N

N

X

n=1

xnx>

n = 1

N XX>

Singular Value Decomposition (SVD)

SLIDE 9

Singular Value Decomposition

Idea: Decompose the  d x n matrix X into

1. A n x n basis V

(unitary matrix)

2. A d x n matrix Σ

(diagonal projection)

3. A d x d basis U

(unitary matrix)

d X = Ud⇥dΣd⇥nV>

n⇥n

SLIDE 10

2

1. Rotation

V T~ x =

n

X

i=1

h~ vi, ~ xi ~ ei

2. Scaling

SV T~ x =

n

X

i=1

si h~ vi, ~ xi ~ ei

3. Rotation

USV T~ x =

n

X

i=1

si h~ vi, ~ xi ~ ui

SLIDE 11

SVD for PCA

X

|{z}

D⇥N

= U

|{z}

D⇥D

Σ

|{z}

D⇥N

V >

|{z}

N⇥N

,

S = 1 N XX> = 1 N UΣ V >V

| {z }

=IN

Σ>U > = 1 N UΣΣ>U >

SLIDE 12

SVD for PCA

X

|{z}

D⇥N

= U

|{z}

D⇥D

Σ

|{z}

D⇥N

V >

|{z}

N⇥N

,

S = 1 N XX> = 1 N UΣ V >V

| {z }

=IN

Σ>U > = 1 N UΣΣ>U >

It turns out the columns of U are the eigenvectors of XXT

SLIDE 13

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX>

Computing C already takes O(nd2) time (very expensive) Method 2: singular value decomposition (SVD) Find X = Ud⇥dΣd⇥nV>

n⇥n

where U>U = Id⇥d, V>V = In⇥n, Σ is diagonal Computing top k singular vectors takes only O(ndk)

SLIDE 14

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX>

Computing C already takes O(nd2) time (very expensive) Method 2: singular value decomposition (SVD) Find X = Ud⇥dΣd⇥nV>

n⇥n

where U>U = Id⇥d, V>V = In⇥n, Σ is diagonal Computing top k singular vectors takes only O(ndk) Relationship between eigendecomposition and SVD: Left singular vectors are principal components (C = UΣ2U>)

SLIDE 15

d = number of pixels
Each xi 2 Rd is a face image
xji = intensity of the j-th pixel in image i

Eigen-faces [Turk & Pentland 1991]

SLIDE 16

Eigen-faces [Turk & Pentland 1991]

d = number of pixels
Each xi 2 Rd is a face image
xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

SLIDE 17

Eigen-faces [Turk & Pentland 1991]

d = number of pixels
Each xi 2 Rd is a face image
xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification

SLIDE 18

d = number of pixels
Each xi 2 Rd is a face image
xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

Eigen-faces [Turk & Pentland 1991]

SLIDE 19

Aside: How many components?

Magnitude of eigenvalues indicate fraction of variance captured.
Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1 553.6 820.1 1086.7 1353.2

λi

Eigenvalues typically drop off sharply, so don’t need that many.
Of course variance isn’t everything...

SLIDE 20

Wrapping up PCA

PCA is a linear model for dimensionality reduction which

finds a mapping to a lower dimensional space that maximizes variance

We saw that this is equivalent to performing an

eigendecomposition on the covariance matrix of X

Next time Auto-encoders and neural compression for

non-linear projections

SLIDE 21

Wrapping up PCA

PCA is a linear model for dimensionality reduction which

finds a mapping to a lower dimensional space that maximizes variance

We saw that this is equivalent to performing an

eigendecomposition on the covariance matrix of X

Next time Auto-encoders and neural compression for

non-linear projections

SLIDE 22

Non-linear dimensionality reduction

SLIDE 23

Limitations of Linearity

PCA is effective PCA is ineffective

SLIDE 24

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

SLIDE 25

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)>

{ }

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

Idea: Use kernels

SLIDE 26

Kernel PCA

SLIDE 27

Alternatively: t-SNE!

SLIDE 28

Stochastic Neighbor   Embeddings

Borrowing from:  Laurens van der Maaten  (Delft -> Facebook AI)

SLIDE 29

Manifold learning

Idea: Perform a non-linear dimensionality reduction  in a manner that preserves proximity (but not distances)

SLIDE 30

Manifold learning

SLIDE 31

PCA on MNIST digits

SLIDE 32

t-SNE on MNIST Digits

SLIDE 33

Swiss roll

Euclidean distance is not always   a good notion of proximity

SLIDE 34

Non-linear projection

Bad projection: relative position to neighbors changes

SLIDE 35

Non-linear projection

Intuition: Want to preserve local neighborhood

SLIDE 36

Stochastic Neighbor Embedding

Original space The map

SLIDE 37

SNE to t-SNE (on board)

SLIDE 38

t-SNE: SNE with a t-Distribution

∂C ∂yi = 4 X

j6=i

(pij − qij)(1 + ||yi − yj||2)1(yi − yj) pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = (1 + ||yi − yj||2)1 P

k6=l(1 + ||yk − yl||2)1

Similarity in High Dimension Similarity in Low Dimension Gradient

SLIDE 39

Algorithm 1: Simple version of t-Distributed Stochastic Neighbor Embedding. Data: data set X = {x1,x2,...,xn}, cost function parameters: perplexity Perp,

ptimization parameters: number of iterations T, learning rate η, momentum α(t).

Result: low-dimensional data representation Y (T) = {y1,y2,...,yn}. begin compute pairwise affinities p j|i with perplexity Perp (using Equation 1) set pij =

p j|i+pi|j 2n

sample initial solution Y (0) = {y1,y2,...,yn} from N (0,10−4I) for t=1 to T do compute low-dimensional affinities qij (using Equation 4) compute gradient δC

δY (using Equation 5)

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2) end end

SLIDE 40

Algorithm 1: Simple version of t-Distributed Stochastic Neighbor Embedding. Data: data set X = {x1,x2,...,xn}, cost function parameters: perplexity Perp,

ptimization parameters: number of iterations T, learning rate η, momentum α(t).

Result: low-dimensional data representation Y (T) = {y1,y2,...,yn}. begin compute pairwise affinities p j|i with perplexity Perp (using Equation 1) set pij =

p j|i+pi|j 2n

sample initial solution Y (0) = {y1,y2,...,yn} from N (0,10−4I) for t=1 to T do compute low-dimensional affinities qij (using Equation 4) compute gradient δC

δY (using Equation 5)

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2) end end

SLIDE 41

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2)

SLIDE 42

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2)

Regular gradient descent

SLIDE 43

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2)

“momentum”

SLIDE 44

Figure credit: Bisong, 2019

SLIDE 45

What’s magical about t?

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Basically, the gradient has nice properties

SLIDE 46

What’s magical about t?

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Positive gradient —> “attraction” between points Basically, the gradient has nice properties

SLIDE 47

What’s magical about t?

High−dimensional distance > Low−dimensional distance > −1 −0.5 0.5 1

(c) Gradient of t-SNE.

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Positive gradient —> “attraction” between points

SLIDE 48

What’s magical about t?

High−dimensional distance > Low−dimensional distance > −1 −0.5 0.5 1

(c) Gradient of t-SNE.

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

t-SNE repels points in low dim space that are different in the high dim space

SLIDE 49

What’s magical about t?

High−dimensional distance > Low−dimensional distance > −1 −0.5 0.5 1

(c) Gradient of t-SNE.

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Also strongly attracts points nearby in high dim space

SLIDE 50

Let’s see some code

SLIDE 51

Another perspective: Auto-encoders

SLIDE 52

Figure credit: https://stackabuse.com/autoencoders-for-image-reconstruction-in-python-and-keras/

Machine Learning 2

DS 4420 - Spring 2020

Dimensionality reduction 2

Byron C Wallace

Today

Data

Eigenvectors of Covariance Idea: Take top-k eigenvectors to maximize variance

In Sum: Principal Component Analysis

Why?

Idea: Take top-k eigenvectors to maximize variance

Why?

Minimizing reconstruction error

Getting the eigenvalues, two ways

Getting the eigenvalues, two ways

Singular Value Decomposition

Idea: Decompose the d x n matrix X into

(unitary matrix)

(diagonal projection)

(unitary matrix)

SVD for PCA

SVD for PCA

Computing PCA

Computing PCA

Eigen-faces [Turk & Pentland 1991]

Eigen-faces [Turk & Pentland 1991]

(

) u ( )( z1 . . . zn)

Eigen-faces [Turk & Pentland 1991]

(

) u ( )( z1 . . . zn)

(

) u ( )( z1 . . . zn)

Eigen-faces [Turk & Pentland 1991]

Aside: How many components?

Wrapping up PCA

Wrapping up PCA

Non-linear dimensionality reduction

Limitations of Linearity

PCA is effective PCA is ineffective

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

{ }

Idea: Use kernels

Kernel PCA

Alternatively: t-SNE!

Stochastic Neighbor Embeddings

Manifold learning

Idea: Perform a non-linear dimensionality reduction in a manner that preserves proximity (but not distances)

Manifold learning

PCA on MNIST digits

t-SNE on MNIST Digits

Swiss roll

Euclidean distance is not always a good notion of proximity

Non-linear projection

Bad projection: relative position to neighbors changes

Non-linear projection

Intuition: Want to preserve local neighborhood

Stochastic Neighbor Embedding

Original space The map

SNE to t-SNE (on board)

t-SNE: SNE with a t-Distribution

Similarity in High Dimension Similarity in Low Dimension Gradient

Regular gradient descent

“momentum”

Figure credit: Bisong, 2019

What’s magical about t?

What’s magical about t?

What’s magical about t?

What’s magical about t?

What’s magical about t?

Let’s see some code

Another perspective: Auto-encoders

Idea: Decompose the  d x n matrix X into

Stochastic Neighbor   Embeddings

Idea: Perform a non-linear dimensionality reduction  in a manner that preserves proximity (but not distances)

Euclidean distance is not always   a good notion of proximity