Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 - - PowerPoint PPT Presentation

machine learning 2
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 - - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A bit of wrap up on PCA Then : Non-linear dimensionality reduction! (SNE/t-SNE) In Sum: Principal Component Analysis X = ( x 1 x


slide-1
SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Dimensionality reduction 2

Byron C Wallace

slide-2
SLIDE 2

Today

  • A bit of wrap up on PCA
  • Then: Non-linear dimensionality reduction! (SNE/t-SNE)
slide-3
SLIDE 3

Λ =    λ1 λ2 ... λd   

Data

X =( x1 · · · · · · xn) ∈ Rd⇥n

Eigenvectors of Covariance Idea: Take top-k eigenvectors to maximize variance

In Sum: Principal Component Analysis

slide-4
SLIDE 4

Why?

Idea: Take top-k eigenvectors to maximize variance

Last time, we saw that we can derive this by maximizing the variance in the compressed space

slide-5
SLIDE 5

Why?

Idea: Take top-k eigenvectors to maximize variance Last time, we saw that we can derive this by maximizing the variance in the compressed space Can also motivate by explicitly minimizing reconstruction error

slide-6
SLIDE 6

Minimizing reconstruction error

slide-7
SLIDE 7

Getting the eigenvalues, two ways

S = 1 N

N

X

n=1

xnx>

n = 1

N XX>

  • Direct eigenvalue decomposition of the covariance matrix
slide-8
SLIDE 8

Getting the eigenvalues, two ways

  • Direct eigenvalue decomposition of the covariance matrix

S = 1 N

N

X

n=1

xnx>

n = 1

N XX>

  • Singular Value Decomposition (SVD)
slide-9
SLIDE 9

Singular Value Decomposition

Idea: Decompose the
 d x n matrix X into

  • 1. A n x n basis V


(unitary matrix)

  • 2. A d x n matrix Σ


(diagonal projection)

  • 3. A d x d basis U


(unitary matrix)

d X = Ud⇥dΣd⇥nV>

n⇥n

slide-10
SLIDE 10

2

  • 1. Rotation

V T~ x =

n

X

i=1

h~ vi, ~ xi ~ ei

  • 2. Scaling

SV T~ x =

n

X

i=1

si h~ vi, ~ xi ~ ei

  • 3. Rotation

USV T~ x =

n

X

i=1

si h~ vi, ~ xi ~ ui

slide-11
SLIDE 11

SVD for PCA

X

|{z}

D⇥N

= U

|{z}

D⇥D

Σ

|{z}

D⇥N

V >

|{z}

N⇥N

,

S = 1 N XX> = 1 N UΣ V >V

| {z }

=IN

Σ>U > = 1 N UΣΣ>U >

slide-12
SLIDE 12

SVD for PCA

X

|{z}

D⇥N

= U

|{z}

D⇥D

Σ

|{z}

D⇥N

V >

|{z}

N⇥N

,

S = 1 N XX> = 1 N UΣ V >V

| {z }

=IN

Σ>U > = 1 N UΣΣ>U >

It turns out the columns of U are the eigenvectors of XXT

slide-13
SLIDE 13

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX>

Computing C already takes O(nd2) time (very expensive) Method 2: singular value decomposition (SVD) Find X = Ud⇥dΣd⇥nV>

n⇥n

where U>U = Id⇥d, V>V = In⇥n, Σ is diagonal Computing top k singular vectors takes only O(ndk)

slide-14
SLIDE 14

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX>

Computing C already takes O(nd2) time (very expensive) Method 2: singular value decomposition (SVD) Find X = Ud⇥dΣd⇥nV>

n⇥n

where U>U = Id⇥d, V>V = In⇥n, Σ is diagonal Computing top k singular vectors takes only O(ndk) Relationship between eigendecomposition and SVD: Left singular vectors are principal components (C = UΣ2U>)

slide-15
SLIDE 15
  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Eigen-faces [Turk & Pentland 1991]

slide-16
SLIDE 16

Eigen-faces [Turk & Pentland 1991]

  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

slide-17
SLIDE 17

Eigen-faces [Turk & Pentland 1991]

  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification

slide-18
SLIDE 18
  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

Eigen-faces [Turk & Pentland 1991]

slide-19
SLIDE 19

Aside: How many components?

  • Magnitude of eigenvalues indicate fraction of variance captured.
  • Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1 553.6 820.1 1086.7 1353.2

λi

  • Eigenvalues typically drop off sharply, so don’t need that many.
  • Of course variance isn’t everything...
slide-20
SLIDE 20

Wrapping up PCA

  • PCA is a linear model for dimensionality reduction which

finds a mapping to a lower dimensional space that maximizes variance

  • We saw that this is equivalent to performing an

eigendecomposition on the covariance matrix of X

  • Next time Auto-encoders and neural compression for

non-linear projections

slide-21
SLIDE 21

Wrapping up PCA

  • PCA is a linear model for dimensionality reduction which

finds a mapping to a lower dimensional space that maximizes variance

  • We saw that this is equivalent to performing an

eigendecomposition on the covariance matrix of X

  • Next time Auto-encoders and neural compression for

non-linear projections

slide-22
SLIDE 22

Non-linear dimensionality reduction

slide-23
SLIDE 23

Limitations of Linearity

PCA is effective PCA is ineffective

slide-24
SLIDE 24

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

slide-25
SLIDE 25

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)>

{ }

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

Idea: Use kernels

slide-26
SLIDE 26

Kernel PCA

slide-27
SLIDE 27

Alternatively: t-SNE!

slide-28
SLIDE 28

Stochastic Neighbor 
 Embeddings

Borrowing from:
 Laurens van der Maaten
 (Delft -> Facebook AI)

slide-29
SLIDE 29

Manifold learning

Idea: Perform a non-linear dimensionality reduction
 in a manner that preserves proximity (but not distances)

slide-30
SLIDE 30

Manifold learning

slide-31
SLIDE 31

PCA on MNIST digits

slide-32
SLIDE 32

t-SNE on MNIST Digits

slide-33
SLIDE 33

Swiss roll

Euclidean distance is not always 
 a good notion of proximity

slide-34
SLIDE 34

Non-linear projection

Bad projection: relative position to neighbors changes

slide-35
SLIDE 35

Non-linear projection

Intuition: Want to preserve local neighborhood

slide-36
SLIDE 36

Stochastic Neighbor Embedding

Original space The map

slide-37
SLIDE 37

SNE to t-SNE (on board)

slide-38
SLIDE 38

t-SNE: SNE with a t-Distribution

∂C ∂yi = 4 X

j6=i

(pij − qij)(1 + ||yi − yj||2)1(yi − yj) pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = (1 + ||yi − yj||2)1 P

k6=l(1 + ||yk − yl||2)1

Similarity in High Dimension Similarity in Low Dimension Gradient

slide-39
SLIDE 39

Algorithm 1: Simple version of t-Distributed Stochastic Neighbor Embedding. Data: data set X = {x1,x2,...,xn}, cost function parameters: perplexity Perp,

  • ptimization parameters: number of iterations T, learning rate η, momentum α(t).

Result: low-dimensional data representation Y (T) = {y1,y2,...,yn}. begin compute pairwise affinities p j|i with perplexity Perp (using Equation 1) set pij =

p j|i+pi|j 2n

sample initial solution Y (0) = {y1,y2,...,yn} from N (0,10−4I) for t=1 to T do compute low-dimensional affinities qij (using Equation 4) compute gradient δC

δY (using Equation 5)

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2) end end

slide-40
SLIDE 40

Algorithm 1: Simple version of t-Distributed Stochastic Neighbor Embedding. Data: data set X = {x1,x2,...,xn}, cost function parameters: perplexity Perp,

  • ptimization parameters: number of iterations T, learning rate η, momentum α(t).

Result: low-dimensional data representation Y (T) = {y1,y2,...,yn}. begin compute pairwise affinities p j|i with perplexity Perp (using Equation 1) set pij =

p j|i+pi|j 2n

sample initial solution Y (0) = {y1,y2,...,yn} from N (0,10−4I) for t=1 to T do compute low-dimensional affinities qij (using Equation 4) compute gradient δC

δY (using Equation 5)

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2) end end

slide-41
SLIDE 41

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2)

slide-42
SLIDE 42

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2)

Regular gradient descent

slide-43
SLIDE 43

set Y (t) = Y (t−1) +η δC

δY +α(t)

Y (t−1) −Y (t−2)

“momentum”

slide-44
SLIDE 44

Figure credit: Bisong, 2019

slide-45
SLIDE 45

What’s magical about t?

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Basically, the gradient has nice properties

slide-46
SLIDE 46

What’s magical about t?

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Positive gradient —> “attraction” between points Basically, the gradient has nice properties

slide-47
SLIDE 47

What’s magical about t?

High−dimensional distance > Low−dimensional distance > −1 −0.5 0.5 1

(c) Gradient of t-SNE.

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Positive gradient —> “attraction” between points

slide-48
SLIDE 48

What’s magical about t?

High−dimensional distance > Low−dimensional distance > −1 −0.5 0.5 1

(c) Gradient of t-SNE.

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

t-SNE repels points in low dim space that are different in the high dim space

slide-49
SLIDE 49

What’s magical about t?

High−dimensional distance > Low−dimensional distance > −1 −0.5 0.5 1

(c) Gradient of t-SNE.

High−dimensional distance > Low−dimensional distance >

(a) Gradient of SNE.

2 4 6 8 10 12 14 16 18

Also strongly attracts points nearby in high dim space

slide-50
SLIDE 50

Let’s see some code

slide-51
SLIDE 51

Another perspective: Auto-encoders

slide-52
SLIDE 52

Figure credit: https://stackabuse.com/autoencoders-for-image-reconstruction-in-python-and-keras/