The Zen of PCA, t-SNE, and Autoencoders http://mit6874.github.io 1 - - PowerPoint PPT Presentation

the zen of pca t sne and autoencoders
SMART_READER_LITE
LIVE PREVIEW

The Zen of PCA, t-SNE, and Autoencoders http://mit6874.github.io 1 - - PowerPoint PPT Presentation

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490 HST.506 David Gifford Lecture 6 February 25, 2019 The Zen of PCA, t-SNE, and Autoencoders http://mit6874.github.io 1 Today: Gene Expression, PCA,


slide-1
SLIDE 1

Computational Systems Biology Deep Learning in the Life Sciences 6.802

6.874 20.390 20.490 HST.506

David Gifford Lecture 6 February 25, 2019

The Zen of PCA, t-SNE, and Autoencoders

http://mit6874.github.io

1

slide-2
SLIDE 2

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-3
SLIDE 3
  • 1. The biology: RNA-seq data
slide-4
SLIDE 4

RNA-Seq characterizes RNA molecules

A B C Gene in genome A B C

pre-mRNA or ncRNA transcription

A B C

splicing

A C

export to cytoplasm mRNA nucleus cytoplasm

High-throughput sequencing of RNAs at various stages of processing

Slide courtesy Cole Trapnell

slide-5
SLIDE 5

RNA-Seq: De novo tx reconstruction / quantification

RNA-Seq technology:

  • Sequence short reads from

mRNA, map to genome

  • Variations:
  • Count reads mapping to each

known gene

  • Reconstruct transcriptome de

novo in each experiment

  • Advantage:
  • Digital measurements, de novo

Count

Microarray technology

  • Synthesize DNA probe array,

complementary hybridization

  • Variations:
  • One long probe per gene
  • Many short probes per gene
  • Tiled k-mers across genome
  • Advantage:
  • Can focus on small regions,

even if few molecules / cell

slide-6
SLIDE 6

Expression Analysis Data Matrix

  • Measure 20,000 genes in 100s of conditions
  • Study resulting matrix

n experiments m genes

Condition 1 Condition 2 Condition 3 …

Experiment similarity questions Gene similarity questions Expression profile of a gene

Each experiment measures expression of thousands

  • f ‘spots’, typically genes
slide-7
SLIDE 7

Clustering vs. Classification

  • Supervised learning

Conditions Genes

Alizadeh, Nature 2000

Conditions Genes

Proliferation genes in transformed cell lines B-cell genes in blood cell lines

Alizadeh, Nature 2000

Lymph node genes in diffuse large B-cell lymphoma (DLBCL) Chronic lymphocytic leukemia

Goal of Clustering: Group similar items that likely come from the same category, and in doing so reveal hidden structure Goal of Classification: Extract features from the data that best assign new elements to ≥1 of well-defined classes

  • Unsupervised learning

Known classes: Independent validation

  • f groups that emerge:
slide-8
SLIDE 8

Feature Y (liver expression)

Proteins

Clustering vs Classification

  • Objects characterized by one or more

features

  • Classification (supervised learning)

– Have labels for some points – Want a “rule” that will accurately assign labels to new points – Sub-problem: Feature selection – Metric: Classification accuracy

  • Clustering (unsupervised learning)

– No labels – Group points into clusters based on how “near” they are to one another – Identify structure in data – Metric: independent validation features

Genes

Feature X (brain expression) Feature Y (liver expression) Feature X (brain expression)

slide-9
SLIDE 9

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-10
SLIDE 10
  • 2. Supervised learning:

differential gene expression

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16

What is the right distribution for modeling read counts?

Poission?

slide-17
SLIDE 17

Orange Line – DESeq Dashed Orange – edgeR Purple - Poission

Read count data is overdispersed for a Poission Use a Negative Binomial instead

slide-18
SLIDE 18

A Negative Binomial distribution is better (DESeq)

  • i gene or isoform p condition
  • j sample (experiment) p(j) condition of sample j
  • m number of samples
  • Kij number of counts for isoform i in experiment j
  • qip Average scaled expression for gene i condition p
slide-19
SLIDE 19
slide-20
SLIDE 20

Hypergeometric test for gene set overlap significance

N – total # of genes 1000 n1 - # of genes in set A 20 n2 - # of genes in set B 30 k - # of genes in both A and B 3

0.017 0.020

slide-21
SLIDE 21

Bonferroni correction

  • Total number of rejections of null hypothesis over all N tests denoted by

R. Pr(R>0) ~= Nα

  • Need to set α’ = Pr(R>0) to required significance level over all tests.

Referred to as the experimentwise error rate.

  • With 100 tests, to achieve overall experimentwise significance level of

α’=0.05: 0.05 = 100α

  • > α = 0.0005
  • Pointwise significance level of 0.05%.
slide-22
SLIDE 22

Example - Genome wide association screens

  • Risch & Merikangas (1996).
  • 100,000 genes.
  • Observe 10 SNPs in each gene.
  • 1 million tests of null hypothesis of no association.
  • To achieve experimentwise significance level of

5%, require pointwise p-value less than 5 x 10-8

slide-23
SLIDE 23

Bonferroni correction - problems

  • Assumes each test of the null hypothesis to be

independent.

  • If not true, Bonferroni correction to significance level is

conservative.

  • Loss of power to reject null hypothesis.
  • Example: genome-wide association screen across linked

SNPs – correlation between tests due to LD between loci.

slide-24
SLIDE 24

Benjamini Hochberg

  • Select False Discovery Rate α
  • Number of tests is m
  • Sort p-values P(k) in ascending order (most significant first)
  • Assumes tests are uncorrelated or positively correlated
slide-25
SLIDE 25

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-26
SLIDE 26
  • 3. Unsupervised learning:

dimensionality reduction

slide-27
SLIDE 27

Dimensionality reduction has multiple applications

  • Uses:

– Data Visualization – Data Reduction – Data Classification – Trend Analysis – Factor Analysis – Noise Reduction

  • Examples:

– How many unique “sub-sets” are in the sample? – How are they similar / different? – What are the underlying factors that influence the samples? – Which time / temporal trends are (anti)correlated? – Which measurements are needed to differentiate? – How to best present what is “interesting”? – Which “sub-set” does this new sample rightfully belong?

slide-28
SLIDE 28

A manifold is a topological space that locally resembles Euclidean space near each point A manifold embedding is a structure preserving mapping of a high dimensional space into a manifold Manifold learning learns a lower dimensional space that enables a manifold embedding

slide-29
SLIDE 29

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-30
SLIDE 30
  • 4. Principal Component Analysis
slide-31
SLIDE 31

Example data

  • Example: 53 Blood and

urine measurements (wet chemistry) from 65 people (33 alcoholics, 32 non- alcoholics)

  • Trivariate plot

H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000

100200300400500 200 400 600 1 2 3 4

C-Triglycerides C-LDH M-EPI

slide-32
SLIDE 32

This is accomplished by rotating the axes. Suppose we have a population measured on p random variables X1,…,Xp. Note that these random variables represent the p-axes of the Cartesian coordinate system in which the population resides. Our goal is to develop a new set of p axes (linear combinations of the original p axes) in the directions of greatest variability:

X1 X2

Principal Component = axis of greatest variability

slide-33
SLIDE 33

Data projected onto PC1

slide-34
SLIDE 34
  • Given m points in a n dimensional space, for large n, how does
  • ne project on to a 1 dimensional space?
  • Formally, minimize sum of squares of distances to the line.
  • Why sum of squares? Because it allows fast minimization,

assuming the line passes through 0

Selecting Principal Components

slide-35
SLIDE 35

Linear Algebra Review

  • Eigenvectors (for a square m×m matrix S)
  • How many eigenvalues are there at most?
  • nly has a non-zero solution if

this is a m-th order equation in λ which can have at most m distinct solutions (root s of t he charact erist ic polynomial) –

can be complex even t hough S is real. eigenvalue (right ) eigenvect or

Example

slide-36
SLIDE 36

Eigenvalues & Eigenvectors

and ,

2 1 2 1 } 2 , 1 { } 2 , 1 { } 2 , 1 {

=

≠ = v v v Sv λ λ λ

  • For symmetric matrices, eigenvectors for distinct

eigenvalues are orthogonal

ℜ ∈ ⇒ = = − λ λ λ

T

S S and if , complex for I S

  • All eigenvalues of a real symmetric matrix are real.

v Sv if then , , ≥ ⇒ = ≥ ℜ ∈ ∀ λ λ Sw w w

T n

  • All eigenvalues of a positive semidefinite matrix

are non-negative

slide-37
SLIDE 37
  • Let be a square matrix with m

linearly independent eigenvectors (a “non- defective” matrix)

  • Theorem: Exists an eigen decomposition

– (cf. matrix diagonalization theorem)

  • Columns of U are eigenvectors of S
  • Diagonal elements of are eigenvalues of

Eigen/diagonal Decomposition

diagonal

Unique for distinct eigen- values

v1 v2 v3 … vm λ1 λ2 λ3 … λm

S = U Λ U-1

slide-38
SLIDE 38
  • If is a symmetric matrix:
  • Theorem: Exists a (unique) eigen

decomposition

  • where Q is orthogonal:

– Q-1= Q QT – Columns of Q are normalized eigenvectors – Columns are orthogonal. – (everything is real)

Symmetric Eigen Decomposition

T

Q Q S Λ =

slide-39
SLIDE 39

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-40
SLIDE 40
  • 5. Singular value decomposition

(general m x n matrices)

slide-41
SLIDE 41

Singular Value Decomposition

T

V U A Σ =

m×m m×n V is n×n

For an m× n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: The columns of U are orthogonal eigenvectors of AAT. The columns of V are orthogonal eigenvectors of ATA.

i i

λ σ =

( )

r

diag σ σ ...

1

= Σ

Singular values.

Eigenvalues λ1 … λr of AAT are the eigenvalues of ATA.

slide-42
SLIDE 42

Geometric interpretation of SVD

Mx = M(x) = U( S( V*(x) ) ) Rotation Scaling Rotation Shearing

slide-43
SLIDE 43

Singular Value Decomposition

  • Illustration of SVD dimensions and

sparseness

slide-44
SLIDE 44

Singular Value Decomposition-example

  • Let

          − = 1 1 1 1 A

Thus m=3, n=2. Its SVD is

      −                     − − 2 / 1 2 / 1 2 / 1 2 / 1 3 1 3 / 1 6 / 1 2 / 1 3 / 1 6 / 1 2 / 1 3 / 1 6 / 2

Typically, the singular values arranged in decreasing order.

slide-45
SLIDE 45
  • SVD can be used to compute optimal low-

rank approximations.

  • Approximation problem: Find Ak of rank k

such that Ak and X are both m×n matrices.

Typically, want k << r.

Low-rank Approximation

Frobenius norm (aka Euclidian norm)

F k X rank X k

X A A − =

=

min

) ( :

slide-46
SLIDE 46
  • Solution via SVD

Low-rank Approximation

set smallest r-k singular values t o zero

T k k

V U A ) ,..., , ,..., ( diag

1

σ σ =

column not at ion: sum

  • f rank 1 mat rices

T i i k i i k

v u A

∑ =

=

k

1 ) ( :min + =

= − = −

k F k F k X rank X

A A X A σ

  • Error:
slide-47
SLIDE 47
slide-48
SLIDE 48

PCA of MNIST digits

slide-49
SLIDE 49

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-50
SLIDE 50
  • 6. Non-linear embeddings: t-SNE
slide-51
SLIDE 51
slide-52
SLIDE 52

Neighborhood not preserved

slide-53
SLIDE 53

Neighborhood preserved

slide-54
SLIDE 54

Measure pairwise distances in high dimensional space

Shannon entropy of Pi

slide-55
SLIDE 55

We want to choose an embedding that minimizes divergence between low and high dimension similarities

slide-56
SLIDE 56

Low dimensional embedding using a Student t-distribution to avoid overcrowding

Red – Student t-distribution (1 degree of freedom) Blue - Gaussian

slide-57
SLIDE 57

We can use gradient methods to find an embedding

pij = New (low) dimension distance qij = Original (high) dimension D (okay to separate nearby points) (not okay to bring distant points closer)

slide-58
SLIDE 58

Interpretation of SNE (left) and t-SNE (right) gradients

slide-59
SLIDE 59

t-SNE of MNIST digits

1 2 7 4 3 6 5 8 9

slide-60
SLIDE 60

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-61
SLIDE 61
  • 7. Playing with t-SNE parameters
slide-62
SLIDE 62

Perplexity matters

https://distill.pub/2016/misread-tsne/ Recommended range by Van Der Maaten and Hinton

slide-63
SLIDE 63

Number of steps matter

“pinched”: Not enough steps Too tight Spread again Tight again https://distill.pub/2016/misread-tsne/

slide-64
SLIDE 64

Cluster sizes are not meaningful

Original data: 2 Gaussians Widely different (10-fold) dispersion t-SNE loses that notion of distance. By design, it adapts to regional variations in distance. https://distill.pub/2016/misread-tsne/

slide-65
SLIDE 65

Between-cluster distance is not always preserved

Equidistant Equidistant Captured Captured Equidistant https://distill.pub/2016/misread-tsne/

slide-66
SLIDE 66

False clusters may appear

https://distill.pub/2016/misread-tsne/

slide-67
SLIDE 67

Relationships are not always preserved

https://distill.pub/2016/misread-tsne/

slide-68
SLIDE 68

Different runs may produce similar results… (but not at very low perplexity)

https://distill.pub/2016/misread-tsne/

slide-69
SLIDE 69

t-SNE of equidistant points

(learning rate) (#neighbors)

slide-70
SLIDE 70

t-SNE of square grid

(learning rate) (#neighbors)

slide-71
SLIDE 71

t-SNE of 3D Knot

(learning rate) (#neighbors)

slide-72
SLIDE 72

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE) – Building intuition: Playing with t-SNE parameters

  • Deep Learning embeddings

– Autoencoders

slide-73
SLIDE 73
  • 8. Embedding with Deep learning:

Auto-encoders

slide-74
SLIDE 74

Autoencoder: dimensionality reduction with neural net

  • Tricking a supervised learning algorithm to work in unsupervised fashion
  • Feed input as output function to be learned. But! Constrain model complexity
  • Pretraining with RBMs to learn representations for future supervised tasks. Use RBM
  • utput as “data” for training the next layer in stack
  • After pretraining, "unroll” RBMs to create deep autoencoder
  • Fine-tune using backpropagation

[Hinton et al, 2006]

slide-75
SLIDE 75

Autoencoders learn a latent representation of data

slide-76
SLIDE 76

Denoising autoencoders recover signal corrupted by noise

slide-77
SLIDE 77

We can learn manifolds with autoencoders

slide-78
SLIDE 78

http://elf-project.sourceforge.net/autoencoder.html

Auto-encoder learning of MNIST digit data

slide-79
SLIDE 79

Today: Gene Expression, PCA, t-SNE, autoencoders

  • Gene expression analysis: The Biology of RNA-seq
  • Supervised (Classification) vs. unsupervised (Clustering)
  • Supervised: Differential expression analysis
  • Unsupervised: Embedding into lower dimensional space
  • Linear reduction of dimensionality

– Principle Component Analysis – Singular Value Decomposition

  • Non-linear dimensionality reduction: embeddings

– t-distributed Stochastic Network Embedding (t-SNE)

  • Deep Learning embeddings

– Autoencoders

slide-80
SLIDE 80

FIN - Thank You

slide-81
SLIDE 81

http://dpkingma.com/sgvb_mnist_demo/demo_old.html http://elf-project.sourceforge.net/autoencoder.html http://vdumoulin.github.io/morphing_faces/online_demo.html

Interesting on-line demos