[PPT] - Catharina Olsen & Antonio Colaprico PowerPoint Presentation

SLIDE 1

Mining and analysis of genomic and epigenomic

data (TCGA) using R

Catharina Olsen & Antonio Colaprico Academic supervisor: Gianluca Bontempi Machine Learning Group (MLG) Interuniversity Institute of Bioinformatics in Brussels (IB)2

December 6th & 7th 2016

SLIDE 2

Workshop overview

◮ day 1

◮ introduction R ◮ Analyses ◮ Differential expression analysis ◮ Enrichment analysis ◮ Clustering, dendrograms & heatmaps ◮ Survival analysis ◮ data in biomedical research: NGS, TCGA, downloading and

normalization

◮ day 2

◮ integrative analysis ◮ Command line vs. graphical user interface (introduction to

TCGAbiolinksGUI)

2/ 69

SLIDE 3

Analyses

◮ methods for monitoring genome-wide mRNA expressions

such as microarrays or RNAseq

◮ allow to observe expression levels of the entire genome

under many different induced conditions

◮ Knowing when and under what conditions a gene or a set of

genes is expressed often provides strong clues as to their biological role and function

◮ possible strategies to determine the function of unknown

genes

◮ clustering algorithms: group together genes with similar

expression profiles

◮ apply supervised learning methods: predictive precision can

be quantified

3/ 69

SLIDE 4

Patient classification (breast cancer)

◮ Breast cancer is one of the most common malignant tumors

affecting women.

◮ patients with the same disease stage can have different

treatment responses and overall outcome

◮ cancer classification has been based primarily on

morphological appearance

◮ the strongest predictors for metastasis fail to classify

accurately breast tumors according to their clinical behavior

◮ relied on specific biological insights, rather than systematic

and unbiased approaches for recognizing tumor subtypes

4/ 69

SLIDE 5

Breast cancer classification (II)

◮ Chemotherapy or hormonal therapy reduces the risk of

distant metastasis by approximately one-third; however 70-80% of patients receiving this treatment would have survived without it. Also, these therapies frequently have toxic side effects.

◮ Diagnosis of cancer must be accurate in order for the

patient to receive the correct treatment and so have the best chance of survival.

◮ The cellular and molecular heterogeneity of breast tumors

and the large number of genes potentially involved in controlling cell growth, death and differentiation emphasize the importance of studying multiple genetic alterations

◮ The development of -omics technology provides the

pportunity of correlating genome-wide expressions with

the response of tumor cells to chemotherapy.

5/ 69

SLIDE 6

Breast cancer classification (III)

◮ Systematic investigation of expression patterns of

thousands of genes in tumors using DNA microarrays and their correlation to specific features of phenotypic variation might provide the basis for an improved taxonomy of cancer.

◮ It is expected that variations in gene expression patterns in

different tumors could provide a “molecular portrait” of each tumor, and that the tumors could be classified into subtypes based solely on the difference of expression patterns.

6/ 69

SLIDE 7

Take-home message

A very large number of problems in life science may be formalized as supervised learning problems characterized by

1. The intuition of the existence of a dependence between

some input (e.g. genotype) and output (e.g. phenotype).

2. An outcome measurement, also called output, usually

quantitative (like the gene expression) or categorical (like metastasis or not).

3. a set of features or inputs, also quantitative or categorical,

that we wish to use to predict the output.

4. the availability of a finite set of input/output observations.

7/ 69

SLIDE 8

Methods of feature selection

Three main approaches

◮ Filter methods

◮ preprocessing methods ◮ assess the merits of features from the data ◮ ignore the effects of the selected feature subset on the

performance of the learning algorithm

◮ examples: ranking using compression techniques (PCA or

clustering), correlation with output, etc.

◮ Wrapper methods

◮ assess subsets of variables according to their usefulness to a

given predictor

◮ conduct a search for a good subset using the learning

algorithm itself as part of the evaluation function

◮ boils down to stochastic state space search

◮ Embedded methods

◮ variable selection as part of the learning procedure ◮ usually specific to given learning machines ◮ examples: classification trees, random forests, and methods

based on regularization techniques (e.g. lasso)

8/ 69

SLIDE 9

Pro-con analysis

Filter methods

◮ Pros:

◮ easily scale to very high-dimensional datasets ◮ computationally simple and fast ◮ independent of the classification algorithm ◮ feature selection needs to be performed only once, and then

different classifiers can be evaluated.

◮ Cons:

◮ they ignore the interaction with the classifier ◮ they are often univariate or low-variate, i.e. each feature is

considered separately, thereby ignoring feature dependencies

9/ 69

SLIDE 10

Pro-con analysis

Wrapper methods

◮ Pros:

◮ interaction between feature subset search and model

selection

◮ ability to take into account feature dependencies

◮ Cons:

◮ higher risk of overfitting than filter techniques ◮ very computationally intensive, especially if building the

classifier

10/ 69

SLIDE 11

Pros-cons analysis

Embedded methods

◮ Pros: less computationally intensive than wrapper methods ◮ Cons: specific to a learning machine

11/ 69

SLIDE 12

Unsupervised learning approaches

◮ some filter techniques are based on unsupervised learning

◮ there are inputs but no supervising output ◮ goal: learn relationships and structure from such data (e.g.

groups or clusters)

◮ difficulties

◮ notion of distance in a multivariate space, to the problem ◮ determining the number of clusters ◮ assessing the accuracy of a clustering task (no universally

accepted mechanism for performing cross-validation or computing validating results on an independent data set)

◮ Unsupervised learning is often performed as part of an

exploratory data analysis or preliminary data compression.

12/ 69

SLIDE 13

Differential expression analysis

Goal of this session

◮ available data: expression (microarray, RNAseq)

→ a lot of genes, few samples

◮ possible questions include

◮ which genes are relevant: differentially expressed when

comparing disease and control samples?

◮ what are the co-expression signatures for those genes? ◮ what are the functional categories? 13/ 69

SLIDE 14

Goal

Differentially expressed genes & functional enrichment analysis

14/ 69

SLIDE 15

The data

◮ variable names in the 1st column → row.names = 1 ◮ sample names in 1st row → header = TRUE

15/ 69

SLIDE 16

The data

◮ read data

mydata <- read.csv('geneExpressionV2.csv', row.names = 1, header = TRUE)

◮ show dimension

dim(mydata) ## [1] 2400 64

◮ show sample names

colnames(mydata) ## [1] "Exp.t1.1" "Exp.t1.2" "Exp.t1.3" "Exp.t1.4" "Exp.t2.1" "Exp.t2.2" ## [7] "Exp.t2.3" "Exp.t2.4" "Exp.t3.1" "Exp.t3.2" "Exp.t3.3" "Exp.t3.4" ## [13] "Exp.t4.1" "Exp.t4.2" "Exp.t4.3" "Exp.t4.4" "Exp.t5.1" "Exp.t5.2" ## [19] "Exp.t5.3" "Exp.t5.4" "Exp.t6.1" "Exp.t6.2" "Exp.t6.3" "Exp.t6.4" ## [25] "Exp.t7.1" "Exp.t7.2" "Exp.t7.3" "Exp.t7.4" "Exp.t8.1" "Exp.t8.2" ## [31] "Exp.t8.3" "Exp.t8.4" "Ctl.t1.1" "Ctl.t1.2" "Ctl.t1.3" "Ctl.t1.4" ## [37] "Ctl.t2.1" "Ctl.t2.2" "Ctl.t2.3" "Ctl.t2.4" "Ctl.t3.1" "Ctl.t3.2" ## [43] "Ctl.t3.3" "Ctl.t3.4" "Ctl.t4.1" "Ctl.t4.2" "Ctl.t4.3" "Ctl.t4.4" ## [49] "Ctl.t5.1" "Ctl.t5.2" "Ctl.t5.3" "Ctl.t5.4" "Ctl.t6.1" "Ctl.t6.2" ## [55] "Ctl.t6.3" "Ctl.t6.4" "Ctl.t7.1" "Ctl.t7.2" "Ctl.t7.3" "Ctl.t7.4" ## [61] "Ctl.t8.1" "Ctl.t8.2" "Ctl.t8.3" "Ctl.t8.4"

→ rows 1 to 32 are disease samples, rows 33 to 64 are the control samples

16/ 69

SLIDE 17

Differential expression analysis

◮ test how likely it is to make the same observation by chance ◮ compute significance of differential expression for each of

the genes in the data

◮ How? ◮ statistical test ◮ multiple testing correction

17/ 69

SLIDE 18

Statistical tests

t-test

◮ 2 sets of data with normal distribution ◮ underlying variances are equal ◮ test the null hypothesis that the means of the two groups

are equal (H0)

◮ define null hypothesis

H0: mean control group = mean treatment group

18/ 69

SLIDE 19

Statistical tests

t-test

◮ observations: X1, . . . , XM and Y1, . . . , YN ◮ sample mean: ¯

X = 1

M

i Xi, ¯

Y = 1

N

i Yi

◮ sample variance (= sd2) s2

X = 1 M−1

i(Xi − ¯

X)2

◮ test statistic

T = ¯ Y − ¯ X sp ·

1

M + 1 N

, where sp =

(M−1)s2

X+(N−1)s2 Y

M+N−2

◮ compare with tabulated value with N + M − 2 degrees of freedom ◮ if the calcuated value > the tabulated value

→ reject null hypothesis, this difference is unlikely due to chance

19/ 69

SLIDE 20

Exercise

T-test

◮ use the R function t.test to test for each gene whether

the mean of the disease samples and that of the normal samples are significantly different from each other

◮ implement this using

◮ a for loop ◮ apply 20/ 69

SLIDE 21

Statistical test – R code

◮ using a for loop

pvals <- rep(NA,nrow(mydata)) names(pvals) <- rownames(mydata) for(i in rownames(mydata)){ pvals[i] <- t.test(mydata[i,1:32] ,mydata[i,33:64])$p.value } print(pvals[1:3]) ## VNG0174G VNG0602C VNG0743H ## 5.544254e-05 5.596029e-04 2.352681e-04

◮ using apply

pvals <- apply(mydata,1, function(x){t.test(x[1:32],x[33:64])$p.value}) print(pvals[1:3]) ## VNG0174G VNG0602C VNG0743H ## 5.544254e-05 5.596029e-04 2.352681e-04 21/ 69

SLIDE 22

Multiple testing correction

H0 is true H1 is true accept H0 type II error reject H0 type I error

◮ Family Wise Error Rate (FWER), e.g. Bonferroni

◮ reduce the probability of even one false discovery ◮ increased rates of type I errors

→ very conservative, only few genes identified as significant → used when we need to be certain that identified genes are truly positive

◮ False Discovery Rate (FDR), e.g. Benjamini-Hochberg

◮ controls the proportion of type I errors (small number of

errors in comparison to the number of rejected hypotheses) → less stringent

22/ 69

SLIDE 23

Multiple testing correction – R code

◮ computing the adjusted p-values

pvals.bonferroni<-p.adjust(pvals,method="bonferroni") pvals.bh<-p.adjust(pvals,method="BH")

◮ how many genes are considered significant?

print(paste('Uncorrected =',sum(pvals<=0.05))) ## [1] "Uncorrected = 285" print(paste('Bonferroni =',sum(pvals.bonferroni<=0.05))) ## [1] "Bonferroni = 49" print(paste('Benjamini-Hochberg =',sum(pvals.bh<=0.05))) ## [1] "Benjamini-Hochberg = 94" 23/ 69

SLIDE 24

Looking at the expression

Benjamini-Hochberg corrected

matplot(t(mydata[pvals.bh < 0.05, c(1:32)]), type="l")

5 10 15 20 25 30 5 10 t(mydata[pvals.bh < 0.05, c(1:32)])

→ there are some patterns

24/ 69

SLIDE 25

Looking at the expression

knowing which genes have similar expression profiles we can clearly see the patterns

5 10 15 20 25 30 5 10

cluster no 1 with 23 genes

Expression 5 10 15 20 25 30 5 10

cluster no 2 with 22 genes

Expression 5 10 15 20 25 30 5 10

cluster no 3 with 25 genes

Expression 5 10 15 20 25 30 5 10

cluster no 4 with 11 genes

Expression

25/ 69

SLIDE 26

Looking at the expression

◮ 81 genes spread over these 4 clusters ◮ 13 genes have expression that is not behaving as smoothly

5 10 15 20 25 30 −2 −1 1 2 3

cluster no 5 with 13 genes

Expression

How can we determine which genes have similar expression profiles?

26/ 69

SLIDE 27

Clustering

◮ determine groups of genes or observations with similar

patterns (e.g. patterns of gene expression).

◮ require the definition of a distance function between

variables and the definition of a distance between clusters.

◮ Nearest neighbor clustering: the number of clusters is

decided first, then each variable is assigned to a cluster Examples: K-means

◮ Agglomerative clustering: bottom-up methods where

clusters start as empty and variables are successively added Example hierarchical clustering

27/ 69

SLIDE 28

Dendrogram

28/ 69

SLIDE 29

Identify the clusters

◮ compute dissimilarity matrix (default: Euclidean)

mydata.bh <- mydata[pvals.bh<=0.05,] dist(mydata.bh)

◮ cluster the distance

mydata.bh <- mydata[pvals.bh <= 0.05,] clust.bh <- hclust(dist(mydata.bh))

29/ 69

SLIDE 30

Visualizing the data

Dendrogram plot(clust.bh, cex = 0.5)

VNG6223C VNG1985C VNG6177G VNG6178G VNG6247G VNG2068C VNG2602G VNG6218G VNG6313G VNG6176G VNG2124C VNG1644G VNG0769H VNG0020H VNG2177H VNG0881G VNG0095G VNG2241H VNG0003C VNG1991H VNG2493C VNG6168H VNG0865C VNG6362G VNG2384G VNG6349C VNG1002H VNG6182H VNG2470C VNG0947G VNG0521G VNG1184G VNG1553G VNG2338G VNG2240G VNG2135G VNG0585H VNG0361C VNG0700G VNG1233G VNG0702H VNG2581H VNG1039H VNG0606G VNG1576G VNG6287H VNG0149G VNG6367H VNG1740C VNG0730C VNG6400H VNG2075C VNG2201G VNG0141H VNG5185H VNG1921H VNG1255C VNG1148G VNG0370H VNG0140H VNG0144H VNG5126G VNG0303G VNG0311H VNG0840H VNG0115G VNG1374G VNG0300C VNG0017H VNG0213H VNG1095H VNG0743H VNG1536C VNG0798H VNG2521H VNG2523H VNG0602C VNG1518H VNG1752C VNG6205C VNG2289G VNG1658C VNG0021H VNG2105H VNG1776G VNG6179G VNG0826C VNG2443G VNG2386C VNG0174G VNG2520C VNG1240G VNG1849H VNG1898C

10 20 30 40

Cluster Dendrogram

hclust (*, "complete") dist(mydata.bh) Height 30/ 69

SLIDE 31

Visualizing the data

Heatmap heatmap(as.matrix(mydata.bh[,1:32]), Colv = NA)

Exp.t1.1 Exp.t1.2 Exp.t1.3 Exp.t1.4 Exp.t2.1 Exp.t2.2 Exp.t2.3 Exp.t2.4 Exp.t3.1 Exp.t3.2 Exp.t3.3 Exp.t3.4 Exp.t4.1 Exp.t4.2 Exp.t4.3 Exp.t4.4 Exp.t5.1 Exp.t5.2 Exp.t5.3 Exp.t5.4 Exp.t6.1 Exp.t6.2 Exp.t6.3 Exp.t6.4 Exp.t7.1 Exp.t7.2 Exp.t7.3 Exp.t7.4 Exp.t8.1 Exp.t8.2 Exp.t8.3 Exp.t8.4

VNG6223C VNG1985C VNG6218G VNG6313G VNG2124C VNG6176G VNG2602G VNG2068C VNG6177G VNG6178G VNG6247G VNG1644G VNG0769H VNG0020H VNG2177H VNG0881G VNG0095G VNG2241H VNG0003C VNG1991H VNG2493C VNG6168H VNG6349C VNG2384G VNG0865C VNG6362G VNG2470C VNG0947G VNG1002H VNG6182H VNG2338G VNG2135G VNG2240G VNG0521G VNG1184G VNG1553G VNG0300C VNG0017H VNG1095H VNG0213H VNG0370H VNG0140H VNG0144H VNG0840H VNG1374G VNG0115G VNG5126G VNG0311H VNG0303G VNG0602C VNG1752C VNG1518H VNG0743H VNG1536C VNG0798H VNG2521H VNG2523H VNG2289G VNG6205C VNG1658C VNG0021H VNG1240G VNG1898C VNG1849H VNG1776G VNG2105H VNG6179G VNG0826C VNG2443G VNG2386C VNG0174G VNG2520C VNG0361C VNG0585H VNG0700G VNG1233G VNG0702H VNG2581H VNG2075C VNG2201G VNG0141H VNG1255C VNG1148G VNG1921H VNG5185H VNG1039H VNG1576G VNG0606G VNG6287H VNG0149G VNG6367H VNG1740C VNG0730C VNG6400H

31/ 69

SLIDE 32

Visualizing the data

More typical color scheme heatmap(as.matrix(mydata.bh[,1:32]),Colv=NA, col=colorRampPalette(c('green','black','red'))(256))

Exp.t1.1 Exp.t1.2 Exp.t1.3 Exp.t1.4 Exp.t2.1 Exp.t2.2 Exp.t2.3 Exp.t2.4 Exp.t3.1 Exp.t3.2 Exp.t3.3 Exp.t3.4 Exp.t4.1 Exp.t4.2 Exp.t4.3 Exp.t4.4 Exp.t5.1 Exp.t5.2 Exp.t5.3 Exp.t5.4 Exp.t6.1 Exp.t6.2 Exp.t6.3 Exp.t6.4 Exp.t7.1 Exp.t7.2 Exp.t7.3 Exp.t7.4 Exp.t8.1 Exp.t8.2 Exp.t8.3 Exp.t8.4

VNG6223C VNG1985C VNG6218G VNG6313G VNG2124C VNG6176G VNG2602G VNG2068C VNG6177G VNG6178G VNG6247G VNG1644G VNG0769H VNG0020H VNG2177H VNG0881G VNG0095G VNG2241H VNG0003C VNG1991H VNG2493C VNG6168H VNG6349C VNG2384G VNG0865C VNG6362G VNG2470C VNG0947G VNG1002H VNG6182H VNG2338G VNG2135G VNG2240G VNG0521G VNG1184G VNG1553G VNG0300C VNG0017H VNG1095H VNG0213H VNG0370H VNG0140H VNG0144H VNG0840H VNG1374G VNG0115G VNG5126G VNG0311H VNG0303G VNG0602C VNG1752C VNG1518H VNG0743H VNG1536C VNG0798H VNG2521H VNG2523H VNG2289G VNG6205C VNG1658C VNG0021H VNG1240G VNG1898C VNG1849H VNG1776G VNG2105H VNG6179G VNG0826C VNG2443G VNG2386C VNG0174G VNG2520C VNG0361C VNG0585H VNG0700G VNG1233G VNG0702H VNG2581H VNG2075C VNG2201G VNG0141H VNG1255C VNG1148G VNG1921H VNG5185H VNG1039H VNG1576G VNG0606G VNG6287H VNG0149G VNG6367H VNG1740C VNG0730C VNG6400H

32/ 69

SLIDE 33

Get information about clusters

cutree function

WHAT DOES THE NUMBER OF CLUSTERS DEPEND ON?

◮ specify the number of clusters (parameter k)

cuttree.bh <- cutree(clust.bh, k = 5)

◮ specify the height of the tree (parameter h)

cuttree.bh.height <- cutree(clust.bh, h = 15)

33/ 69

SLIDE 34

Visualizing the data

Adding cluster information heatmap(as.matrix(mydata.bh[,1:32]), Colv = NA, col = colorRampPalette(c('green','black','red'))(256), RowSideColors = as.character(cuttree.bh))

Exp.t1.1 Exp.t1.2 Exp.t1.3 Exp.t1.4 Exp.t2.1 Exp.t2.2 Exp.t2.3 Exp.t2.4 Exp.t3.1 Exp.t3.2 Exp.t3.3 Exp.t3.4 Exp.t4.1 Exp.t4.2 Exp.t4.3 Exp.t4.4 Exp.t5.1 Exp.t5.2 Exp.t5.3 Exp.t5.4 Exp.t6.1 Exp.t6.2 Exp.t6.3 Exp.t6.4 Exp.t7.1 Exp.t7.2 Exp.t7.3 Exp.t7.4 Exp.t8.1 Exp.t8.2 Exp.t8.3 Exp.t8.4 VNG6223C VNG1985C VNG6218G VNG6313G VNG2124C VNG6176G VNG2602G VNG2068C VNG6177G VNG6178G VNG6247G VNG1644G VNG0769H VNG0020H VNG2177H VNG0881G VNG0095G VNG2241H VNG0003C VNG1991H VNG2493C VNG6168H VNG6349C VNG2384G VNG0865C VNG6362G VNG2470C VNG0947G VNG1002H VNG6182H VNG2338G VNG2135G VNG2240G VNG0521G VNG1184G VNG1553G VNG0300C VNG0017H VNG1095H VNG0213H VNG0370H VNG0140H VNG0144H VNG0840H VNG1374G VNG0115G VNG5126G VNG0311H VNG0303G VNG0602C VNG1752C VNG1518H VNG0743H VNG1536C VNG0798H VNG2521H VNG2523H VNG2289G VNG6205C VNG1658C VNG0021H VNG1240G VNG1898C VNG1849H VNG1776G VNG2105H VNG6179G VNG0826C VNG2443G VNG2386C VNG0174G VNG2520C VNG0361C VNG0585H VNG0700G VNG1233G VNG0702H VNG2581H VNG2075C VNG2201G VNG0141H VNG1255C VNG1148G VNG1921H VNG5185H VNG1039H VNG1576G VNG0606G VNG6287H VNG0149G VNG6367H VNG1740C VNG0730C VNG6400H

34/ 69

SLIDE 35

Enrichment of Gene Ontology (GO) terms

◮ GO described gene products across databases in terms of

their

◮ associated biological processes, ◮ cellular components, and ◮ molecular functions in a species-independent manner

◮ Goal: retrieve a functional profile of a given gene set

35/ 69

SLIDE 36

Enrichment of Gene Ontology (GO) terms

◮ load data

mygo <- read.csv('halo_GO_annotations.csv', header = TRUE)

◮ some annotations

print(unique(mygo$annotation[1:10])) ## [1] ATP binding ## [2] ATPase activity ## [3] membrane ## [4] tRNA aminoacylation for protein translation ## [5] protein glycosylation ## [6] cytoplasm ## [7] dolichyl-diphosphooligosaccharide-protein glycotransferase activity ## [8] aminoacyl-tRNA ligase activity ## 939 Levels: 'de novo' IMP biosynthetic process ... 36/ 69

SLIDE 37

Enrichment of Gene Ontology (GO) terms

◮ GO terms for annotated genes

myannot<- list() for(i in unique(mygo$annotation)){ myannot[[i]]<-mygo$geneId[mygo$annotation==i] }

◮ the list elements

print(myannot[[unique(mygo$annotation)[1]]]) ## [1] VNG0166G VNG0880G ## 1544 Levels: VNG0002G VNG0003C VNG0005H VNG0006G VNG0008G ... VNG7175

◮ interested in keyword copper

names(myannot)[grep('copper',names(myannot))] ## [1] "copper ion binding" "copper-exporting ATPase activity" 37/ 69

SLIDE 38

Prepare list of genes by cluster

◮ create list, each element i contains the genes belong to

cluster i

myclusts <- list(Clust1 = names(cuttree.bh)[cuttree.bh==1], Clust2 = names(cuttree.bh)[cuttree.bh==2], Clust3 = names(cuttree.bh)[cuttree.bh==3], Clust4 = names(cuttree.bh)[cuttree.bh==4]) 38/ 69

SLIDE 39

Statistics to test for enrichment

◮ What is the chance of observing enrichment at least this

extreme due to chance

◮ Tests: Fisher’s exact, hypergeometric ◮ in R

phyper(q, m, n, k, lower.tail = FALSE)

◮ q = overlap between the co-expression cluster genes and the

genes from a functional category

◮ m = number of genes in a functional category ◮ n = number of genes-m ◮ k = number of genes in the co-expression cluster ◮ lower.tail = if TRUE (default), probabilities are P(X ≤ x)

therwise P(X > x)

39/ 69

SLIDE 40

Statistics to test for enrichment - R code

◮ q = overlap between the co-expression cluster genes and

the genes from a functional category

intersect(myclusts[[clust]], myannot[[set.i]])

◮ m = number of genes in a functional category

length( myannot[[set.i]])

◮ n = number of genes-m

2400-length( myannot[[set.i]])

◮ k = number of genes in the co-expression cluster

length(myclusts[[clust]])

40/ 69

SLIDE 41

Statistics to test for enrichment

hg.pValues <- matrix(ncol = length(myclusts),nrow = length(myannot), dimnames = list( names(myannot), names(myclusts) )) for(clust in names(myclusts)) { for(set.i in names(myannot)) { hg.pValues[set.i, clust] <- phyper(length( intersect( myclusts[[clust]], myannot[[set.i]])),length( myannot[[set.i]]), 2400-length(myannot[[set.i]]), length(myclusts[[clust]]),lower.tail = FALSE ) } hg.pValues[,clust] <- p.adjust(hg.pValues[,clust],method = 'bonferroni', n <- length(myannot)*length(myclusts)) } 41/ 69

SLIDE 42

Enriched Terms for Cluster 1

hg.pValues[hg.pValues[,'Clust1']<=0.05,'Clust1'] ## amino acid transport ## 0.002890711 ## response to stress ## 0.000000000 ## amino acid transmembrane transporter activity ## 0.002890711 ## methylaspartate ammonia-lyase activity ## 0.000000000 ## cellular iron ion homeostasis ## 0.000000000 ## ferric iron binding ## 0.000000000

42/ 69

SLIDE 43

Principal component analysis

◮ one of the most popular methods for linear dimensionality

reduction

◮ can project the data from the original space into a a lower

dimensional space in an unsupervised manner

◮ each of the original dimensions is an axis (other axes can be

created as linear combinations of the original axes)

◮ PCA creates a completely new set of orthogonal axes

(principal components) along which the original data are highly variable

◮ a large set of correlated variables can be summarized using

principal components with a smaller number of representative variables that collectively explain most of the variability in the original set

43/ 69

SLIDE 44

Principal component analysis

◮ The first principal component is the axis through the data along

which there is the greatest variation amongst the observations. This corresponds to find the vector a = [a1, . . . , an] ∈ Rn, n

i=1 a2 i = 1 such that the variable

z = a1x·1 + · · · + anx·n = aTx has the largest variance.

◮ It can be shown that the optimal a is the eigenvector of VAR(x)

corresponding to the largest eigenvalue.

◮ The second principal component is the axis orthogonal to the first

that has the greatest variation in the data associated with it; the third p.c. is the axis with the greatest variation along it that is

rthogonal to both the first and the second axed; and so forth.

44/ 69

SLIDE 45

PCA example

45/ 69

SLIDE 46

PCA: the algorithm

Consider the training input matrix X having size [N, n] where N is typically much smaller than n. The PCA consists in the following steps.

◮ The matrix is normalized and transformed to a matrix ˜

X such that each column ˜ X[, i], i = 1, . . . , n, has mean 0 and variance 1.

◮ The Singular Value Decomposition (SVD) of ˜

X is computed ˜ X = UDVT where U is an orthogonal [N, N] matrix, D is a [N, n] rectangular diagonal matrix with diagonal elements d1 ≥ d2 ≥ · · · ≥ dn and V is an orthogonal matrix [n, n]. Note that the columns of V are the eigenvectors of the matrix ˜ XT˜ X.

46/ 69

SLIDE 47

PCA: the algorithm

◮ The matrix ˜

X can be transformed into a new set of coordinates Z = ˜ XV where Z is a [N, n] matrix, where each column is a linear combination of the original features and its importance is diminishing.

◮ The first h < n columns of Z (aka eigen-genes) may be

chosen to represent the input dataset in a lower dimensional space.

47/ 69

SLIDE 48

PCA for visualization

◮ It is not easy to visualize n dimensional data. We would need

n(n − 1)/2 scatterplots and most likely none of them will be informative since they each contain just a small fraction of the total information present in the data set.

◮ PCA aims to provide a low-dimensional representation of

the data that captures as much of the information as possible.

◮ The idea is that each of the N observations lives in

n-dimensional space, but not all of these dimensions are equally interesting.

◮ PCA seeks a small number of dimensions that are as

interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension.

48/ 69

SLIDE 49

PCA for visualization

◮ Another interpretation for PC is that the provide

low-dimensional linear surfaces that are closest to the

bservations. For instance the first PC is the line in

n-dimensional space that is closest to the N observations (according to Euclidean distance).

◮ The first two PCs span the plane that is closest to the N

bservations, in terms of Euclidean distance.

◮ This implies that the first h principal components provide

the best h-dimensional approximation (in terms of Euclidean distance) to the observations x.

49/ 69

SLIDE 50

Determining different cancer types

◮ gene expression data on different cancer cell lines ◮ use unsupervised methods (PCA or clustering) to analyze

the data

◮ Goal: do the results correspond to the cancer type labels?

50/ 69

SLIDE 51

Data

◮ load data

library(ISLR) nci.labs <- NCI60$labs nci.data <- NCI60$data dim(nci.data) ## [1] 64 6830

◮ cancer types

table(nci.labs) ## nci.labs ## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA ## 7 5 7 1 1 6 ## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE ## 1 1 8 9 6 2 ## RENAL UNKNOWN ## 9 1 51/ 69

SLIDE 52

Perform & visualize PCA

◮ PCA on scaled data (to have standard deviation)

pr.out <- prcomp(nci.data, scale=TRUE)

◮ assign a specific color to each cell line

Cols <- function(vec){ cols <- rainbow(length(unique(vec))) return(cols[as.numeric(as.factor(vec))]) }

52/ 69

SLIDE 53

Perform & visualize PCA

◮ plot the first principal component score vectors (colors

correspond to cell lines)

par(mfrow=c(1,2)) plot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19,xlab="Z1",ylab="Z2") legend("bottomright", bty="n", col=unique(Cols(nci.labs)), legend=unique(nci.labs), cex= plot(pr.out$x[,c(1,3)], col=Cols(nci.labs), pch=19,xlab="Z1",ylab="Z3")

●
●
−40

20 40 60 −60 −40 −20 20 Z1 Z2

CNS

RENAL BREAST NSCLC UNKNOWN OVARIAN MELANOMA PROST A TE LEUKEMIA K562B−repro K562A−repro COLON MCF7A−repro MCF7D−repro

●
●

−40 20 40 60 −40 −20 20 40 Z1 Z3

→ indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels

53/ 69

SLIDE 54

Goal

Predict survival

54/ 69

SLIDE 55

Background

◮ Kaplan-Meier estimate

◮ measure the fraction of patients living for a certain amount

f time after treatment

◮ estimate is a step function with jumps at observed event

times ti

◮ ˆ

S(t) =

ti≤t

1 − di

Yi

if t1 ≤ t, else ˆ

S(t) = 1

◮ di the number of individuals with an observed event time ti, Yi

the number of individuals at risk at time ti (i.e. individuals who die at time ti or later)

◮ Cox proportional hazards model

◮ fits survival model with covariates z to a hazard function of

the form h(t|z) = h0(t) exp{β′z}, where β is an unknown vector and h0(t) is the baseline hazard → estimate β

◮ use coxph function to fit Cox PH model to the data 55/ 69

SLIDE 56

R package survival

◮ survival object: Surv() ◮ fitting the Kaplan-Meier estimate: survfit() ◮ comparing different groups (e.g. treatment vs no treatment,

men vs women)

◮ testing for difference between groups using log-rank test:

survdiff()

56/ 69

SLIDE 57

Example: Melanom

library(ISwR) library(survival) data(melanom) str(melanom) ## structure of melanom object ## 'data.frame': 205 obs. of 6 variables: ## $ no : int 789 13 97 16 21 469 685 7 932 944 ... ## $ status: int 3 3 2 3 1 1 1 1 3 1 ... ## $ days : int 10 30 35 99 185 204 210 232 232 279 ... ## $ ulc : int 1 2 2 2 1 1 1 1 1 1 ... ## $ thick : int 676 65 134 290 1208 484 516 1288 322 741 ... ## $ sex : int 2 2 2 1 2 2 2 2 1 1 ...

57/ 69

SLIDE 58

Example: Melanom

Variables of interest

◮ days: time on study after operation for malignant melanoma ◮ status: patient’s status at the end of the study

◮ 1: dead from malignant melanoma ◮ 2: alive at end of the study ◮ 3: dead from other causes (consider this as censored, thus

same value as alive)

◮ Surv needs T(death)/F(alive) as status indicator

mysurv<-with(melanom,Surv(days,status==1))

58/ 69

SLIDE 59

The created Surv object

print(mysurv) ## [1] 10+ 30+ 35+ 99+ 185 204 210 232 232+ 279 295 ## [12] 355+ 386 426 469 493+ 529 621 629 659 667 718 ## [23] 752 779 793 817 826+ 833 858 869 872 967 977 ## [34] 982 1041 1055 1062 1075 1156 1228 1252 1271 1312 1427+ ## [45] 1435 1499+ 1506 1508+ 1510+ 1512+ 1516 1525+ 1542+ 1548 1557+ ## [56] 1560 1563+ 1584 1605+ 1621 1627+ 1634+ 1641+ 1641+ 1648+ 1652+ ## [67] 1654+ 1654+ 1667 1678+ 1685+ 1690 1710+ 1710+ 1726 1745+ 1762+ ## [78] 1779+ 1787+ 1787+ 1793+ 1804+ 1812+ 1836+ 1839+ 1839+ 1854+ 1856+ ## [89] 1860+ 1864+ 1899+ 1914+ 1919+ 1920+ 1927+ 1933 1942+ 1955+ 1956+ ## [100] 1958+ 1963+ 1970+ 2005+ 2007+ 2011+ 2024+ 2028+ 2038+ 2056+ 2059+ ## [111] 2061 2062 2075+ 2085+ 2102+ 2103 2104+ 2108 2112+ 2150+ 2156+ ## [122] 2165+ 2209+ 2227+ 2227+ 2256 2264+ 2339+ 2361+ 2387+ 2388 2403+ ## [133] 2426+ 2426+ 2431+ 2460+ 2467 2492+ 2493+ 2521+ 2542+ 2559+ 2565 ## [144] 2570+ 2660+ 2666+ 2676+ 2738+ 2782 2787+ 2984+ 3032+ 3040+ 3042 ## [155] 3067+ 3079+ 3101+ 3144+ 3152+ 3154+ 3180+ 3182+ 3185+ 3199+ 3228+ ## [166] 3229+ 3278+ 3297+ 3328+ 3330+ 3338 3383+ 3384+ 3385+ 3388+ 3402+ ## [177] 3441+ 3458+ 3459+ 3459+ 3476+ 3523+ 3667+ 3695+ 3695+ 3776+ 3776+ ## [188] 3830+ 3856+ 3872+ 3909+ 3968+ 4001+ 4103+ 4119+ 4124+ 4207+ 4310+ ## [199] 4390+ 4479+ 4492+ 4668+ 4688+ 4926+ 5565+

◮ ’x+’ after time means that patient did not die from melanoma

within x days, and was then unavailable for further study

◮ ’x’ means that patient died x days after the operation

59/ 69

SLIDE 60

The created Surv object

print(summary(mysurv)) ## time status ## Min. : 10 Min. :0.000 ## 1st Qu.:1525 1st Qu.:0.000 ## Median :2005 Median :0.000 ## Mean :2153 Mean :0.278 ## 3rd Qu.:3042 3rd Qu.:1.000 ## Max. :5565 Max. :1.000 60/ 69

SLIDE 61

Fit Kaplan-Meier estimator

Compare two groups

Does survival differ in men and women?

myfit <- survfit(Surv(melanom$days , melanom$status==1)~melanom$sex) print(myfit) ## Call: survfit(formula = Surv(melanom$days, melanom$status == 1) ~ melanom$sex) ## ## n events median 0.95LCL 0.95UCL ## melanom$sex=1 126 28 NA NA NA ## melanom$sex=2 79 29 NA 2388 NA 61/ 69

SLIDE 62

Visualization

plot(myfit,col=c("red","blue")) legend(x="bottomleft",c("Female","Male"), col=c("red","blue"),lty=1)

1000 2000 3000 4000 5000 0.0 0.2 0.4 0.6 0.8 1.0 Female Male

62/ 69

SLIDE 63

Visualization with confidence intervals

plot(myfit,col=c("red","blue"),conf.int=TRUE) legend(x="bottomleft",c("Female","Male"), col=c("red","blue"),lty=1)

1000 2000 3000 4000 5000 0.0 0.2 0.4 0.6 0.8 1.0 Female Male

63/ 69

SLIDE 64

Comparing survival curves

Log-rank test

survdiff(Surv(days,status==1)~sex, data=melanom) ## Call: ## survdiff(formula = Surv(days, status == 1) ~ sex, data = melanom) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## sex=1 126 28 37.1 2.25 6.47 ## sex=2 79 29 19.9 4.21 6.47 ## ## Chisq= 6.5

n 1 degrees of freedom, p= 0.011

64/ 69

SLIDE 65

Exercises

An investigator collected data on survival of patients with lung cancer at Mayo Clinic. The investigator would like you, the statistician, to answer the following questions and provide some

graphs. The data is located in the survival package under the

name: cancer.

◮ What is the probability that someone will survive past 300

days?

◮ Is there a difference in the survival rates between males and

females? Provide a formal statistical test with p-value.

◮ Is there a difference in the survival rates for the older half of

the group vs the younger half?

65/ 69

SLIDE 66

Solution i

attach(cancer) ## The following objects are masked from cancer (pos = 3): ## ## age, inst, meal.cal, pat.karno, ph.ecog, ph.karno, sex, ## status, time, wt.loss ## The following objects are masked from cancer (pos = 4): ## ## age, inst, meal.cal, pat.karno, ph.ecog, ph.karno, sex, ## status, time, wt.loss surv.can <- Surv(time, status == 2) fit.can <- survfit(surv.can~1) summary(fit.can, time=300)$surv ## [1] 0.5306081 66/ 69

SLIDE 67

Solution ii

surv.can <- Surv(time, status == 2) can.bysex <- survfit(surv.can~sex) survdiff(surv.can~sex) ## Call: ## survdiff(formula = surv.can ~ sex) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## sex=1 138 112 91.6 4.55 10.3 ## sex=2 90 53 73.4 5.68 10.3 ## ## Chisq= 10.3

n 1 degrees of freedom, p= 0.00131

200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0

67/ 69

SLIDE 68

Solution iii

## Call: ## survdiff(formula = surv.can ~ age > median(age)) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## age > median(age)=FALSE 117 80 88.8 0.865 1.88 ## age > median(age)=TRUE 111 85 76.2 1.007 1.88 ## ## Chisq= 1.9

n 1 degrees of freedom, p= 0.17

200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0

68/ 69

SLIDE 69

Questions?

69/ 69