Bioinformatics Chapter 3: Data basing and data mining
B I O I N F O R M A T I C S
Kristel Van Steen, PhD2
Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg
kristel.vansteen@ulg.ac.be
B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore - - PowerPoint PPT Presentation
Bioinformatics Chapter 3: Data basing and data mining B I O I N F O R M A T I C S Kristel
Bioinformatics Chapter 3: Data basing and data mining
kristel.vansteen@ulg.ac.be
Bioinformatics K Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
(UIC Bioinformatics Group)
Bioinformatics Kristel Van Steen
Chapter 1: Wh
: What it means and doesn’t mean
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
.
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
Bioinformatics Chapter 1: What it means and doesn’t mean Kristel Van Steen
(http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html)
Bioinformatics K Van Steen
Chapter
(http://www.ncbi.
pter 3: Data bases and data mining 15
cbi.nlm.nih.gov/)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 16
http://www.ncbi.nlm.nih.gov/About/
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 17
(http://www.ncbi.nlm.nih.gov/Genbank/index.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 18
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 19
(http://www.ncbi.nlm.nih.gov/Sitemap/ResourceGuide.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 20
(http://www.ncbi.nlm.nih.gov/Sitemap/ResourceGuide.html#SampleRecord)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 21
(http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 22
(http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#LocusB)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 23
(http://www.ncbi.nlm.nih.gov/Sitemap/Summary/statistics.html#GenBankStats)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 24
(http://www.ncbi.nlm.nih.gov/projects/SNP/)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 25
(http://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 26
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp&cmd=search&term=)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 27
(http://www.ncbi.nlm.nih.gov/snp/limits)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 28
(http://www.embl.org/)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 29
(http://www.ebi.ac.uk/embl/index.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 30
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 31
(http://www.ddbj.nig.ac.jp/ddbjingtop-e.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 32
(S Star slide: Ping)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 33
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 34
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 35
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 36
(http://www.ncbi.nlm.nih.gov/Omim/omimhelp.html#MainFeatures)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 37
(http://www.genome.jp/kegg/)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 38
(http://www.genome.ad.jp/kegg/pathway.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 39
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 40
(http://www.genome.ad.jp/kegg-bin/resize_map.cgi?map=hsa05310&scale=0.67)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 41
(http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/about.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 42
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 43
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 44
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 45
(http://www.ensembl.org/index.html)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 46
(http://www.ensembl.org/Homo_sapiens/Info/Index)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 47
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 48
(http://www.ncbi.nlm.nih.gov/)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 49
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 50
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 51
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 52
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 53
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 54
(http://www.ncbi.nlm.nih.gov/sites/gquery)
Bioinformatics K Van Steen
(Bioinf
Chapter
pter 3: Data bases and data mining 55
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 56
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 57
SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 …
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 58
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 59
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 60
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 61
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 62
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 63
Opening C:/PROGRA~1/R/R-27~1.2/library/ALL/doc/ALLintro.pdf source("http://www.bioconductor.org/getBioC.R") getBioC() source('http://bioconductor.org/biocLite.R') biocLite('ALL') library(“ALL”) data("ALL") class(ALL) show(ALL)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 64
slotNames(ALL) ## note, slots like exprs and phenoType ## can be accessed by slot accessor "@" ## or by functions like exprs() or pData() levels(ALL$mol.biol) ## list different molecular biology types table(ALL$mol.biol) ## frequency of these > slotNames(ALL) [1] "assayData" "phenoData" "featureData" "experimentData" "annotation" ".__classVersion__" > table(ALL$mol.biol) ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16 10 37 5 74 1 1
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 65
## let's only select two molecular types: selSamples <- ALL$mol.biol %in% c("ALL1/AF4", "E2A/PBX1") ALLs <- ALL[, selSamples] show(ALLs) ALLs$mol.biol <- factor(ALLs$mol.biol) ALLs$mol.biol > show(ALLs) ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 15 samples element names: exprs phenoData sampleNames: 04006, 08018, ..., LAL5 (15 total) varLabels and varMetadata description: cod: Patient ID diagnosis: Date of diagnosis ...: ...
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 66
date last seen: date patient was last seen (21 total) featureData featureNames: 1000_at, 1001_at, ..., AFFX-YEL024w/RIP1_at (12625 total) fvarLabels and fvarMetadata description: none experimentData: use 'experimentData(object)' pubMedIds: 14684422 16243790 Annotation: hgu95av2 > ALLs$mol.biol <- factor(ALLs$mol.biol) > ALLs$mol.biol [1] ALL1/AF4 E2A/PBX1 ALL1/AF4 ALL1/AF4 ALL1/AF4 ALL1/AF4 E2A/PBX1 ALL1/AF4 E2A/PBX1 ALL1/AF4 ALL1/AF4 ALL1/AF4 E2A/PBX1 [14] ALL1/AF4 E2A/PBX1 Levels: ALL1/AF4 E2A/PBX1
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 67
## add molecular biology type to colnames of samples colnames(exprs(ALLs)) colnames(exprs(ALLs)) <- paste(ALLs$mol.biol, colnames(exprs(ALLs))) colnames(exprs(ALLs)) > colnames(exprs(ALLs)) [1] "04006" "08018" "15004" "16004" "19005" "24005" "24019" "26008" "28003" "28028" "28032" "31007" "36001" "63001" "LAL5" > colnames(exprs(ALLs)) <- paste(ALLs$mol.biol, colnames(exprs(ALLs))) > colnames(exprs(ALLs)) [1] "ALL1/AF4 04006" "E2A/PBX1 08018" "ALL1/AF4 15004" "ALL1/AF4 16004" "ALL1/AF4 19005" "ALL1/AF4 24005" "E2A/PBX1 24019" [8] "ALL1/AF4 26008" "E2A/PBX1 28003" "ALL1/AF4 28028" "ALL1/AF4 28032" "ALL1/AF4 31007" "E2A/PBX1 36001" "ALL1/AF4 63001" [15] "E2A/PBX1 LAL5" hist(exprs(ALLs)) hist(ALLs@exprs)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 68
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 69
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 70
The comparison of BCR/ABL to NEG is difficult, and the error rates are typically quite
distinguish and the error rates should be smaller.
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 71
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 72
> class(ALLfilt_bcrneg) [1] "ExpressionSet" attr(,"package") [1] "Biobase"
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 73
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 74
(http://gim.unmc.edu/dxtests/ROC1.htm)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 75
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 76
(http://www.medcalc.be/manual/roc.php)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 77
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 78
rowIQRs = function(eSet){ numSamp = ncol(eSet) lowQ = rowQ(eSet,floor(0.25 * numSamp)) upQ = rowQ(eSet, ceiling(0.75 * numSamp)) upQ - lowQ }
standardize = function(x) (x-rowMedians(x)) / rowIQRs(x) exprs(ALLfilt_bcrneg) = standardize(exprs(ALLfilt_bcrneg))
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 79
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 80
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 81
eucD = dist(t(exprs(ALLfilt_bcrneg))) eucM = as.matrix(eucD) dim(eucM)
library("RColorBrewer") hmcol = colorRampPalette(brewer.pal(10,"RdBu"))(256) hmcol = rev(hmcol) heatmap(eucM,sym=TRUE,col=hmcol,distfun=as.dist)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 82
a heatmap of the between-sample distances in our example data
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 83
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 84
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 85
Negs = which(ALLfilt_bcrneg$mol.biol=="NEG") Bcr = which(ALLfilt_bcrneg$mol.biol =="BCR/ABL") set.seed(1969) S1=sample(Negs,20,replace=FALSE) S2=sample(Bcr,20,replace=FALSE) TrainInd =c(S1,S2) TestInd = setdiff(1:79,TrainInd)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 86
> Negs [1] 2 4 5 6 7 8 11 12 14 19 22 24 26 28 31 35 37 38 39 43 44 45 46 49 50 [26] 51 52 54 55 56 57 58 61 62 65 66 67 68 70 74 75 77 > ALLfilt_bcrneg$mol.biol [1] BCR/ABL NEG BCR/ABL NEG NEG NEG NEG NEG BCR/ABL [10] BCR/ABL NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL BCR/ABL BCR/ABL [19] NEG BCR/ABL BCR/ABL NEG BCR/ABL NEG BCR/ABL NEG BCR/ABL [28] NEG BCR/ABL BCR/ABL NEG BCR/ABL BCR/ABL BCR/ABL NEG BCR/ABL [37] NEG NEG NEG BCR/ABL BCR/ABL BCR/ABL NEG NEG NEG [46] NEG BCR/ABL BCR/ABL NEG NEG NEG NEG BCR/ABL NEG [55] NEG NEG NEG NEG BCR/ABL BCR/ABL NEG NEG BCR/ABL [64] BCR/ABL NEG NEG NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL [73] BCR/ABL NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL Levels: BCR/ABL NEG
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 87
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 88
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 89
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 90
Bioinformatics K Van Steen
Chapter
(www.wikipedia.org)
pter 3: Data bases and data mining 91
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 92
> krun = MLearn(mol.biol ~ ., data=ALLfilt_bcrneg, knnI(k=1,l=0),TrainInd) > krun MLInterfaces classification output container The call was: MLearn(formula = mol.biol ~ ., data = ALLfilt_bcrneg, method = knnI(k = 1, l = 0), trainInd = TrainInd) Predicted outcome distribution for test set: BCR/ABL NEG 17 22 … > names(RObject(krun)) [1] "traindat" "ans" "traincl" > confuMat(krun) predicted given BCR/ABL NEG BCR/ABL 10 7 NEG 7 15
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 93
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 94
> ldarun = MLearn(mol.biol ~ ., data=ALLfilt_bcrneg, ldaI,TrainInd) > ldarun MLInterfaces classification output container The call was: MLearn(formula = mol.biol ~ ., data = ALLfilt_bcrneg, method = ldaI, trainInd = TrainInd) Predicted outcome distribution for test set: BCR/ABL NEG 12 27 > names(RObject(ldarun)) [1] "prior" "counts" "means" "scaling" "lev" "svd" "N" [8] "call" "terms" "xlevels" > confuMat(ldarun) predicted given BCR/ABL NEG BCR/ABL 10 7 NEG 2 20
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 95
Bioinformatics K Van Steen
Chapter
pter 3: Data bases and data mining 96
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 97
> dldarun = MLearn(mol.biol ~ ., data=ALLfilt_bcrneg, dldaI,TrainInd) Loading required package: sma > dldarun MLInterfaces classification output container The call was: MLearn(formula = mol.biol ~ ., data = ALLfilt_bcrneg, method = dldaI, trainInd = TrainInd) Predicted outcome distribution for test set: BCR/ABL NEG 21 18 > names(RObject(dldarun)) [1] "traindat" "ans" "traincl" > confuMat(dldarun) predicted given BCR/ABL NEG BCR/ABL 13 4 NEG 8 14 >
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 98
(adapted from http://www.travelnotes.de/rays/fortran/snoopy.gif)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 99
statistic dm p.value 41654_at -1.01298 -0.1983496 0.3174765
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 100
> BNf = ALLfilt_bcrneg[fNtt,] > knnf = MLearn(mol.biol ~.,data=BNf, knnI(k=1,l=0), TrainInd) > confuMat(knnf) predicted given BCR/ABL NEG BCR/ABL 14 3 NEG 1 21
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 101
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 102
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 103
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 104
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 105
> BNx = ALLfilt_bcrneg[1:1000,]
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 106
> BNx = ALLfilt_bcrneg[1:1000,] > knnXval1 = MLearn(mol.biol~.,data=BNx,knn.cvI(k=1,l=0),trainInd=1:ncol(BNx)) > > confuMat(knnXval1) predicted given BCR/ABL NEG BCR/ABL 32 5 NEG 16 26
knnXval1 = MLearn(mol.biol~.,data=BNx,knnI(k=1,l=0),xvalSpec("LOO"))
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 107
> lk3f1 = MLearn(mol.biol~.,data=BNx,knnI(k=1),xvalSpec("LOO",fsFun=fs.absT(50))) > > confuMat(lk3f1) predicted given BCR/ABL NEG BCR/ABL 33 4 NEG 8 34
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 108
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 109
library("randomForest") set.seed(123) rf1 = MLearn(mol.biol~.,data=ALLfilt_bcrneg,randomForestI,TrainInd,ntree=1000,mtry=55,import ance=TRUE) rf2 = MLearn(mol.biol~.,data=ALLfilt_bcrneg,randomForestI,TrainInd,ntree=1000,mtry=10,import ance=TRUE)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 110
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 111
> confuMat(rf1,"train") predicted given BCR/ABL NEG BCR/ABL 20 0 NEG 0 20 > confuMat(rf1,"test") predicted given BCR/ABL NEG BCR/ABL 12 5 NEG 5 17 > confuMat(rf2,"train") predicted given BCR/ABL NEG BCR/ABL 20 0 NEG 0 20 > confuMat(rf2,"test") predicted given BCR/ABL NEG BCR/ABL 12 5 NEG 5 17
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 112
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 113
par(las=2) impV1 = getVarImp(rf1) plot(impV1, n=15) par(opar) par(las=2, mar=c(7,5,4,2)) impV2 = getVarImp(rf2) plot(impV2, n=15) par(opar)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 114
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 115
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 116
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 117
Bcell = grep ("^B", ALL$BT) ALLs = ALL [,Bcell] types = c("BCR/ABL", "NEG", "ALL1/AF4") threeG = ALLs$mol.biol %in% types ALL3g = ALLs[,threeG] qrange <- function(x) diff(guantile(x, c(O.l, 0.9))) ALL3gf = nsFilter(ALL3g, var.cutoff=0.75, var.func=qrange)$eset ALL3gf$mol.biol = factor(ALL3gf$mol.biol)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 118
S1 = table(ALL3gf$mol.biol) trainN = ceiling(s1/2) sN = split(1:length(ALL3gf$mol.biol), ALL3gf$mol.biol) > sN $`ALL1/AF4` [1] 4 24 26 28 35 46 57 59 68 83 $`BCR/ABL` [1] 1 3 10 11 14 16 17 18 19 21 22 25 29 31 33 34 37 38 39 41 45 47 48 53 54 [26] 61 67 69 72 73 78 80 81 82 86 88 89 $NEG [1] 2 5 6 7 8 9 12 13 15 20 23 27 30 32 36 40 42 43 44 49 50 51 52 55 56 [26] 58 60 62 63 64 65 66 70 71 74 75 76 77 79 84 85 87 trainInd = NULL testInd = NULL
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 119
set.seed(777) for(i in 1:3) { trI = sample (sN[[i]] , train[[i]]) teI = setdiff(sN[[i]],trI) trainInd = c(trainInd, trI) testlnd = c(testInd, teI) trainSet = ALL3gf[, trainInd] testSet = ALL3gf[, testInd]
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 120
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 121
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 122
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 123
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 124
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 125
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 126
library(“ALL”) data (ALL) bcell = grep(”^B”, as.character(ALL$BT)) moltyp = which(as.character(ALL$mol.biol) %in% c(”NEG”, ”BCR/ABL”)) ALL_bcrneg = ALL[, intersect(bcell, moltyp)] ALL_bcrneg$mol.biol = factor(ALL_bcrneg$mol.biol) ALLfilt_bcrneg = nsFilter(ALL_bcrneg,var.cutoff=0.75)$eset
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 127
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 128
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 129
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 130
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 131
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 132
iqrs = esApply(es2, 1, IQR) gvals = scale(t(exprs(es2)), rowMedians(es2),iqrs[featureNames(es 2)]) manDist = dist(gvals, method="manhattan") hmcol = colorRampPalette(brewer.pal(10, "RdBu"))(256) hmcol = rev(hmcol) heatmap(as.matrix(manDist), sym=TRUE, col=hmcol, distfun=function(x) as.dist(x))
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 133
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 134
library(MASS) cols = ifelse(es2$mol.biol == "BCR/ABL", "black", "goldenrod") sam1 = sammon(manDist, trace=FALSE) plot(sam1$points,col=cols,xlab="Dimen sion 1",ylab="Dimension 2")
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 135
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 136
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 137
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 138
library(hopach) mD = as.matrix(manDist) silEst = silcheck(mD, diss=TRUE) silEst [1] 3.0000000 0.1126571 d2 = as.matrix(dist(t(gvals), method="man")) silEstG = silcheck(d2, diss=TRUE) silEstG [1] 3.0000000 0.1122456
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 139
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 140
Hc1 = hclust(manDist) Hc2 = hclust(manDist, method="single") Hc3 = hclust(manDist, method="ward") Hc4 = hclust(manDist)
par(mfrow=c(2,1)) plot(Hc1, ann=FALSE) title(main="Complete Linkage", cex.main=2) plot(Hc2, ann=FALSE) title(main="Single Linkage", cex.main=2)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 141
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 142
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 143
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 144
cph1 = cophenetic(Hc1) cor1 = cor(manDist, cph1) cor1 plot(manDist, cph1, pch="/", col="blue")
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 145
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 146
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 147
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 148
km2 = kmeans(gvals, centers=2, nstart=5) kmx = kmeans(gvals, centers=2, nstart=25)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 149
> km2$cluster 01005 01010 03002 04007 04008 04010 04016 06002 08001 08011 08012 08024 09008 2 1 2 2 1 1 2 2 2 2 2 2 1 09017 11005 12006 12007 12012 12019 12026 14016 15001 15005 16009 20002 22009 2 2 2 2 2 2 2 2 2 2 1 1 2 22010 22011 22013 24001 24008 24010 24011 24017 24018 24022 25003 25006 26001 2 2 2 2 2 2 2 1 1 2 2 1 2 26003 27003 27004 28001 28005 28006 28007 28019 28021 28023 28024 28031 28035 2 2 2 2 2 2 2 1 1 2 2 2 1 28036 28037 28042 28043 28044 28047 30001 31011 33005 36002 37013 43001 43004 2 1 2 2 2 2 1 2 1 2 2 2 2 43007 43012 48001 49006 57001 62001 62002 62003 64001 64002 65005 68001 68003 2 2 1 2 2 2 2 2 2 1 2 1 2 84004 2
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 150
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 151
pam2 = pam(manDist, k=2, diss=TRUE) pam3 = pam(manDist, k=3, diss=TRUE)
all(names(km2$clustering)) pam2km = table(km2$cluster, pam2$clustering) pam2km
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 152
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 153
set.seed(123) library(kohonen) s1 = som(gvals, grid=somgrid(4,4)) names(s1) s2 = som(gvals, grid=somgrid(4,4), alpha=c(1,0.1),rlen=1000) s3 = som(gvals, grid=somgrid(4,4,topo="hexagonal"),alpha=c(1,0.1),rlen=1000) whGP = table(s3$unit.classif) whGP
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 154
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 155
set.seed(777) library(class) s4 = SOM(gvals, grid=somgrid(4,4,topo="hexagonal")) SOMgp = knn1(s4$code, gvals, 1:nrow(s4$code)) table(SOMgp) SOMgp
> SOMgp [1] 12 16 3 10 12 16 12 12 3 12 3 7 16 3 3 4 12 12 4 12 4 12 15 16 12 [26] 14 9 10 12 10 1 4 14 12 15 3 15 16 12 10 16 10 16 3 16 10 16 16 16 10 [51] 16 16 10 16 3 3 16 9 16 3 16 15 11 4 16 2 3 4 12 7 4 3 15 12 12 [76] 12 16 3 1 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 156
cD = dist(s4$code) > cD
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 157
1 2 3 4 5 6 7 8 2 0.000 3 0.857 0.857 4 1.573 1.573 1.813 5 0.000 0.000 0.857 1.573 6 0.000 0.000 0.857 1.573 0.000 7 0.839 0.839 1.132 1.571 0.839 0.839 8 1.182 1.182 1.558 1.796 1.182 1.182 1.219 9 0.839 0.839 1.132 1.571 0.839 0.839 0.000 1.219 10 0.857 0.857 0.000 1.813 0.857 0.857 1.132 1.558 11 1.182 1.182 1.558 1.796 1.182 1.182 1.219 0.000 12 2.669 2.669 3.132 3.046 2.669 2.669 2.888 2.565 13 1.182 1.182 1.558 1.796 1.182 1.182 1.219 0.000 14 1.182 1.182 1.558 1.796 1.182 1.182 1.219 0.000 15 1.573 1.573 1.813 0.000 1.573 1.573 1.571 1.796 16 2.176 2.176 2.445 2.648 2.176 2.176 2.167 2.290 9 10 11 12 13 14 15
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 158
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 159
min
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 160
range from 1 to -1, with
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 161
silpam2 = silhouette(pam2) plot(silpam2, main=””)
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 162
> silpam2[silpam2[,"sil_width"]<0,] cluster neighbor sil_width 57001 1 2 -0.03928263 12006 2 1 -0.01466478 28043 2 1 -0.02104596 14016 2 1 -0.03727049 68003 2 1 -0.04262153 43001 2 1 -0.04936700
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 163
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 164
rtt = rowtests(ALLfilt_bcrneg, ”mol.biol”)
esTT = ALLfilt_bcrneg[ordtt[1:50],]
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 165
pairs(t(exprs(esTT)[1:5,]),col=ifelse(esTT$mol.biol=="NEG", "green", "blue"))
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 166
pc = prcomp(t(exprs(esTT))) pairs(pc$x[,1:5], col=ifelse(esTT$mol.biol=="NEG","green","blue"))
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 167
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 168
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 169
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 170
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 171
(Chapter 10)
and systems biology. Nature Reviews Genetics – Perspectives 7: 482-.
PLoS computational biology 8: e1000482.
[Sections 1-4, 5.1,5.2,5.4]
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 172
Genetics 39(10): 1181-.
Genetics 6: 1.
http://www.ornl.gov/sci/techresources/Human_Genome/faq/seqfacts.shtml
http://www.ornl.gov/sci/techresources/Human_Genome/project/journals/insights.shtml
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 173
(Nature, May 18, 2000 issue)
chromosome of Down's syndrome, which is the most frequent neonatal disorder. Sequencing chromosome 21 has revealed the existence of 11 genes within the essential region of Down's syndrome (upper panel). It is supposed that the
related to the symptoms of Down's syndrome, such as mental retardation. In addition, we determined the sequence in the corresponding region of the mouse genome (bottom panel) and conducted a comparative study. Although 10 genes were well conserved in the mouse genome, a gene designated DSCR9 was found only in the human genome.