TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 - PowerPoint PPT Presentation
Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10 Outline TCGA data used Batch effects in TCGA data Identification of batch effects Algorithms PCA and Hierarchical
Assessment of batch effects in TCGA gene expression data Nianxiang Zhang BCB, MDACC 7/14/10
Outline • TCGA data used • Batch effects in TCGA data • Identification of batch effects – Algorithms • PCA and Hierarchical clustering • Correlation of correlations (CR) – Batch effects in TCGA gene expression data on GBM and ovarian cancer • Adjustment for batch effects – Methods of batch effects adjustment – Adjustment for batch effects in TCGA gene expression data on GBM and ovarian cancer – Comparison of adjustment methods • Implications
Data Level 3 gene expression data on GBM and Ovarian cancer • 3 platforms – Affymetrix U133a – Agilent – Affymetrix Exon array • GBM – 11 batches, 372 tumor samples • OV – 13 batches, 511 samples-30 samples excluded
Batch Effects in TCGA • TCGA data are collected in multiple batches • TCGA data come from multiple platforms, analyses, and institutions • Batch effects can be very important for biological and clinical predictions
OV Data Distribution By Batch
Ovarian Cancer Data
GBM Data
Identification of Batch Effects • Standard techniques – Principal component analysis (PCA) – Clustering analysis (1-Pearson metric, Ward linkage) • Correlation of correlations (CR) – A scalar index of the similarity of batches in terms of gene-gene interactions • CR=1 if batches are identical • CR=0 if batches are uncorrelated
Calculation of Correlation of Correlations (CR) U ij denotes the correlation of genes i and j in batch 1 V ij denotes the correlation of genes i and j in batch 2 (Scherf, …. Weinstein, Nature Genetics 2000; 24:236) Permutation test of CR provides the statistical significance of batch effects
Visualization of the Correlation of Correlations Calculation (for 4 genes and batches consisting of 4 and 3 samples) Batch 1 Batch 2 Gene 1 Gene 1 R 12 =Corr (1,2) R’ 12 =Corr (1,2) Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 CR=Corr [(R 12 , R 13 , R 14 , R 23 , R 24 , R 34 ), *(R’ 12 , R’ 13 , R’ 14 , R’ 23 , R’ 24 , R’ 34 )] Then calculate a scalar quantity, the correlation between the vector of 6 correlation coefficients for Batch 1 and the vector of 6 correlation coefficients for Batch 2
Permutation test of CRs We scramble the batch labels of samples in two batches, calculate CR between two permutated batches to obtain distribution of CR under H 0 . Actual CR Between two Batches (two-sided p)
PCA GBM data Batch 16, 20 Batch 16
GBM:Affy
GBM:Agilent
GBM:Exon
Tests for batch effects using CR:GBM
Q-values for testing batch effects in GBM data
PCA-Ovarian data Batch 9, 11
Ovarian-Affymetrix
Ovarian-Agilent
Ovarian-Exon Batch 9, 11
Tests for batch effects using CR:OV
Q-values for batch effects in OV data
Batch effects in unified OV gene expression data Unadjusted Affy U133a Data Unified Gene Expression Data
Adjustment of Batch Effects • Empirical Bayes (ComBat) – Parametric prior (EBP) – Nonparametric prior (EBNP) • Median Polish – Overall (MP) – Within each batch (MPB) • ANOVA – Naïve ANOVA (AN) – With variance shrinkage (WAN)
Batch effect adjustment
GBM:Affy
GBM Agilent
GBM:Exon data
Effect of batch effects adjustment on gene expression
Assessment of batch effects with adjustments CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11-15 p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values
Assessment of batch effects with adjustments CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 MPB adjusted data 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values
Assessment of batch effects with adjustments:OV CDF of p-values from the permutation tests of batch effect in TCGA ovarian cancer Affymetrix gene expression data batch 9 and 11 to 15 p-value = 0.05 1.0 Unadjusted data Cumulative Distribution Function 0.8 Cumulative probability 0.6 MPB adjusted data 0.4 U EBP EBNP 0.2 MP MPB AN 0.0 WAN 0.0 0.2 0.4 0.6 0.8 1.0 P-values p-values
Association of Clinical Outcomes with Batches Overall survival by batch (TCGA ovarian cancer data) P<0.001 P=0.018 Batch 9 Batch 9
Implications • Assessments based on Correlation of Correlations parameter can be used to identify batch effects in TCGA data. This is complemented by principal component analysis and hierarchical clustering. • Batch effects exist in TCGA GBM and ovarian cancer data • Be cautious when we do batch effects adjustment. – The batch differences may be technical or biological • We do not want to correct biological difference • We do want to correct technical difference (bias) – Some methods may over massage the data • The impact of batch effects on clinical predictions from the data remains to be determined.
Thank you!
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.