Biweight Correlation as a Measure of Distance between Genes on a - - PowerPoint PPT Presentation

biweight correlation as a measure of distance between
SMART_READER_LITE
LIVE PREVIEW

Biweight Correlation as a Measure of Distance between Genes on a - - PowerPoint PPT Presentation

Biweight Correlation as a Measure of Distance between Genes on a Microarray Aya Mitani Pitzer College 06 Advisor: Professor Johanna Hardin Pomona College April 29, 2006 1 About microarray Small chip Contains thousands of probes


slide-1
SLIDE 1

Biweight Correlation as a Measure of Distance between Genes

  • n a Microarray

Aya Mitani Pitzer College ’06 Advisor: Professor Johanna Hardin Pomona College April 29, 2006

1

slide-2
SLIDE 2

About microarray

  • Small chip
  • Contains thousands of probes
  • Measures mRNA activity in a particular cell type
  • Contains control and treatment sample
  • Expression level is measured from light intensity

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

Problem with microarray

  • Noisy data
  • Needs robust estimation of correlation
  • Pearson correlation is often used
  • One outlier can greatly affect correlation

5

slide-6
SLIDE 6

Last summer M-estimation

weighed average with points farther from the center given less weight di =

  • (xi − ˜

µ)′ ˜ Σ−1(xi − ˜ µ) (1) ˜ µ =

  • i w(di)xi
  • i w(di)

(2) ˜ Σ =

  • i w(di)(xi − ˜

µ)(xi − ˜ µ)′

  • i w(di)

(3) Tukey’s biweight w(di) =

  • di(1 − (di

c )2)2

di ≤ c di > c Use Minimum Covariance Determinant (MCD) for initial estimation of µ and Σ

6

slide-7
SLIDE 7

Plot of Biweight weight function (w)

distance weight 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

7

slide-8
SLIDE 8

Biweight Correlation Coefficient bwcjk = σjk σjjσkk where σjk is biweight estimate of covariance of genej and genek and σjj is biweight estimate of variance of gene j Want to find out the correlation(similarities/differences) of two genes

8

slide-9
SLIDE 9

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 Biweight correlation Pearson correlation

9

slide-10
SLIDE 10

−0.5 0.0 0.5 −4 −3 −2 −1 Gene 14 Gene 86 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.0 0.0 0.5 Gene 26 Gene 11

10

slide-11
SLIDE 11

Further work to be done

  • Computational time
  • Biweight correlation on clean data

11

slide-12
SLIDE 12

This Spring

  • Matrix correlation vs Pair by pair correlation
  • One-step M-estimation
  • Median vs MCD
  • Biweight correlation good for clean data?

12

slide-13
SLIDE 13

Instead of computing pair by pair correlation, compute correla- tion matrix from biweight covariance matrix simultaneously

di =

  • (xi − ˜

µ)′ ˜ Σ−1(xi − ˜ µ) (4) ˜ µ =

  • i w(di)xi
  • i w(di)

(5) ˜ Σ =

  • i w(di)(xi − ˜

µ)(xi − ˜ µ)′

  • i w(di)

(6)

⎛ ⎝

mat.bwc11 . . . mat.bwc1n mat.bwc21 . . . mat.bwc2n . . . ... . . . mat.bwcn1 . . . mat.bwcnn

⎞ ⎠ =

  • σ11

. . . . . . ... . . . . . . σnn

−1 ⎛ ⎝

σ11 . . . σ1n σ21 . . . σ2n . . . ... . . . σn1 . . . σnn

⎞ ⎠

  • σ11

. . . . . . ... . . . . . . σnn

−1

mat.bwcjk = bwcjk???

13

slide-14
SLIDE 14

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

10 genes

Matrix Correlation Pair by pair correlation

15

slide-15
SLIDE 15

One-step M-estimation

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

  • ne−step

Converged

Converged M-estimation was doing 10-25 iterations on average (Takes 11 seconds to compute 190 pairs of genes)

16

slide-16
SLIDE 16

Few-step

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

3−step Converged −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

5−step Converged

3.5 seconds 5.5 seconds

17

slide-17
SLIDE 17

−2.0 −1.5 −1.0 −0.5 0.0 0.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Gene 11 Gene 18

18

slide-18
SLIDE 18

10-step

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

10−step Converged

8 seconds

19

slide-19
SLIDE 19

Median instead of MCD

  • Median for ˜

µ

  • Median absolute deviation (MAD) for ˜

Σ

MAD(X) = median|xi − median(xi)|

If converged → no difference

20

slide-20
SLIDE 20

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

Median converged MCD converged

7 seconds

21

slide-21
SLIDE 21

Few-step median

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

Median 3−step MCD converged −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

Median 5−step MCD converged

1.5 seconds 2.5 seconds

22

slide-22
SLIDE 22

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 −2 −1 1 2 Gene 17 Gene 7

23

slide-23
SLIDE 23

10-step median

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

Median 10−step MCD converged

5 seconds

24

slide-24
SLIDE 24

10-step median 5-step MCD

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

Median 10−step MCD converged −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

20 genes

5−step Converged

5 seconds 5.5 seconds

25

slide-25
SLIDE 25

Biweight correlation on clean data

How biased/variable compared to Pearson correlation?

0.6 0.7 0.8 0.9 1.0

Pearson correlation

0.5 0.6 0.7 0.8 0.9 1.0

Biweight correlation

0.7636 0.8482 0.7850 0.7523 0.8541 0.7945

26

slide-26
SLIDE 26

What makes the difference?

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0

Multivariate normal data

Biweight correlation Pearson correlation

27

slide-27
SLIDE 27

−3 −2 −1 1 −3 −2 −1 1

bw−pearson=0.1166

row2 row11 −3 −2 −1 1 −2 −1 1 2

bw−pearson=0.1108

row2 row16 −2 −1 1 −2 −1 1 2 3

bw−pearson=0.0523

row6 row17 −2 −1 1 −2 −1 1 2

bw−pearson=0.0003

row5 row15

28

slide-28
SLIDE 28

Concluding remarks

  • Biweight correlation is unbiased and similarly variable with

Pearson correlation

  • Median and median absolute deviation for initiation of ˜

µ and ˜ Σ is as robust as MCD estimators

  • Median and median absolute deviation for initiation of ˜

µ and ˜ Σ is faster than MCD estimators

  • Depending on how robust we want the result to be, compu-

tational time can be shortened by number of iterations for speed efficiency

  • Generally, 5 iterations or more is recommended

29