Model Misspecification due to Site Specific Rate Heterogeneity: how - - PowerPoint PPT Presentation

model misspecification due to site specific rate
SMART_READER_LITE
LIVE PREVIEW

Model Misspecification due to Site Specific Rate Heterogeneity: how - - PowerPoint PPT Presentation

Model Misspecification due to Site Specific Rate Heterogeneity: how is tree inference affected? Stephen Crotty School of Mathematical Sciences, University of Adelaide October, 2013 Stephen Crotty (School of Math. Sci.) Model Misspecification


slide-1
SLIDE 1

Model Misspecification due to Site Specific Rate Heterogeneity: how is tree inference affected?

Stephen Crotty

School of Mathematical Sciences, University of Adelaide

October, 2013

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 1 / 21

slide-2
SLIDE 2

What is Site Specific Rate Heterogeneity (SSRH)?

A B C D 0.1 0.4 0.4 0.4 0.4

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21

slide-3
SLIDE 3

What is Site Specific Rate Heterogeneity (SSRH)?

A B C D 0.1 0.4 0.4 0.4 0.4 The model contains 3 site types: Invariable sites

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21

slide-4
SLIDE 4

What is Site Specific Rate Heterogeneity (SSRH)?

A B C D 0.1 0.4 0.4 0.4 0.4 The model contains 3 site types: Invariable sites Variable sites

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21

slide-5
SLIDE 5

What is Site Specific Rate Heterogeneity (SSRH)?

A B C D 0.1 0.4 0.3 0.3 0.1 0.1 0.4 The model contains 3 site types: Invariable sites Variable sites Switching sites

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 2 / 21

slide-6
SLIDE 6

Why should we care about SSRH?

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-7
SLIDE 7

Why should we care about SSRH?

Tasmanian Pygmy Possum Tasmanian Native Hen

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-8
SLIDE 8

Why should we care about SSRH?

Tasmanian Pygmy Possum Tasmanian Native Hen

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-9
SLIDE 9

Why should we care about SSRH?

Tasmanian Devil

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-10
SLIDE 10

Why should we care about SSRH?

What’s up Doc? Devil Facial Tumour Syndrome

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-11
SLIDE 11

Why should we care about SSRH?

What’s up Doc? Devil Facial Tumour Syndrome

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-12
SLIDE 12

Why should we care about SSRH?

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-13
SLIDE 13

Why should we care about SSRH?

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-14
SLIDE 14

Why should we care about SSRH?

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 3 / 21

slide-15
SLIDE 15

Experimental Procedure

1 Data was simulated using the program LineageSpecificSeqgen1

1Source: L. Shavit Grievink, D. Penny, M. D. Hendy, and B. R. Holland.

BMC Evolutionary Biology, 8:317, 2008.

2http://evolution.genetics.washington.edu/phylip/ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 4 / 21

slide-16
SLIDE 16

Experimental Procedure

1 Data was simulated using the program LineageSpecificSeqgen1 2 The Phylip2 software package was used to perform tree inference

using the maximum parsimony (MP), neighbour joining (NJ) and maximum likelihood (ML) methods.

1Source: L. Shavit Grievink, D. Penny, M. D. Hendy, and B. R. Holland.

BMC Evolutionary Biology, 8:317, 2008.

2http://evolution.genetics.washington.edu/phylip/ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 4 / 21

slide-17
SLIDE 17

Experimental Procedure

1 Data was simulated using the program LineageSpecificSeqgen1 2 The Phylip2 software package was used to perform tree inference

using the maximum parsimony (MP), neighbour joining (NJ) and maximum likelihood (ML) methods.

3 A theoretical analysis of each method was carried out in an effort to

understand their performance.

1Source: L. Shavit Grievink, D. Penny, M. D. Hendy, and B. R. Holland.

BMC Evolutionary Biology, 8:317, 2008.

2http://evolution.genetics.washington.edu/phylip/ Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 4 / 21

slide-18
SLIDE 18

Simulation Parameters

A B C D 0.1 0.4 0.3 0.3 0.1 0.1 0.4

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 5 / 21

slide-19
SLIDE 19

Simulation Parameters

A B C D 0.1 0.4 0.3 0.3 0.1 0.1 0.4 pinv = 80% pvar = 20% pswitch = 0, 1, 2, . . . , 100%

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 5 / 21

slide-20
SLIDE 20

Simulation Parameters

A B C D 0.1 0.4 0.3 0.3 0.1 0.1 0.4 pinv = 80% pvar = 20% pswitch = 0, 1, 2, . . . , 100% 100000 base pairs Jukes Cantor substitution model 100 replications

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 5 / 21

slide-21
SLIDE 21

Maximum Parsimony

25 50 75 100 25 50 75 100 pswitch Correct Tree Inferred %

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 6 / 21

slide-22
SLIDE 22

Maximum Parsimony

Site pattern analysis predicts the asymptotic failure point of MP to be 26.56%.

25 50 75 100 25 50 75 100 pswitch Correct Tree Inferred %

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 6 / 21

slide-23
SLIDE 23

Neighbour Joining

25 50 75 100 25 50 75 100 pswitch Correct Tree Inferred % MP NJ

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 7 / 21

slide-24
SLIDE 24

Neighbour Joining - why the recovery?

The neighbour joining algorithm r = number of taxa. Dij = JC distance between taxa i and j. Qij = (r − 2)Dij − r

k=1 Dik − r k=1 Djk

Q is the matrix used by the NJ algorithm: the pair of taxa with the smallest Qij are joined together and the process is repeated.

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 8 / 21

slide-25
SLIDE 25

The Q matrix for a 4-taxa tree

QAB = (4 − 2)DAB −

  • k∈{B,C,D}

DAk −

  • k∈{A,C,D}

DBk = −(DAC + DAD + DBC + DBD) Similarly, QAD = −(DAB + DAC + DBD + DCD) and, QAC = −(DAB + DAD + DBC + DCD)

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 9 / 21

slide-26
SLIDE 26

Digression - what tree might we infer?

AB|CD

A B C D

min(QAB, QAD, QAC ) = QAB = ⇒

AD|BC

A D B C

min(QAB, QAD, QAC ) = QAD = ⇒

AC|BD

A C B D

min(QAB, QAD, QAC ) = QAC = ⇒

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 10 / 21

slide-27
SLIDE 27

Digression - what tree might we infer?

AB|CD

A B C D

QAB < QAD(QAB, QAD, = ⇒

AD|BC

A D B C

QAD < QAB(QAB, QAD, = ⇒

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 10 / 21

slide-28
SLIDE 28

The Q matrix for a 4-taxa tree

The correct tree (AB|CD) will be inferred given the condition: QAB < QAD = ⇒ < QAD − QAB = ⇒ < DAD + DBC − DAB − DCD We now define C = DAD + DBC − DAB − DCD so that the correct tree will be inferred when C > 0.

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 11 / 21

slide-29
SLIDE 29

Deriving the expected value of C

T = the tree topology Pij = the proportion of differing sites between taxa i and j E[Pij] = f (pswitch, T) E[Dij] = − 3

4ln(1 − 4 3E[Pij])

E[C] = E[DAD] + E[DBC] − E[DAB] − E[DCD]

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 12 / 21

slide-30
SLIDE 30

Expected value of C

−0.01 0.00 0.01 25 50 75 100 pswitch NJ critical quantity

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 13 / 21

slide-31
SLIDE 31

Neighbour Joining

25 50 75 100 25 50 75 100 pswitch Correct Tree Inferred %

E[C] > 0 E[C] < 0 E[C] > 0

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 14 / 21

slide-32
SLIDE 32

Maximum Likelihood

25 50 75 100 25 50 75 100 pswitch Correct Tree Inferred % MP NJ ML

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 15 / 21

slide-33
SLIDE 33

Why is this important?

Traditional methods of phylogenetic inference may be compromised by SSRH.

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 16 / 21

slide-34
SLIDE 34

Why is this important?

Traditional methods of phylogenetic inference may be compromised by SSRH. Diagnostic tools need to be developed to help identify the presence and extent of SSRH in sequence data.

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 16 / 21

slide-35
SLIDE 35

Why is this important?

Traditional methods of phylogenetic inference may be compromised by SSRH. Diagnostic tools need to be developed to help identify the presence and extent of SSRH in sequence data. Data driven model checking will be the focus of my PhD going forward.

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 16 / 21

slide-36
SLIDE 36

Acknowledgements

I would like to thank my supervisory team for their input and guidance:

  • Prof. Nigel Bean - University of Adelaide

Dr Lars Jermiin - CSIRO Dr Barbara Holland - University of Tasmania Dr Jono Tuke - University of Adelaide

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 17 / 21

slide-37
SLIDE 37

That’s all folks!

Questions?

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 18 / 21

slide-38
SLIDE 38

Why is the AC|BD tree never inferred? I’m glad you asked!

QAB − QAC = DAB + DCD − DAC − DBD

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 19 / 21

slide-39
SLIDE 39

Why is the AC|BD tree never inferred? I’m glad you asked!

QAB − QAC = DAB + DCD − DAC − DBD = + − − A B C D A B C D A B C D A B C D

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 19 / 21

slide-40
SLIDE 40

Why is the AC|BD tree never inferred? I’m glad you asked!

QAB − QAC = DAB + DCD − DAC − DBD = + − − A B C D A B C D A B C D A B C D

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 19 / 21

slide-41
SLIDE 41

Why is the AC|BD tree never inferred? I’m glad you asked!

QAB − QAC = DAB + DCD − DAC − DBD = + − − A B C D A B C D A B C D A B C D

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 19 / 21

slide-42
SLIDE 42

Why is the AC|BD tree never inferred? I’m glad you asked!

QAB − QAC = DAB + DCD − DAC − DBD = + − − A B C D A B C D A B C D A B C D = − − ≤

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 19 / 21

slide-43
SLIDE 43

How was the MP crash point derived? I’m glad you asked!

Site Correct Incorrect Pattern Tree Tree xxxx xxxy 1 1 xxyx 1 1 xyxx 1 1 yxxx 1 1 xxyy 1 2 xyyx 2 1 xyxy 2 2 xxyz 2 2 xyzx 2 2 xyxz 2 2 yxxz 2 2 yxzx 2 2 yzxx 2 2 wxyz 3 3

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 20 / 21

slide-44
SLIDE 44

How was the MP crash point derived? I’m glad you asked!

Site Correct Incorrect Pattern Tree Tree xxxx xxxy 1 1 xxyx 1 1 xyxx 1 1 yxxx 1 1 xxyy 1 2 xyyx 2 1 xyxy 2 2 xxyz 2 2 xyzx 2 2 xyxz 2 2 yxxz 2 2 yxzx 2 2 yzxx 2 2 wxyz 3 3

A B C D Consider site pattern xxyy:

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 20 / 21

slide-45
SLIDE 45

How was the MP crash point derived? I’m glad you asked!

Site Correct Incorrect Pattern Tree Tree xxxx xxxy 1 1 xxyx 1 1 xyxx 1 1 yxxx 1 1 xxyy 1 2 xyyx 2 1 xyxy 2 2 xxyz 2 2 xyzx 2 2 xyxz 2 2 yxxz 2 2 yxzx 2 2 yzxx 2 2 wxyz 3 3

A B C D x x y y Consider site pattern xxyy:

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 20 / 21

slide-46
SLIDE 46

How was the MP crash point derived? I’m glad you asked!

Site Correct Incorrect Pattern Tree Tree xxxx xxxy 1 1 xxyx 1 1 xyxx 1 1 yxxx 1 1 xxyy 1 2 xyyx 2 1 xyxy 2 2 xxyz 2 2 xyzx 2 2 xyxz 2 2 yxxz 2 2 yxzx 2 2 yzxx 2 2 wxyz 3 3

A D B C x y x y xxyy = ⇒ Consider site pattern xxyy:

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 20 / 21

slide-47
SLIDE 47

How was the MP crash point derived? I’m glad you asked!

Site Correct Incorrect Pattern Tree Tree xxxx xxxy 1 1 xxyx 1 1 xyxx 1 1 yxxx 1 1 xxyy 1 2 xyyx 2 1 xyxy 2 2 xxyz 2 2 xyzx 2 2 xyxz 2 2 yxxz 2 2 yxzx 2 2 yzxx 2 2 wxyz 3 3

P(xxyy) = f (T) P(xyyx) = g(pswitch, T)

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 20 / 21

slide-48
SLIDE 48

How was the MP crash point derived? I’m glad you asked!

Site Correct Incorrect Pattern Tree Tree xxxx xxxy 1 1 xxyx 1 1 xyxx 1 1 yxxx 1 1 xxyy 1 2 xyyx 2 1 xyxy 2 2 xxyz 2 2 xyzx 2 2 xyxz 2 2 yxxz 2 2 yxzx 2 2 yzxx 2 2 wxyz 3 3

P(xxyy) = f (T) P(xyyx) = g(pswitch, T) The failure point of MP is given by finding pswitch such that: P(xxyy) = P(xyyx)

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 20 / 21

slide-49
SLIDE 49

How was the MP crash point derived? I’m glad you asked!

0.010 0.015 0.020 25 50 75 100 pswitch Proportion of Sites

P(xxyy) P(xyyx)

Stephen Crotty (School of Math. Sci.) Model Misspecification due to SSRH October, 2013 21 / 21