Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on - - PowerPoint PPT Presentation

genomic exploration of the hemiascomycetous yeasts
SMART_READER_LITE
LIVE PREVIEW

Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on - - PowerPoint PPT Presentation

1 Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08 David J. Sherman U. Bordeaux, France LaBRI CNRS & INRIA team MAGNOME Comparative genomics 2 Which And what ones


slide-1
SLIDE 1

1

Genomic Exploration of the Hemiascomycetous Yeasts

3rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08

  • U. Bordeaux, France

LaBRI CNRS & INRIA team “MAGNOME”

David J. Sherman

slide-2
SLIDE 2

2

Comparative genomics

Is certainly about comparison But is also about the genomes

Which

  • nes do we

sequence? And what do we do after that?

slide-3
SLIDE 3

3

Data Results Algorithmic –

  • therwise

not interesting Biological –

  • therwise

not interesting A solved problem The hard part The hard part Push button

A caricature

slide-4
SLIDE 4

4

Hemiascomycetous yeasts

Eukaryotic genomes Small and compact Experimental model Biotechnological interest

  • beer, wine, bread
  • assimilate hydrocarbons,

tannin extracts

  • horomones and vaccines

Medical interest Biodiversity Systems Understand mechanisms

  • f molecular evolution

Genome redundancy Ortho-/para- log divergence Expansion and contraction

  • f universal families

Tandem duplications Block duplication and rearrangement Conservation of synteny

slide-5
SLIDE 5

90 80 70 60 50 100

Mus musculus Takifugu rubripes Tetraodon negroviridis Homo sapiens Ciona intestinalis Saccharomyces cerevisiae Candida glabrata Kluyveromyces lactis Debaryomyces hansenii Yarrowia lipolytica Saccharomyces paradoxus Saccharomyces uvarum

Mammals Fishes Urochordates S a c c h a r

  • m

y c e s s e n s u s t r i c t

  • Gallus gallus

Birds Scale: average % of amino-acid identity between complete set of orthologous proteins

Comparison of evolutionary range of Hemiascomycetes and Chordates

Dujon (2006) Trends in Genetics 22: 375-387

slide-6
SLIDE 6

6

Génolevures Sequencing Projects

Génolevures 1

  • 13 species, partial 0.2-0.4X
  • Souciet et al 2000 [21 papers]

FEBS Letters 487

Génolevures 2

  • 4 species complete 12X
  • Dujon, Sherman et al 2004

Nature 430

  • Sherman et al 2006 NAR 34

Génolevures 3

  • 3 species complete 12X
  • 2 species complete 7-12X

Génolevures 4

  • 4 + 5 + 5 close species, NGS
slide-7
SLIDE 7

7

high rate of intron loss

Yarrowia lipolytica 6 20.5 Candida albicans 8 14.9 Debaryomyces hansenii 7 12.2 Ashbya gossypii 7 9.2 Kluyveromyces lactis 6 10.6 Kluyveromyces waltii 8 10.7 Candida glabrata 13 12.3 Saccharomyces cerevisiae 16 12.1 Nb of chrom. Genome Size (Mb)

whole genome duplication non universal genetic code expansion of sugar-utilisation genes expansion of gene families encoding lipases, extracellular proteases etc... expansion of gene families encoding lipases, extracellular proteases, allantoin and allantoate transporters etc... short centromeres triplicated mating-type cassettes HO endonuclease loss of sex loss of sex loss of GAL genes loss of GAL genes loss of GAL genes loss of HO degradation of HO loss of class II Transposons and non-LTR retroposons loss of all active type I retroposons loss of active Ty5 Ty4 Ty1 / 2 Ty5 Tca2 Ty3 post-duplication gene loss post-duplication gene loss

Dujon (2006) Trends in Genetics 22: 375-387

slide-8
SLIDE 8

8

Genomic data for complete genomes

Complete genomes sequenced by the Génoscope What is complete?

  • Sequence subtelomere to subtelomere
  • Fully assembled chromosomes
  • Careful manual annotation

What can you do with a complete sequence?

  • Track chromosomal rearrangements
  • Analyze species- or clade-specific gain or loss
  • Measure expansion and contraction of protein families
  • Look for long-range correlations
slide-9
SLIDE 9

9

What’s next?

Genome Annotation

  • Magus annotation system
  • Simultaneous annotation of putative homologs

Classification into protein families

  • Consensus ensemble clustering

Comparative maps

  • Discovering synteny
  • Identifying orthologs

And what do we do after that?

slide-10
SLIDE 10

10

Let’s avoid teleology

Genomes are thrown together from bits and pieces of things that worked, once Genome annotation has to reflect reality and not expectations It is hard, painstaking work It is not fully automatic Good tools help

  • R. Greaves
slide-11
SLIDE 11

11

Genomic DNA Annotated genome Protein families Gene models Homolog groups Protein-coding genes RNA genes and

  • ther elements

Curated genes Simultaneous gene annotation Integration Systematic compar- ison and consensus Predictive methods Curation updates Transcript sequencing Algorithmic sequence analysis Classification Complementary analyses

Magus

Legend

GÉNOLEVURES technique GÉNOLEVURES result Predictive methods External technology External data source

The Annotation Process

slide-12
SLIDE 12

12

The “big iron”

Production

Redundant, high disp. Servers

  • 3 web
  • 1 database
  • Mini-cluster

Storage

  • 11 Tbyte RAID

Dinkum-thinkum

74 cores 4 Gbyte

  • IBM, Dell
  • x86_64

Rocks + bio roll

  • HMMER, NCBI BLAST,

ClustalW, EMBOSS, Glimmer, Fasta, MrBayes,Phylip, T_Coffee, MPI-Blast, GROMACS

GenCore 6

Web Service Bus

Fast browser database

Genomes database Genome Browser U.I. components

Rule checker Rules

Compute d results KB

Alignments & DB search In silico predictions

Web users

Web Service Bus

slide-13
SLIDE 13

13

Browsing a genome region

slide-14
SLIDE 14

14

Viewing a Locus on a Genome

slide-15
SLIDE 15

Validating a Gene Model

slide-16
SLIDE 16

Annotating Homolog Groups

slide-17
SLIDE 17

17

Protein families

Multi-species groups of related proteins Phylogenetic relationship → functional similarity Diversity of in silico results Need to calibrate or train methods for different phylogenetic groups New algorithm for consensus clustering that is efficient in practice

slide-18
SLIDE 18

What’s the goal?

Complete genomes

Protein families

Blast Smith- Waterman homeomorphy homeomorphy Partition ∏1 Partition ∏n Partition ∏4 Partition ∏3 Partition ∏2 E-val threshold

slide-19
SLIDE 19

Reconciling different in silico predictions

Proteomes Homeomorphic and Nonhomeomorphic Alignments Blast & SW sequence alignments partition partition partition partition partition

Agreement between partitions

  • Confusion matrix
  • Distance between partitions

that is, a shortest path in a graph of fusions/fissions NP-complete

slide-20
SLIDE 20

Median partitions by consensus clustering

Proteomes Homeomorphic and Nonhomeomorphic Alignments Blast & SW sequence alignments

Partition ∏n Partition ∏4 Partition ∏3 Partition ∏2 Partition ∏1

consensus Compute a median partition ∏ minimizing

slide-21
SLIDE 21

Construction and algorithm

FReli,j : encodes confusion matrix Define a similarity measure based on the composants ci Rk maximal conflict regions Select ci in each Rk by MDC (min. disjoint cover)

NP-complete

slide-22
SLIDE 22

Efficient heuristic

Relaxation: admit inexact cover

(Not all proteins are in families)

Resolve conflicts by election + policy

For each comp. C for each ci ∈ C compute Si et Di each p votes for ci in ordre Di ↑ and Si ↓ take the winning ci in order so as to cover the most proteins p

Conflict graph Conflict regions

slide-23
SLIDE 23

family subgroups

slide-24
SLIDE 24

24

slide-25
SLIDE 25

Correlated gain and loss and in networks and metabolic pathways

slide-26
SLIDE 26

Construct a PSSM for each family

Family GL2 fasta

PSI blast PSSM Proteomes GL2 PSI blast Comparison

TP,TN,FP,FN and worst E-val

4384 families as follows 4240 where FN = 0 FP med 0,0 avg 3,7 max 302 Ev med 6e-78 max 9e-6 144 where FN > 0 FP med 4,5 avg 33 max 307 Construction Validation

slide-27
SLIDE 27

Build a PSSM for each family and use to improve gene prediction

Family GL2 fasta

PSI blast PSSM* ORF translations PSI blast Candidates filtering Loci assigned to families Per-family size and E-value criteria

*PSSM: position-specific scoring matrix for PSIBLAST

slide-28
SLIDE 28

Comparison with KOGs

Project families on S. cerevisiae Select intersection and compare 3625 proteins (~2500 families)

identities: 1901 split: 159 (4 GLS, 42 GLR, 113 GLC) merge: 117 (6 GLS, 70 GLR, 79 GLC) messy: 25 (2 GLS, 13 GLR, 23 GLC)

slide-29
SLIDE 29

Comparison with KOGs

slide-30
SLIDE 30

Comparison GLR.3294 with PIRSF (UniProt) 017196 and 017667

slide-31
SLIDE 31

Comparison of GLR.3292 with PIRSF 017297 and 016767

slide-32
SLIDE 32

32

Despite similarity in size, gene content, ecological niche, yeast genomes are highly rearranged. In general, synteny is poorly conserved. In part:

  • evolutionary distance
  • artifact of WGD

Comparative maps

slide-33
SLIDE 33

33

But, if we focus on the Saccharomycetacae that did not undergo a recent whole genome duplication Protoploid genomes

  • homogeneous
  • low redundancy
  • less reshuffling

Comparative maps

slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Syntenic homologs are orthologs

slide-37
SLIDE 37

37

So, in conclusion

Comparative genomics works if you pay attention to the data

  • High-quality, complete genomes
  • Chosen from interesting phylogenetic groups

Building tools and analyses works if you have a plan

  • Genome annotation
  • Protein families and subgroups
  • Syntenic blocks and common markers

Many opportunities for further work http://genolevures.org/ ≡ http://cbi.labri.fr/Genolevures/

slide-38
SLIDE 38

38

Acknowledgments and support

Bordeaux Bordeaux

  • Macha Nikolski

Macha Nikolski CNRS CNRS

  • Tiphaine Martin CNRS

Tiphaine Martin CNRS

  • Pascal

Pascal Durrens Durrens CNRS CNRS

  • David Sherman INRIA

David Sherman INRIA

  • G

Géraldine Jean éraldine Jean

  • Hayssam Soueidan

Hayssam Soueidan

  • Nicol

Nicolás Loira ás Loira

  • Adrien

Adrien Goëffon Goëffon

  • Julie

Julie Bourbeillon Bourbeillon

  • Rodrigo

Rodrigo Assar Assar

G Génolevures énolevures

  • Jean-Luc

Jean-Luc Souciet Souciet

  • Bernard

Bernard Dujon Dujon

  • Claude

Claude Gaillardin Gaillardin

  • Christian Marck

Christian Marck

  • Eric

Eric Westhof Westhof

  • C

Cécile écile Neuvéglise Neuvéglise

  • C

Cécile écile Fairhead Fairhead

  • Andr

André é Goffeau Goffeau

  • Philippe Baret

Philippe Baret

  • Ed Louis

Ed Louis

  • Mark Johnston

Mark Johnston

CNRS GDR 2354 Génolevures CNRS UMR 5800 LaBRI INRIA team MAGNOME