Discovery of Transcription- Regulating Regions in Genes Wyeth - - PowerPoint PPT Presentation

discovery of transcription regulating regions in genes
SMART_READER_LITE
LIVE PREVIEW

Discovery of Transcription- Regulating Regions in Genes Wyeth - - PowerPoint PPT Presentation

Discovery of Transcription- Regulating Regions in Genes Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Childrens and Womens Hospital University of British Columbia Overview CMMT Bioinformatics for detection of


slide-1
SLIDE 1

Discovery of Transcription- Regulating Regions in Genes

Wyeth Wasserman

Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia

slide-2
SLIDE 2

CMMT

Overview

  • Bioinformatics for detection of transcription factor

binding sites

  • The Specificity Problem
  • Methods to enhance specificity of discrimination

algorithms

  • Pattern discovery for the analysis of regulatory

sequences in sets of co-expressed genes

  • Methods to enhance sensitivity of discovery algorithms
  • Current activities
slide-3
SLIDE 3

Layers of Complexity in Metazoan Transcription

slide-4
SLIDE 4

CMMT

Transcription Simplified

TATA URE

URF Pol-II

slide-5
SLIDE 5

Teaching a computer to find TFBS…

slide-6
SLIDE 6

Representing Binding Sites for a TF

  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

  • A matrix describing a a set of sites
  • A single site
  • AAGTTAATGA

Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

slide-7
SLIDE 7

CMMT

TGCTG = 0.9

PFMs to PWMs

One would like to add the following features to the model:

  • 1. Correcting for the base frequencies in DNA
  • 2. Weighting for the confidence (depth) in the pattern
  • 3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log(

)

f(b,i)+ s(N) p(b)

slide-8
SLIDE 8

CMMT

Performance of Profiles

  • 95% of predicted sites bound in vitro

(Tronche 1997)

  • MyoD binding sites predicted about once

every 600 bp (Fickett 1995)

  • The Futility Theorem

– Nearly 100% of predicted transcription factor binding sites have no function in vivo

slide-9
SLIDE 9

CMMT

A 1 kbp promoter screened with collection of TF profiles

slide-10
SLIDE 10

CMMT

Phylogenetic Footprinting for better specificity

70,000,000 years of evolution reveals most regulatory regions.

slide-11
SLIDE 11

CMMT

Phylogenetic Footprinting to Identify Functional Segments

% Identity

Actin gene compared between human and mouse with DPB.

200 bp Window Start Position (human sequence)

slide-12
SLIDE 12

CMMT

Phylogenetic Footprinting (2)

  • 0.2

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000

FoxC2

100% 80% 60% 40% 20% 0%

% Identity Start Position of 200bp Window

slide-13
SLIDE 13

CMMT

Recall...

slide-14
SLIDE 14

CMMT

The 1kbp promoter screen with phylogenetic footprinting

slide-15
SLIDE 15

CMMT

Choosing the ”right” species...

COW MOUSE CHICKEN

HUMAN HUMAN HUMAN

slide-16
SLIDE 16

CMMT

ConSite (www.phylofoot.org)

Now driven by the ORCA Aligner

slide-17
SLIDE 17

CMMT

Performance: Human vs. Mouse

  • Testing set: 40 experimentally defined sites in 15 well

studied genes (Replicated with 100+ site set)

  • 85-95% of defined sites detected with conservation filter,

while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

slide-18
SLIDE 18

CMMT

Emerging Issues

  • Multiple sequence comparisons

– Incorporate phylogenetic trees – Visualization

  • Analysis of closely related species

– Phylogenetic shadowing

  • Genome rearrangements

– Inversion compatible alignment algorithm

  • Higher order models of TFBS
slide-19
SLIDE 19

CMMT

Regulatory Modules for better specificity

TFs do NOT act in isolation

slide-20
SLIDE 20

Layers of Complexity in Metazoan Transcription

slide-21
SLIDE 21

CMMT

Liver regulatory modules

slide-22
SLIDE 22

CMMT

Models for Liver TFs…

(10 second slide for 3 months of work)

HNF1 C/EBP HNF3 HNF4

slide-23
SLIDE 23

CMMT

Statistically Significant Clusters of Sites

  • Can we identify dense clusters of sites that are

statistically significant?

  • Diverse methods have been introduced over the past few

years…Berman; Markstein; Frith; Noble; Wagner;…

  • In the best cases, we have enough data to train a

discriminant function

  • Rare to have sufficient data
  • For general purpose, we identify statistically

significant clusters of TFBS

  • Non-trivial to correct for non-random properties of DNA

– Most difficulty comes from local direct repeats

slide-24
SLIDE 24

CMMT

MSCAN

(collaboration with Jens Lagergren)

  • MSCAN allows users to submit any set of TF

profiles

  • Calculates significance for each site based on local

sequence characteristics

  • Calculates cluster significance using a dynamic

programming approach

  • Approximately 1 significant liver cluster / 18 000 bp in human

genome sequence

  • Filters out “significant” clusters of sites that

contain local repeats

  • Identification of non-random characteristics in DNA

http://mscan.cgb.ki.se

slide-25
SLIDE 25

CMMT

Training predictive models for modules

  • MSCAN and similar methods assume that any

combination of sites is meaningful

  • Reality: Some factors critical, others secondary
  • An alternative is to teach the computer which

combinations are better

  • Limited by small size of positive training set
  • Our original method: Logistic Regression Analysis
  • Recent method from Frith et al: Hidden Markov Model

(COMET)

slide-26
SLIDE 26

CMMT

Liver regulatory modules

slide-27
SLIDE 27

CMMT

Logistic Regression Analysis

∗ α1 ∗ α2 ∗ α3 ∗ α4

Σ

“logit” Optimize α vector to maximize the distance between output values for positive and negative training data. Output value is: elogit p(x)= 1 + elogit

slide-28
SLIDE 28

CMMT

PERFORMANCE

  • Liver (Genome Research, 2001)

– At 1 hit per 35 kbp, identifies 60% of modules – Limited to genes expressed late in liver development

LRA Models do not account for multiple sites for the same TF*

*Frith et al’s COMET and CISTER algorithms circumvent this problem

slide-29
SLIDE 29

CMMT

UDPGT1 (Gilbert’s Syndrome)

  • 0.2

0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype Mutant

Liver Module Model Score “Window” Position in Sequence

slide-30
SLIDE 30

CMMT

Making better predictions

  • Profiles make far too many false predictions to

have predictive value in isolation

  • Phylogenetic footprinting eliminates about 90% of

false predictions

  • Detection of clusters of binding sites offers better

predictive performance, especially through trained discriminant functions

slide-31
SLIDE 31

CMMT

Active Issues

  • Significance of clusters of sites
  • Segmentation of DNA into regions of different

composition

  • Methods using training to find clusters
  • Where to place weights?
  • Lack of large reference collections of modules
  • Limited profile databases
slide-32
SLIDE 32

CMMT

de novo Discovery

  • f TF Binding Sites
slide-33
SLIDE 33

CMMT

Pattern Discovery

slide-34
SLIDE 34

CMMT

Pattern Discovery Methods

  • Exhaustive

– e.g. “Moby Dick” (Bussemaker, Li & Siggia) – Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

  • Monte Carlo/Gibbs Sampling

– e.g. AnnSpec (Workman & Stormo) – Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

slide-35
SLIDE 35

CMMT

Yeast Regulatory Sequence Analysis (YRSA) system

slide-36
SLIDE 36

CMMT

Tests of YRSA System

PDR3-regulated genes from array study Classic cell-cycle array data re-clustered by Getz et al DNA-damage response partially mediating by MCB

slide-37
SLIDE 37

CMMT

Yeast genomes are ideal for such studies

Metazoan genomes are far from ideal

slide-38
SLIDE 38

CMMT

Biochemical complexity enables greater complexity in regulation

500 bp

Yeast ORF A

GO GO GO

Humans

20 000 bp

EXON 1 EXON 3 2

GO GO GO GO GO GO GO GO GO

slide-39
SLIDE 39

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10 12 14 16 18 100 200 300 400 500 600

SEQUENCE LENGTH PATTERN SIMILARITY

  • vs. TRUE MEF2 PROFILE

True Mef2 Binding Sites

slide-40
SLIDE 40

CMMT

Four Approaches to Improve Sensitivity

  • Better background models
  • Higher-order properties of DNA
  • Phylogenetic Footprinting

– Human:Mouse comparison eliminates ~75% of sequence

  • Regulatory Modules

– Architectural rules

  • Limit the types of binding profiles

allowed

– TFBS patterns are NOT random

slide-41
SLIDE 41

CMMT

Phylogenetic Footprinting to Identify Conserved Regions

Bayes Block Aligner (Lawrence Group) ORCA

slide-42
SLIDE 42

CMMT

Skeletal Muscle Genes

  • One of the most extensively studied tissues for

transcriptional regulation

– 45 genes partially analyzed – 26 genes with orthologous genomic sequence from human and rodent

  • Five primary classes of transcription factors

– Principal: Myf (myoD), Mef2, SRF – Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal muscle types)

slide-43
SLIDE 43

CMMT

de novo Discovery of Skeletal Muscle Transcription Factor Binding Sites

Mef2-Like SRF-Like Myf-Like

slide-44
SLIDE 44

CMMT

Pattern discovery methods using biochemical constraints

slide-45
SLIDE 45

CMMT

Some profile constraints have been explored…

  • Segmentation of informative

columns

  • Palindromic patterns
slide-46
SLIDE 46

CMMT

Our Hypothesis

  • Point 1: Structurally-related DNA binding

domains interact with similar target sequences

  • Exceptions exist (e.g. Zn-fingers)
  • Point 2: There are a finite number of binding

domains used in human TFs

  • Approximately 20-25
  • Idea: We could use the shared binding properties

for each family to focus pattern detection methods

  • Constrain the range of patterns sought
slide-47
SLIDE 47

CMMT

Comparison of profiles requires alignment and a scoring function

  • Scoring function based on sum of

squared differences

  • Align frequency matrices with modified

Needleman-Wunsch algorithm

  • Calculate empirical p-values based on

simulated set of matrices

Score Frequency

slide-48
SLIDE 48

CMMT

Intra-family comparisons more similar than inter-family

TF Database (JASPAR) COMPARE Match to bHLH

Jackknife Test 87% correct Independent Test Set 93% correct

slide-49
SLIDE 49

CMMT

slide-50
SLIDE 50

CMMT

FBPs enhance sensitivity

  • f pattern detection
slide-51
SLIDE 51
slide-52
SLIDE 52

CMMT

APPLICATION:

Cancer Protection Response

  • Detoxification-related enzymes are induced by

compounds present in Broccoli

  • Arrays, SSH and hard work have defined a set of

responsive genes

  • A known element mediates the response

(Antioxidant Responsive Element)

  • Controversy over the type of mediating leucine

zipper TF

  • NF-E2/Maf or Jun/Fos
slide-53
SLIDE 53

CMMT

Gibbs Sampling

Application (2)

Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF. Gibbs with FBP Prior Classify New TF Motif Maf (p<0.02) Jun (p<0.98)

slide-54
SLIDE 54

CMMT

CURRENT ACTIVITY de novo Analysis

  • f Regulatory Modules

(Collaboration with Chip Lawrence (Wadsworth))

slide-55
SLIDE 55

CMMT

Focus on regulatory modules for pattern detection

Cluster Genes by Expression Identify and Model Contributing TFs

6 0 0 0 7 0 0 2 8 4 7 1 0 2 0 0 4 0 0 8 0 0 0 0 1 0 0 6

Predictive Models

slide-56
SLIDE 56

Width Distributions

(Sum of Separations)

Binding Profiles

250

β General Circuit Properties µi =

Number

  • f Sites

Distribution

3

κ

Separation Distribution

(Default = Uniform)

100bp

γ Analyze co-regulated genes to define circuit characteristics αij

Neighbor Interactions µi µj µi µj

Specific Gene Features

slide-57
SLIDE 57

CMMT

Discovery performance

  • Approximately 50% of annotated TFBS are

detected in the training set sequences of 25 genes

  • Only 40% of predicted TFBS are annotated
  • We suspect that most of the un-annotated sites will

turn out to be functional. This needs to be determined.

slide-58
SLIDE 58

CMMT

β γ1 γ2 γ3 γ4 α1 α2 α3 α4 τ=5 binding sites µ1 µ2 µ3 µ4 µ5

  • B. Based on defined parameters, discover additional modules
slide-59
SLIDE 59

CMMT

Assessing predictive performance

  • Tested “forward” implementation for

specificity on initial set of 50 human non- muscle genes.

  • Only 1 significant regulatory region predicted.

– Region is closest to a skeletal muscle specific gene on the

  • pposite strand
  • 0 false predictions in this initial 250 kb screen
  • Currently repeating on larger collection
slide-60
SLIDE 60

CMMT

A related tangent...

slide-61
SLIDE 61

CMMT

The Integrated Module Sampler

Gene1 Gene2 Gene3 Gene4 Gene5

Calls to ensEMBL Calls to GeneLynx Calls to BlastZ Module Sampler

slide-62
SLIDE 62

CMMT

Conclusions

  • Evolution drives understanding in biology

– Phylogenetic Footprinting

  • Biochemistry inspires Bioinformatics

– Regulatory Modules – Familial Binding Profiles

  • Analysis of regulatory sequences is improving

– Given sets of orthologous genes, one can predict regulatory regions – Given sets of co-regulated genes, it is possible to infer the binding profiles for critical transcription factors

  • Much more work is needed…
slide-63
SLIDE 63

THANKS!

Wasserman Group – CMMT Dave Arenillas Jochen Brumm Danielle Kemmer Jonathan Lim Chris Walsh Wasserman Group - Karolinska Albin Sandelin Raf Podowski Wynand Alkema Collaborating Trainees Malin Andersson (KTH) Öjvind Johansson (UCSD)

Support: CIHR, CGDN, Merck-Frosst, BC Children’s & Women’s Hospitals, Pharmacia, EC–Marie Curie, KI-Funder

Collaborators Chip Lawrence (Wadsworth) William Thompson (Wadsworth) Boris Lenhard (K.I.) Jens Lagergren (SBC) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ) Group Alumni Elena Herzog Annette Höglund William Krivan Boris Lenhard Luis Mendoza

slide-64
SLIDE 64

CMMT

What will a computational biologist do with a scoring function?

Build a similarity tree!

slide-65
SLIDE 65

CMMT

The matrix tree:

slide-66
SLIDE 66

CMMT

Compare with consensus for both classes - CANNTG

bHLH-Zip domain bHLH domain