CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - - PowerPoint PPT Presentation
CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - - PowerPoint PPT Presentation
CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence Analysis Analysis Gene Finding Much other analysis is possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis
- Biol. Data analysis: Review
Protein Sequence Analysis
Sequence Analysis Gene Finding Assembly
Much other analysis is possible
Protein Sequence Analysis
Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics
A Static picture of the cell is insufficient
- Each Cell is continuously
active,
– Genes are being transcribed into RNA – RNA is translated into proteins – Proteins are PT modified and transported – Proteins perform various cellular functions
- Can we probe the Cell
dynamically Gene Regulation Proteomic profiling Transcript profiling
ncRNA gene finding
- Gene is transcribed but not translated.
- What are the clues to non-coding genes?
– Look for signals selecting start of transcription and
- translation. Non coding genes are transcribed by Pol III
– Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure
- Structure: Given a sequence, what is the structure
into which it can fold with minimum energy?
tRNA structure
RNA structure: Basics
- Key: RNA is single-stranded. Think of a string over 4
letters, AC,G, and U.
- The complementary bases form pairs.
- Base-pairing defines a secondary structure. The base-
pairing is usually non-crossing.
RNA structure: pseudoknots
Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots
RNA structure prediction
- Any set of non-crossing base-pairs
defines a secondary structure.
- Abstract Question:
– Given an RNA string find a structure that maximizes the number of non-crossing base- pairs – Incorporate the true energetics of folding – Incorporate Pseudo-knots
A combinatorial problem
- Input:
- A string over A,C,G,U
- A pairs with U, C pairs with G
- Output:
- A subset of possible base-pairs of maximum
size such that
- No two base-pairs intersect
- How can we compute this set efficiently?
RNA structure
1.
Nussinov’s algorithm
1.
Score B for every base-pair. No penalty for loops. No pesudo-knots.
2.
Let W(i,j) be the score of the best structure of the subsequence from i to j.
for i = n down to 1 { for j = i+1 to n { } }
W (i, j) = max B(r
i,rj) + W (i +1, j -1),
W (i, j -1), W(i +1,j) W (i,k) + W (k +1, j) i £ k < j Ï Ì Ô Ô Ó Ô Ô
Obtaining RNA structure
for i = n downto 1 { for j = i+1 to n { } }
W (i, j) = max B(r
i,rj) + W (i +1, j -1),
W (i, j -1), W(i +1,j) W(i,k) +W(k +1,j) (1) (2) (3) (4) Ï Ì Ô Ô Ó Ô Ô
if (1) { S(i,j) = / else if (2) S(i,j) = | else if(3) S(i,j) = - else S(i,j) = k }
Obtaining RNA Structure
Procedure print_RNA(i,j) { if S(i,j) = / { print “(i,j)”; print_RNA(i+1,j-1); else if (S(i,j) = -) { print_RNA(i+1,j); } else if (S(i,j) = |) { print_RNA(i,j-1); } else { k=S(i,j) print_RNA(i,k); print_RNA(k+1,j); } }
RNA structure: example
1 1 2 3 1 1 2 2 1 1 1 1 i 1 2 3 4 5 6 j 3 4 5 6
A C G A U U A C G A U U 1 2 3 4 5 6 1 2 3 4 5 6
2
RNA Structure: Details
Base-pairing & Loops
- Base-pairs arise from complementary nucleotides
- Single-stranded
- Stack is when 2 base-pairs are contiguous
- Loops arise when there are unpaired bases.
- They are characterized by the number of base-pairs that close it.
- Hairpin: closed by 1 base-pair
- Bulge/Interior Loops (2 base-pairs)
- Multiple Internal loops (k base-pairs)
Scoring Loops, multi-loops
- Zuker-Turner Energy Rules
- http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html
- Stacking Energies
- Energy for Bulges and Interior Loops
- Energy for Multi-loops
Other tricks for obtaining structure
- Alignment and Covariance
RNA: unsolved problems
- The structure problem is still unsolved.
– De novo prediction does not work as well. – Co-variance models require prior alignment.
- Many undiscovered non-coding genes
– miRNA, and others have only just been discovered. – Very hard to detect signal for these genes – Random sequence folds into low energy structures.
Other ncRNA: miRNA
- ncRNA ~22 nt in length
- Pairs to sites within the 3’ UTR,
specifying translational repression.
- Similar to siRNA (involved in RNAi)
- Unlike siRNA, miRNA do not need
perfect base complementarity
- Until recently, no computational
techniques to predict miRNA
- Most predictions based on cloning
small RNAs from size fractionated samples
Gene Regulation
Gene expression
- The expression of
transcripts and protein in the cell is not static. It changes in response to signals.
- The expression can be
measured using micro- arrays.
- What causes the change
in expression?
Transcriptional machinery
- DNA polymerase (II) scans the genome, initiating
transcription, and terminating it.
- The same machinery is used for every gene, so while Pol II
is required, it is not sufficient to confer specificity
TF binding
- Other transcription
factors interact with the core machinery and upstream DNA to provide specificity.
- TFs bind to TF binding
sites which are clustered in upstream enhancer and promoter elements.
- The enhancer elements
may be located many kb upstream of the core- promoter
Upstream elements Transcription factors
TF binding sites
- TF binding sites are
weak signal (about 10 bp with 5bp conserved)
- If two genes are co-
regulated, they are likely to share binding sites
- Discovery of binding
site motifs is an important research problem. TGAGGAG TCAGGAG TCAGGTG TGAGGTG TCAGGTG g1 g2 g3 g4 g5
http://www.gene-regulation.com/pub/databases.html#transfac
Discovering TF binding sites
- Identification of these TF binding
sites/switches is critical.
- Requires identification of co-regulated
genes (genes containing the same set of switches).
- How do we find co-regulated genes?
Idea1: Use orthologous genes from different species
ACGGCAGCTCGCCGCCGCGC ||||| || ||||||| || ACGGC-GGGCGCCGCCCCGC ACGGCAGCTCGCCGCCGC-C | || | ||||||| | AGTGC-GGGCGCCGCCTCAT ACGGC-GC-TCGCCGCCGCGC | | | || | | AT-ACGAAGTAGCGG-ATGGT
1. The species are too close (EX: humans and chimps). Binding & non-binding sites are both conserved. 2. The species are distant. Binding sites are conserved but not
- ther sequence.
3. The species are very distant. Even binding sites are not
- conerved. The genes have
alternative regulators.
Idea2: Measure expression of genes
- Northern Blot:
– Quantitative expression of a few genes
Microarray
- Expression level of all genes
Protein Expression using MS
Pathways
- Proteins interact to
transduce signal, catalyze reactions, etc.
- The interactions can be
captured in a database.
- Queries on this
database are about looking for interesting sub-graphs in a large graph.
Biological databases in NAR
- http://www3.oup.co.uk/nar/database/c
- 548 databases in various categories
Rfam Genbank SwissProt Stanford microarray db PDB Kegg dbSNP/OMIM/seattleSNPs SWISS 2D-page
Summary
- Biological databases cannot be
understood without understanding the data, and the tools for querying and accessing these data.
- While database technology (XML,
Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people.
- In this course, we looked at various
data-streams, and pointed to databases that store these data- streams
- Nucleic Acids Research brings out
a database issue every January