Characterizing transcriptomes using ngs data T. Kllman BILS/Scilife - - PowerPoint PPT Presentation

characterizing transcriptomes using ngs data
SMART_READER_LITE
LIVE PREVIEW

Characterizing transcriptomes using ngs data T. Kllman BILS/Scilife - - PowerPoint PPT Presentation

Characterizing transcriptomes using ngs data T. Kllman BILS/Scilife Lab/Uppsala University May 2015 20150521 1/38 Outline The transcriptome 1 RNA sequence technologies 2 RNA-seq analysis 3 Mapping based approach Tools for working


slide-1
SLIDE 1

Characterizing transcriptomes using ngs data

  • T. Källman

BILS/Scilife Lab/Uppsala University

May 2015

20150521 1/38

slide-2
SLIDE 2

Outline

1

The transcriptome

2

RNA sequence technologies

3

RNA-seq analysis Mapping based approach Tools for working with ngs alignments Gene expression from RNA-seq de-novo assembly

20150521 2/38

slide-3
SLIDE 3

The transcriptome

The Central Dogma

ATG Promoter Region Intron Exon AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA UGA UAA UAG PO

4

PO

4

S S 3’ Poly A tail 5’ Cap Methionine Stop Codons

Transcription and mRNA processing Translation Post-Translational Modification DNA mRNA Protein

5’ Un-Translated Region TATA

Active Protein 20150521 3/38

slide-4
SLIDE 4

The transcriptome

A more complex view

20150521 4/38

slide-5
SLIDE 5

The transcriptome

Transcriptomes vs genomes

Dynamic, not the same over tissues and time points Smaller sequence space Less repetitive (but large gene families can be found) Fairly stable in size? (eg. 2-4 fold change among eukaryotes, whereas genome size can vary 1000-fold) Genes are often expressed in multiple different splice-variants RNA often from only one strand

20150521 5/38

slide-6
SLIDE 6

RNA sequence technologies

NGS data

20150521 6/38

slide-7
SLIDE 7

RNA sequence technologies

Machine output

20150521 7/38

slide-8
SLIDE 8

RNA sequence technologies

Machine output

20150521 8/38

slide-9
SLIDE 9

RNA sequence technologies

Sequence quality

Phred quality scores: Q = -10 x log P (High Q = high probability of the base being correct A Phred quality score of 20 to a base, means that the base is called incorrectly in 1 out of 100 times.

20150521 9/38

slide-10
SLIDE 10

RNA sequence technologies

Pair-end (PE) sequencing

20150521 10/38

slide-11
SLIDE 11

RNA sequence technologies

Pair-end reads

File format Two files are created The order in files identical and naming of reads are the same with the exception of the end The way of naming reads are changing over time so the read names depend on software version

@61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA @61DFRAAXX100204:1:100:10494:3070/2 ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA + _^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad

20150521 11/38

slide-12
SLIDE 12

RNA sequence technologies

Pair-end data

20150521 12/38

slide-13
SLIDE 13

RNA sequence technologies

Stranded or not

20150521 13/38

slide-14
SLIDE 14

RNA-seq analysis

Two main routes for analysis

Haas & Zody (2010), Nature Biotechnology 28, 421–423 20150521 14/38

slide-15
SLIDE 15

RNA-seq analysis Mapping based approach

Aligning short reads from RNA to genomes

If available map to the genome sequence If no genome sequence one can also map to transcriptome reference Make use of available genome annotation (GTF , GFF , BED files)

20150521 15/38

slide-16
SLIDE 16

RNA-seq analysis Mapping based approach

Aligning short reads from RNA to genomes

Large number of programs available: Star, Tophat, Subread etc Important feature: Allow for spliced mapping

20150521 16/38

slide-17
SLIDE 17

RNA-seq analysis Mapping based approach

Aligning short reads from RNA to genomes

After mapping perform QC of the output

20150521 17/38

slide-18
SLIDE 18

RNA-seq analysis Mapping based approach

Example workflow

Tophat: Aligns reads to genome (allows for spliced read mapping) Cufflinks: Extract transcripts from spliced read alignments Cuffmerge: Merge results from multiple Cufflinks results Cuffdiff: Detect differential gene expression

Trapnell et al. (2012), Nature Protocols 7, 562–578 20150521 18/38

slide-19
SLIDE 19

RNA-seq analysis Mapping based approach

Tophat

1

Efficient and fast alignment to the genome using bowtie2

2

Create a data base of putative splice junctions from the reads mapping in step 1

3

Map reads that did not map in step 1 run using the splice information

20150521 19/38

slide-20
SLIDE 20

RNA-seq analysis Mapping based approach

Cufflinks

20150521 20/38

slide-21
SLIDE 21

RNA-seq analysis Mapping based approach

Cuffdiff

Program that estimate expression levels and identify differentially expressed genes from ngs alignments Basically uses the read data to estimate dispersion parameters (the amount of deviation from a Poisson distr.) Genes that show patterns deviating from the above expectations are differentially expressed between treatments Will work also for detection of isoform differential expression

20150521 21/38

slide-22
SLIDE 22

RNA-seq analysis Tools for working with ngs alignments

Samtools

Program to work with ngs alignment files (SAM, BAM, CRAM) Can be used to view data, calculate basic info, extract subsets of alignments and convert between file formats http://www.htslib.org

20150521 22/38

slide-23
SLIDE 23

RNA-seq analysis Tools for working with ngs alignments

Picard

A set of Java command line tools with the same (or similar functionality as samtools) Note that even though they largely aim at doing similar functions Picard and Samtools is not always generating compatible file formats http://broadinstitute.github.io/picard/

20150521 23/38

slide-24
SLIDE 24

RNA-seq analysis Tools for working with ngs alignments

Samtools tview, a text-based alignment viewer

$ samtools view alignment.bam target.fasta

20150521 24/38

slide-25
SLIDE 25

RNA-seq analysis Tools for working with ngs alignments

IGV: Integrative Genomics Viewer

20150521 25/38

slide-26
SLIDE 26

RNA-seq analysis Tools for working with ngs alignments

IGV: Integrative Genomics Viewer

20150521 26/38

slide-27
SLIDE 27

RNA-seq analysis Gene expression from RNA-seq

From counts to gene expression

20150521 27/38

slide-28
SLIDE 28

RNA-seq analysis Gene expression from RNA-seq

From counts to gene expression

20150521 28/38

slide-29
SLIDE 29

RNA-seq analysis Gene expression from RNA-seq

Not all reads are the same

from: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html 20150521 29/38

slide-30
SLIDE 30

RNA-seq analysis Gene expression from RNA-seq

Normalized expression Values

Transcript-mapped read counts are normalized for both length of the transcript and total depth of sequencing. Count data is hence converted to: Reads/Fragments per kb of transcript length and million mapped reads (RPKM or FPKM)

20150521 30/38

slide-31
SLIDE 31

RNA-seq analysis Gene expression from RNA-seq

Experimental design

20150521 31/38

slide-32
SLIDE 32

RNA-seq analysis Gene expression from RNA-seq

Experimental design

Count reads (convert to RPKM/FPKM?) Small number of reads (= low RPKM/FPKM values) often non-significant Remember that Fold change is not the same as significance

Gene A Condition 1 Condition 2 Gene B Fold_Change Significant? 1 2 2-fold 100 200 2-fold No Yes 20150521 32/38

slide-33
SLIDE 33

RNA-seq analysis de-novo assembly

Major challenges in relation to genome assembly

Genes show different levels of gene expression, hence uneven coverage among genes Many genes are expressed in different isoforms As sequence depth increase detected number of loci increase. (What is actually expressed?) Sequence error from highly expressed genes might be seen more

  • ften than "true" sequences from lowly expressed genes

20150521 33/38

slide-34
SLIDE 34

RNA-seq analysis de-novo assembly

Several programs available

SOAP-denovo TRANS Oases Trans-ABYSS Trinity All of them uses de Bruijn graphs to cope with the data and many of them have been developed from a genome assembly program

20150521 34/38

slide-35
SLIDE 35

RNA-seq analysis de-novo assembly

Trinity

20150521 35/38

slide-36
SLIDE 36

RNA-seq analysis de-novo assembly

Trinity

20150521 36/38

slide-37
SLIDE 37

RNA-seq analysis de-novo assembly

Summary - with ref.

Map to genome allow for spliced alignment If novel transcripts of interest: use method that can re-create transcripts from mapped reads (cufflinks, Scripture or Bayesembler) NB! In well annotated genomes most reads should map to known genes If interest is expression of known genes/exons: Use available annotation for analysis Replicate, replicate....!

20150521 37/38

slide-38
SLIDE 38

RNA-seq analysis de-novo assembly

Summary - without ref.

Assemble using your favourite assembler Spend lots of time in assessing the results (compare to related species, look for ORFs etc) Often large number of partial transcripts (hence often large number of contigs)

20150521 38/38