A Quick Guide to the
Analytics Behind Genomic Testing
1
Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories
Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP - - PowerPoint PPT Presentation
A Quick Guide to the Analytics Behind Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories 1 Learn earning ing Objec Objectives ives Catalogue various types of bioinformatics analyses that support clinical genomic
A Quick Guide to the
1
Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories
Learn earning ing Objec Objectives ives
Catalogue various types of bioinformatics analyses that support clinical genomic testing Enumerate types of variant classes Describe algorithmic methods for variant detection by NGS Compare and contrast germline and somatic clinical bioinformatics pipeline methodologies Discuss the infrastructure complexity required to support analytics for NGS testing at scale in the cloud Explain validation strategies for bringing best-in-class pipelines into clinical production
The Hu The Huma man n Re Reference ference Genom Genome ~3B
structured into
23 chromosome pairs
3,098,825,702 base pairs 20,805 coding genes 14,181 pseudogenes 196,501 gene transcripts
Why Why Genom Genomic ic Tes Testing ing?
KRAS G12D
cancer deaths are from lung cancer. ~222,500 new cases of lung cancer in the U.S. in 2017.
Short-Read Sequencers Illumina Ion-Torrent Long-Read Sequencers PacBio NanoPore 10X Nanostring
Genom Genomic ic Tes Testing ing
7
Types Types o
f NG NGS S Tes Testing ing—Somat Somatic ic & Germl & Germline ine
Types Types o
f NG NGS S Tes Testing ing—cfDNA cfDNA an and d ctDNA ctDNA
Non-Invasive Prenatal Testing (NIPT) Liquid Biopsy
Trisomy 21 (Down Syndrome) Non-small cell lung cancer EGFR
Types Types o
f NG NGS S Tes Testing ing—Inf Infec ectiou ious Dise Disease ase
Types Types o
f NG NGS S Tes Testing ing—RNA RNA-Seq Seq
Role of Role of Clinica Clinical B l Bioi ioinf nform
atics ics
Build pipelines Provide supplemental information for clinical interpretation and quality control Other computationally heavy analytics are involved in evaluating:
Design of new panels Identification of genetic patterns in patient cohorts Discovery of gene pathways
Steps in a bioinformatics pipeline:
Vari Varian ant Call Calling ing Pi Pipeline peline
Step 1 Step 1: : Sa Sampl mple e Demult Demultipl iplexing exing
Step 2 Step 2: : Read Read Alig Alignment nment
Read Alignment SAM Format
A T C C T G A T C C C T G A T C C T G A T C C T G A T C C T G A T C C T G
PCR Duplicate Removal Base Quality Score Recalibration Step 3 Step 3: : BAM Po BAM Polishin lishing g Steps Steps
homopolymer
+1%
Q30 Phred base quality score → 99.9% → 1/1000
reference read
C C C C C
Step 4 Step 4: : Va Varia riant nt Call Calling ing by C by Cla lass
SNV/Insertion/Deletion
(GATK Unified Genotyper, LoFreq)
(GATK Haplotype Caller)
(Graph Genome)
Duplications/Structural variants
(Manta, DELLY, CREST)
(ITD Assembler)
correction + principal component analysis (XHMM)
Exa Exampl mple e Vari Varian ant Call Calling ing Algor Algorit ithms hms
Example KRAS G12D Variant Cell
The annotated variant includes:
– Polymorphism – Synonymous – Non-synonymous
– Frame shift The VCF variant includes:
– information and individual format fields – filter flags
Step 5 Step 5: : Va Varia riant nt Annot Annotat ation ions
VCF variant Annotated variant
Sample Report QC metrics
Step 6 Step 6: : QC QC Calcula Calculation ions
Sequencing
Ru Run Lev Level el Cluster density Base call quality score Fragment size Sam ample e Lev Level el Depth coverage Uniformity Mapping quality Duplication rate Var Varian ant t Lev Level el Novel variants Known variants Transition-to-transversion ratio
Off-Target Read Depth On-Target Gene Exon Intronic regions
Sampl Sample-Leve evel l QC QC Metr Metrics ics fo for Target r Targeted ed Capt Capture ure
Minimum Depth of Coverage Uniformity Mapping Quality Duplication Rate
Job 1 Job 3 Job 2 Job 4 Job 5
Comput Compute e Inf Infra rastruct ructur ure e fo for Dat r Data a Pr Proce
ing
Dat Data Stora a Storage ge Inf Infra rastruct ructur ure
BCL (500–550 GB) FASTQs (12–15 GB)
Database Object Storage “hot” Archive “cold”
99.9% availability 99.999999999% durability
Raw output for a single run Exome ~150x In-house Cloud based
FASTQs, BAMs, VCFs
Bioinformatics HiSeq 4000
– 17 recommendation statements – 59 variants tested in each variant class
– Positive percentage agreement (PPA) – Positive predictive value (PPV) – Reproducibility – Allelic fraction lower limit of detection
Bioin Bioinfo forma rmatics ics Pi Pipeline V peline Val alida idation ion
Summa Summary ry
Catalogue various types of bioinformatics analyses that support clinical genomic testing Enumerate types of variant classes Describe algorithmic methods for variant detection by NGS Compare and contrast germline and somatic clinical bioinformatics pipeline methodologies Discuss the infrastructure complexity required to support analytics for NGS testing at scale in the cloud Explain validation strategies for bringing best-in-class pipelines into clinical production
Elaine Gee, PhD Director of Bioinformatics ARUP Laboratories elaine.gee@aruplab.com