Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational - - PowerPoint PPT Presentation

amyotrophic lateral sclerosis
SMART_READER_LITE
LIVE PREVIEW

Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational - - PowerPoint PPT Presentation

Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational Bioinformatics | @nat_twine 11 October 2017 HEATH & BIOSECURITY Genomics will outpace other BigData


slide-1
SLIDE 1

Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis

HEATH & BIOSECURITY

11 October 2017 Dr Natalie Twine | Transformational Bioinformatics | @nat_twine

slide-2
SLIDE 2

Genomics will outpace other BigData disciplines

Stephens et al. PLOS Biology 2015

Astronomy Twitter YouTube Genomics

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-3
SLIDE 3

Population-scale genomic data analysis requires BigData solutions

Desktop compute High-performance compute cluster Hadoop/Spark compute cluster Focus small data Compute-intensive Data-intensive Fault tolerant No No Yes Node-bound Yes Yes No Parallelization 10 CPU 100 CPU 1000 CPU Parallelization procedure bespoke bespoke standardized CSIRO solution

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-4
SLIDE 4
  • ALS is a devastating motor neurone disease

Amyotrophic lateral sclerosis (ALS)

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  • Leads to death within 3 years
  • Affects more than 200,000 people worldwide
  • Causes largely unknown – genetic component
  • Project MinE - sequencing 15,000 ALS genomes worldwide
slide-5
SLIDE 5
  • Sporadic cases are potentially related but separated over generations
  • ALS is reported to be 5% familial and 95% sporadic
  • Familial component is potentially higher than 5%
  • Australia is a small population and disease is late onset

What is our hypothesis?

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

What are our aims ?

  • Uncover hidden patient relationships to increase detection power
  • Identify novel disease causing variants
  • Datasets available:

– Exomes (Familial, n=137) – WGS (Sporadic, n=800) – Project MinE WGS (Sporadic, n=15,000)

GOI SNP A SNP B SNP C

slide-6
SLIDE 6
  • PLINK (Chang CC et al. GigaScience, 2015)
  • KING (Manichaikul A et al. Bioinformatics, 2010)
  • SNPduo, ERSA, GRAB, XIBD (etc.)

Existing methods for measuring relatedness

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

  • Limitations of these tools

– Designed to identify and remove relatives as part of GWAS workflow – Identifying more distant relatives is challenging – Tools effective at distant relationship detection are SLOW

  • We want to expand existing family structures

– Identify more distant relationships with confidence

slide-7
SLIDE 7

Relatedness between ALS patients using KING

  • KING identifies close relationships
  • 172 Familial and Sporadic ALS Exomes

Each blue dot represents a relationship between a pair of ALS patients.

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-8
SLIDE 8

Relatedness between ALS patients using KING

n=172 ( 137 Familial and 35 Sporadic)

  • KING identified 8 novel relationships, at 3rd degree
  • 3 we have ruled out as false positives.
  • 5 are potentially REAL as can’t be classified as FP (no mutation status known).

Degree of relationship Number True positives False positives Unknown Duplicates 6 6 (100%) 1st degree 33 33 (100%) 2nd degree 23 23 (100%) 3rd degree 27 12 (44%) 9 (33%) 6 (22%) 4th degree 1310 n/a n/a n/a 5th degree 7852 n/a n/a n/a

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-9
SLIDE 9

How can we improve on this result?

  • True positive rate - 44% - this needs to improve
  • Whole Genome Sequencing >> Exome (SNP density)
  • WGS (n=800) cohort has 42 million variants
  • High density data - > more informative
  • BUT - Existing tools struggle/fail with Big Data volumes
  • We are implementing relatedness testing in VariantSpark
  • to identify novel relationships in

– 800 WGS Sporadic and Familial ALS – 15,000 WGS samples (Project MinE)

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-10
SLIDE 10

z

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

 Bringing BigLearning to genomics applications.  VariantSpark learns from

3000 individuals and 80 million mutations in

under 30 minutes  Association testing  Clustering  Classification

Speed Accuracy Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-11
SLIDE 11

Using VariantSpark to identify relatedness

  • Data-driven rather than model-driven approach
  • VariantSpark can handle 80 million variants x 3000 individuals
  • What is the genetic distance between samples ? (allele sharing)
  • Euclidean distance
  • Identity by descent (IBD) (as in PLINK)
  • Sliding window for IBD segments (as in ERSA)
  • Include data from 1000 Genomes as controls (family and ancestry known)
  • Approaches currently being tested for feasibility

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-12
SLIDE 12

Testing using different distance measures

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Exomes (n= 137 Familial ALS)

  • Euclidean distance performs well until 4th degree
  • Plink (IBD) performs better than other distances

Euclidean distance 1 2 3 5 UR 4 Degree of relationship distance

(IBD)

1 1 0.4 accuracy Degree of relationship 10 5

slide-13
SLIDE 13

Effective methods are compute intensive

Ramstetter et al., Genetics 2017

  • IBD segment based methods most accurate for more distant relatives
  • BUT – They are also are most compute intensive (a.k.a SLOW)

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-14
SLIDE 14

Next steps in tool development

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Identify novel relationships in Sporadic ALS WGS (n=800) proof of principle cohort Familial ALS WGS (n=89)

  • Simulate a large pedigree using whole genome data
  • Implement these methods in VariantSpark

 speed and scalability

  • Measure performance of different distance measures

 sensitivity and specificity (AUC)

  • Calculate genetic distance between each sample
  • Generate relationship degree metrics from simulated cohort
slide-15
SLIDE 15
  • Joint-loci analysis (machine learning -

random forest)

  • Replicate known BMD genes identified by

traditional GWAS (single loci regression).

  • Amplify signal over traditional methods so

smaller cohorts give robust insights

  • Random forests identifies interaction of 2 or

more loci

VariantSpark application – genetic association

  • We will use this methodology to identify novel & modulating ALS variants

Bone Mineral Density (BMD) as the phenotype; 1,936 individuals with 7.2 Million variants (imputed from array).

slide-16
SLIDE 16

Novel disease- causing variants Preventative measures Identify related individuals Personalised treatment

In summary: BigLearning to understand ALS

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

slide-17
SLIDE 17

Natalie Twine

Transformational Bioinformatics

Denis Bauer Oscar Luo Rob Dunne Piotr Szul

Team

Aidan O’Brien Laurence Wilson

Adrian White Mia Champion

Collaborators News Software Kaitao Lai

Ian Blair Kelly Williams Emily McCann Jenn Fifita

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine