Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational - - PowerPoint PPT Presentation

▶

Aug 15, 2023 178 likes •369 views

Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis Dr Natalie Twine | Transformational Bioinformatics | @nat_twine 11 October 2017 HEATH & BIOSECURITY Genomics will outpace other BigData

SLIDE 1

Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis

HEATH & BIOSECURITY

11 October 2017 Dr Natalie Twine | Transformational Bioinformatics | @nat_twine

SLIDE 2

Genomics will outpace other BigData disciplines

Stephens et al. PLOS Biology 2015

Astronomy Twitter YouTube Genomics

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 3

Population-scale genomic data analysis requires BigData solutions

Desktop compute High-performance compute cluster Hadoop/Spark compute cluster Focus small data Compute-intensive Data-intensive Fault tolerant No No Yes Node-bound Yes Yes No Parallelization 10 CPU 100 CPU 1000 CPU Parallelization procedure bespoke bespoke standardized CSIRO solution

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 4

ALS is a devastating motor neurone disease

Amyotrophic lateral sclerosis (ALS)

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Leads to death within 3 years
Affects more than 200,000 people worldwide
Causes largely unknown – genetic component
Project MinE - sequencing 15,000 ALS genomes worldwide

SLIDE 5

Sporadic cases are potentially related but separated over generations
ALS is reported to be 5% familial and 95% sporadic
Familial component is potentially higher than 5%
Australia is a small population and disease is late onset

What is our hypothesis?

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

What are our aims ?

Uncover hidden patient relationships to increase detection power
Identify novel disease causing variants
Datasets available:

– Exomes (Familial, n=137) – WGS (Sporadic, n=800) – Project MinE WGS (Sporadic, n=15,000)

GOI SNP A SNP B SNP C

SLIDE 6

PLINK (Chang CC et al. GigaScience, 2015)
KING (Manichaikul A et al. Bioinformatics, 2010)
SNPduo, ERSA, GRAB, XIBD (etc.)

Existing methods for measuring relatedness

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Limitations of these tools

– Designed to identify and remove relatives as part of GWAS workflow – Identifying more distant relatives is challenging – Tools effective at distant relationship detection are SLOW

We want to expand existing family structures

– Identify more distant relationships with confidence

SLIDE 7

Relatedness between ALS patients using KING

KING identifies close relationships
172 Familial and Sporadic ALS Exomes

Each blue dot represents a relationship between a pair of ALS patients.

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 8

Relatedness between ALS patients using KING

n=172 ( 137 Familial and 35 Sporadic)

KING identified 8 novel relationships, at 3rd degree
3 we have ruled out as false positives.
5 are potentially REAL as can’t be classified as FP (no mutation status known).

Degree of relationship Number True positives False positives Unknown Duplicates 6 6 (100%) 1st degree 33 33 (100%) 2nd degree 23 23 (100%) 3rd degree 27 12 (44%) 9 (33%) 6 (22%) 4th degree 1310 n/a n/a n/a 5th degree 7852 n/a n/a n/a

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 9

How can we improve on this result?

True positive rate - 44% - this needs to improve
Whole Genome Sequencing >> Exome (SNP density)
WGS (n=800) cohort has 42 million variants
High density data - > more informative
BUT - Existing tools struggle/fail with Big Data volumes
We are implementing relatedness testing in VariantSpark
to identify novel relationships in

– 800 WGS Sporadic and Familial ALS – 15,000 WGS samples (Project MinE)

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 10

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

 Bringing BigLearning to genomics applications.  VariantSpark learns from

3000 individuals and 80 million mutations in

under 30 minutes  Association testing  Clustering  Classification

Speed Accuracy Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 11

Using VariantSpark to identify relatedness

Data-driven rather than model-driven approach
VariantSpark can handle 80 million variants x 3000 individuals
What is the genetic distance between samples ? (allele sharing)
Euclidean distance
Identity by descent (IBD) (as in PLINK)
Sliding window for IBD segments (as in ERSA)
Include data from 1000 Genomes as controls (family and ancestry known)
Approaches currently being tested for feasibility

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 12

Testing using different distance measures

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Exomes (n= 137 Familial ALS)

Euclidean distance performs well until 4th degree
Plink (IBD) performs better than other distances

Euclidean distance 1 2 3 5 UR 4 Degree of relationship distance

(IBD)

1 1 0.4 accuracy Degree of relationship 10 5

SLIDE 13

Effective methods are compute intensive

Ramstetter et al., Genetics 2017

IBD segment based methods most accurate for more distant relatives
BUT – They are also are most compute intensive (a.k.a SLOW)

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 14

Next steps in tool development

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

Identify novel relationships in Sporadic ALS WGS (n=800) proof of principle cohort Familial ALS WGS (n=89)

Simulate a large pedigree using whole genome data
Implement these methods in VariantSpark

 speed and scalability

Measure performance of different distance measures

 sensitivity and specificity (AUC)

Calculate genetic distance between each sample
Generate relationship degree metrics from simulated cohort

SLIDE 15

Joint-loci analysis (machine learning -

random forest)

Replicate known BMD genes identified by

traditional GWAS (single loci regression).

Amplify signal over traditional methods so

smaller cohorts give robust insights

Random forests identifies interaction of 2 or

more loci

VariantSpark application – genetic association

We will use this methodology to identify novel & modulating ALS variants

Bone Mineral Density (BMD) as the phenotype; 1,936 individuals with 7.2 Million variants (imputed from array).

SLIDE 16

Novel disease- causing variants Preventative measures Identify related individuals Personalised treatment

In summary: BigLearning to understand ALS

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine

SLIDE 17

Natalie Twine

Transformational Bioinformatics

Denis Bauer Oscar Luo Rob Dunne Piotr Szul

Team

Aidan O’Brien Laurence Wilson

Adrian White Mia Champion

Collaborators News Software Kaitao Lai

Ian Blair Kelly Williams Emily McCann Jenn Fifita

Dr Natalie Twine | Big Data technologies to understand ALS| @nat_twine