[PPT] - Genome-wide association studies Fernando Rivadeneira MD PhD 1,2 1 PowerPoint Presentation

SLIDE 1

Genome-wide association studies

Fernando Rivadeneira MD PhD1,2

1Department of Internal Medicine

2Department of Epidemiology

SNPs and Diseases Molecular School of Medicine Monday, November 12th, 2018

SLIDE 2

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 3

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up studies and prospects

Topic outline

SLIDE 4

What is linkage disequilibrium (LD)?

Co-occurrence of alleles at distinct/adjacent loci

more frequently than expected by the allele frequencies and recombination rate

Allellic association depends on:

1)physical distance (debate?) 2)population history of sample 3)age of mutation/allele

SNP2-C SNP1-G or A SNP3-A

G→A C → T G→ A

SLIDE 5

Identifying common variants associated to common traits and diseases is often targeted using the principles of:

common, complex (association) common

CD/CV

Linkage disequilibrium mapping

D (SNP, DIP, CNV) M1 (SNP1) M2 (SNP2)

SLIDE 6

Linkage disequilibrium (LD) is the basis of the haplotype block structure

Linear, ordered arrangement of alleles on a chromosome
Combination of alleles of different polymorphisms on a single

chromosome

What is an haplotype?

Region in LD

Present-day Ancestor

SLIDE 7

Genetic variation is structured into blocks

f high LD:

SLIDE 8

r2 is inversely related to sample size of genetic association

studies 1/r2 1,000 cases 1,250 cases 1,000 controls r2=1.0 1,250 controls r2 = 0.80

D´ is related to recombination history

D´ ~ 1 no recombination D´ < 1 (0.8) historical recombination

D’ and r2 are complementary

D´ = 1 when r2 is low (i.e. 0.02)

LD Statistics in practice

SLIDE 9

In the absence of recombination, the shape of the tree and where

mutations fall on it determine patterns of haplotype structure

Two mutations on the same branch will be in complete association,

mutations on different branches will have lower and often low association

r2 = 0.04 r2 = 1

Haplotype structure in the absence of recombination

SLIDE 10

LD information allows to pick selected variants that “tag” variation in haplotypes

Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6

A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6

high r2 high r2 high r2

A A T T G C C G G C C G T C C C A C C C G C C G T C C C G G A A G G A A

After Carlson et al. (2004) AJHG 74:106

SLIDE 11

LD information allows to pick selected variants that “tag” variation in haplotypes

Tags: SNP 1 SNP 3 2 in total Test for association: SNP 1 captures 1+2 SNP 3 captures 3+5 “AG” haplotype captures SNP 4+6

A A T T G C C G G C C G T C C C A C C C G C C G T C C C G G A A G G A A A C C C

A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6

tags in multi-marker test should be in high LD in order to avoid

verfitting

SLIDE 12

Regions of extensive Linkage disequilibrium and reduced

haplotype diversity

Within a block SNPs are not independent
Haplotype-tag SNPs (htSNPs) are the subset of SNPs that can

capture most of the haplotype diversity

Properties underlying the haplotype-block structure

SLIDE 13

Genetic architecture fully determined by allele frequency and penetrance (effect size) of variants

Rivadeneira & Makitie TEM 2016

SLIDE 14

Effect Size Frequency Genetic Variant

rare, monogenic (linkage) common, complex (association)

Probably real

(impossible to identify with current methods)

Few examples

rare common small big Genetic architecture of traits

Modified from McCarthy et al., Nat Genet Rev 2008

Genome-wide association (GWA) combines the strongest properties

f linkage (hypothesis-free) and association (power) designs

Hypothesis- free approach

SLIDE 15

Genome-wide association (GWA) has been facilitated by the advent of:

Of 3,000,000,000 bases in human genome ~10,000,000 positions show variation ~4,000,000 catalogued as common variation ~2,200,000 in CEU ~80-90% are captured by typing 500K markers

*from Mark McCarthy

SLIDE 16

SLIDE 17

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up studies and prospects

Topic outline

SLIDE 18

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 AA AB BB AA BB AB

Microarray technology allows to genotype in the same effort hundred of thousands of SNPs per individual…

SLIDE 19

… which in the setting of large epidemiological studies allows the simultaneous testing of 2.5 million (imputed) markers for association with traits

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 AA AB BB AA BB AB

SLIDE 20

This first step of the GWA approach is merely a hypothesis generating phase (with some very few exceptions)

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 1 2 3 4 5 6 7 8 14 18 X

Chromosomes

10 12 AA AB BB AA BB AB

SLIDE 21

The crucial step is replication which allows building-up evidence for association (genome-wide significance)

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 1 2 3 4 5 6 7 8 14 18 X 10 12 AA AB BB AA BB AB

p<0.05 threshold results in ~20,000 hypotheses Follow-up Set

f Top SNPs

Meta-analysis full datasets

SLIDE 22

Only a selected number of SNPs is expected to achieve REPLICATION reaching a genome wide-significant level (i.e. 5 x 10-8)

Population stratification

SLIDE 23

Quality Control Genotyping

SLIDE 24

MAF> 1% GT SNPs: 512,849 RS-I Call Rate > 98% 466,389 RS-II pHWE > 1x10-6

514,073 RS-III

Imputed SNPs: 2,543,887 Sample call rate < 98% Missing DNA Gender mismatch Excess autosomal heterozigocity Duplicates or family relations IBS>97% Ethnic outliers (IBS distances > 4SD) Missing traits

Rotterdam Study datasets QC methods description

24

SLIDE 25

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 26

Relatedness base
Family (extended pedigrees,

pedigrees, trios, sibs)

Unrelated individuals
Sampling base
Population-based
Disease oriented (case/control,

proband families)

Epidemiological base
Case/control
Cross-sectional
Cohort (follow-up)
Phenotype base
Case enrichment
Extreme truncates
Super/shared controls
Genetics base
Genetic load enrichment
Isolates (extended LD)
Ethnicity
Admixture
Genotype platform base
Staged approach (Gen)
Joint analysis (Imp)

Type of study designs common variants

SLIDE 27

Examples types of GWA studies

Disease oriented case/control studies

– WTCCC, FUSION

Diseased oriented population-based studies

– FRAMINGHAM HEART STUDY

Population-based Studies

– ROTTERDAM STUDY – Generation R STUDY

Mega-GWAS

– UKBIOBANK – MVP

SLIDE 28

Rotterdam Study

CHARGE

GEnetic Factors of OSteoporosis

GENETIC INVESTIGATIONS OF ANTHROPOMETRIC TRAITS

Most (if not all) GWA activities occur within CONSORTIA summing tenths to hundreds of thousands of participants

SLIDE 29

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 30

Different genetic models do influence the power of analysis but are difficult to determine a-priori

=> To avoid multiple testing problems the first genetic analyses are usually run using additive models which preserve power across different scenarios

SLIDE 31

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

SLIDE 32

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 33

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

SLIDE 34

Population Adm ixture

Admixture is the presence of (Undetected) population substructure… by itself not a problem for association studies

SLIDE 35

Population Stratification

If disease prevalence or the distribution of the trait studied is associated to population substructure… it becomes a problem for association studies

SPURIOUS ASSOCIATION OF TRAIT WITH GENETIC ANCESTRY MARKERS

SLIDE 36

Methods to control for population stratification include: filtering out IBS outliers, genomic control (statistic/lambda)

r principal components corrections

SLIDE 37

In Rotterdam Study datasets... managed with exclusion of ~2-5% of population. In Generation R Study ~50% of participants are of non-Northern European ancestry Early deviations denote spurious results… all genome associated Correction for 4-20 PC does the trick to correct for stratification

True association

Expected Observed

RED HAIR COLOR –Generation R

Observed Expected

High stratification

GWAS in admixed populations are prone to population structure resulting in an increase in false positive findings

37

SLIDE 38

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 39

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

SLIDE 40

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 41

Study design will influence several of the factors determining the power of GWAS TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

Allele frequency
Effect size
Linkage disequilibrium
Phenotype definition
Alfa level
Sample size

SLIDE 42

Study design will influence several of the factors determining the power of GWAS TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

Allele frequency
Effect size
Linkage disequilibrium
Phenotype definition
Alfa level
Sample size

SLIDE 43

Power of a study is proportional to the degree of linkage disequilibrium between marker and real genetic variant

Samples size needs to be increased by factor 1/r2

SLIDE 44

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

Allele frequency
Effect size
Linkage disequilibrium
Phenotype definition
Alfa level
Sample size

SLIDE 45

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

SLIDE 46

Phased approach “Genotyping” Of 3,000,000,000 bases in human genome ~10,000,000 positions show variation ~4,000,000 catalogued as common variation ~2,200,000 in CEU ~80-90% are captured by typing 500K markers

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 AA AB BB AA BB AB

How many tests should be taken into account?

Significance threshold

5 x 10-9 5 x 10-8 1 x 10-7 5 x 10-7

rare AND common “Sequencing” Joint meta-analysis “Imputation”

SLIDE 47

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

Allele frequency
Effect size
Linkage disequilibrium
Phenotype definition
Alfa level
Sample size

SLIDE 48

A real life example…

SLIDE 49

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

Allele frequency
Effect size
Linkage disequilibrium
Phenotype definition
Alfa level
Sample size

SLIDE 50

50

SLIDE 51

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 52

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

Allele frequency
Effect size
Linkage disequilibrium
Phenotype definition
Alfa level
Sample size

SLIDE 53

Relatedness base
Family (extended pedigrees,

pedigrees, trios, sibs)

Unrelated individuals
Sampling base
Population-based
Disease oriented (case/control,

proband families)

Epidemiological base
Case/control
Cross-sectional
Cohort (follow-up)
Phenotype base
Case enrichment
Extreme truncates
Super/shared controls
Genetics base
Genetic load enrichment
Isolates (extended LD)
Ethnicity
Admixture
Genotype platform base
Staged approach (Gen)
Joint analysis (Imp)

Type of study designs common variants

SLIDE 54

Rationale GWAS Approach
Technology and QC
Study design
Study populations
Test for association
Population Stratification
Imputation (next talk)
Power
Phenotype definition
Follow-up GWAS signals

Topic outline

SLIDE 55

SLIDE 56

http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of- phenotypes-for-337000-samples-in-the-uk-biobank https://data.broadinstitute.org/alkesgroup/UKBB/) https://biobankengine.stanford.edu

SLIDE 57

SLIDE 58

Pimping up your GWAS studies

SLIDE 59

LD-Hub Pathway Analysis Animal models FineMap

Diverse approaches to follow-up GWAS findings:

SLIDE 60

Genetic correlations with other traits

SLIDE 61

Gene Prioritization and biological relevance of the variants

FINEMAP: efficient variable selection using summary data from genome-wide association studies Requirements

file with z-scores of the

variants (beta effect /SE)

file with the correlations

between the variants.

SLIDE 62

ENCODE ANALYSIS

MCF-7 CTCF ChIA-PET

SLIDE 63

Gene Prioritization and biological relevance

DEPICT : Data-Drive Expression-Prioritized Integration

SLIDE 64

Animal models: The mouse Phenotype Consortium

http://www.mousephenotype.org/

SLIDE 65

SLIDE 66

SLIDE 67

SLIDE 68

SLIDE 69

SLIDE 70

SLIDE 71

V

SNP2GENE: MAGMA analysis

SLIDE 72

GWAS hypothesis-free approach based on HapMap/1000GP

(through imputation) has been and will continue being successful

Optimal study design is crucial for applying successfully the GWAS

approach (control for stratification and other biases)

Replication of GWAS signals through meta-analysis achieves the

highest level of evidence for true associations

Increasing sample size in high-quality sample collections will

continue being the favored approach

Effect sizes in complex diseases are modest but this is not related

to the biological relevance of the identified genes

As discoveries emerge biologic clustering of identified loci becomes

evident revealing new biology and many translational opportunities

Take home messages

SLIDE 73