Genome-wide association studies Fernando Rivadeneira MD PhD 1,2 1 - - PowerPoint PPT Presentation

genome wide association studies
SMART_READER_LITE
LIVE PREVIEW

Genome-wide association studies Fernando Rivadeneira MD PhD 1,2 1 - - PowerPoint PPT Presentation

Genome-wide association studies Fernando Rivadeneira MD PhD 1,2 1 Department of Internal Medicine 2 Department of Epidemiology SNPs and Diseases Molecular School of Medicine Monday, November 12 th , 2018 Topic outline - Rationale GWAS Approach


slide-1
SLIDE 1

Genome-wide association studies

Fernando Rivadeneira MD PhD1,2

1Department of Internal Medicine

2Department of Epidemiology

SNPs and Diseases Molecular School of Medicine Monday, November 12th, 2018

slide-2
SLIDE 2
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-3
SLIDE 3
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up studies and prospects

Topic outline

slide-4
SLIDE 4

What is linkage disequilibrium (LD)?

  • Co-occurrence of alleles at distinct/adjacent loci

more frequently than expected by the allele frequencies and recombination rate

  • Allellic association depends on:

1)physical distance (debate?) 2)population history of sample 3)age of mutation/allele

SNP2-C SNP1-G or A SNP3-A

G→A C → T G→ A

slide-5
SLIDE 5

Identifying common variants associated to common traits and diseases is often targeted using the principles of:

common, complex (association) common

CD/CV

Linkage disequilibrium mapping

D (SNP, DIP, CNV) M1 (SNP1) M2 (SNP2)

slide-6
SLIDE 6

Linkage disequilibrium (LD) is the basis of the haplotype block structure

  • Linear, ordered arrangement of alleles on a chromosome
  • Combination of alleles of different polymorphisms on a single

chromosome

What is an haplotype?

Region in LD

Present-day Ancestor

slide-7
SLIDE 7

Genetic variation is structured into blocks

  • f high LD:
slide-8
SLIDE 8
  • r2 is inversely related to sample size of genetic association

studies 1/r2 1,000 cases 1,250 cases 1,000 controls r2=1.0 1,250 controls r2 = 0.80

  • D´ is related to recombination history

D´ ~ 1 no recombination D´ < 1 (0.8) historical recombination

  • D’ and r2 are complementary

D´ = 1 when r2 is low (i.e. 0.02)

LD Statistics in practice

slide-9
SLIDE 9
  • In the absence of recombination, the shape of the tree and where

mutations fall on it determine patterns of haplotype structure

  • Two mutations on the same branch will be in complete association,

mutations on different branches will have lower and often low association

r2 = 0.04 r2 = 1

Haplotype structure in the absence of recombination

slide-10
SLIDE 10

LD information allows to pick selected variants that “tag” variation in haplotypes

Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6

A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6

high r2 high r2 high r2

A A T T G C C G G C C G T C C C A C C C G C C G T C C C G G A A G G A A

After Carlson et al. (2004) AJHG 74:106

slide-11
SLIDE 11

LD information allows to pick selected variants that “tag” variation in haplotypes

Tags: SNP 1 SNP 3 2 in total Test for association: SNP 1 captures 1+2 SNP 3 captures 3+5 “AG” haplotype captures SNP 4+6

A A T T G C C G G C C G T C C C A C C C G C C G T C C C G G A A G G A A A C C C

A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6

tags in multi-marker test should be in high LD in order to avoid

  • verfitting
slide-12
SLIDE 12
  • Regions of extensive Linkage disequilibrium and reduced

haplotype diversity

  • Within a block SNPs are not independent
  • Haplotype-tag SNPs (htSNPs) are the subset of SNPs that can

capture most of the haplotype diversity

Properties underlying the haplotype-block structure

slide-13
SLIDE 13

Genetic architecture fully determined by allele frequency and penetrance (effect size) of variants

Rivadeneira & Makitie TEM 2016

slide-14
SLIDE 14

Effect Size Frequency Genetic Variant

rare, monogenic (linkage) common, complex (association)

Probably real

(impossible to identify with current methods)

Few examples

rare common small big Genetic architecture of traits

Modified from McCarthy et al., Nat Genet Rev 2008

Genome-wide association (GWA) combines the strongest properties

  • f linkage (hypothesis-free) and association (power) designs

Hypothesis- free approach

slide-15
SLIDE 15

Genome-wide association (GWA) has been facilitated by the advent of:

Of 3,000,000,000 bases in human genome ~10,000,000 positions show variation ~4,000,000 catalogued as common variation ~2,200,000 in CEU ~80-90% are captured by typing 500K markers

Of 3,000,000,000 bases in human genome ~10,000,000 positions show variation ~4,000,000 catalogued as common variation ~2,200,000 in CEU ~80-90% are captured by typing 500K markers

*from Mark McCarthy

slide-16
SLIDE 16
slide-17
SLIDE 17
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up studies and prospects

Topic outline

slide-18
SLIDE 18

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 AA AB BB AA BB AB

Microarray technology allows to genotype in the same effort hundred of thousands of SNPs per individual…

slide-19
SLIDE 19

… which in the setting of large epidemiological studies allows the simultaneous testing of 2.5 million (imputed) markers for association with traits

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 AA AB BB AA BB AB

slide-20
SLIDE 20

This first step of the GWA approach is merely a hypothesis generating phase (with some very few exceptions)

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 1 2 3 4 5 6 7 8 14 18 X

Chromosomes

10 12 AA AB BB AA BB AB

slide-21
SLIDE 21

The crucial step is replication which allows building-up evidence for association (genome-wide significance)

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 1 2 3 4 5 6 7 8 14 18 X 10 12 AA AB BB AA BB AB

p<0.05 threshold results in ~20,000 hypotheses Follow-up Set

  • f Top SNPs

Meta-analysis full datasets

slide-22
SLIDE 22

Only a selected number of SNPs is expected to achieve REPLICATION reaching a genome wide-significant level (i.e. 5 x 10-8)

Population stratification

slide-23
SLIDE 23

Quality Control Genotyping

slide-24
SLIDE 24

MAF> 1% GT SNPs: 512,849 RS-I Call Rate > 98% 466,389 RS-II pHWE > 1x10-6

514,073 RS-III

Imputed SNPs: 2,543,887 Sample call rate < 98% Missing DNA Gender mismatch Excess autosomal heterozigocity Duplicates or family relations IBS>97% Ethnic outliers (IBS distances > 4SD) Missing traits

Rotterdam Study datasets QC methods description

24

slide-25
SLIDE 25
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-26
SLIDE 26
  • Relatedness base
  • Family (extended pedigrees,

pedigrees, trios, sibs)

  • Unrelated individuals
  • Sampling base
  • Population-based
  • Disease oriented (case/control,

proband families)

  • Epidemiological base
  • Case/control
  • Cross-sectional
  • Cohort (follow-up)
  • Phenotype base
  • Case enrichment
  • Extreme truncates
  • Super/shared controls
  • Genetics base
  • Genetic load enrichment
  • Isolates (extended LD)
  • Ethnicity
  • Admixture
  • Genotype platform base
  • Staged approach (Gen)
  • Joint analysis (Imp)

Type of study designs common variants

slide-27
SLIDE 27

Examples types of GWA studies

  • Disease oriented case/control studies

– WTCCC, FUSION

  • Diseased oriented population-based studies

– FRAMINGHAM HEART STUDY

  • Population-based Studies

– ROTTERDAM STUDY – Generation R STUDY

  • Mega-GWAS

– UKBIOBANK – MVP

slide-28
SLIDE 28

Rotterdam Study

CHARGE

GEnetic Factors of OSteoporosis

GENETIC INVESTIGATIONS OF ANTHROPOMETRIC TRAITS

Most (if not all) GWA activities occur within CONSORTIA summing tenths to hundreds of thousands of participants

slide-29
SLIDE 29
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-30
SLIDE 30

Different genetic models do influence the power of analysis but are difficult to determine a-priori

=> To avoid multiple testing problems the first genetic analyses are usually run using additive models which preserve power across different scenarios

slide-31
SLIDE 31

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

slide-32
SLIDE 32
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-33
SLIDE 33

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

slide-34
SLIDE 34

Population Adm ixture

Admixture is the presence of (Undetected) population substructure… by itself not a problem for association studies

slide-35
SLIDE 35

Population Stratification

If disease prevalence or the distribution of the trait studied is associated to population substructure… it becomes a problem for association studies

SPURIOUS ASSOCIATION OF TRAIT WITH GENETIC ANCESTRY MARKERS

slide-36
SLIDE 36

Methods to control for population stratification include: filtering out IBS outliers, genomic control (statistic/lambda)

  • r principal components corrections
slide-37
SLIDE 37

In Rotterdam Study datasets... managed with exclusion of ~2-5% of population. In Generation R Study ~50% of participants are of non-Northern European ancestry Early deviations denote spurious results… all genome associated Correction for 4-20 PC does the trick to correct for stratification

True association

Expected Observed

RED HAIR COLOR –Generation R

Observed Expected

High stratification

GWAS in admixed populations are prone to population structure resulting in an increase in false positive findings

37

slide-38
SLIDE 38
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-39
SLIDE 39

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

slide-40
SLIDE 40
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-41
SLIDE 41

Study design will influence several of the factors determining the power of GWAS TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

  • Allele frequency
  • Effect size
  • Linkage disequilibrium
  • Phenotype definition
  • Alfa level
  • Sample size
slide-42
SLIDE 42

Study design will influence several of the factors determining the power of GWAS TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

  • Allele frequency
  • Effect size
  • Linkage disequilibrium
  • Phenotype definition
  • Alfa level
  • Sample size
slide-43
SLIDE 43

Power of a study is proportional to the degree of linkage disequilibrium between marker and real genetic variant

Samples size needs to be increased by factor 1/r2

slide-44
SLIDE 44

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

  • Allele frequency
  • Effect size
  • Linkage disequilibrium
  • Phenotype definition
  • Alfa level
  • Sample size
slide-45
SLIDE 45

Statistical Methods

Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I2 > 50 Significance: GWS α < 5 x 10-8 after double GC correction

slide-46
SLIDE 46

Phased approach “Genotyping” Of 3,000,000,000 bases in human genome ~10,000,000 positions show variation ~4,000,000 catalogued as common variation ~2,200,000 in CEU ~80-90% are captured by typing 500K markers

AA→ BB→ AB→ . . . AB→ SNP1 SNP2 SNP3 . . . SNP500,000 AA AB BB AA BB AB

How many tests should be taken into account?

Significance threshold

5 x 10-9 5 x 10-8 1 x 10-7 5 x 10-7

rare AND common “Sequencing” Joint meta-analysis “Imputation”

slide-47
SLIDE 47

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

  • Allele frequency
  • Effect size
  • Linkage disequilibrium
  • Phenotype definition
  • Alfa level
  • Sample size
slide-48
SLIDE 48

A real life example…

slide-49
SLIDE 49

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

  • Allele frequency
  • Effect size
  • Linkage disequilibrium
  • Phenotype definition
  • Alfa level
  • Sample size
slide-50
SLIDE 50

50

slide-51
SLIDE 51
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-52
SLIDE 52

Study design will influence several of the factors determining the power of a study TRUTH

GWA Study

H0: No Association HA: Association

Reject H0 Association

Alpha (α) error OK

Accept H0 No association

OK Beta (β) error Power (1-β) of a GWA study will depend on:

FIXED FACTORS MODIFIABLE FACTORS

  • Allele frequency
  • Effect size
  • Linkage disequilibrium
  • Phenotype definition
  • Alfa level
  • Sample size
slide-53
SLIDE 53
  • Relatedness base
  • Family (extended pedigrees,

pedigrees, trios, sibs)

  • Unrelated individuals
  • Sampling base
  • Population-based
  • Disease oriented (case/control,

proband families)

  • Epidemiological base
  • Case/control
  • Cross-sectional
  • Cohort (follow-up)
  • Phenotype base
  • Case enrichment
  • Extreme truncates
  • Super/shared controls
  • Genetics base
  • Genetic load enrichment
  • Isolates (extended LD)
  • Ethnicity
  • Admixture
  • Genotype platform base
  • Staged approach (Gen)
  • Joint analysis (Imp)

Type of study designs common variants

slide-54
SLIDE 54
  • Rationale GWAS Approach
  • Technology and QC
  • Study design
  • Study populations
  • Test for association
  • Population Stratification
  • Imputation (next talk)
  • Power
  • Phenotype definition
  • Follow-up GWAS signals

Topic outline

slide-55
SLIDE 55
slide-56
SLIDE 56

http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of- phenotypes-for-337000-samples-in-the-uk-biobank https://data.broadinstitute.org/alkesgroup/UKBB/) https://biobankengine.stanford.edu

slide-57
SLIDE 57
slide-58
SLIDE 58

Pimping up your GWAS studies

slide-59
SLIDE 59

LD-Hub Pathway Analysis Animal models FineMap

Diverse approaches to follow-up GWAS findings:

slide-60
SLIDE 60

Genetic correlations with other traits

slide-61
SLIDE 61

Gene Prioritization and biological relevance of the variants

FINEMAP: efficient variable selection using summary data from genome-wide association studies Requirements

  • file with z-scores of the

variants (beta effect /SE)

  • file with the correlations

between the variants.

slide-62
SLIDE 62

ENCODE ANALYSIS

MCF-7 CTCF ChIA-PET

slide-63
SLIDE 63

Gene Prioritization and biological relevance

  • DEPICT : Data-Drive Expression-Prioritized Integration
slide-64
SLIDE 64

Animal models: The mouse Phenotype Consortium

http://www.mousephenotype.org/

slide-65
SLIDE 65
slide-66
SLIDE 66
slide-67
SLIDE 67
slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71

V

SNP2GENE: MAGMA analysis

slide-72
SLIDE 72
  • GWAS hypothesis-free approach based on HapMap/1000GP

(through imputation) has been and will continue being successful

  • Optimal study design is crucial for applying successfully the GWAS

approach (control for stratification and other biases)

  • Replication of GWAS signals through meta-analysis achieves the

highest level of evidence for true associations

  • Increasing sample size in high-quality sample collections will

continue being the favored approach

  • Effect sizes in complex diseases are modest but this is not related

to the biological relevance of the identified genes

  • As discoveries emerge biologic clustering of identified loci becomes

evident revealing new biology and many translational opportunities

Take home messages

slide-73
SLIDE 73

Acknowledgements