Searching for the genetic basis of complex traits in humans and - - PowerPoint PPT Presentation
Searching for the genetic basis of complex traits in humans and - - PowerPoint PPT Presentation
Searching for the genetic basis of complex traits in humans and primates Vasily Ramensky UCLA Center for Neurobehavioral Genetics 22/03/16 University of California Los Angeles Center for Neurobehavioral Genetics -- 35-year project aimed to
University of California Los Angeles Center for Neurobehavioral Genetics
- - 35-year project aimed to decrease the global economic
and health impact of depression by 50% by 2050
- - 100,000 individuals to be enrolled
- - The largest UCLA research initiative thus far, with an
anticipated budget of $525 million for the first 10 years
Projects
Finnish Metabolic Sequencing: genetic basis of quantitative metabolic traits in the Finnish population
- - Target gene sequencing in >6,000 NFBC1966 members
- - Whole exome sequencing in 20,000 individual
Vervet monkeys: non-human primates in biomedical research
- - Whole genome sequencing of >700 members of Vervet Research Colony
Tourette Syndrome: genetic basis of Tourette Syndrome
- - Exome and targeted sequencing of >100 members of large TS pedigrees
- - GWAS studies of large TS cohorts
Bipolar disorder: genetic factors that contribute to risk for bipolar disorder
- - Whole genome sequencing of 450 members of large pedigrees from
Colombia and Costa Rica with severe form of bipolar disorder
Finnish Metabolic Sequencing
- - Founder population, inhabited Northern Finnland in the 1600s
- - Genetic isolate, homogeneous in genetic and environmental
background, enriched in potentially damaging variants
- - Birth cohort: no age as a confounder; longitudinal data
- - Quantitative heritable traits:
* body mass index, * fasting serum concentrations of lipids, * glucose and insulin, * inflammation CRP, * blood pressure
Finnish Metabolic Sequencing
Northern Finnland Birth Cohort 1966
Finnish Metabolic Sequencing
GWAS in NFBC66: Sabatti et al., 2009
31 associations to 6 traits, 9 associations previously unreported
Finnish Metabolic Sequencing
GWAS in NFBC66: Sabatti et al., 2009
Finnish Metabolic Sequencing
GWAS in NFBC66: Sabatti et al., 2009
Identified loci explained little of trait variability => contribution of rare variants?
Genetic architecture of complex traits
- - 78 genes in 6,121 samples, 17 loci on 10 chr
- - 2,234 variants, 76% with MAF<=0.5%
- - Single variant tests: variants with MAF>0.1% in additive
genetic model
- - Gene-level tests: missense variants with MAF<1%
- - Goal: new single variant signals independent from GWAS
- r associations at the gene level
Finnish Metabolic Sequencing
Targeted sequencing in NFBC66 and FUSION
Why?
- - Insertions and deletions
- - Epistatic interactions
- - Compound heterozygotes
- - Testing all rare missense variants
- - Non-coding regulatory variants
Finnish Metabolic Sequencing
Targeted sequencing in NFBC66 and FUSION
Tourette Syndrome
Tourette Syndrome
- - An inherited disorder, childhood onset (prevalence 0.4-3.8%)
- - Multiple physical (motor) and vocal tics
- - Linkage studies of large families: genetic signal on chr2p
- - No significant associations for coding exome variants
- - Exome + targeted non-coding regions on chr2p in 109
individuals from 15 large TS families (65 affected, 35 not affected, 9 unknown)
- - Genotyping of candidate variants in >700 individuals from
sib-pair families (UCLA)
- - GWAS studies in multiple cohorts
Tourette Syndrome
Candidate variants in the chr2p region
Pos, Mbp Region dbSNP AAF Idx Segregation Aff (Fam) Chi2 Epigenomic info
59.1 FLJ30838 FunSeq enhancer 0.91% 5 9 (4) 0.30 Enh H9 Neuronal Progen Cells (REMC) 60.5 AC007381 Intron 0.78% 2 8 (3) 0.04 Fetal Brain (REMC) 60.8 N/A 9.4% 30 (10) 0.001 LBL enh
// Idx: conserv. mammals, primates, CADD, DANN, fatHMM-mkl
Jeremiah Scharf, Dongmei Yu
Tourette Syndrome
LINC01122
BrainSpan: RNA-seq in 524 prenatal and postnatal samples Time points Brain regions
BCL11A
Tourette Syndrome
BrainSpan: RNA-seq in 524 prenatal and postnatal samples Time points Brain regions
- 3
- 2
- 1
1 2 3 4
- 2
- 1
1 2 3 4
Normalized expression (X-Xmean)/Xstdev Rcorr=0.723
LINC01122 BCL11A
Tourette Syndrome
Annotation of “anonymous” lincRNA 1) Search for genes coexpressed with query Q:
- - Threshold: genes with Rcorr > R0
- - Forward: genes in Q’s top x% // contaminated by
“promiscuous” genes
- - Reverse: genes for which Q is in top x%
- - Reverse-back-reverse (Gene’s best friends by Sasha
Favorov) 2) Check enriched GO terms for top ranked genes
Tourette Syndrome
GO annotations for reverse and forward ranks
Sequencing in the VRC
N ~ 2X104
Vervet Research Colony
Non-human primates vs. humans and rodents
- - Low sequence divergence, syntenic blocks
- - Phenotypic similarity (brain/behavior, infectious
diseases, metabolism)
- - Invasive studies are possible
- - Controlled environment
- - Longitudinal approaches are possible
Sequencing in the VRC
Examples of available phenotypes:
- - Brain and behavior: MRI, CSF monoamines, novelty
seeking, intruder challenge, anxiety, mother-infant interaction, sleep/circadian rhythms, cortisol, oxytocin
- - Metabolism and growth: lipids, glycemic measures,
adipokines/leptin, vitamin D, morphometry (BMI)
- - Microbiome at multiple body sites
- - Life history traits and disease history
- - RNA-seq: eQTLs from multiple tissues
Sequencing in the VRC
Non-human primates vs. humans and rodents
- - Low sequence divergence, syntenic blocks
- - Phenotypic similarity (brain/behavior, infectious
diseases, metabolism)
- - Invasive studies are possible
- - Controlled environment
- - Longitudinal approaches are possible
- - No reference datasets (dbSNP, Encode, etc.)
- - Not all tools work for highly inbred populations
Sequencing in the VRC
Blue: Founders. Orange: sequenced monkeys, size ~ coverage
Sequencing in the VRC
- - WGS of >700 samples with varying coverage (1..30x)
- - Reference genome C.sabaeus 1.1: 29 + 2 chr
Workflow:
- - Raw variant calling with GATK, genotype refinement in trios
- - Postprocessing: genotype conflicts, Mendelian errors, low qual
- - Phasing in 99 = 82 HC + 17 LC samples with Beagle
- - Phasing and imputation in 620 LC, 99 as reference haplotypes
- - Postprocessing: Mendelian errors, QC, quality flags
- - Two independent call sets: 16.7 mln SNVs genomewide,
1.3 mln extended exome SNVs and indels
Sequencing in the VRC
NR annotation Variants %
- Upstream-1000 325,968 23.8
Downstream-1000 284,953 20.8 Intron 174,171 12.7 3-UTR 167,523 12.2 Non-coding 144,099 10.5 5-UTR 102,395 7.5 Synon 79,477 5.8 Missense 75,436 5.5 Coding-exon-indel 10,325 0.8 Stop-gain 1,514 0.1 Donor 1,352 0.1 Acceptor 1,191 0.1 Stop-loss 187 0.0
- Total 1,368,591
COMPLEX 50502 3.7 DEL 133861 9.8 INS 69993 5.1 SNV 1114235 81.4
Sequencing in the VRC
Variant annotation
Alternative allele count distributions by type
Sequencing in the VRC
Constrained human genes in vervets
- - ExAC: exomes in 60,706 humans
- - 3,230 genes depleted with PTVs (protein-truncating
variants: indels, splice site, stop gain)
- - 3,118 constrained genes (96.5%) have vervet orthologs
- - Of them, 1,256 vervet genes harbor 2,212 PTVs (total
13,665)
- - Genes with multiple PTVs: not constrained in vervets?
Genes with few PTVs: check respective phenotypes
Sequencing in the VRC
Unconstrained genes with many PTVs
Sequencing in the VRC
Constrained genes with many PTVs
Sequencing in the VRC
Alt allele counts for PTVs
Sequencing in the VRC
New methods to interpret genome variation
New methods to interpret variation
Protein-truncating variants: why are they tolerated? Data:
- - ExAC: ~60,000 human exomes
- - Vervets: ~15,000 PTVs in 719 exomes
- - Available microexon data
Approach
- - Protein structure: models and features
New methods to interpret variation
Good old missense variants Motivation?
- - Prediction targeted at specific protein families
- - Need to explain the mechanism
- - Account for intragenic compensation
- - Traditional training sets need revision
New methods to interpret variation
Good old missense variants Motivation?
- - Prediction targeted at specific protein families
- - Need to explain the mechanism
- - Account for intragenic compensation
- - Traditional training sets need revision
Data:
- - New and emerging: NGS-based (ExAC)
- - Old and forgotten:
functional experiments // How an impact on biochemical function translates to the clinical and population levels?
New methods to interpret variation
Compensated pathogenic deviation
- - A source of prediction errors for existing methods
- - Fundamental mechanism of protein evolution and resistance
development for pathogens Data:
- - Protein mutation databases: functional effect of M1, M1+M2…
- - Literature-based
New methods to interpret variation
Non-coding variation Data
- - Genome sequence markup: genes and their elements
- - Population-based variant frequencies (dbSNP, WGS)
- - Genotype-phenotype associations (ClinVar, eQTLs, GWAS)
- - Comparative genomics: conservation
- - TF binding sites: experimental (ChIP-seq) and predicted
- - Epigenomics data (REMC, ENCODE)
Problems
- - Training sets
- - Tissue specificity
New methods to interpret variation
Non-coding genes: lincRNAs
- - ~1/3 are specific to human lineage
- - large fraction is brain-specific
- - known to regulate neighboring protein-coding genes
- - involved in gene expression regulation
// Derrien, et al. (2012) Genome Res Q: Can we attempt at more systematic annotation of non- coding RNA genes?
- - Data: large-scale RNA-seq datasets (BrainSpan)
- - Method: Gene’s best friends: thoughtful analysis of gene