My P value is lower than your P value! Beyond GWAS in livestock genomics
Joanna Szyda
My P value is lower than your P value! Beyond GWAS in livestock - - PowerPoint PPT Presentation
My P value is lower than your P value! Beyond GWAS in livestock genomics Joanna Szyda Motivation P value based inference Motivation Biology emerges from pathways, not from single genes Eric Lander Motivation Combine various sources
My P value is lower than your P value! Beyond GWAS in livestock genomics
Joanna Szyda
Motivation P value based inference
Motivation „Biology emerges from pathways, not from single genes” Eric Lander
Motivation
information
Outline
Data set 1 Illustration of methodology and biological conclusions
ARSBFGLBAC10172 4408169577_E B B 0.8830 9.9999 ARSBFGLBAC1020 4408169577_E A B 0.8990 9.9999 ARSBFGLBAC10245 4408169577_E B B 0.6582 9.9999
Combine selected sources of information Data set 2 Illustration of the available genetic variability
@HWI WI-1K 1KL15 157: 7:87: 7:C3N 3NCK CKACX CXX: X:8: 8:230 307:2 :203 034:7 :7845 453 3 2:N :N:0 :0:A :AGTT TTCC GG GGGA GAACT CTTGC GCTG TGTAT ATGTG TGCA CAGGG GGAG AGCA CAGGT GTGCT CTCT CTGTG TGCCA CAAC ACCTG TGGA GAGG GGGGA GAGGG GGAT ATGGG GGGTG TGGG GGA + <= <=?DBDA DAB:+ :+<? <?<CB CB@GE GEED ED>?@ ?@A@ A@AA AACF): ):CE CECG CG@GF GFIGG GGFF FFFFG FGFI FIBF BFA<' <'5@E @E4; 4;5=@ =@?3> 3>88 889
ARSBFGLBAC10172 4408169577_E B B 0.8830 ARSBFGLBAC1020 4408169577_E A B 0.8990 ARSBFGLBAC10245 4408169577_E B B 0.6582 ARSBFGLBAC10345 4408169577_E A B 0.9092 ARSBFGLBAC10365 4408169577_E B B 0.8021 ARSBFGLBAC10375 4408169577_E B B 0.8858 ARSBFGLBAC10591 4408169577_E A A 0.8670 ARSBFGLBAC10793 4408169577_E B B 0.8722 ARSBFGLBAC10867 4408169577_E A A 0.9316 ARSBFGLBAC10919 4408169577_E A B 0.7805 ARSBFGLBAC10952 4408169577_E A B 0.9314 ARSBFGLBAC10960 4408169577_E A B 0.5666 ARSBFGLBAC10975 4408169577_E A B 0.8665 ARSBFGLBAC10986 4408169577_E A B 0.8687 ARSBFGLBAC10993 4408169577_E B B 0.8146 ARSBFGLBAC11000 4408169577_E A A 0.9135 ARSBFGLBAC11003 4408169577_E A A 0.9454 ARSBFGLBAC11007 4408169577_E B B 0.9106 ARSBFGLBAC11025 4408169577_E B B 0.8742 ARSBFGLBAC11028 4408169577_E A A 0.8534 ARSBFGLBAC11034 4408169577_E B B 0.5769 ARSBFGLBAC11039 4408169577_E B B 0.8987
Data Set 1 SNP
Data Set 1 SNP 2 601 HF bulls black-white & red-white pedigree 10 355 individuals SNP Illumina 50 K chip SNP positions pairwise LD Phenotype deregressed national EBV complex inheritance mode Gene genomic position (Ensembl) Gene Ontology terms (GO) metabolic pathways (KEGG)
Data set 1 SNP effect estimation
Data set 1 gene networks identify physiological processes underlying complex traits + corresponding genes
associations detected
LHX8 HEPHL1 DHX34 FBP2 TANC2 AP1B1
SNP
Data set 1 network construction for PY
Data set 1 network validation
SNP effect estimation Gene effect estimation Gene selection Network construction Functional information
EBV permutation X 100
Data set 1 testing functional features For each GO / KEGG:
Odds for the
Odds for permuted data
Data set 1 results
CI: 8.8-51.7 → P<0.00001
protein degradation, tissue regression, inflammation
CI: 3.0-11.4 → P=0.00005
development of mammary epithelium
CI: 7.5-245 → P=0.00588
NADPH production in tissues engaged in biosynthesis
Significant KEGG pathways for PY (examples)
Data set 1 trait similarity identify similarities between complex traits
Data set 1 trait similarity Trait similarity GO / genes GO / genes
Data set 1 similarity metrics Cosine metric: Jaccard metric:
number of GO / genes in networks for trait i and j
number of GO / genes in a network for trait i
number of GO / genes in a network for trait j
Data set 1 results
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 PY, FY PY, MY PY, SCS PY, STA FY, MY FY, SCS FY, STA MY, SCSMY, STA SCS, STA
genes cosine genes Jaccard GO Jaccard
Similarity between traits
Data Set 2 DNA sequence There is much more informative data to do it
@HWI-1KL157:67:D2AGFACXX:1:2316:10694:65033 2:N:0: CTATTACACGCCCCCGAAGCTCTAGCGGGTGTTCTCACGCACCCAAGGCATCCTCAACCACCACCATTTCTG + CCCFFADFHHGHHJJGGIIG@HIIFEHIJ;@F@DGGGGCCEB8BCDDDDBACDDCDDDBDDBDDDBDDDEE @HWI-1KL157:67:D2AGFACXX:1:2316:10671:65034 2:N:0: AGTGTATTACTGTCTTTGCACTCTTTAATCCTAGGTGACTTTTGGGGGTTCAGTATCAGATAGAGAACATATT + ?@@ADDDDHDBFHCEHIIBHEHEEHEH>BF?EFHCHFGFGFHH@HIG:6@=CGICAGG=7@@CHG===7 @HWI-1KL157:67:D2AGFACXX:1:2316:10609:65040 2:N:0: CTGGAGTGGGTATCCTTTCCCTTATCCAGGTTATCTTCCCAACCCAGGGATTGAACCCAGGTATCCTGGATT + @CCFDD2AFHDH<AFHII4CGIIJIJJGGIGIIJIIIJJJIHHIJJJIJEFGGICHHGGIIIHEHIHHGHHHFFFFFDDDDDD @HWI-1KL157:67:D2AGFACXX:1:2316:10717:65046 2:N:0: TACTCAAAAGAATCTGTGTTTAGACAGTTTAGAACATCTCCTACCTCTCACAGTTGGGAGGCTCTGAACAAT + @@@DD;DDHDBCFBEGGDHGHI<FBHIAEHE@GGEEFFHGDGIHGIGIIGBGGFGHIAFEGGHGIIIIIIEHH @HWI-1KL157:67:D2AGFACXX:1:2316:10507:65046 2:N:0: GAAGAAAAACTGTGTTTATGTCTCGAACATAATAAAGTCAACATGGATTATGTTAACTGTAATTGTACATCTA + @@@DDDDBHHHHBDBBHBHH3ACHHIIGBHIGCHGHGHIHHEGHII?4BFBDHHIGIDGDGFCCBF@FHI @HWI-1KL157:67:D2AGFACXX:1:2316:10653:65048 2:N:0: TATTGAAAACCTACCTACTAGGTAAATCTTAAGTAGGTTTAATCATGTCCACGTTTCCACTTGTTCACTCATTC
Data Set 2 DNA sequence
Data Set 2 DNA sequence 32 HF cows paternal half-sib whole genome DNA sequence Illumina HiSeq variant calling FreeBayes, GATK, Samtools, CNVnator alignment UMD3.1 reference genome BWA, Smalt
Data set 2 genomic variability describe genetic variability
basis for complex trait modelling
Data set 2 averaged coverage
2 4 6 8 10 12 14 16 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
coverage Cow ID
Genome averaged coverage for each cow
2 4 6 8 10 12 14 16 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
coverage Cow ID
Data set 2 coverage along the genome Chromosomewise coverage for a particular cow BTA01 : 𝒚 =8.56 BTA10 : 𝒚 =8.03 BTA20 : 𝒚 =8.14 BTX : 𝒚 =8.60
Data set 2 SNPs
1 000 000 2 000 000 3 000 000 4 000 000 5 000 000 6 000 000 7 000 000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# SNP Cow ID
Total number of identified SNPs
663 223
Data set 2 SNPs
0.5 1 1 4 7 10 13 16 19 22 25 28 % of SNPs BTA 3 alleles 100 000 200 000 300 000 400 000 500 000 600 000 700 000 800 000 900 000 1 000 000 1 4 7 10 13 16 19 22 25 28 total number of SNPs BTA
Total number of identified SNPs
0.002 0.004 0.006 0.008 1 4 7 10 13 16 19 22 25 28 % of SNPs BTA 4 alleles
Data set 2 SNPs
50 100 150 200 250 300 HK SS NS number of missense SNPs 0.001 0.002 0.003 0.004 0.005 0.006 HK SS NS missense SNP density
Missense SNPs
Housekeeping Strong Selection Neutral to Selection
Data Set 2 SNPs Housekeeping
beta Actin, Beta-2-microglobulin, Glyceraldehyde-3-
phosphate, Hydroxymethylbilane synthase, beta Heat shock 90kDa protein 1, Ubiquitin C
Strong Selection
diacylglycerol O-acyltransferase 1, alpha 6 integrin, ADP-
ribosylation factor-like 4A, bone morphogenetic protein 4, myeloid differentiation primary response
Neutral to Selection
URI1 prefoldin-like chaperone, low density lipoprotein
receptor-related protein, ATP/GTP binding protein 1, ankyrin repeat domain32, spectrin repeat containing, nuclear envelope 2
Data set 2 SNPs
50 100 150 200 250 300 HK SS NS number of missense SNPs 0.001 0.002 0.003 0.004 0.005 0.006 HK SS NS missense SNP density
Missense SNPs
Housekeeping Strong Selection Neutral to Selection
Data set 2 SNPs
category F P = 0.230 gene(category) F P = 0.008
= category + gene(category) category F P < 10-4 gene(category) F P < 10-4
Missense SNPs
House keeping & Strong Selection Neutral to selection
50 000 100 000 150 000 200 000 250 000 300 000 350 000 400 000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# insertion / deletion cow ID
Data set 2 indels Total number of identified indels
85 058 – 343 649
86 711 – 330 103
5 000 10 000 15 000 20 000 25 000
1 2 3 4 5 6
# insertion / deletion cow ID
Data set 2 CNV Total number of identified CNV
2 527 – 3 046
15 432 – 19 661
Data set 2 CNV CNV length
5 506 bp
200 – 182 300 bp
7 408 bp
Data Set 2 DNA sequence The data is prone to technical error
Data set 2 SNPs Mononucleotide polymorphisms
rcoverage,#SNP = 0.39
y = 2614400 + 1069104ln(x)
1 000 000 2 000 000 3 000 000 4 000 000 5 000 000 6 000 000 7 000 000 8 000 000 10 20 30 40 50 60 70 # SNPs coverage averaged along the genome Stothard et al. (2012) Kõks et al. (2013) Kõks et al. (2014)
Data set 2 SNPs
50 000 100 000 150 000 200 000 250 000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# SNP cow ID
all FreeBayes GATK Samtools Common and private SNPs
Data set 2 SNPs Common and private SNPs’ likelihood
Data set 2 technical variability - alignment
91 92 93 94 95 96 97 98 99 100 % aligned reads
Aligned reads Called SNPs
BWA Smalt
3 450 000 3 455 000 3 460 000 3 465 000 3 470 000 3 475 000 3 480 000 3 485 000 3 490 000 3 495 000 # SNPs
General conclusions summarising
General conclusions
Functional approach
General conclusions
…
technical-based software-based a variant a fixed data point = a variable
DNA sequence
Contribution:
Magdalena Frąszczak Riccardo Giannico Stanisław Kamiński Magda Mielczarek Giulietta Minozzi Ezequiel Nicolazzi Tomasz Suchocki Katarzyna Wojdak-Maksymiec Andrzej Żarnecki
Institutions:
Wroclaw University of Environmental and Life Sciences National Research Institute of Animal Production Parco Tecnologico Padano University of Warmia and Mazury West Pomeranian University of Technology Poznań Supercomputing – Networking Center NADIR, The Network of Animal Deisease Infectiology Research
External software:
Bisogenet (Martin et al. 2010), Cytoscape (Shannon et al. 2003)
PLINK (Purcell et al. 2007)
BWA (Li and Durbin 2010) Smalt
Freebayes (Garrison and Marth 2012), GATK (McKenna et al. 2010), Samtools (Li et al. 2009)
CNVnator (Abyzov et al. 2011)
Thank you for your attention
In the land of Ignacy Misztal and Janusz Jamrozik