[PPT] - My P value is lower than your P value! Beyond GWAS in livestock PowerPoint Presentation

SLIDE 1

My P value is lower than your P value! Beyond GWAS in livestock genomics

Joanna Szyda

SLIDE 2

Motivation P value based inference

SLIDE 3

Motivation „Biology emerges from pathways, not from single genes” Eric Lander

SLIDE 4

Motivation

Combine various sources of biological

information

Use computational resources (data analysis)
Use brain  (biological conclusions)

SLIDE 5

Outline

Data set 1  Illustration of methodology and biological conclusions

ARSBFGLBAC10172 4408169577_E B B 0.8830 9.9999 ARSBFGLBAC1020 4408169577_E A B 0.8990 9.9999 ARSBFGLBAC10245 4408169577_E B B 0.6582 9.9999

Combine selected sources of information Data set 2  Illustration of the available genetic variability

@HWI WI-1K 1KL15 157: 7:87: 7:C3N 3NCK CKACX CXX: X:8: 8:230 307:2 :203 034:7 :7845 453 3 2:N :N:0 :0:A :AGTT TTCC GG GGGA GAACT CTTGC GCTG TGTAT ATGTG TGCA CAGGG GGAG AGCA CAGGT GTGCT CTCT CTGTG TGCCA CAAC ACCTG TGGA GAGG GGGGA GAGGG GGAT ATGGG GGGTG TGGG GGA + <= <=?DBDA DAB:+ :+<? <?<CB CB@GE GEED ED>?@ ?@A@ A@AA AACF): ):CE CECG CG@GF GFIGG GGFF FFFFG FGFI FIBF BFA<' <'5@E @E4; 4;5=@ =@?3> 3>88 889

SLIDE 6

ARSBFGLBAC10172 4408169577_E B B 0.8830 ARSBFGLBAC1020 4408169577_E A B 0.8990 ARSBFGLBAC10245 4408169577_E B B 0.6582 ARSBFGLBAC10345 4408169577_E A B 0.9092 ARSBFGLBAC10365 4408169577_E B B 0.8021 ARSBFGLBAC10375 4408169577_E B B 0.8858 ARSBFGLBAC10591 4408169577_E A A 0.8670 ARSBFGLBAC10793 4408169577_E B B 0.8722 ARSBFGLBAC10867 4408169577_E A A 0.9316 ARSBFGLBAC10919 4408169577_E A B 0.7805 ARSBFGLBAC10952 4408169577_E A B 0.9314 ARSBFGLBAC10960 4408169577_E A B 0.5666 ARSBFGLBAC10975 4408169577_E A B 0.8665 ARSBFGLBAC10986 4408169577_E A B 0.8687 ARSBFGLBAC10993 4408169577_E B B 0.8146 ARSBFGLBAC11000 4408169577_E A A 0.9135 ARSBFGLBAC11003 4408169577_E A A 0.9454 ARSBFGLBAC11007 4408169577_E B B 0.9106 ARSBFGLBAC11025 4408169577_E B B 0.8742 ARSBFGLBAC11028 4408169577_E A A 0.8534 ARSBFGLBAC11034 4408169577_E B B 0.5769 ARSBFGLBAC11039 4408169577_E B B 0.8987

Data Set 1  SNP

SLIDE 7

Data Set 1  SNP 2 601 HF bulls  black-white & red-white  pedigree 10 355 individuals SNP  Illumina 50 K chip  SNP positions  pairwise LD Phenotype  deregressed national EBV  complex inheritance mode Gene  genomic position (Ensembl)  Gene Ontology terms (GO)  metabolic pathways (KEGG)

SLIDE 8

Data set 1  SNP effect estimation

y deregressed EBV for protein yield
µ general mean
q additive SNP
Z { -1, 0, 1 }
e residual

SLIDE 9

Data set 1  gene networks identify physiological processes underlying complex traits + corresponding genes

SLIDE 10

4 345 gene estimates
SNPs within / close to genes
better interpretation
Data set 1  gene effect estimation
46 267 SNP estimates
varying LD to causal variants
multiple testing correction
only the most significant

associations detected

LHX8 HEPHL1 DHX34 FBP2 TANC2 AP1B1

6 „major” genes for PY
BTA: 3, 8, 17, 18, 19, 29
… find the other genes 

SNP

log10P

SLIDE 11

Data set 1  network construction for PY

44 genes
660 GO
75 KEGG

SLIDE 12

Data set 1  network validation

SNP effect estimation Gene effect estimation Gene selection Network construction Functional information

EBV permutation X 100

GO
KEGG

SLIDE 13

Data set 1  testing functional features For each GO / KEGG:

Odds for the

riginal data

Odds for permuted data

SLIDE 14

Data set 1  results

Lysosome (bta04142)

CI: 8.8-51.7 → P<0.00001

 protein degradation, tissue regression, inflammation

Cell cycle (bta04110)

CI: 3.0-11.4 → P=0.00005

 development of mammary epithelium

Pentose phosphate (bta00030)

CI: 7.5-245 → P=0.00588

 NADPH production in tissues engaged in biosynthesis

Significant KEGG pathways for PY (examples)

SLIDE 15

Data set 1  trait similarity identify similarities between complex traits

SLIDE 16

Data set 1  trait similarity Trait similarity GO / genes GO / genes

SLIDE 17

Data set 1  similarity metrics Cosine metric: Jaccard metric:

Nij

number of GO / genes in networks for trait i and j

Ni

number of GO / genes in a network for trait i

Nj

number of GO / genes in a network for trait j

SLIDE 18

Data set 1  results

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 PY, FY PY, MY PY, SCS PY, STA FY, MY FY, SCS FY, STA MY, SCSMY, STA SCS, STA

genes cosine genes Jaccard GO Jaccard

Similarity between traits

SLIDE 19

Data Set 2  DNA sequence There is much more informative data to do it

SLIDE 20

@HWI-1KL157:67:D2AGFACXX:1:2316:10694:65033 2:N:0: CTATTACACGCCCCCGAAGCTCTAGCGGGTGTTCTCACGCACCCAAGGCATCCTCAACCACCACCATTTCTG + CCCFFADFHHGHHJJGGIIG@HIIFEHIJ;@F@DGGGGCCEB8BCDDDDBACDDCDDDBDDBDDDBDDDEE @HWI-1KL157:67:D2AGFACXX:1:2316:10671:65034 2:N:0: AGTGTATTACTGTCTTTGCACTCTTTAATCCTAGGTGACTTTTGGGGGTTCAGTATCAGATAGAGAACATATT + ?@@ADDDDHDBFHCEHIIBHEHEEHEH>BF?EFHCHFGFGFHH@HIG:6@=CGICAGG=7@@CHG===7 @HWI-1KL157:67:D2AGFACXX:1:2316:10609:65040 2:N:0: CTGGAGTGGGTATCCTTTCCCTTATCCAGGTTATCTTCCCAACCCAGGGATTGAACCCAGGTATCCTGGATT + @CCFDD2AFHDH<AFHII4CGIIJIJJGGIGIIJIIIJJJIHHIJJJIJEFGGICHHGGIIIHEHIHHGHHHFFFFFDDDDDD @HWI-1KL157:67:D2AGFACXX:1:2316:10717:65046 2:N:0: TACTCAAAAGAATCTGTGTTTAGACAGTTTAGAACATCTCCTACCTCTCACAGTTGGGAGGCTCTGAACAAT + @@@DD;DDHDBCFBEGGDHGHI<FBHIAEHE@GGEEFFHGDGIHGIGIIGBGGFGHIAFEGGHGIIIIIIEHH @HWI-1KL157:67:D2AGFACXX:1:2316:10507:65046 2:N:0: GAAGAAAAACTGTGTTTATGTCTCGAACATAATAAAGTCAACATGGATTATGTTAACTGTAATTGTACATCTA + @@@DDDDBHHHHBDBBHBHH3ACHHIIGBHIGCHGHGHIHHEGHII?4BFBDHHIGIDGDGFCCBF@FHI @HWI-1KL157:67:D2AGFACXX:1:2316:10653:65048 2:N:0: TATTGAAAACCTACCTACTAGGTAAATCTTAAGTAGGTTTAATCATGTCCACGTTTCCACTTGTTCACTCATTC

Data Set 2  DNA sequence

SLIDE 21

Data Set 2  DNA sequence 32 HF cows  paternal half-sib whole genome DNA sequence  Illumina HiSeq variant calling  FreeBayes, GATK, Samtools, CNVnator alignment  UMD3.1 reference genome  BWA, Smalt

SLIDE 22

Data set 2  genomic variability describe genetic variability

n the DNA level

 basis for complex trait modelling

SLIDE 23

Data set 2  averaged coverage

2 4 6 8 10 12 14 16 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

coverage Cow ID

min: 5
max: 17

Genome averaged coverage for each cow

2 4 6 8 10 12 14 16 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

coverage Cow ID

SLIDE 24

Data set 2  coverage along the genome Chromosomewise coverage for a particular cow BTA01 : 𝒚 =8.56 BTA10 : 𝒚 =8.03 BTA20 : 𝒚 =8.14 BTX : 𝒚 =8.60

SLIDE 25

Data set 2  SNPs

1 000 000 2 000 000 3 000 000 4 000 000 5 000 000 6 000 000 7 000 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# SNP Cow ID

Total number of identified SNPs

min: 2 063 811  0.08% of genome
max: 6 117 976  0.23% of genome
sd:

663 223

sd-32: 216 861
c2 P < 10-4

SLIDE 26

Data set 2  SNPs

0.5 1 1 4 7 10 13 16 19 22 25 28 % of SNPs BTA 3 alleles 100 000 200 000 300 000 400 000 500 000 600 000 700 000 800 000 900 000 1 000 000 1 4 7 10 13 16 19 22 25 28 total number of SNPs BTA

Total number of identified SNPs

15 272 427
99.16% biallelic

0.002 0.004 0.006 0.008 1 4 7 10 13 16 19 22 25 28 % of SNPs BTA 4 alleles

SLIDE 27

Data set 2  SNPs

50 100 150 200 250 300 HK SS NS number of missense SNPs 0.001 0.002 0.003 0.004 0.005 0.006 HK SS NS missense SNP density

Missense SNPs

Housekeeping Strong Selection Neutral to Selection

SLIDE 28

Data Set 2  SNPs Housekeeping

 beta Actin, Beta-2-microglobulin, Glyceraldehyde-3-

phosphate, Hydroxymethylbilane synthase, beta Heat shock 90kDa protein 1, Ubiquitin C

Strong Selection

 diacylglycerol O-acyltransferase 1, alpha 6 integrin, ADP-

ribosylation factor-like 4A, bone morphogenetic protein 4, myeloid differentiation primary response

Neutral to Selection

 URI1 prefoldin-like chaperone, low density lipoprotein

receptor-related protein, ATP/GTP binding protein 1, ankyrin repeat domain32, spectrin repeat containing, nuclear envelope 2

SLIDE 29

Data set 2  SNPs

50 100 150 200 250 300 HK SS NS number of missense SNPs 0.001 0.002 0.003 0.004 0.005 0.006 HK SS NS missense SNP density

Missense SNPs

Housekeeping Strong Selection Neutral to Selection

SLIDE 30

Data set 2  SNPs

ANOVA: SNPdensity = category + gene(category)

category F  P = 0.230 gene(category) F  P = 0.008

ANOVA: #SNP

= category + gene(category) category F  P < 10-4 gene(category) F  P < 10-4

Missense SNPs

House keeping & Strong Selection Neutral to selection

SLIDE 31

50 000 100 000 150 000 200 000 250 000 300 000 350 000 400 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# insertion / deletion cow ID

Data set 2  indels Total number of identified indels

insertion:

85 058 – 343 649

deletion:

86 711 – 330 103

c2 P < 10-4

SLIDE 32

5 000 10 000 15 000 20 000 25 000

1 2 3 4 5 6

# insertion / deletion cow ID

Data set 2  CNV Total number of identified CNV

insertion:

2 527 – 3 046

deletion:

15 432 – 19 661

c2 P < 10-4

SLIDE 33

Data set 2  CNV CNV length

deletion: 200 – 1 074 600 bp
mean:

5 506 bp

ANOVA P < 10-4
insertion:

200 – 182 300 bp

mean:

7 408 bp

ANOVA P < 10-4

SLIDE 34

Data Set 2  DNA sequence The data is prone to technical error

SLIDE 35

Data set 2  SNPs Mononucleotide polymorphisms

 significant correlation: coverage vs. # SNPs

rcoverage,#SNP = 0.39

y = 2614400 + 1069104ln(x)

1 000 000 2 000 000 3 000 000 4 000 000 5 000 000 6 000 000 7 000 000 8 000 000 10 20 30 40 50 60 70 # SNPs coverage averaged along the genome Stothard et al. (2012) Kõks et al. (2013) Kõks et al. (2014)

SLIDE 36

Data set 2  SNPs

50 000 100 000 150 000 200 000 250 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# SNP cow ID

all FreeBayes GATK Samtools Common and private SNPs

SLIDE 37

Data set 2  SNPs Common and private SNPs’ likelihood

SLIDE 38

Data set 2  technical variability - alignment

91 92 93 94 95 96 97 98 99 100 % aligned reads

Aligned reads Called SNPs

BWA Smalt

3 450 000 3 455 000 3 460 000 3 465 000 3 470 000 3 475 000 3 480 000 3 485 000 3 490 000 3 495 000 # SNPs

SLIDE 39

General conclusions summarising

SLIDE 40

General conclusions

identify influential, but less significant genes
closer to trait physiology
use more than one source of information
requires full genomic resolution = sequence
centered on human / mouse data

Functional approach

SLIDE 41

General conclusions

various polymorphisms: SNPs, indels, CNVs
expensive
soon affordable: 1 000 bull genome, Gene2Farm,

…

subjected to error:

technical-based software-based a variant  a fixed data point = a variable

variants = differences with a particular genome

DNA sequence

SLIDE 42

Contribution:

Magdalena Frąszczak Riccardo Giannico Stanisław Kamiński Magda Mielczarek Giulietta Minozzi Ezequiel Nicolazzi Tomasz Suchocki Katarzyna Wojdak-Maksymiec Andrzej Żarnecki

Institutions:

Wroclaw University of Environmental and Life Sciences National Research Institute of Animal Production Parco Tecnologico Padano University of Warmia and Mazury West Pomeranian University of Technology Poznań Supercomputing – Networking Center NADIR, The Network of Animal Deisease Infectiology Research

SLIDE 43

External software:

Gene networks:

Bisogenet (Martin et al. 2010), Cytoscape (Shannon et al. 2003)

LD

PLINK (Purcell et al. 2007)

Sequence alignemnt:

BWA (Li and Durbin 2010) Smalt

Snp & indel calling:

Freebayes (Garrison and Marth 2012), GATK (McKenna et al. 2010), Samtools (Li et al. 2009)

CNV calling:

CNVnator (Abyzov et al. 2011)

SLIDE 44

Thank you for your attention

SLIDE 45

In the land of Ignacy Misztal and Janusz Jamrozik

SLIDE 46

2015

SLIDE 47

My P value is lower than your P value! Beyond GWAS in livestock - - PowerPoint PPT Presentation

2015

EAAP Warsaw Poland