My P value is lower than your P value! Beyond GWAS in livestock - - PowerPoint PPT Presentation

my p value is lower than your p value beyond gwas in
SMART_READER_LITE
LIVE PREVIEW

My P value is lower than your P value! Beyond GWAS in livestock - - PowerPoint PPT Presentation

My P value is lower than your P value! Beyond GWAS in livestock genomics Joanna Szyda Motivation P value based inference Motivation Biology emerges from pathways, not from single genes Eric Lander Motivation Combine various sources


slide-1
SLIDE 1

My P value is lower than your P value! Beyond GWAS in livestock genomics

Joanna Szyda

slide-2
SLIDE 2

Motivation P value based inference

slide-3
SLIDE 3

Motivation „Biology emerges from pathways, not from single genes” Eric Lander

slide-4
SLIDE 4

Motivation

  • Combine various sources of biological

information

  • Use computational resources (data analysis)
  • Use brain  (biological conclusions)
slide-5
SLIDE 5

Outline

Data set 1  Illustration of methodology and biological conclusions

ARSBFGLBAC10172 4408169577_E B B 0.8830 9.9999 ARSBFGLBAC1020 4408169577_E A B 0.8990 9.9999 ARSBFGLBAC10245 4408169577_E B B 0.6582 9.9999

Combine selected sources of information Data set 2  Illustration of the available genetic variability

@HWI WI-1K 1KL15 157: 7:87: 7:C3N 3NCK CKACX CXX: X:8: 8:230 307:2 :203 034:7 :7845 453 3 2:N :N:0 :0:A :AGTT TTCC GG GGGA GAACT CTTGC GCTG TGTAT ATGTG TGCA CAGGG GGAG AGCA CAGGT GTGCT CTCT CTGTG TGCCA CAAC ACCTG TGGA GAGG GGGGA GAGGG GGAT ATGGG GGGTG TGGG GGA + <= <=?DBDA DAB:+ :+<? <?<CB CB@GE GEED ED>?@ ?@A@ A@AA AACF): ):CE CECG CG@GF GFIGG GGFF FFFFG FGFI FIBF BFA<' <'5@E @E4; 4;5=@ =@?3> 3>88 889

slide-6
SLIDE 6

ARSBFGLBAC10172 4408169577_E B B 0.8830 ARSBFGLBAC1020 4408169577_E A B 0.8990 ARSBFGLBAC10245 4408169577_E B B 0.6582 ARSBFGLBAC10345 4408169577_E A B 0.9092 ARSBFGLBAC10365 4408169577_E B B 0.8021 ARSBFGLBAC10375 4408169577_E B B 0.8858 ARSBFGLBAC10591 4408169577_E A A 0.8670 ARSBFGLBAC10793 4408169577_E B B 0.8722 ARSBFGLBAC10867 4408169577_E A A 0.9316 ARSBFGLBAC10919 4408169577_E A B 0.7805 ARSBFGLBAC10952 4408169577_E A B 0.9314 ARSBFGLBAC10960 4408169577_E A B 0.5666 ARSBFGLBAC10975 4408169577_E A B 0.8665 ARSBFGLBAC10986 4408169577_E A B 0.8687 ARSBFGLBAC10993 4408169577_E B B 0.8146 ARSBFGLBAC11000 4408169577_E A A 0.9135 ARSBFGLBAC11003 4408169577_E A A 0.9454 ARSBFGLBAC11007 4408169577_E B B 0.9106 ARSBFGLBAC11025 4408169577_E B B 0.8742 ARSBFGLBAC11028 4408169577_E A A 0.8534 ARSBFGLBAC11034 4408169577_E B B 0.5769 ARSBFGLBAC11039 4408169577_E B B 0.8987

Data Set 1  SNP

slide-7
SLIDE 7

Data Set 1  SNP 2 601 HF bulls  black-white & red-white  pedigree 10 355 individuals SNP  Illumina 50 K chip  SNP positions  pairwise LD Phenotype  deregressed national EBV  complex inheritance mode Gene  genomic position (Ensembl)  Gene Ontology terms (GO)  metabolic pathways (KEGG)

slide-8
SLIDE 8

Data set 1  SNP effect estimation

  • y deregressed EBV for protein yield
  • µ general mean
  • q additive SNP
  • Z { -1, 0, 1 }
  • e residual
slide-9
SLIDE 9

Data set 1  gene networks identify physiological processes underlying complex traits + corresponding genes

slide-10
SLIDE 10
  • 4 345 gene estimates
  • SNPs within / close to genes
  • better interpretation
  • Data set 1  gene effect estimation
  • 46 267 SNP estimates
  • varying LD to causal variants
  • multiple testing correction
  • only the most significant

associations detected

LHX8 HEPHL1 DHX34 FBP2 TANC2 AP1B1

  • 6 „major” genes for PY
  • BTA: 3, 8, 17, 18, 19, 29
  • … find the other genes 

SNP

  • log10P
slide-11
SLIDE 11

Data set 1  network construction for PY

  • 44 genes
  • 660 GO
  • 75 KEGG
slide-12
SLIDE 12

Data set 1  network validation

SNP effect estimation Gene effect estimation Gene selection Network construction Functional information

EBV permutation X 100

  • GO
  • KEGG
slide-13
SLIDE 13

Data set 1  testing functional features For each GO / KEGG:

Odds for the

  • riginal data

Odds for permuted data

slide-14
SLIDE 14

Data set 1  results

  • Lysosome (bta04142)

CI: 8.8-51.7 → P<0.00001

 protein degradation, tissue regression, inflammation

  • Cell cycle (bta04110)

CI: 3.0-11.4 → P=0.00005

 development of mammary epithelium

  • Pentose phosphate (bta00030)

CI: 7.5-245 → P=0.00588

 NADPH production in tissues engaged in biosynthesis

Significant KEGG pathways for PY (examples)

slide-15
SLIDE 15

Data set 1  trait similarity identify similarities between complex traits

slide-16
SLIDE 16

Data set 1  trait similarity Trait similarity GO / genes GO / genes

slide-17
SLIDE 17

Data set 1  similarity metrics Cosine metric: Jaccard metric:

  • Nij

number of GO / genes in networks for trait i and j

  • Ni

number of GO / genes in a network for trait i

  • Nj

number of GO / genes in a network for trait j

slide-18
SLIDE 18

Data set 1  results

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 PY, FY PY, MY PY, SCS PY, STA FY, MY FY, SCS FY, STA MY, SCSMY, STA SCS, STA

genes cosine genes Jaccard GO Jaccard

Similarity between traits

slide-19
SLIDE 19

Data Set 2  DNA sequence There is much more informative data to do it

slide-20
SLIDE 20

@HWI-1KL157:67:D2AGFACXX:1:2316:10694:65033 2:N:0: CTATTACACGCCCCCGAAGCTCTAGCGGGTGTTCTCACGCACCCAAGGCATCCTCAACCACCACCATTTCTG + CCCFFADFHHGHHJJGGIIG@HIIFEHIJ;@F@DGGGGCCEB8BCDDDDBACDDCDDDBDDBDDDBDDDEE @HWI-1KL157:67:D2AGFACXX:1:2316:10671:65034 2:N:0: AGTGTATTACTGTCTTTGCACTCTTTAATCCTAGGTGACTTTTGGGGGTTCAGTATCAGATAGAGAACATATT + ?@@ADDDDHDBFHCEHIIBHEHEEHEH>BF?EFHCHFGFGFHH@HIG:6@=CGICAGG=7@@CHG===7 @HWI-1KL157:67:D2AGFACXX:1:2316:10609:65040 2:N:0: CTGGAGTGGGTATCCTTTCCCTTATCCAGGTTATCTTCCCAACCCAGGGATTGAACCCAGGTATCCTGGATT + @CCFDD2AFHDH<AFHII4CGIIJIJJGGIGIIJIIIJJJIHHIJJJIJEFGGICHHGGIIIHEHIHHGHHHFFFFFDDDDDD @HWI-1KL157:67:D2AGFACXX:1:2316:10717:65046 2:N:0: TACTCAAAAGAATCTGTGTTTAGACAGTTTAGAACATCTCCTACCTCTCACAGTTGGGAGGCTCTGAACAAT + @@@DD;DDHDBCFBEGGDHGHI<FBHIAEHE@GGEEFFHGDGIHGIGIIGBGGFGHIAFEGGHGIIIIIIEHH @HWI-1KL157:67:D2AGFACXX:1:2316:10507:65046 2:N:0: GAAGAAAAACTGTGTTTATGTCTCGAACATAATAAAGTCAACATGGATTATGTTAACTGTAATTGTACATCTA + @@@DDDDBHHHHBDBBHBHH3ACHHIIGBHIGCHGHGHIHHEGHII?4BFBDHHIGIDGDGFCCBF@FHI @HWI-1KL157:67:D2AGFACXX:1:2316:10653:65048 2:N:0: TATTGAAAACCTACCTACTAGGTAAATCTTAAGTAGGTTTAATCATGTCCACGTTTCCACTTGTTCACTCATTC

Data Set 2  DNA sequence

slide-21
SLIDE 21

Data Set 2  DNA sequence 32 HF cows  paternal half-sib whole genome DNA sequence  Illumina HiSeq variant calling  FreeBayes, GATK, Samtools, CNVnator alignment  UMD3.1 reference genome  BWA, Smalt

slide-22
SLIDE 22

Data set 2  genomic variability describe genetic variability

  • n the DNA level

 basis for complex trait modelling

slide-23
SLIDE 23

Data set 2  averaged coverage

2 4 6 8 10 12 14 16 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

coverage Cow ID

  • min: 5
  • max: 17

Genome averaged coverage for each cow

2 4 6 8 10 12 14 16 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

coverage Cow ID

slide-24
SLIDE 24

Data set 2  coverage along the genome Chromosomewise coverage for a particular cow BTA01 : 𝒚 =8.56 BTA10 : 𝒚 =8.03 BTA20 : 𝒚 =8.14 BTX : 𝒚 =8.60

slide-25
SLIDE 25

Data set 2  SNPs

1 000 000 2 000 000 3 000 000 4 000 000 5 000 000 6 000 000 7 000 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# SNP Cow ID

Total number of identified SNPs

  • min: 2 063 811  0.08% of genome
  • max: 6 117 976  0.23% of genome
  • sd:

663 223

  • sd-32: 216 861
  • c2 P < 10-4
slide-26
SLIDE 26

Data set 2  SNPs

0.5 1 1 4 7 10 13 16 19 22 25 28 % of SNPs BTA 3 alleles 100 000 200 000 300 000 400 000 500 000 600 000 700 000 800 000 900 000 1 000 000 1 4 7 10 13 16 19 22 25 28 total number of SNPs BTA

Total number of identified SNPs

  • 15 272 427
  • 99.16% biallelic

0.002 0.004 0.006 0.008 1 4 7 10 13 16 19 22 25 28 % of SNPs BTA 4 alleles

slide-27
SLIDE 27

Data set 2  SNPs

50 100 150 200 250 300 HK SS NS number of missense SNPs 0.001 0.002 0.003 0.004 0.005 0.006 HK SS NS missense SNP density

Missense SNPs

Housekeeping Strong Selection Neutral to Selection

slide-28
SLIDE 28

Data Set 2  SNPs Housekeeping

 beta Actin, Beta-2-microglobulin, Glyceraldehyde-3-

phosphate, Hydroxymethylbilane synthase, beta Heat shock 90kDa protein 1, Ubiquitin C

Strong Selection

 diacylglycerol O-acyltransferase 1, alpha 6 integrin, ADP-

ribosylation factor-like 4A, bone morphogenetic protein 4, myeloid differentiation primary response

Neutral to Selection

 URI1 prefoldin-like chaperone, low density lipoprotein

receptor-related protein, ATP/GTP binding protein 1, ankyrin repeat domain32, spectrin repeat containing, nuclear envelope 2

slide-29
SLIDE 29

Data set 2  SNPs

50 100 150 200 250 300 HK SS NS number of missense SNPs 0.001 0.002 0.003 0.004 0.005 0.006 HK SS NS missense SNP density

Missense SNPs

Housekeeping Strong Selection Neutral to Selection

slide-30
SLIDE 30

Data set 2  SNPs

  • ANOVA: SNPdensity = category + gene(category)

category F  P = 0.230 gene(category) F  P = 0.008

  • ANOVA: #SNP

= category + gene(category) category F  P < 10-4 gene(category) F  P < 10-4

Missense SNPs

House keeping & Strong Selection Neutral to selection

slide-31
SLIDE 31

50 000 100 000 150 000 200 000 250 000 300 000 350 000 400 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# insertion / deletion cow ID

Data set 2  indels Total number of identified indels

  • insertion:

85 058 – 343 649

  • deletion:

86 711 – 330 103

  • c2 P < 10-4
slide-32
SLIDE 32

5 000 10 000 15 000 20 000 25 000

1 2 3 4 5 6

# insertion / deletion cow ID

Data set 2  CNV Total number of identified CNV

  • insertion:

2 527 – 3 046

  • deletion:

15 432 – 19 661

  • c2 P < 10-4
slide-33
SLIDE 33

Data set 2  CNV CNV length

  • deletion: 200 – 1 074 600 bp
  • mean:

5 506 bp

  • ANOVA P < 10-4
  • insertion:

200 – 182 300 bp

  • mean:

7 408 bp

  • ANOVA P < 10-4
slide-34
SLIDE 34

Data Set 2  DNA sequence The data is prone to technical error

slide-35
SLIDE 35

Data set 2  SNPs Mononucleotide polymorphisms

  •  significant correlation: coverage vs. # SNPs

rcoverage,#SNP = 0.39

y = 2614400 + 1069104ln(x)

1 000 000 2 000 000 3 000 000 4 000 000 5 000 000 6 000 000 7 000 000 8 000 000 10 20 30 40 50 60 70 # SNPs coverage averaged along the genome Stothard et al. (2012) Kõks et al. (2013) Kõks et al. (2014)

slide-36
SLIDE 36

Data set 2  SNPs

50 000 100 000 150 000 200 000 250 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# SNP cow ID

all FreeBayes GATK Samtools Common and private SNPs

slide-37
SLIDE 37

Data set 2  SNPs Common and private SNPs’ likelihood

slide-38
SLIDE 38

Data set 2  technical variability - alignment

91 92 93 94 95 96 97 98 99 100 % aligned reads

Aligned reads Called SNPs

BWA Smalt

3 450 000 3 455 000 3 460 000 3 465 000 3 470 000 3 475 000 3 480 000 3 485 000 3 490 000 3 495 000 # SNPs

slide-39
SLIDE 39

General conclusions summarising

slide-40
SLIDE 40

General conclusions

  • identify influential, but less significant genes
  • closer to trait physiology
  • use more than one source of information
  • requires full genomic resolution = sequence
  • centered on human / mouse data

Functional approach

slide-41
SLIDE 41

General conclusions

  • various polymorphisms: SNPs, indels, CNVs
  • expensive
  • soon affordable: 1 000 bull genome, Gene2Farm,

  • subjected to error:

technical-based software-based a variant  a fixed data point = a variable

  • variants = differences with a particular genome

DNA sequence

slide-42
SLIDE 42

Contribution:

Magdalena Frąszczak Riccardo Giannico Stanisław Kamiński Magda Mielczarek Giulietta Minozzi Ezequiel Nicolazzi Tomasz Suchocki Katarzyna Wojdak-Maksymiec Andrzej Żarnecki

Institutions:

Wroclaw University of Environmental and Life Sciences National Research Institute of Animal Production Parco Tecnologico Padano University of Warmia and Mazury West Pomeranian University of Technology Poznań Supercomputing – Networking Center NADIR, The Network of Animal Deisease Infectiology Research

slide-43
SLIDE 43

External software:

  • Gene networks:

Bisogenet (Martin et al. 2010), Cytoscape (Shannon et al. 2003)

  • LD

PLINK (Purcell et al. 2007)

  • Sequence alignemnt:

BWA (Li and Durbin 2010) Smalt

  • Snp & indel calling:

Freebayes (Garrison and Marth 2012), GATK (McKenna et al. 2010), Samtools (Li et al. 2009)

  • CNV calling:

CNVnator (Abyzov et al. 2011)

slide-44
SLIDE 44

Thank you for your attention

slide-45
SLIDE 45

In the land of Ignacy Misztal and Janusz Jamrozik

slide-46
SLIDE 46

2015

slide-47
SLIDE 47

EAAP Warsaw Poland