Bayesian matrix factorization for drug-target activity prediction - - PowerPoint PPT Presentation

bayesian matrix factorization for
SMART_READER_LITE
LIVE PREVIEW

Bayesian matrix factorization for drug-target activity prediction - - PowerPoint PPT Presentation

Bayesian matrix factorization for drug-target activity prediction Yves Moreau University of Leuven ESAT-STADIUS SymBioSys Center for Computational Biology Number of new drugs per billion US$ 100 1950 10 1 2010 0.1 Scannell et al . 2012


slide-1
SLIDE 1

Bayesian matrix factorization for drug-target activity prediction

Yves Moreau

University of Leuven – ESAT-STADIUS SymBioSys Center for Computational Biology

slide-2
SLIDE 2

1950 2010

10 1 0.1 100

Scannell et al. 2012

Number of new drugs per billion US$

slide-3
SLIDE 3
slide-4
SLIDE 4

The curse of attrition…

Hay et al. 2014

64% 32% 60% 83%

Phase 1 to phase 2 Phase 2 to phase 3 Phase 3 to NDA/BLA NDA/BLA to approval

Phase success rates

slide-5
SLIDE 5

Safety Other

Arrowsmith & Miller 2013

Causes of failure between Phase 2 and submission in 2011 and 2012

Efficacy

…mainly due to safety and efficacy issues

slide-6
SLIDE 6

Chemoinformatics

  • Goal: estimate interaction

between compounds and protein targets

  • Activity measured by high-

throughput screening

  • Activity depends on

match between shape

  • f compound and

shape of protein

  • 3D modeling is challenging

?

Compound (ex: Viagra) Enzyme (ex: ACE2)

slide-7
SLIDE 7

Drug–target activities

  • IC50 – amount of compound

needed for half inhibition

  • pIC50 = -log10(IC50)
  • EC50 – amount of compound

needed for half effect

slide-8
SLIDE 8

High-throughput screening

  • Hit discovery in early drug discovery
  • Identify compounds active against

a protein drug target of interest

  • Activity measured by

high-throughput screening

  • Activity = “scarce” data

7 3 8 2 9

IC50

Millions of compounds 1-2% fill rate Thousands of targets Comp1 Comp2 CompN Compu P1 Pm Px

slide-9
SLIDE 9

Ø High-dimensional fingerprints of 2D compound structures Ø Sparse vectors Circular fingerprints

each fingerprint represents a central atom and its neighbors A bit string represents the presence or absence of particular substructures

Key-based fingerprints

FP2 & MACCS MNA & MPD & ECFP

Molecular fingerprints

9

slide-10
SLIDE 10

Quantitative Structure–Activity Relationship (QSAR)

Ø Finds optimal model α based on predictive features

Ø IC50(x) = α1x1 + α2x2 + … + αFxF Ø Minimize error loss Ø PLS, ridge regression

Ø Good performance if enough training examples

P1 Pm 7 3 8 2 9

IC50

Px 2 00100010 00101101 01000001 Comp1 Comp2 CompN Compu 00101101

Ø Does not share information across tasks!

slide-11
SLIDE 11

Multitask learning

  • From fingerprints and

available activities, predict missing activities

  • Approaches

1.

Supervised learning per target (QSAR)

2.

Matrix factorization

  • Netflix style

3.

MF + supervised

  • Macau

7 3 8 2 9

IC50

3M compounds 1-2% fill rate 1500 targets Features (6K-4M) 00100010 00101101 01000001 Comp1 Comp2 CompN Compu 00101101 P1 Pm Px

slide-12
SLIDE 12

The Netflix Challenge

  • Goal: predict user movie ratings
  • 440K users, 18K movies
  • 100 million ratings
  • 1% fill rate
  • è Predict 99% missing
  • How can this work?

18K movies 440K users 1 ? 2 ? ? ? ? ? ? ? ? 1 ? ? ? 5 ? ? ? ? ? ? ? 4 ? 5 ? ? ? ? ? ? ? ? 3 ? ? ? 3 ? ? ? 4 ? ? ? ? ?

slide-13
SLIDE 13

Factor analysis

Ø Low-rank approximation of full matrix

≈ *

Loadings Factors

Y U V

slide-14
SLIDE 14

Factor analysis

* ≈

Factors Loadings

Yi. Ui. V

slide-15
SLIDE 15

Factor analysis

Ø Individual response (= row) modeled as individual mixture (= loading) of a small number of latent responses (= factor)

≈ * * *

+ +

Factors Loadings

slide-16
SLIDE 16

Alternating Least Squares

? ? ? *

Factors Loadings

Yi. Ui. V

slide-17
SLIDE 17

Alternating Least Squares

Ø If V were known, U could be found by linear regression

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

≈ *

Loadings Factors

Y U V

slide-18
SLIDE 18

Alternating Least Squares

Ø If U were known, V could also be found by linear regression

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

*

Loadings Factors

Y U V

slide-19
SLIDE 19

Scarce matrix factorization

Ø Only observed values are used in regressions

≈ *

Loadings Factors

U* V * Y

min

U,V W !(Y −U.V) 2

slide-20
SLIDE 20

Scarce matrix factorization

Ø Once factors are obtained, other entries can be predicted

≈ *

Loadings Factors

ˆ Y U* V *

slide-21
SLIDE 21

Uncertainty

Ø Given scarce data, is a single solution (U*, V*) meaningful?

≈ *

Loadings Factors

ˆ Y U* V *

slide-22
SLIDE 22

Bayesian modeling

Ø Given uncertainty from scarce data, Bayesian inference is desirable

  • Instead of ,

we want to consider the Bayesian posterior distribution

  • Posterior predictive distribution

is more informative than any optimal estimator

U*,V *

( ) = min

U,V W !(Y −U.V) 2

p(U,V |Y ) p( ˆ Y |Y )

slide-23
SLIDE 23

Ordinary least squares

Ø ALS involves successive regressions solved by OLS

* ≈

Factors Loadings

Yi. Ui. V

? ? ?

slide-24
SLIDE 24

Ordinary least squares

Ø Model Ø Solution Ø Setup = transposed of previous notation Ø If Gaussian noise, then OLS is max. likelihood estimate

slide-25
SLIDE 25

Block Gibbs sampler

Ø The Gibbs sampler is a Markov Chain Monte Carlo method Ø MCMC for model inference generates samples from complex posterior distributions of model parameters by iteratively sampling from simpler distributions Ø The following scheme is a block Gibbs sampler Ø Under mild conditions of ergodicity, after burn-in, the samples will be dependently drawn from joint distribution Ø Similar to alternating least squares, but global optimization

U (i+1) ~ p(U |V (i),Y) V (i+1) ~ p(V |U (i+1),Y) For i sufficiently large, (U (i),V (i)) ~ p(U,V |Y)

slide-26
SLIDE 26

Markov Chain Monte Carlo

Ø We do not get the posterior distribution analytically,

  • nly samples from it

Ø Samples are sufficient to characterize posterior distribution

Ø e.g., average solutions to get posterior mean estimate Ø e.g., marginal variance of individual predictions to characterize uncertainty

slide-27
SLIDE 27

Bayesian linear regression

Ø The distribution of β in function of the data X and y can be modeled as a multivariate Gaussian distribution over β Ø Model Ø Assume a Gaussian prior for β and an inverse gamma prior for ρ

=

slide-28
SLIDE 28

Bayesian linear regression

Ø Then the posterior distribution of β is also a Gaussian distribution by application of Bayes’ rule Ø If Λ0=0 and µ0=0, then solution for µn is identical to OLS! Ø Average solution µn is similar to ridge regression solution Ø Precision matrix Λn characterizes variance of solution

= =

slide-29
SLIDE 29

GAMBLR trick

Ø Executing the Gibbs sampler requires sampling repeatedly from posterior Gaussian distributions (which change every time U and V change) Ø Sampling from multivariate Gaussian distribution Ø For Bayesian linear regression Ø This has the same form as OLS!

ε ~ N(0, I). If A such that Σ=AA', then z = µ + Aε ~ N(µ,Σ)

X = X L0 ! " # # $ % & & , y = y L0µ0 ! " # # $ % & & with Λ0 = L0L0 ' µn = XX '

( )

−1 Xy and Λn = XX '

It can be shown that z = XX '

( )

−1 X(y +σ.ε) ~ N(µn,σ 2Λn −1)

slide-30
SLIDE 30

GAMBLR trick

Ø This means that we can sample from the posterior Gaussian distribution by solving a linear regression on the

  • riginal data plus injected noise!

Ø Running the Gibbs sampler then only amounts to solving a sequence of linear regressions with variable noise injection! Ø Linear regression is one of the best studied problems in numerical analysis

Ø Fast algorithms Ø Scalable code Ø One multivariate regression per row or column of Y at each iteration step, hence easy parallelization

slide-31
SLIDE 31

Matrix factorization

  • One of the best approaches for Netflix challenge
  • Prediction of ratings for viewer-movie pairs
  • Does not use features, only matrix values
  • Two popular versions
  • Probabilistic Matrix Factorization (PMF) = Maximum Likelihood
  • Bayesian PMF = Bayesian inference
slide-32
SLIDE 32

Netflix comparison (PMF vs. BPMF)

Ø Data: 100M ratings from 480K users, 18K movies Ø BPMF has advantage for users with few ratings

slide-33
SLIDE 33

Motivation for Bayesian PMF

  • PMF gives point

estimates

  • Problematic for

compounds that have

  • nly few samples
  • We are interested in

uncertainty of estimates

Example IC50 data set from CHEMBL with 15K compounds

slide-34
SLIDE 34

Bayesian PMF

slide-35
SLIDE 35

Gibbs sampling

  • Iteratively samples each parameter
  • Obtains posterior samples of the model
  • e.g., sample 200 models after burn-in
  • Using the samples one can also measure uncertainty
  • Related to Alternating Least Squares
  • Blocked Gibbs sampler with large blocks, good sampling

behavior

slide-36
SLIDE 36

ChEMBL: PMF vs. Bayesian PMF

  • ChEMBL public data set of assay activities
  • Classified IC50
  • 15,118 compounds
  • 344 proteins
  • 59,451 values
  • Discretization at 200nM
  • 20% test
  • BPMF outperforms PMF
  • Does not use features,
  • nly matrix values

Test classification error

slide-37
SLIDE 37

ChEMBL: BPMF vs. ridge regression

15K compounds 344 protein 200 nM threshold 20% for test set Vary number of dimensions Matrix factorization not as good as QSAR, but does capture information.

slide-38
SLIDE 38

BPMF (relation view)

Model 2 entities, 1 relation Latent variables (green) are learned from the IC50 data.

IC50

Comp. Protein

slide-39
SLIDE 39

Macau

Can we get the best of both worlds? Model 2 entities, 1 relation + features for compounds Latent variables are learned together with βcomp

IC50

Comp. Latent U Protein Latent V Fingerprints

βcomp

slide-40
SLIDE 40
slide-41
SLIDE 41

Results on ChEMBL

15K compounds 344 protein 200 nM threshold 20% for test set Compound features improve performance Multitask modeling improves performance

slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

Industrial scaling (J&J data)

Ø ~2M compounds, ~1K targets, tens of millions of activities

  • Compute nodes: dual Xeon E5-2699 v3
  • Fingerprint 1: 6,000 features

Ø Latent dimension = 30 Ø Direct solver on single node

Ø 40s per Gibbs sampling pass Ø 1,000 iterations (800 burn-in) = ½ day

  • Fingerprint 2: 4,000,000 features

Ø Sparsity of X: 0.002% Ø Latent dimension = 30 Ø Iterative solver on 15 nodes

Ø 600s per Gibbs sampling pass Ø 1,000 iterations (800 burn-in) = 1 week

slide-45
SLIDE 45

Ø SVM using scikit-learn

Ø Separate classifier for every assay Ø Hyperparameter by nested CV

Ø For each assay separately

Ø Linear kernel

Ø Gaussian kernel has equivalent performance but does not scale

Ø Macau classification using TensorFlow

Ø Non-Bayesian approach (optimization) Ø Multi-task learning Ø Hidden representation size: 1,000 Ø Model parameters chosen by ChEMBL experiments

Single-task vs. multitask learning

slide-46
SLIDE 46

Ø Chemical series effect

Ø All members of a series should be either in training

  • r test set

Ø Clustering Ø Tanimoto > 0.7 Ø Nested cross-validation for hyperparameter tuning

Nested clustered crossvalidation

slide-47
SLIDE 47

AUC per assay

Ø Mean over assays Ø Macau: 0.886 Ø SVM: 0.840 Ø From 712 assay Ø Macau wins 382 Ø SVM wins 0 Ø Ties 330 Ø Using p < 0.01

slide-48
SLIDE 48

Variational Bayes

Ø Gibbs sampling = “old” Ø Variational Bayes popular Ø Hierarchical blindness in VB

Ø Ignored covariance between 𝛾 and latents u Ø Poor variance estimates

  • ui covariance increases

if side information

slide-49
SLIDE 49

Empirical comparison: ChEMBL

Ø 15k compounds Ø 346 proteins Ø ~60k activity measurements

Ø pIC50 Ø 20% test set

Ø Sparse high- dimensional side information (#feat is ~100k) Ø Macau drastically

  • utperforms VBMFSI
slide-50
SLIDE 50

Repurposing High-Content Imaging data

slide-51
SLIDE 51

Classical high-content imaging

slide-52
SLIDE 52

Repurposing imaging assays

Ø High-throughput imaging (= high-content screening) Ø 500K compounds, 600 drug targets, 10M activities (30% fill rate) Ø Glucocorticoid receptor assay phenotypic screen

  • Feature extraction from images with CellProfiler

Simm et al., Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery, Cell Chemical Biology (2018)

slide-53
SLIDE 53

Application

Ø Oncology drug discovery project

  • Active project
  • Initial screen = 0.725% hit rate (submicromolar)
  • Kinase target
  • No known direct relation to glucocorticoid receptor
  • Rank unscreened compounds with imaging data
  • Test top 342 compounds
  • 141 submicromolar hits (41% hit rate)
  • 60x enrichment
slide-54
SLIDE 54

Application

Ø Central nervous system project

  • Active project
  • Initial screen = 0.088% hit rate
  • Enzyme target
  • No known direct relation to glucocorticoid receptor
  • Rank unscreened compounds with imaging data
  • Some additional ADME filtering
  • Select 141 compounds
  • 37 submicromolar hits (22.7% hit rate)
  • x250 enrichment
slide-55
SLIDE 55

Imaging data improves chemical diversity

Ø Similar or better hit rates using structure fingerprints Ø BUT high chemical diversity (biologically driven vs. chemically driven) Oncology CNS

slide-56
SLIDE 56

Imaging assays for drug discovery

Ø 500K compounds, 600 targets, 10M activities (30% fill rate) Ø Glucocorticoid receptor assay phenotypic screen Ø Evaluate predictivity using clustered cross-validation Ø Macau predictive for 37% of assays (CV AUC>0.7), highly predictive for 5% of assays (CV AUC>0.9)

  • Assays not related to original screen!

Ø Here: single imaging assay Ø Future: build systematic library of imaging assays

slide-57
SLIDE 57

Macau

Ø Generic package Ø Open source Ø OpenMP/C++ with Python wrapper library

  • https://github.com/jaak-s/macau
  • Factorization with and without side information
  • Real valued and binary matrices (normal and probit noise)
  • Supports tensors (alpha)
  • Univariate and multivariate Gibbs sampler
slide-58
SLIDE 58

ID:1234 ID:5478

Activity

Deep Macau

Ø Combine deep learning and matrix factorization

  • Deep learning allows to capture nonlinear effects
  • Matrix factorization allows item level reasoning
  • Instead of only transforming features into prediction, learn a latent

representation of each entity

slide-59
SLIDE 59

Privacy-Preserving Machine Learning

59

slide-60
SLIDE 60

Privacy-preserving modeling

Ø Partners want to model data jointly across multiple partners Ø The partners DO NOT want to disclose the original data to each other Ø The partners are willing to disclose some derived data Ø How can you model data jointly without disclosing it?!?

Ø Privacy-preserving modeling

slide-61
SLIDE 61

Privacy-preserving sum

? ? ? ? SUM?

slide-62
SLIDE 62

SUM(S1:S4) +SUM(R0:R4)

S1+R1 S2+R2 S3+R3 S4+R4

⊕ ⊕ ⊕ ⊕

Privacy-preserving sum

S1 (0-10) S2 (0-10) S3 (0-10) S4 (0-10) R0 <100 R1 <100 R2 <100 R3 <100 R4 <100 SUM(S1:S4)

SUM (R0:R1)

SUM (R0:R2)

SUM (R0:R3)

SUM (R0:R4)

slide-63
SLIDE 63

Privacy-preserving sum

Ø What we are calculating

R0 +(S1+ R1)+(S2 + R2)+(S3+ R3)+(S4+ R4) −((((R0 + R1) + R2) + R3) + R4) S1 + S2 + S3 + S4

slide-64
SLIDE 64

Partner 1

Single-party Macau

64

Y1

U1 V1

T

X1

ECFP features

β1

slide-65
SLIDE 65

Independent parties

65

Partner 2

Y2

V2

T

Partner 1

Y1

V1

T

U1 X1

ECFP features

β1 U2 X2

ECFP features

β2

slide-66
SLIDE 66

Private for Partner 1

Shared

Privacy-preserving broker

66

Y1

U V1

T

X

ECFP features

β

Broker

Private for Partner 2

Y2

V2

T

Private for Partner 3

Y3

V3

T

Initialization Broker receives X from each partner and aligns them Iteration

  • 1. Partners privately

update V

  • 2. Partners send

contributions for U to broker

  • 3. Broker computes

and shares U

  • 4. Broker updates β
slide-67
SLIDE 67

MachinE Learning Ledger Orchestration for Drug DiscoverY

67 This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement N° 831472. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA

PHARMA PARTNERS PUBLIC PARTNERS

Innovative Medicines Initiative 10 pharma partners €18,000,000 June 2019 – May 2022

slide-68
SLIDE 68

Conclusions

  • Fully Bayesian matrix factorization with side information
  • Multitask learning with tasks tied by matrix factorization
  • Scalable, parallelizable full MCMC
  • Particularly attractive when
  • Modeling prediction uncertainty
  • Scarce target matrix
  • Sparse feature matrix
  • State-of-the-art performance on chemogenomic tasks
slide-69
SLIDE 69

Adam Arany Jaak Simm Hugo Ceulemans Jörg Wegner Pooya Zakeri Edward De Brouwer You? Daniele Parisi

  • Postdoc/PhD
  • Deep learning
  • Privacy-preserving ML
  • Chemoinformatics
  • EHR