[PPT] - Bayesian matrix factorization for drug-target activity prediction PowerPoint Presentation

SLIDE 1

Bayesian matrix factorization for drug-target activity prediction

Yves Moreau

University of Leuven – ESAT-STADIUS SymBioSys Center for Computational Biology

SLIDE 2

1950 2010

10 1 0.1 100

Scannell et al. 2012

Number of new drugs per billion US$

SLIDE 3

SLIDE 4

The curse of attrition…

Hay et al. 2014

64% 32% 60% 83%

Phase 1 to phase 2 Phase 2 to phase 3 Phase 3 to NDA/BLA NDA/BLA to approval

Phase success rates

SLIDE 5

Safety Other

Arrowsmith & Miller 2013

Causes of failure between Phase 2 and submission in 2011 and 2012

Efficacy

…mainly due to safety and efficacy issues

SLIDE 6

Chemoinformatics

Goal: estimate interaction

between compounds and protein targets

Activity measured by high-

throughput screening

Activity depends on

match between shape

f compound and

shape of protein

3D modeling is challenging

?

Compound (ex: Viagra) Enzyme (ex: ACE2)

SLIDE 7

Drug–target activities

IC50 – amount of compound

needed for half inhibition

pIC50 = -log10(IC50)
EC50 – amount of compound

needed for half effect

SLIDE 8

High-throughput screening

Hit discovery in early drug discovery
Identify compounds active against

a protein drug target of interest

Activity measured by

high-throughput screening

Activity = “scarce” data

7 3 8 2 9

IC50

Millions of compounds 1-2% fill rate Thousands of targets Comp1 Comp2 CompN Compu P1 Pm Px

SLIDE 9

Ø High-dimensional fingerprints of 2D compound structures Ø Sparse vectors Circular fingerprints

each fingerprint represents a central atom and its neighbors A bit string represents the presence or absence of particular substructures

Key-based fingerprints

FP2 & MACCS MNA & MPD & ECFP

Molecular fingerprints

9

SLIDE 10

Quantitative Structure–Activity Relationship (QSAR)

Ø Finds optimal model α based on predictive features

Ø IC50(x) = α1x1 + α2x2 + … + αFxF Ø Minimize error loss Ø PLS, ridge regression

Ø Good performance if enough training examples

P1 Pm 7 3 8 2 9

IC50

Px 2 00100010 00101101 01000001 Comp1 Comp2 CompN Compu 00101101

Ø Does not share information across tasks!

SLIDE 11

Multitask learning

From fingerprints and

available activities, predict missing activities

Approaches

1.

Supervised learning per target (QSAR)

2.

Matrix factorization

Netflix style

3.

MF + supervised

Macau

7 3 8 2 9

IC50

3M compounds 1-2% fill rate 1500 targets Features (6K-4M) 00100010 00101101 01000001 Comp1 Comp2 CompN Compu 00101101 P1 Pm Px

SLIDE 12

The Netflix Challenge

Goal: predict user movie ratings
440K users, 18K movies
100 million ratings
1% fill rate
è Predict 99% missing
How can this work?

18K movies 440K users 1 ? 2 ? ? ? ? ? ? ? ? 1 ? ? ? 5 ? ? ? ? ? ? ? 4 ? 5 ? ? ? ? ? ? ? ? 3 ? ? ? 3 ? ? ? 4 ? ? ? ? ?

SLIDE 13

Factor analysis

Ø Low-rank approximation of full matrix

≈ *

Loadings Factors

Y U V

SLIDE 14

Factor analysis

* ≈

Factors Loadings

Yi. Ui. V

SLIDE 15

Factor analysis

Ø Individual response (= row) modeled as individual mixture (= loading) of a small number of latent responses (= factor)

≈ * * *

+ +

Factors Loadings

SLIDE 16

Alternating Least Squares

? ? ? *

≈

Factors Loadings

Yi. Ui. V

SLIDE 17

Alternating Least Squares

Ø If V were known, U could be found by linear regression

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

≈ *

Loadings Factors

Y U V

SLIDE 18

Alternating Least Squares

Ø If U were known, V could also be found by linear regression

≈

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

*

Loadings Factors

Y U V

SLIDE 19

Scarce matrix factorization

Ø Only observed values are used in regressions

≈ *

Loadings Factors

U* V * Y

min

U,V W !(Y −U.V) 2

SLIDE 20

Scarce matrix factorization

Ø Once factors are obtained, other entries can be predicted

≈ *

Loadings Factors

ˆ Y U* V *

SLIDE 21

Uncertainty

Ø Given scarce data, is a single solution (U*, V*) meaningful?

≈ *

Loadings Factors

ˆ Y U* V *

SLIDE 22

Bayesian modeling

Ø Given uncertainty from scarce data, Bayesian inference is desirable

Instead of ,

we want to consider the Bayesian posterior distribution

Posterior predictive distribution

is more informative than any optimal estimator

U*,V *

( ) = min

U,V W !(Y −U.V) 2

p(U,V |Y ) p( ˆ Y |Y )

SLIDE 23

Ordinary least squares

Ø ALS involves successive regressions solved by OLS

* ≈

Factors Loadings

Yi. Ui. V

? ? ?

SLIDE 24

Ordinary least squares

Ø Model Ø Solution Ø Setup = transposed of previous notation Ø If Gaussian noise, then OLS is max. likelihood estimate

SLIDE 25

Block Gibbs sampler

Ø The Gibbs sampler is a Markov Chain Monte Carlo method Ø MCMC for model inference generates samples from complex posterior distributions of model parameters by iteratively sampling from simpler distributions Ø The following scheme is a block Gibbs sampler Ø Under mild conditions of ergodicity, after burn-in, the samples will be dependently drawn from joint distribution Ø Similar to alternating least squares, but global optimization

U (i+1) ~ p(U |V (i),Y) V (i+1) ~ p(V |U (i+1),Y) For i sufficiently large, (U (i),V (i)) ~ p(U,V |Y)

SLIDE 26

Markov Chain Monte Carlo

Ø We do not get the posterior distribution analytically,

nly samples from it

Ø Samples are sufficient to characterize posterior distribution

Ø e.g., average solutions to get posterior mean estimate Ø e.g., marginal variance of individual predictions to characterize uncertainty

SLIDE 27

Bayesian linear regression

Ø The distribution of β in function of the data X and y can be modeled as a multivariate Gaussian distribution over β Ø Model Ø Assume a Gaussian prior for β and an inverse gamma prior for ρ

=

SLIDE 28

Bayesian linear regression

Ø Then the posterior distribution of β is also a Gaussian distribution by application of Bayes’ rule Ø If Λ0=0 and µ0=0, then solution for µn is identical to OLS! Ø Average solution µn is similar to ridge regression solution Ø Precision matrix Λn characterizes variance of solution

= =

SLIDE 29

GAMBLR trick

Ø Executing the Gibbs sampler requires sampling repeatedly from posterior Gaussian distributions (which change every time U and V change) Ø Sampling from multivariate Gaussian distribution Ø For Bayesian linear regression Ø This has the same form as OLS!

ε ~ N(0, I). If A such that Σ=AA', then z = µ + Aε ~ N(µ,Σ)

X = X L0 ! " # # $ % & & , y = y L0µ0 ! " # # $ % & & with Λ0 = L0L0 ' µn = XX '

( )

−1 Xy and Λn = XX '

It can be shown that z = XX '

( )

−1 X(y +σ.ε) ~ N(µn,σ 2Λn −1)

SLIDE 30

GAMBLR trick

Ø This means that we can sample from the posterior Gaussian distribution by solving a linear regression on the

riginal data plus injected noise!

Ø Running the Gibbs sampler then only amounts to solving a sequence of linear regressions with variable noise injection! Ø Linear regression is one of the best studied problems in numerical analysis

Ø Fast algorithms Ø Scalable code Ø One multivariate regression per row or column of Y at each iteration step, hence easy parallelization

SLIDE 31

Matrix factorization

One of the best approaches for Netflix challenge
Prediction of ratings for viewer-movie pairs
Does not use features, only matrix values
Two popular versions
Probabilistic Matrix Factorization (PMF) = Maximum Likelihood
Bayesian PMF = Bayesian inference

SLIDE 32

Netflix comparison (PMF vs. BPMF)

Ø Data: 100M ratings from 480K users, 18K movies Ø BPMF has advantage for users with few ratings

SLIDE 33

Motivation for Bayesian PMF

PMF gives point

estimates

Problematic for

compounds that have

nly few samples
We are interested in

uncertainty of estimates

Example IC50 data set from CHEMBL with 15K compounds

SLIDE 34

Bayesian PMF

SLIDE 35

Gibbs sampling

Iteratively samples each parameter
Obtains posterior samples of the model
e.g., sample 200 models after burn-in
Using the samples one can also measure uncertainty
Related to Alternating Least Squares
Blocked Gibbs sampler with large blocks, good sampling

behavior

SLIDE 36

ChEMBL: PMF vs. Bayesian PMF

ChEMBL public data set of assay activities
Classified IC50
15,118 compounds
344 proteins
59,451 values
Discretization at 200nM
20% test
BPMF outperforms PMF
Does not use features,
nly matrix values

Test classification error

SLIDE 37

ChEMBL: BPMF vs. ridge regression

15K compounds 344 protein 200 nM threshold 20% for test set Vary number of dimensions Matrix factorization not as good as QSAR, but does capture information.

SLIDE 38

BPMF (relation view)

Model 2 entities, 1 relation Latent variables (green) are learned from the IC50 data.

IC50

Comp. Protein

SLIDE 39

Macau

Can we get the best of both worlds? Model 2 entities, 1 relation + features for compounds Latent variables are learned together with βcomp

IC50

Comp. Latent U Protein Latent V Fingerprints

βcomp

SLIDE 40

SLIDE 41

Results on ChEMBL

15K compounds 344 protein 200 nM threshold 20% for test set Compound features improve performance Multitask modeling improves performance

SLIDE 42

SLIDE 43

SLIDE 44

Industrial scaling (J&J data)

Ø ~2M compounds, ~1K targets, tens of millions of activities

Compute nodes: dual Xeon E5-2699 v3
Fingerprint 1: 6,000 features

Ø Latent dimension = 30 Ø Direct solver on single node

Ø 40s per Gibbs sampling pass Ø 1,000 iterations (800 burn-in) = ½ day

Fingerprint 2: 4,000,000 features

Ø Sparsity of X: 0.002% Ø Latent dimension = 30 Ø Iterative solver on 15 nodes

Ø 600s per Gibbs sampling pass Ø 1,000 iterations (800 burn-in) = 1 week

SLIDE 45

Ø SVM using scikit-learn

Ø Separate classifier for every assay Ø Hyperparameter by nested CV

Ø For each assay separately

Ø Linear kernel

Ø Gaussian kernel has equivalent performance but does not scale

Ø Macau classification using TensorFlow

Ø Non-Bayesian approach (optimization) Ø Multi-task learning Ø Hidden representation size: 1,000 Ø Model parameters chosen by ChEMBL experiments

Single-task vs. multitask learning

SLIDE 46

Ø Chemical series effect

Ø All members of a series should be either in training

r test set

Ø Clustering Ø Tanimoto > 0.7 Ø Nested cross-validation for hyperparameter tuning

Nested clustered crossvalidation

SLIDE 47

AUC per assay

Ø Mean over assays Ø Macau: 0.886 Ø SVM: 0.840 Ø From 712 assay Ø Macau wins 382 Ø SVM wins 0 Ø Ties 330 Ø Using p < 0.01

SLIDE 48

Variational Bayes

Ø Gibbs sampling = “old” Ø Variational Bayes popular Ø Hierarchical blindness in VB

Ø Ignored covariance between 𝛾 and latents u Ø Poor variance estimates

ui covariance increases

if side information

SLIDE 49

Empirical comparison: ChEMBL

Ø 15k compounds Ø 346 proteins Ø ~60k activity measurements

Ø pIC50 Ø 20% test set

Ø Sparse high- dimensional side information (#feat is ~100k) Ø Macau drastically

utperforms VBMFSI

SLIDE 50

Repurposing High-Content Imaging data

SLIDE 51

Classical high-content imaging

SLIDE 52

Repurposing imaging assays

Ø High-throughput imaging (= high-content screening) Ø 500K compounds, 600 drug targets, 10M activities (30% fill rate) Ø Glucocorticoid receptor assay phenotypic screen

Feature extraction from images with CellProfiler

Simm et al., Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery, Cell Chemical Biology (2018)

SLIDE 53

Application

Ø Oncology drug discovery project

Active project
Initial screen = 0.725% hit rate (submicromolar)
Kinase target
No known direct relation to glucocorticoid receptor
Rank unscreened compounds with imaging data
Test top 342 compounds
141 submicromolar hits (41% hit rate)
60x enrichment

SLIDE 54

Application

Ø Central nervous system project

Active project
Initial screen = 0.088% hit rate
Enzyme target
No known direct relation to glucocorticoid receptor
Rank unscreened compounds with imaging data
Some additional ADME filtering
Select 141 compounds
37 submicromolar hits (22.7% hit rate)
x250 enrichment

SLIDE 55

Imaging data improves chemical diversity

Ø Similar or better hit rates using structure fingerprints Ø BUT high chemical diversity (biologically driven vs. chemically driven) Oncology CNS

SLIDE 56

Imaging assays for drug discovery

Ø 500K compounds, 600 targets, 10M activities (30% fill rate) Ø Glucocorticoid receptor assay phenotypic screen Ø Evaluate predictivity using clustered cross-validation Ø Macau predictive for 37% of assays (CV AUC>0.7), highly predictive for 5% of assays (CV AUC>0.9)

Assays not related to original screen!

Ø Here: single imaging assay Ø Future: build systematic library of imaging assays

SLIDE 57

Macau

Ø Generic package Ø Open source Ø OpenMP/C++ with Python wrapper library

https://github.com/jaak-s/macau
Factorization with and without side information
Real valued and binary matrices (normal and probit noise)
Supports tensors (alpha)
Univariate and multivariate Gibbs sampler

SLIDE 58

ID:1234 ID:5478

Activity

Deep Macau

Ø Combine deep learning and matrix factorization

Deep learning allows to capture nonlinear effects
Matrix factorization allows item level reasoning
Instead of only transforming features into prediction, learn a latent

representation of each entity

SLIDE 59

Privacy-Preserving Machine Learning

59

SLIDE 60

Privacy-preserving modeling

Ø Partners want to model data jointly across multiple partners Ø The partners DO NOT want to disclose the original data to each other Ø The partners are willing to disclose some derived data Ø How can you model data jointly without disclosing it?!?

Ø Privacy-preserving modeling

SLIDE 61

Privacy-preserving sum

? ? ? ? SUM?

SLIDE 62

SUM(S1:S4) +SUM(R0:R4)

⊕

S1+R1 S2+R2 S3+R3 S4+R4

⊕ ⊕ ⊕ ⊕

Privacy-preserving sum

S1 (0-10) S2 (0-10) S3 (0-10) S4 (0-10) R0 <100 R1 <100 R2 <100 R3 <100 R4 <100 SUM(S1:S4)

⊖

SUM (R0:R1)

⊕

SUM (R0:R2)

⊕

SUM (R0:R3)

⊕

SUM (R0:R4)

⊕

SLIDE 63

Privacy-preserving sum

Ø What we are calculating

R0 +(S1+ R1)+(S2 + R2)+(S3+ R3)+(S4+ R4) −((((R0 + R1) + R2) + R3) + R4) S1 + S2 + S3 + S4

SLIDE 64

Partner 1

Single-party Macau

64

Y1

U1 V1

T

X1

ECFP features

β1

SLIDE 65

Independent parties

65

Partner 2

Y2

V2

T

Partner 1

Y1

V1

T

U1 X1

ECFP features

β1 U2 X2

ECFP features

β2

SLIDE 66

Private for Partner 1

Shared

Privacy-preserving broker

66

Y1

U V1

T

X

ECFP features

β

Broker

Private for Partner 2

Y2

V2

T

Private for Partner 3

Y3

V3

T

Initialization Broker receives X from each partner and aligns them Iteration

1. Partners privately

update V

2. Partners send

contributions for U to broker

3. Broker computes

and shares U

4. Broker updates β

SLIDE 67

MachinE Learning Ledger Orchestration for Drug DiscoverY

67 This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement N° 831472. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA

PHARMA PARTNERS PUBLIC PARTNERS

Innovative Medicines Initiative 10 pharma partners €18,000,000 June 2019 – May 2022

SLIDE 68

Conclusions

Fully Bayesian matrix factorization with side information
Multitask learning with tasks tied by matrix factorization
Scalable, parallelizable full MCMC
Particularly attractive when
Modeling prediction uncertainty
Scarce target matrix
Sparse feature matrix
State-of-the-art performance on chemogenomic tasks

SLIDE 69

Adam Arany Jaak Simm Hugo Ceulemans Jörg Wegner Pooya Zakeri Edward De Brouwer You? Daniele Parisi

Postdoc/PhD
Deep learning
Privacy-preserving ML
Chemoinformatics
EHR