[PPT] - Doubly Stochastic Variational Inference for Neural Processes with PowerPoint Presentation

SLIDE 1

Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables

Q. Wang & Herke van Hoof

Amsterdam Machine Learning Lab

ICML 2020

1 / 69

SLIDE 2

Highlights in this Work

2 / 69

SLIDE 3

Highlights in this Work

A systematical revisit to SPs with an Implicit Latent Variable Model

◮ conceptualization of latent SP models ◮ comprehension about SPs with LVMs

3 / 69

SLIDE 4

Highlights in this Work

A systematical revisit to SPs with an Implicit Latent Variable Model

◮ conceptualization of latent SP models ◮ comprehension about SPs with LVMs

A novel exchangeable SP within a Hierarchical Bayesian Framework

◮ formalization of a hierarchical SP ◮ plausible approximate inference method

4 / 69

SLIDE 5

Highlights in this Work

A systematical revisit to SPs with an Implicit Latent Variable Model

◮ conceptualization of latent SP models ◮ comprehension about SPs with LVMs

A novel exchangeable SP within a Hierarchical Bayesian Framework

◮ formalization of a hierarchical SP ◮ plausible approximate inference method

Competitive performance on extensive Uncertainty-aware Applications

◮ high dimensional regressions on simulators/real-world dataset ◮ classification and o.o.d. detection on image dataset

5 / 69

SLIDE 6

Outline of this Talk

1

Motivation for SPs

2

Study of SPs with LVMs

3

NP with Hierarchical Latent Variables

4

Experiments and Applications

6 / 69

SLIDE 7

Motivation for SPs

7 / 69

SLIDE 8

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1])

8 / 69

SLIDE 9

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1])

9 / 69

SLIDE 10

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non-i.i.d. dataset ;

10 / 69

SLIDE 11

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non-i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p(st+1|st, at) in autonomous driving [2] ;

11 / 69

SLIDE 12

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non-i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p(st+1|st, at) in autonomous driving [2] ; Model distributions instead of point estimates : working as a generative model for more realizations [3].

12 / 69

SLIDE 13

Two Consistencies in Exchangeable SPs

Some required properties for exchangeable stochastic process ρ [4] :

13 / 69

SLIDE 14

Two Consistencies in Exchangeable SPs

Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables {y1, y2, . . . , yN+M}, the probability after marginalization over subset is unchanged.

ρx1:N+M(y1:N+M)dyN+1:N+M = ρx1:N(y1:N)

(1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρx1:N(y1:N) = ρxπ(1:N)(yπ(1:N)) (1.2)

14 / 69

SLIDE 15

Two Consistencies in Exchangeable SPs

Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables {y1, y2, . . . , yN+M}, the probability after marginalization over subset is unchanged.

ρx1:N+M(y1:N+M)dyN+1:N+M = ρx1:N(y1:N)

(1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρx1:N(y1:N) = ρxπ(1:N)(yπ(1:N)) (1.2) With these two conditions, an exchangeable SP can be induced. (Refer to Kolmogorov Extension Theorem)

15 / 69

SLIDE 16

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: Flexibility in distributions: Extension to high dimensions: Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

16 / 69

SLIDE 17

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Extension to high dimensions: Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

17 / 69

SLIDE 18

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

18 / 69

SLIDE 19

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

19 / 69

SLIDE 20

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) Neural Processes (NPs)

20 / 69

SLIDE 21

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) → less flexible with Gaussian distributions Neural Processes (NPs)

21 / 69

SLIDE 22

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) → less flexible with Gaussian distributions Neural Processes (NPs) → more scalable with computational complexity O(N)

22 / 69

SLIDE 23

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) → less flexible with Gaussian distributions Neural Processes (NPs) → more scalable with computational complexity O(N) → more flexible with no explicit distributions

23 / 69

SLIDE 24

Study of SPs with LVMs

24 / 69

SLIDE 25

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is (2.3) mostly intractable.

25 / 69

SLIDE 26

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

index depend. l.v.

= φ(xi)

deter. term

+ ǫ(xi)

stoch. term

(2.1) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is (2.3) mostly intractable.

26 / 69

SLIDE 27

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

index depend. l.v.

= φ(xi)

deter. term

+ ǫ(xi)

stoch. term

(2.1) yi

bs.

= ϕ(xi, zi)

trans.

+ ζi

bs. noise

(2.2) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is (2.3) mostly intractable.

27 / 69

SLIDE 28

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

index depend. l.v.

= φ(xi)

deter. term

+ ǫ(xi)

stoch. term

(2.1) yi

bs.

= ϕ(xi, zi)

trans.

+ ζi

bs. noise

(2.2) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is pθ(zT|xC, yC, xT) = p(zC, zT)

p(zC, zT)dzC

, (2.3) mostly intractable.

28 / 69

SLIDE 29

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

index depend. l.v.

= φ(xi)

deter. term

+ ǫ(xi)

stoch. term

(2.1) yi

bs.

= ϕ(xi, zi)

trans.

+ ζi

bs. noise

(2.2) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is pθ(zT|xC, yC, xT) = p(zC, zT)

p(zC, zT)dzC

, yT ∼ p(yT|xT, zT, ζ) (2.3) mostly intractable.

29 / 69

SLIDE 30

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : NP as an exchangeable SP with a global latent variable :

30 / 69

SLIDE 31

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : ρx(y) =

N(y; z, τ −1I) N(z; m(x), K(., .))
l.v.

dz (2.4) NP as an exchangeable SP with a global latent variable :

31 / 69

SLIDE 32

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : ρx(y) =

N(y; z, τ −1I) N(z; m(x), K(., .))
l.v.

dz (2.4) NP as an exchangeable SP with a global latent variable : ρx1:N+M(y1:N+M) =

N+M
i=1

p(yi|xi, zG)

trans.

p(zG)

global l.v.

dzG (2.5)

32 / 69

SLIDE 33

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : ρx(y) =

N(y; z, τ −1I) N(z; m(x), K(., .))
l.v.

dz (2.4) NP as an exchangeable SP with a global latent variable : ρx1:N+M(y1:N+M) =

N+M
i=1

p(yi|xi, zG)

trans.

p(zG)

global l.v.

dzG (2.5)

Remark

Some other models, such as Hierarchical GPs [5] and Deep GPs [6], [7] can also be expressed with LVMs.

33 / 69

SLIDE 34

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings :

MLP

sampling

Permutation Invariant Encoder Decoder

34 / 69

SLIDE 35

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : ln

p(yT|xC, yC, xT)
≥ Eqφ ln
pθ(yT|xT, zG)
data likelihood
−DKL
qφ(zG|xC, yC, xT, yT)
global posterior

p(zG|xC, yC)

global prior
(2.6)

Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings :

MLP

sampling

Permutation Invariant Encoder Decoder

35 / 69

SLIDE 36

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : ln

p(yT|xC, yC, xT)
≥ Eqφ ln
pθ(yT|xT, zG)
data likelihood
−DKL
qφ(zG|xC, yC, xT, yT)
global posterior

p(zG|xC, yC)

global prior
(2.6)

Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings :

MLP

sampling

Permutation Invariant Encoder Decoder

36 / 69

SLIDE 37

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : ln

p(yT|xC, yC, xT)
≥ Eqφ ln
pθ(yT|xT, zG)
data likelihood
−DKL
qφ(zG|xC, yC, xT, yT)
global posterior

p(zG|xC, yC)

global prior
(2.6)

Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings : ri = hθ(xi, yi), r =

N

i=1

ri, pθ(zC|xC, yC) = N(zC|[fµ(r), fσ(r)]) (2.7)

MLP

sampling

Permutation Invariant Encoder Decoder

37 / 69

SLIDE 38

NPs with Hierarchical Latent Variables

38 / 69

SLIDE 39

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness.

39 / 69

SLIDE 40

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases.

40 / 69

SLIDE 41

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases. As a result, a hierarchical LVM is induced as Doubly Stochastic Variational Neural Process (DSVNP):

41 / 69

SLIDE 42

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases. As a result, a hierarchical LVM is induced as Doubly Stochastic Variational Neural Process (DSVNP): ρx1:N+M(y1:N+M) =

N+M
i=1

p(yi|zG, zi, xi) p(zi|xi, zG)p(zG)dz1:N+MdzG (3.1)

42 / 69

SLIDE 43

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases. As a result, a hierarchical LVM is induced as Doubly Stochastic Variational Neural Process (DSVNP): ρx1:N+M(y1:N+M) =

N+M
i=1

p(yi|zG, zi, xi) p(zi|xi, zG)p(zG)dz1:N+MdzG (3.1)

Remark

DSVNP satisfies Marginalization and Exchangeability Consistencies, so it is a new exchangeable SP.

43 / 69

SLIDE 44

Approximate Inference for DSVNP

Exact inference for this hierarchical LVM is mostly intractable, hence approximate inference is used here. Evidence Lower Bound for DSVNP : Generative (Black Lines) and Recognition Models (Blue/Pink Lines) in Graphs : Specify generative process with black line

44 / 69

SLIDE 45

Approximate Inference for DSVNP

Exact inference for this hierarchical LVM is mostly intractable, hence approximate inference is used here. Evidence Lower Bound for DSVNP : ln

p(y∗|xC, yC, x∗)
≥ Eqφ1,1Eqφ2,1 ln[p(y∗|zG, z∗, x∗)]

−Eqφ1,1[DKL[qφ2,1(z∗|zG, x∗, y∗) pφ2,2(z∗|zG, x∗)]

−DKL
qφ1,1(zG|xC, yC, xT, yT) pφ1,2(zG|xC, yC)
(3.2)

Generative (Black Lines) and Recognition Models (Blue/Pink Lines) in Graphs : Specify generative process with black line

45 / 69

SLIDE 46

Approximate Inference for DSVNP

Exact inference for this hierarchical LVM is mostly intractable, hence approximate inference is used here. Evidence Lower Bound for DSVNP : ln

p(y∗|xC, yC, x∗)
≥ Eqφ1,1Eqφ2,1 ln[p(y∗|zG, z∗, x∗)]

−Eqφ1,1[DKL[qφ2,1(z∗|zG, x∗, y∗) pφ2,2(z∗|zG, x∗)]

−DKL
qφ1,1(zG|xC, yC, xT, yT) pφ1,2(zG|xC, yC)
(3.2)

Generative (Black Lines) and Recognition Models (Blue/Pink Lines) in Graphs : Specify generative process with black line

46 / 69

SLIDE 47

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates :

47 / 69

SLIDE 48

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates :

48 / 69

SLIDE 49

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates : p(y∗|xC, yC, x∗) ≈ 1 KS

K

k=1

S

s=1

pθ(y∗|x∗, z(s)

∗ , z(k) G )

(3.3)

49 / 69

SLIDE 50

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates : p(y∗|xC, yC, x∗) ≈ 1 KS

K

k=1

S

s=1

pθ(y∗|x∗, z(s)

∗ , z(k) G )

(3.3) using latent variables sampled in prior networks as z(k)

G

∼ pφ1,2(zG|xC, yC) and z(s)

∗

∼ pφ2,2(z∗|z(k)

G , x∗).

50 / 69

SLIDE 51

Experiments and Applications

51 / 69

SLIDE 52

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : Interpolation in curves of a SP: Extrapolation in curves of a SP:

52 / 69

SLIDE 53

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : NP/AttnNP → over-confident in some regions Interpolation in curves of a SP: Extrapolation in curves of a SP:

(a) CNP (b) NP (c) AttnNP (d) DSVNP

53 / 69

SLIDE 54

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : NP/AttnNP → over-confident in some regions Interpolation in curves of a SP: AttnNP ≻ DSVNP ≻ NP ≻ CNP (Fitting/UQ Performance) Extrapolation in curves of a SP:

(a) CNP (b) NP (c) AttnNP (d) DSVNP

54 / 69

SLIDE 55

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : NP/AttnNP → over-confident in some regions Interpolation in curves of a SP: AttnNP ≻ DSVNP ≻ NP ≻ CNP (Fitting/UQ Performance) Extrapolation in curves of a SP: Tough for all in fitting; NP/AttnNP →

ver-confident; DSVNP → better UQ

(a) CNP (b) NP (c) AttnNP (d) DSVNP

55 / 69

SLIDE 56

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : High-dim regression :

Goal

56 / 69

SLIDE 57

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : High-dim regression :

Goal

57 / 69

SLIDE 58

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : MSE & NLL not in accordance; DSVNP & CNP → better UQ; DSVNP & AttnNP → lower fitting error. High-dim regression :

Goal

58 / 69

SLIDE 59

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : MSE & NLL not in accordance; DSVNP & CNP → better UQ; DSVNP & AttnNP → lower fitting error. High-dim regression : Hierarchical latent variables advance performance significantly.

Goal

59 / 69

SLIDE 60

Classification with Uncertainty Quantification

Observations in image classification and out of distribution detection (based on cumulative distribution of entropies) :

60 / 69

SLIDE 61

Classification with Uncertainty Quantification

Observations in image classification and out of distribution detection (based on cumulative distribution of entropies) : MNIST: no significant difference in classification performance/o.o.d detection (all above 99%) ; DSVNP → better o.o.d. detection on FMNIST/KMNIST ; MC-D more robust to Gaussian/Uniform noise.

61 / 69

SLIDE 62

Classification with Uncertainty Quantification

Observations in image classification and out of distribution detection (based on cumulative distribution of entropies) : MNIST: no significant difference in classification performance/o.o.d detection (all above 99%) ; DSVNP → better o.o.d. detection on FMNIST/KMNIST ; MC-D more robust to Gaussian/Uniform noise. CIFAR10: DSVNP(86.3%) ≻ MC/CNP ≻ AttnNP/NP ≻ NN (Classification Performance) ; DSVNP → best entropy distributions in domain dataset and most robust to Rademacher noise.

62 / 69

SLIDE 63

Future Works

63 / 69

SLIDE 64

Some Challenging and Promising Directions

More effective inference methods for our proposed hierarchical SPs

64 / 69

SLIDE 65

Some Challenging and Promising Directions

More effective inference methods for our proposed hierarchical SPs More expressive context latent variable using higher order statistics

65 / 69

SLIDE 66

Some Challenging and Promising Directions

More effective inference methods for our proposed hierarchical SPs More expressive context latent variable using higher order statistics More explorations to Uncertainty-aware Decision-making Problems

66 / 69

SLIDE 67

Thanks for Your Listening

67 / 69

SLIDE 68

M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and
Y. W. Teh, “Neural processes,” arXiv preprint arXiv:1807.01622, 2018.
M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach

to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.

F. P. Casale, A. Dalca, L. Saglietti, J. Listgarten, and N. Fusi, “Gaussian process prior

variational autoencoders,” in Advances in Neural Information Processing Systems, 2018,

pp. 10 369–10 380.
D. Heath and W. Sudderth, “De finetti’s theorem on exchangeable variables,” The

American Statistician, vol. 30, no. 4, pp. 188–189, 1976.

S. Park and S. Choi, “Hierarchical gaussian process regression,” in Proceedings of 2nd

Asian Conference on Machine Learning, 2010, pp. 95–110.

A. Damianou and N. Lawrence, “Deep gaussian processes,” in Artificial Intelligence and

Statistics, 2013, pp. 207–215.

Z. Dai, A. Damianou, J. Gonz´

alez, and N. Lawrence, “Variational auto-encoded deep gaussian processes,” arXiv preprint arXiv:1511.06455, 2015.

68 / 69

SLIDE 69

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint

arXiv:1312.6114, 2013.

Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving pilco with bayesian neural

network dynamics models,” in Data-Efficient Machine Learning workshop, ICML, vol. 4, 2016, p. 34.

69 / 69