Doubly Stochastic Variational Inference for Neural Processes with - - PowerPoint PPT Presentation

doubly stochastic variational inference for neural
SMART_READER_LITE
LIVE PREVIEW

Doubly Stochastic Variational Inference for Neural Processes with - - PowerPoint PPT Presentation

Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables Q. Wang & Herke van Hoof Amsterdam Machine Learning Lab ICML 2020 1 / 69 Highlights in this Work 2 / 69 Highlights in this Work A


slide-1
SLIDE 1

Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables

  • Q. Wang & Herke van Hoof

Amsterdam Machine Learning Lab

ICML 2020

1 / 69

slide-2
SLIDE 2

Highlights in this Work

2 / 69

slide-3
SLIDE 3

Highlights in this Work

A systematical revisit to SPs with an Implicit Latent Variable Model

◮ conceptualization of latent SP models ◮ comprehension about SPs with LVMs

3 / 69

slide-4
SLIDE 4

Highlights in this Work

A systematical revisit to SPs with an Implicit Latent Variable Model

◮ conceptualization of latent SP models ◮ comprehension about SPs with LVMs

A novel exchangeable SP within a Hierarchical Bayesian Framework

◮ formalization of a hierarchical SP ◮ plausible approximate inference method

4 / 69

slide-5
SLIDE 5

Highlights in this Work

A systematical revisit to SPs with an Implicit Latent Variable Model

◮ conceptualization of latent SP models ◮ comprehension about SPs with LVMs

A novel exchangeable SP within a Hierarchical Bayesian Framework

◮ formalization of a hierarchical SP ◮ plausible approximate inference method

Competitive performance on extensive Uncertainty-aware Applications

◮ high dimensional regressions on simulators/real-world dataset ◮ classification and o.o.d. detection on image dataset

5 / 69

slide-6
SLIDE 6

Outline of this Talk

1

Motivation for SPs

2

Study of SPs with LVMs

3

NP with Hierarchical Latent Variables

4

Experiments and Applications

6 / 69

slide-7
SLIDE 7

Motivation for SPs

7 / 69

slide-8
SLIDE 8

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1])

8 / 69

slide-9
SLIDE 9

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1])

9 / 69

slide-10
SLIDE 10

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non-i.i.d. dataset ;

10 / 69

slide-11
SLIDE 11

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non-i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p(st+1|st, at) in autonomous driving [2] ;

11 / 69

slide-12
SLIDE 12

Why Do We Need Stochastic Processes?

The stochastic process (SP) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non-i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p(st+1|st, at) in autonomous driving [2] ; Model distributions instead of point estimates : working as a generative model for more realizations [3].

12 / 69

slide-13
SLIDE 13

Two Consistencies in Exchangeable SPs

Some required properties for exchangeable stochastic process ρ [4] :

13 / 69

slide-14
SLIDE 14

Two Consistencies in Exchangeable SPs

Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables {y1, y2, . . . , yN+M}, the probability after marginalization over subset is unchanged.

  • ρx1:N+M(y1:N+M)dyN+1:N+M = ρx1:N(y1:N)

(1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρx1:N(y1:N) = ρxπ(1:N)(yπ(1:N)) (1.2)

14 / 69

slide-15
SLIDE 15

Two Consistencies in Exchangeable SPs

Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables {y1, y2, . . . , yN+M}, the probability after marginalization over subset is unchanged.

  • ρx1:N+M(y1:N+M)dyN+1:N+M = ρx1:N(y1:N)

(1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρx1:N(y1:N) = ρxπ(1:N)(yπ(1:N)) (1.2) With these two conditions, an exchangeable SP can be induced. (Refer to Kolmogorov Extension Theorem)

15 / 69

slide-16
SLIDE 16

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: Flexibility in distributions: Extension to high dimensions: Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

16 / 69

slide-17
SLIDE 17

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Extension to high dimensions: Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

17 / 69

slide-18
SLIDE 18

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

18 / 69

slide-19
SLIDE 19

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) Neural Processes (NPs)

19 / 69

slide-20
SLIDE 20

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) Neural Processes (NPs)

20 / 69

slide-21
SLIDE 21

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) → less flexible with Gaussian distributions Neural Processes (NPs)

21 / 69

slide-22
SLIDE 22

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) → less flexible with Gaussian distributions Neural Processes (NPs) → more scalable with computational complexity O(N)

22 / 69

slide-23
SLIDE 23

SPs in Progress and Primary Concerns

Crucial properties for SPs : Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output Analysis on GPs/NPs : Gaussian Processes (GPs) → less scalable with computational complexity O(N3) → less flexible with Gaussian distributions Neural Processes (NPs) → more scalable with computational complexity O(N) → more flexible with no explicit distributions

23 / 69

slide-24
SLIDE 24

Study of SPs with LVMs

24 / 69

slide-25
SLIDE 25

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is (2.3) mostly intractable.

25 / 69

slide-26
SLIDE 26

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

  • index depend. l.v.

= φ(xi)

  • deter. term

+ ǫ(xi)

  • stoch. term

(2.1) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is (2.3) mostly intractable.

26 / 69

slide-27
SLIDE 27

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

  • index depend. l.v.

= φ(xi)

  • deter. term

+ ǫ(xi)

  • stoch. term

(2.1) yi

  • bs.

= ϕ(xi, zi)

trans.

+ ζi

  • bs. noise

(2.2) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is (2.3) mostly intractable.

27 / 69

slide-28
SLIDE 28

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

  • index depend. l.v.

= φ(xi)

  • deter. term

+ ǫ(xi)

  • stoch. term

(2.1) yi

  • bs.

= ϕ(xi, zi)

trans.

+ ζi

  • bs. noise

(2.2) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is pθ(zT|xC, yC, xT) = p(zC, zT)

  • p(zC, zT)dzC

, (2.3) mostly intractable.

28 / 69

slide-29
SLIDE 29

Deep Latent Variable Model as SPs

Here we present an implicit Latent Variable Model for SPs : Generation paradigm with (potentially correlated) latent variables : zi

  • index depend. l.v.

= φ(xi)

  • deter. term

+ ǫ(xi)

  • stoch. term

(2.1) yi

  • bs.

= ϕ(xi, zi)

trans.

+ ζi

  • bs. noise

(2.2) Predictive distribution in SPs : Let the context and target input be C = {(xi, yi)|i = 1, 2, . . . , N} and xT, the computation is pθ(zT|xC, yC, xT) = p(zC, zT)

  • p(zC, zT)dzC

, yT ∼ p(yT|xT, zT, ζ) (2.3) mostly intractable.

29 / 69

slide-30
SLIDE 30

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : NP as an exchangeable SP with a global latent variable :

30 / 69

slide-31
SLIDE 31

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : ρx(y) =

  • N(y; z, τ −1I) N(z; m(x), K(., .))
  • l.v.

dz (2.4) NP as an exchangeable SP with a global latent variable :

31 / 69

slide-32
SLIDE 32

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : ρx(y) =

  • N(y; z, τ −1I) N(z; m(x), K(., .))
  • l.v.

dz (2.4) NP as an exchangeable SP with a global latent variable : ρx1:N+M(y1:N+M) =

  • N+M
  • i=1

p(yi|xi, zG)

  • trans.

p(zG)

global l.v.

dzG (2.5)

32 / 69

slide-33
SLIDE 33

Gaussian Processes & Neural Processes

NP family approximates SPs in the form of LVMs : GP as an exchangeable SP with latent variables : ρx(y) =

  • N(y; z, τ −1I) N(z; m(x), K(., .))
  • l.v.

dz (2.4) NP as an exchangeable SP with a global latent variable : ρx1:N+M(y1:N+M) =

  • N+M
  • i=1

p(yi|xi, zG)

  • trans.

p(zG)

global l.v.

dzG (2.5)

Remark

Some other models, such as Hierarchical GPs [5] and Deep GPs [6], [7] can also be expressed with LVMs.

33 / 69

slide-34
SLIDE 34

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings :

MLP

sampling

Permutation Invariant Encoder Decoder

34 / 69

slide-35
SLIDE 35

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : ln

  • p(yT|xC, yC, xT)
  • ≥ Eqφ ln
  • pθ(yT|xT, zG)
  • data likelihood
  • −DKL
  • qφ(zG|xC, yC, xT, yT)
  • global posterior

p(zG|xC, yC)

  • global prior
  • (2.6)

Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings :

MLP

sampling

Permutation Invariant Encoder Decoder

35 / 69

slide-36
SLIDE 36

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : ln

  • p(yT|xC, yC, xT)
  • ≥ Eqφ ln
  • pθ(yT|xT, zG)
  • data likelihood
  • −DKL
  • qφ(zG|xC, yC, xT, yT)
  • global posterior

p(zG|xC, yC)

  • global prior
  • (2.6)

Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings :

MLP

sampling

Permutation Invariant Encoder Decoder

36 / 69

slide-37
SLIDE 37

Inference for Neural Processes

A general ELBO with a context prior in NP models [1] : ln

  • p(yT|xC, yC, xT)
  • ≥ Eqφ ln
  • pθ(yT|xT, zG)
  • data likelihood
  • −DKL
  • qφ(zG|xC, yC, xT, yT)
  • global posterior

p(zG|xC, yC)

  • global prior
  • (2.6)

Statistics of the context invariant to the order in set instances, such as pooling of element-wise embeddings : ri = hθ(xi, yi), r =

N

  • i=1

ri, pθ(zC|xC, yC) = N(zC|[fµ(r), fσ(r)]) (2.7)

MLP

sampling

Permutation Invariant Encoder Decoder

37 / 69

slide-38
SLIDE 38

NPs with Hierarchical Latent Variables

38 / 69

slide-39
SLIDE 39

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness.

39 / 69

slide-40
SLIDE 40

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases.

40 / 69

slide-41
SLIDE 41

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases. As a result, a hierarchical LVM is induced as Doubly Stochastic Variational Neural Process (DSVNP):

41 / 69

slide-42
SLIDE 42

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases. As a result, a hierarchical LVM is induced as Doubly Stochastic Variational Neural Process (DSVNP): ρx1:N+M(y1:N+M) =

  • N+M
  • i=1

p(yi|zG, zi, xi) p(zi|xi, zG)p(zG)dz1:N+MdzG (3.1)

42 / 69

slide-43
SLIDE 43

Extending NPs from A Hierarchical Bayes Perspective

Our work starts with motivations: Hierarchical Bayesian structures → more expressiveness. Involving local l.v. → reveal local dependencies across input/output in high-dim cases. As a result, a hierarchical LVM is induced as Doubly Stochastic Variational Neural Process (DSVNP): ρx1:N+M(y1:N+M) =

  • N+M
  • i=1

p(yi|zG, zi, xi) p(zi|xi, zG)p(zG)dz1:N+MdzG (3.1)

Remark

DSVNP satisfies Marginalization and Exchangeability Consistencies, so it is a new exchangeable SP.

43 / 69

slide-44
SLIDE 44

Approximate Inference for DSVNP

Exact inference for this hierarchical LVM is mostly intractable, hence approximate inference is used here. Evidence Lower Bound for DSVNP : Generative (Black Lines) and Recognition Models (Blue/Pink Lines) in Graphs : Specify generative process with black line

44 / 69

slide-45
SLIDE 45

Approximate Inference for DSVNP

Exact inference for this hierarchical LVM is mostly intractable, hence approximate inference is used here. Evidence Lower Bound for DSVNP : ln

  • p(y∗|xC, yC, x∗)
  • ≥ Eqφ1,1Eqφ2,1 ln[p(y∗|zG, z∗, x∗)]

−Eqφ1,1[DKL[qφ2,1(z∗|zG, x∗, y∗) pφ2,2(z∗|zG, x∗)]

  • −DKL
  • qφ1,1(zG|xC, yC, xT, yT) pφ1,2(zG|xC, yC)
  • (3.2)

Generative (Black Lines) and Recognition Models (Blue/Pink Lines) in Graphs : Specify generative process with black line

45 / 69

slide-46
SLIDE 46

Approximate Inference for DSVNP

Exact inference for this hierarchical LVM is mostly intractable, hence approximate inference is used here. Evidence Lower Bound for DSVNP : ln

  • p(y∗|xC, yC, x∗)
  • ≥ Eqφ1,1Eqφ2,1 ln[p(y∗|zG, z∗, x∗)]

−Eqφ1,1[DKL[qφ2,1(z∗|zG, x∗, y∗) pφ2,2(z∗|zG, x∗)]

  • −DKL
  • qφ1,1(zG|xC, yC, xT, yT) pφ1,2(zG|xC, yC)
  • (3.2)

Generative (Black Lines) and Recognition Models (Blue/Pink Lines) in Graphs : Specify generative process with black line

46 / 69

slide-47
SLIDE 47

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates :

47 / 69

slide-48
SLIDE 48

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates :

48 / 69

slide-49
SLIDE 49

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates : p(y∗|xC, yC, x∗) ≈ 1 KS

K

  • k=1

S

  • s=1

pθ(y∗|x∗, z(s)

∗ , z(k) G )

(3.3)

49 / 69

slide-50
SLIDE 50

Training and Testing in Practice

Similar to that in NPs, DSVNP is trained in a SGVB way [8]. Scalable training with random context points : Testing/Forecasting with priors and Monte Carlo estimates : p(y∗|xC, yC, x∗) ≈ 1 KS

K

  • k=1

S

  • s=1

pθ(y∗|x∗, z(s)

∗ , z(k) G )

(3.3) using latent variables sampled in prior networks as z(k)

G

∼ pφ1,2(zG|xC, yC) and z(s)

∼ pφ2,2(z∗|z(k)

G , x∗).

50 / 69

slide-51
SLIDE 51

Experiments and Applications

51 / 69

slide-52
SLIDE 52

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : Interpolation in curves of a SP: Extrapolation in curves of a SP:

52 / 69

slide-53
SLIDE 53

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : NP/AttnNP → over-confident in some regions Interpolation in curves of a SP: Extrapolation in curves of a SP:

(a) CNP (b) NP (c) AttnNP (d) DSVNP

53 / 69

slide-54
SLIDE 54

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : NP/AttnNP → over-confident in some regions Interpolation in curves of a SP: AttnNP ≻ DSVNP ≻ NP ≻ CNP (Fitting/UQ Performance) Extrapolation in curves of a SP:

(a) CNP (b) NP (c) AttnNP (d) DSVNP

54 / 69

slide-55
SLIDE 55

Toy Experiments

Discoveries in 1-D Simulation Experiments in terms of fitting errors and uncertainty quantification (UQ) : Episdemic uncertainty in a single curve : NP/AttnNP → over-confident in some regions Interpolation in curves of a SP: AttnNP ≻ DSVNP ≻ NP ≻ CNP (Fitting/UQ Performance) Extrapolation in curves of a SP: Tough for all in fitting; NP/AttnNP →

  • ver-confident; DSVNP → better UQ

(a) CNP (b) NP (c) AttnNP (d) DSVNP

55 / 69

slide-56
SLIDE 56

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : High-dim regression :

Goal

56 / 69

slide-57
SLIDE 57

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : High-dim regression :

Goal

57 / 69

slide-58
SLIDE 58

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : MSE & NLL not in accordance; DSVNP & CNP → better UQ; DSVNP & AttnNP → lower fitting error. High-dim regression :

Goal

58 / 69

slide-59
SLIDE 59

Multi-output Regression: Simulation/Real-world Dataset

Investigations on (1) system identification on cart-pole transitions [9]; (2) regression on real-world dataset : System identification : MSE & NLL not in accordance; DSVNP & CNP → better UQ; DSVNP & AttnNP → lower fitting error. High-dim regression : Hierarchical latent variables advance performance significantly.

Goal

59 / 69

slide-60
SLIDE 60

Classification with Uncertainty Quantification

Observations in image classification and out of distribution detection (based on cumulative distribution of entropies) :

60 / 69

slide-61
SLIDE 61

Classification with Uncertainty Quantification

Observations in image classification and out of distribution detection (based on cumulative distribution of entropies) : MNIST: no significant difference in classification performance/o.o.d detection (all above 99%) ; DSVNP → better o.o.d. detection on FMNIST/KMNIST ; MC-D more robust to Gaussian/Uniform noise.

61 / 69

slide-62
SLIDE 62

Classification with Uncertainty Quantification

Observations in image classification and out of distribution detection (based on cumulative distribution of entropies) : MNIST: no significant difference in classification performance/o.o.d detection (all above 99%) ; DSVNP → better o.o.d. detection on FMNIST/KMNIST ; MC-D more robust to Gaussian/Uniform noise. CIFAR10: DSVNP(86.3%) ≻ MC/CNP ≻ AttnNP/NP ≻ NN (Classification Performance) ; DSVNP → best entropy distributions in domain dataset and most robust to Rademacher noise.

62 / 69

slide-63
SLIDE 63

Future Works

63 / 69

slide-64
SLIDE 64

Some Challenging and Promising Directions

More effective inference methods for our proposed hierarchical SPs

64 / 69

slide-65
SLIDE 65

Some Challenging and Promising Directions

More effective inference methods for our proposed hierarchical SPs More expressive context latent variable using higher order statistics

65 / 69

slide-66
SLIDE 66

Some Challenging and Promising Directions

More effective inference methods for our proposed hierarchical SPs More expressive context latent variable using higher order statistics More explorations to Uncertainty-aware Decision-making Problems

66 / 69

slide-67
SLIDE 67

Thanks for Your Listening

67 / 69

slide-68
SLIDE 68
  • M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and
  • Y. W. Teh, “Neural processes,” arXiv preprint arXiv:1807.01622, 2018.
  • M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach

to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.

  • F. P. Casale, A. Dalca, L. Saglietti, J. Listgarten, and N. Fusi, “Gaussian process prior

variational autoencoders,” in Advances in Neural Information Processing Systems, 2018,

  • pp. 10 369–10 380.
  • D. Heath and W. Sudderth, “De finetti’s theorem on exchangeable variables,” The

American Statistician, vol. 30, no. 4, pp. 188–189, 1976.

  • S. Park and S. Choi, “Hierarchical gaussian process regression,” in Proceedings of 2nd

Asian Conference on Machine Learning, 2010, pp. 95–110.

  • A. Damianou and N. Lawrence, “Deep gaussian processes,” in Artificial Intelligence and

Statistics, 2013, pp. 207–215.

  • Z. Dai, A. Damianou, J. Gonz´

alez, and N. Lawrence, “Variational auto-encoded deep gaussian processes,” arXiv preprint arXiv:1511.06455, 2015.

68 / 69

slide-69
SLIDE 69
  • D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint

arXiv:1312.6114, 2013.

  • Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving pilco with bayesian neural

network dynamics models,” in Data-Efficient Machine Learning workshop, ICML, vol. 4, 2016, p. 34.

69 / 69