Connectivity-Optimized Representation Learning via Persistent - - PowerPoint PPT Presentation

connectivity optimized representation learning via
SMART_READER_LITE
LIVE PREVIEW

Connectivity-Optimized Representation Learning via Persistent - - PowerPoint PPT Presentation

ICML | 2019 Long Beach Connectivity-Optimized Representation Learning via Persistent Homology Christoph D. Hofer, Roland Kwitt Mandar Dixit Marc Niethammer University of Salzburg UNC Chapel Hill Microsoft Unsupervised representation


slide-1
SLIDE 1

Christoph D. Hofer, Roland Kwitt Mandar Dixit Marc Niethammer

Connectivity-Optimized Representation Learning via Persistent Homology

ICML | 2019

University of Salzburg Microsoft UNC Chapel Hill

Long Beach

slide-2
SLIDE 2

Unsupervised representation learning

◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.

Q: What makes a good representation?

slide-3
SLIDE 3

Unsupervised representation learning

fθ : X → Z

◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.

Q: What makes a good representation?

Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.

Rec[x, ˆ

x]

Latent space Z

x ˆ x Encoder Decoder

gφ : Z → X

slide-4
SLIDE 4

Unsupervised representation learning

fθ : X → Z

◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.

Q: What makes a good representation?

Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.

Rec[x, ˆ

x]

Latent space Z

Contractive AE’s [Rifai et al., ICML ’11]

x ˆ x + Reg Encoder Decoder

gφ : Z → X

slide-5
SLIDE 5

Unsupervised representation learning

fθ : X → Z

◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.

Q: What makes a good representation?

Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.

Rec[x, ˆ

x]

Latent space Z

x ˆ x

  • Large

Denoising AE’s [Vincent et al., JMLR ’10]

Encoder Decoder

gφ : Z → X

Perturb, or zero-out

slide-6
SLIDE 6

Unsupervised representation learning

fθ : X → Z

◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.

Q: What makes a good representation?

Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.

Rec[x, ˆ

x]

Latent space Z

x ˆ x

Sparse AE’s [Makhzani & Frey, ICLR ’14]

+ Reg Encoder Decoder

gφ : Z → X

slide-7
SLIDE 7

Unsupervised representation learning

fθ : X → Z

◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.

Q: What makes a good representation?

Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.

Rec[x, ˆ

x]

Latent space Z

x ˆ x

Adversarial AE’s [Makhzani et al., ICLR ’16]

  • Enforce distributional properties through adversarial training

Encoder Decoder

gφ : Z → X

(by far not exhaustive)

slide-8
SLIDE 8

Motivating (toy) example

We aim to control properties of the latent space, but from a topological point of view!

slide-9
SLIDE 9

Motivating (toy) example

Assume, we want to do Kernel Density Estimation (KDE) in the latent space Z.

Bandwidth selection: Scott’s rule [Scott, 1992] Data (zi) Gaussian KDE

We aim to control properties of the latent space, but from a topological point of view!

slide-10
SLIDE 10

Motivating (toy) example

Assume, we want to do Kernel Density Estimation (KDE) in the latent space Z.

Bandwidth selection: Scott’s rule [Scott, 1992] Data (zi) Gaussian KDE Bandwidth selection can be challenging, as the scaling greatly differs! Gaussian KDE Data (zi)

We aim to control properties of the latent space, but from a topological point of view!

slide-11
SLIDE 11

Controlling connectivity

Latent space Z

Q: How do we capture topological properties and what do we want to control?

slide-12
SLIDE 12

Controlling connectivity

Vietoris Rips Persistent Homology (PH) Radius r = r1 Latent space Z

r

Q: How do we capture topological properties and what do we want to control?

slide-13
SLIDE 13

Controlling connectivity

Vietoris Rips Persistent Homology (PH) Radius r = r2 Latent space Z

Q: How do we capture topological properties and what do we want to control?

slide-14
SLIDE 14

Controlling connectivity

Vietoris Rips Persistent Homology (PH)

◮ PH tracks topological changes as the ball radius r increases ◮ Connectivity information is caputred by 0-dim. persistent homology

Radius r = r3 Latent space Z

Q: How do we capture topological properties and what do we want to control?

slide-15
SLIDE 15

Controlling connectivity

Vietoris Rips Persistent Homology (PH)

◮ PH tracks topological changes as the ball radius r increases ◮ Connectivity information is caputred by 0-dim. persistent homology

Homogeneous arrangement!

η/2

z → fθ(z)

What if Radius r = r3 Latent space Z

Q: How do we capture topological properties and what do we want to control?

beneficial for KDE

slide-16
SLIDE 16

Connectivity loss

fθ : X → Rn gφ : Rn → X

Rec[·, ·]

ˆ x x

Q: How can we control topological properties (connectivity properties in particular)?

slide-17
SLIDE 17

Connectivity loss

fθ : X → Rn

(x1, . . . , xB)

gφ : Rn → X

Rec[·, ·]

ˆ x

PH + Connectivity loss

Consider batches

PH Q: How can we control topological properties (connectivity properties in particular)?

slide-18
SLIDE 18

Connectivity loss

fθ : X → Rn

(x1, . . . , xB)

gφ : Rn → X

Rec[·, ·]

ˆ x

PH + Connectivity loss

Consider batches

Lη ,

η

PH

penalize deviation from homogeneous arrangement (with scale η)

Q: How can we control topological properties (connectivity properties in particular)?

slide-19
SLIDE 19

Connectivity loss

fθ : X → Rn

(x1, . . . , xB)

gφ : Rn → X

Rec[·, ·]

ˆ x

PH

Gradient signal

+ Connectivity loss

Consider batches

Lη ,

η

PH

penalize deviation from homogeneous arrangement (with scale η)

Q: How can we control topological properties (connectivity properties in particular)?

slide-20
SLIDE 20

Connectivity loss

fθ : X → Rn

(x1, . . . , xB)

gφ : Rn → X

Rec[·, ·]

ˆ x

  • Until now, we could not backpropagate through PH

PH

Gradient signal

ˆ x

+ Connectivity loss

Consider batches

Lη ,

η

PH

penalize deviation from homogeneous arrangement (with scale η)

Q: How can we control topological properties (connectivity properties in particular)?

slide-21
SLIDE 21

From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable

· · ·

Connectivity loss

+ Connectivity loss

PH

Enc Dec

slide-22
SLIDE 22

From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable

· · ·

(2) . . . metric-entropy based guidelines for choosing the training batch size B

Connectivity loss

+ Connectivity loss

PH

Enc Dec x1, . . . , xB

slide-23
SLIDE 23

From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable (3) . . . “densification” effects occur for samples, N, larger than the training batch size B

· · ·

(2) . . . metric-entropy based guidelines for choosing the training batch size B

N ≫ B

Connectivity loss

+ Connectivity loss

PH

Enc Dec x1, . . . , xN

slide-24
SLIDE 24

From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable (3) . . . “densification” effects occur for samples, N, larger than the training batch size B

· · ·

(2) . . . metric-entropy based guidelines for choosing the training batch size B

N ≫ B

Connectivity loss

+ Connectivity loss

PH

Enc Dec

Intuitively, during training ... ... the reconstruction loss controls what is worth capturing ... the connectivity loss controls how to topologically organize the latent space

x1, . . . , xN

slide-25
SLIDE 25

Experiments – Task: One-class learning

Rec[·, ·]

fθ gφ

+ Connectivity loss (with fixed scale η)

PH

Trained only once (e.g., on CIFAR-10 without labels) Auxiliary unlabled data

slide-26
SLIDE 26

Experiments – Task: One-class learning

Rec[·, ·]

fθ gφ

+ Connectivity loss (with fixed scale η)

PH

Trained only once (e.g., on CIFAR-10 without labels) Auxiliary unlabled data One-class samples

r = η/2

KDE-inspired one-class "learning"

slide-27
SLIDE 27

Experiments – Task: One-class learning

Rec[·, ·]

fθ gφ

+ Connectivity loss (with fixed scale η)

PH

Trained only once (e.g., on CIFAR-10 without labels) Auxiliary unlabled data One-class samples

r = η/2

fθ fθ

KDE-inspired one-class "learning" Computation of a one-class score Count #samples falling into balls of radius η, anchored at the one-class instances

In-class Out-of-class

slide-28
SLIDE 28

0.5 0.6 0.7 0.8

∅ AUROC

0.5 0.6 0.7 0.8 DAGMM DSEBM OC-SVM (CAE) ADT Deep-SVDD Ours-120

Results – Task: One-class learning

ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18]

CIFAR-10 (AE trained on CIFAR-100)

Training batch size: B = 100

slide-29
SLIDE 29

0.5 0.6 0.7 0.8

∅ AUROC

0.5 0.6 0.7 0.8 DAGMM DSEBM OC-SVM (CAE) ADT Deep-SVDD ADT-120 ADT-500 ADT-1,000 Ours-120 Ours-120

Low-sample size

Results – Task: One-class learning

+7 points ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18]

CIFAR-10 (AE trained on CIFAR-100)

Training batch size: B = 100

slide-30
SLIDE 30

0.5 0.6 0.7 0.8

∅ AUROC

0.5 0.6 0.7 0.8 DAGMM DSEBM ADT Deep-SVDD Ours-120 OC-SVM (CAE)

Results – Task: One-class learning

CIFAR-20 (AE trained on CIFAR-10)

ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100

slide-31
SLIDE 31

0.5 0.6 0.7 0.8

∅ AUROC

0.5 0.6 0.7 0.8 DAGMM DSEBM ADT Deep-SVDD Ours-120 ADT-120 ADT-500 ADT-1,000 Ours-120 OC-SVM (CAE)

Results – Task: One-class learning

CIFAR-20 (AE trained on CIFAR-10)

Low-sample size

+6 points ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100

slide-32
SLIDE 32

0.5 0.6 0.7 0.8

∅ AUROC

0.5 0.6 0.7 0.8 ADT-120 Ours-120

+4 points

Results – Task: One-class learning

Low-sample size

CIFAR-100 (AE trained on CIFAR-10)

ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100

slide-33
SLIDE 33

0.5 0.6 0.7 0.8

∅ AUROC

0.5 0.6 0.7 0.8 Ours-120 Ours-120

Results – Task: One-class learning

Low-sample size

ImageNet (i.e., evaluation of 1,000 one-class models)

using one AE trained on CIFAR-10 using one AE trained on CIFAR-100 ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100

slide-34
SLIDE 34

https://github.com/c-hofer/COREL_icml2019 PyTorch code available!

Come see our poster #83 at 6.30pm (Pacific Ballroom)

import torch import chofer_torchex.pershom as pershom batch = torch.randn(10,5, requires_grad=True) batch = batch.to(’cuda’) non_ess, ess = pershom.vr_persistence_l1(batch,0,0) example_loss = non_ess[:,1].sum() example_loss.backward()