Christoph D. Hofer, Roland Kwitt Mandar Dixit Marc Niethammer
Connectivity-Optimized Representation Learning via Persistent Homology
ICML | 2019
University of Salzburg Microsoft UNC Chapel Hill
Long Beach
Connectivity-Optimized Representation Learning via Persistent - - PowerPoint PPT Presentation
ICML | 2019 Long Beach Connectivity-Optimized Representation Learning via Persistent Homology Christoph D. Hofer, Roland Kwitt Mandar Dixit Marc Niethammer University of Salzburg UNC Chapel Hill Microsoft Unsupervised representation
Christoph D. Hofer, Roland Kwitt Mandar Dixit Marc Niethammer
ICML | 2019
University of Salzburg Microsoft UNC Chapel Hill
Long Beach
◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.
Q: What makes a good representation?
◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.
Q: What makes a good representation?
Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.
Rec[x, ˆ
x]
Latent space Z
x ˆ x Encoder Decoder
◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.
Q: What makes a good representation?
Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.
Rec[x, ˆ
x]
Latent space Z
Contractive AE’s [Rifai et al., ICML ’11]
x ˆ x + Reg Encoder Decoder
◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.
Q: What makes a good representation?
Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.
Rec[x, ˆ
x]
Latent space Z
x ˆ x
Denoising AE’s [Vincent et al., JMLR ’10]
Encoder Decoder
Perturb, or zero-out
◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.
Q: What makes a good representation?
Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.
Rec[x, ˆ
x]
Latent space Z
x ˆ x
Sparse AE’s [Makhzani & Frey, ICLR ’14]
+ Reg Encoder Decoder
◮ Robust to pertubations of the input ◮ Ability to reconstruct (→ prevalance of autoencoders) ◮ Useful for downstream tasks (e.g., clustering, or classification) ◮ etc.
Q: What makes a good representation?
Common idea: Control (/or enforce) properties of (/on) the latent representations in Z.
Rec[x, ˆ
x]
Latent space Z
x ˆ x
Adversarial AE’s [Makhzani et al., ICLR ’16]
Encoder Decoder
(by far not exhaustive)
We aim to control properties of the latent space, but from a topological point of view!
Assume, we want to do Kernel Density Estimation (KDE) in the latent space Z.
Bandwidth selection: Scott’s rule [Scott, 1992] Data (zi) Gaussian KDE
We aim to control properties of the latent space, but from a topological point of view!
Assume, we want to do Kernel Density Estimation (KDE) in the latent space Z.
Bandwidth selection: Scott’s rule [Scott, 1992] Data (zi) Gaussian KDE Bandwidth selection can be challenging, as the scaling greatly differs! Gaussian KDE Data (zi)
We aim to control properties of the latent space, but from a topological point of view!
Latent space Z
Q: How do we capture topological properties and what do we want to control?
Vietoris Rips Persistent Homology (PH) Radius r = r1 Latent space Z
r
Q: How do we capture topological properties and what do we want to control?
Vietoris Rips Persistent Homology (PH) Radius r = r2 Latent space Z
Q: How do we capture topological properties and what do we want to control?
Vietoris Rips Persistent Homology (PH)
◮ PH tracks topological changes as the ball radius r increases ◮ Connectivity information is caputred by 0-dim. persistent homology
Radius r = r3 Latent space Z
Q: How do we capture topological properties and what do we want to control?
Vietoris Rips Persistent Homology (PH)
◮ PH tracks topological changes as the ball radius r increases ◮ Connectivity information is caputred by 0-dim. persistent homology
Homogeneous arrangement!
η/2
z → fθ(z)
What if Radius r = r3 Latent space Z
Q: How do we capture topological properties and what do we want to control?
beneficial for KDE
Rec[·, ·]
ˆ x x
Q: How can we control topological properties (connectivity properties in particular)?
(x1, . . . , xB)
Rec[·, ·]
ˆ x
PH + Connectivity loss
Consider batches
PH Q: How can we control topological properties (connectivity properties in particular)?
(x1, . . . , xB)
Rec[·, ·]
ˆ x
PH + Connectivity loss
Consider batches
Lη ,
η
PH
penalize deviation from homogeneous arrangement (with scale η)
Q: How can we control topological properties (connectivity properties in particular)?
(x1, . . . , xB)
Rec[·, ·]
ˆ x
PH
Gradient signal
+ Connectivity loss
Consider batches
Lη ,
η
PH
penalize deviation from homogeneous arrangement (with scale η)
Q: How can we control topological properties (connectivity properties in particular)?
(x1, . . . , xB)
Rec[·, ·]
ˆ x
PH
Gradient signal
ˆ x
+ Connectivity loss
Consider batches
Lη ,
η
PH
penalize deviation from homogeneous arrangement (with scale η)
Q: How can we control topological properties (connectivity properties in particular)?
From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable
· · ·
+ Connectivity loss
PH
Enc Dec
From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable
· · ·
(2) . . . metric-entropy based guidelines for choosing the training batch size B
+ Connectivity loss
PH
Enc Dec x1, . . . , xB
From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable (3) . . . “densification” effects occur for samples, N, larger than the training batch size B
· · ·
(2) . . . metric-entropy based guidelines for choosing the training batch size B
N ≫ B
+ Connectivity loss
PH
Enc Dec x1, . . . , xN
From a theoretical perspective, we show . . . (1) . . . that under mild conditions, the connectivity loss is differentiable (3) . . . “densification” effects occur for samples, N, larger than the training batch size B
· · ·
(2) . . . metric-entropy based guidelines for choosing the training batch size B
N ≫ B
+ Connectivity loss
PH
Enc Dec
Intuitively, during training ... ... the reconstruction loss controls what is worth capturing ... the connectivity loss controls how to topologically organize the latent space
x1, . . . , xN
Rec[·, ·]
fθ gφ
+ Connectivity loss (with fixed scale η)
PH
Trained only once (e.g., on CIFAR-10 without labels) Auxiliary unlabled data
Rec[·, ·]
fθ gφ
+ Connectivity loss (with fixed scale η)
PH
fθ
Trained only once (e.g., on CIFAR-10 without labels) Auxiliary unlabled data One-class samples
r = η/2
KDE-inspired one-class "learning"
Rec[·, ·]
fθ gφ
+ Connectivity loss (with fixed scale η)
PH
fθ
Trained only once (e.g., on CIFAR-10 without labels) Auxiliary unlabled data One-class samples
r = η/2
fθ fθ
KDE-inspired one-class "learning" Computation of a one-class score Count #samples falling into balls of radius η, anchored at the one-class instances
In-class Out-of-class
0.5 0.6 0.7 0.8
∅ AUROC
0.5 0.6 0.7 0.8 DAGMM DSEBM OC-SVM (CAE) ADT Deep-SVDD Ours-120
ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18]
CIFAR-10 (AE trained on CIFAR-100)
Training batch size: B = 100
0.5 0.6 0.7 0.8
∅ AUROC
0.5 0.6 0.7 0.8 DAGMM DSEBM OC-SVM (CAE) ADT Deep-SVDD ADT-120 ADT-500 ADT-1,000 Ours-120 Ours-120
Low-sample size
+7 points ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18]
CIFAR-10 (AE trained on CIFAR-100)
Training batch size: B = 100
0.5 0.6 0.7 0.8
∅ AUROC
0.5 0.6 0.7 0.8 DAGMM DSEBM ADT Deep-SVDD Ours-120 OC-SVM (CAE)
CIFAR-20 (AE trained on CIFAR-10)
ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100
0.5 0.6 0.7 0.8
∅ AUROC
0.5 0.6 0.7 0.8 DAGMM DSEBM ADT Deep-SVDD Ours-120 ADT-120 ADT-500 ADT-1,000 Ours-120 OC-SVM (CAE)
CIFAR-20 (AE trained on CIFAR-10)
Low-sample size
+6 points ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100
0.5 0.6 0.7 0.8
∅ AUROC
0.5 0.6 0.7 0.8 ADT-120 Ours-120
+4 points
Low-sample size
CIFAR-100 (AE trained on CIFAR-10)
ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100
0.5 0.6 0.7 0.8
∅ AUROC
0.5 0.6 0.7 0.8 Ours-120 Ours-120
Low-sample size
ImageNet (i.e., evaluation of 1,000 one-class models)
using one AE trained on CIFAR-10 using one AE trained on CIFAR-100 ADT [Goland & El-Yaniv, NIPS ’18] DAGMM [Zong et al., ICLR ’18] DSEBM [Zhai et al., ICML ’16] Deep-SVDD [Ruff et al., ICML ’18] Training batch size: B = 100
https://github.com/c-hofer/COREL_icml2019 PyTorch code available!
import torch import chofer_torchex.pershom as pershom batch = torch.randn(10,5, requires_grad=True) batch = batch.to(’cuda’) non_ess, ess = pershom.vr_persistence_l1(batch,0,0) example_loss = non_ess[:,1].sum() example_loss.backward()