[PPT] - A Review of Regularized Optimal Transport Marco Cuturi Joint work PowerPoint Presentation

SLIDE 1

A Review of Regularized Optimal Transport

Marco Cuturi

Joint work with many people, including:

G. Peyré, A. Genevay (ENS), A. Doucet (Oxford) J. Solomon (MIT),

J.D. Benamou, N. Bonneel, F. Bach, L. Nenna (INRIA),

G. Carlier (Dauphine).

SLIDE 2

What is Optimal Transport?

A geometric toolbox to   compare probability measures   supported on a metric space.

2

Monge Kantorovich Dantzig Wasserstein Brenier McCann Villani Otto

SLIDE 3

What is Optimal Transport?

A geometric toolbox to   compare probability measures   supported on a metric space.

3

Empirical Measures

µ

ν

h1

Color Histograms

h2

Bags

f features

d

pθ pθ0

Statistical Models Brain Activation Maps

SLIDE 4

h2

Bags

f features

d

Brain Activation Maps

What is Optimal Transport?

A geometric toolbox to   compare probability measures  supported on a metric space.

4

pθ pθ0

Statistical Models Empirical Measures

µ

ν

Color Histograms

SLIDE 5

What is Optimal Transport?

5

pθ pθ0 P(Ω)

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 6

What is Optimal Transport?

5

pθ

Wasserstein Distance

W(pθ, pθ0) pθ0 P(Ω)

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 7

What is Optimal Transport?

5

pθ

[McCann’95] Interpolant

pθ0 P(Ω)

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 8

What is Optimal Transport?

6

pθ0 pθ pθ00 P(Ω)

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 9

What is Optimal Transport?

6

pθ0 pθ pθ00

Wasserstein Barycenter [Agueh’11]

P(Ω)

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 10

OT and data-analysis

Key developments in (applied) maths ~’90s

[McCann’95], [JKO’98], [Benamou’98], [Gangbo’98], [Ambrosio’06], [Villani’03/’09].

Key developments in TCS / graphics since ’00s

[Rubner’98], [Indyk’03], [Naor’07], [Andoni’15].

๏Small to no-impact in large-scale data analysis:

✦ computationally heavy; ✦ Wasserstein distance is not differentiable

7

SLIDE 11

OT and data-analysis

Key developments in

[McCann’95] [Ambrosio’06], [Villani’03/’09].

Key developments in

[Rubner’98],

๏Small to

✦ computationally heavy; ✦ Wasserstein distance is not differentiable

7

Today’s talk: Entropy Regularized OT

Very fast compared to usual approaches,

GPGPU parallel.

Differentiable, important if we want to use

OT distances as loss functions.

Can be automatically differentiated, simple

iterative process, DL-toolboxes compatible.

OT can become a building block in ML.

SLIDE 12

Background: OT Geometry

8

Consider (Ω, D), a metric probability space. Let µ, ν be probability measures in P(Ω).

[Monge’81] problem: find a map T : Ω → Ω

inf

T #µ=ν

Z

Ω

D(x, T (x))µ(dx) x T (x)

SLIDE 13

Background: OT Geometry

8

Consider (Ω, D), a metric probability space. Let µ, ν be probability measures in P(Ω).

[Monge’81] problem: find a map T : Ω → Ω

inf

T #µ=ν

Z

Ω

D(x, T (x))µ(dx) δx

SLIDE 14

[Kantorovich’42] Relaxation

9

Π(µ, ν)

def

= {P ∈ P(Ω × Ω)| ∀A, B ⊂ Ω, P (A × Ω) = µ(A), P (Ω × B) = ν(B)}

Instead of maps , consider

probabilistic maps, i.e. couplings :

T : Ω → Ω P ∈ P(Ω × Ω)

SLIDE 15

10

Π(µ, ν)

def

= {P ∈ P(Ω × Ω)| ∀A, B ⊂ Ω, P (A × Ω) = µ(A), P (Ω × B) = ν(B)}

{ } { } { } {

−1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 0.1 0.2 0.3 P (x, y)

[Kantorovich’42] Relaxation

SLIDE 16

10

Π(µ, ν)

def

= {P ∈ P(Ω × Ω)| ∀A, B ⊂ Ω, P (A × Ω) = µ(A), P (Ω × B) = ν(B)}

{ } { } { } {

−1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 0.1 0.2 0.3 P (x, y) −1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 5 · 10 0.1 0.15 P (x, y) 0.1 0.2 0.3

[Kantorovich’42] Relaxation

SLIDE 17

Couplings

11

{ } { } { } {

−1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 0.1 0.2 0.3 P (x, y)

SLIDE 18

Couplings

12

−1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 5 · 10 0.1 0.15 P (x, y) 0.1 0.2 0.3

SLIDE 19

Wasserstein Distance

13

Def. For p ≥ 1, the p-Wasserstein distance

between µ, ν in P(Ω) is Wp(µ, ν)

def

= ✓ inf

P ∈Π(µ,ν) EP [D(X, Y )p]

◆1/p .

SLIDE 20

Wasserstein between 2 Diracs

14

δy δx (Ω, D) W p

p (δx, δy) = D(x, y)

SLIDE 21

Wasserstein on Uniform Measures

15

µ =

n

X

i=1

1 nδxi ν =

n

X

j=1

1 nδyj (Ω, D)

SLIDE 22

Wasserstein on Uniform Measures

15

µ =

n

X

i=1

1 nδxi ν =

n

X

j=1

1 nδyj (Ω, D) C(σ) = 1 n

n

X

i=1

D(xi, yσi)p

SLIDE 23

Optimal Assignment ⊂ Wasserstein

16

µ =

n

X

i=1

1 nδxi W p

p (µ, ν) = min σ∈Sn C(σ)

ν =

n

X

j=1

1 nδyj (Ω, D)

SLIDE 24

17

(Ω, D)

Wasserstein on Empirical Measures

µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

SLIDE 25

Wasserstein on Empirical Measures

18

U(a, b)

def

= {P ∈ Rn×m

+

|P 1m = a,P T 1n = b} MXY

def

= [D(xi, yj)p]ij Consider µ =

n

X

i=1

aiδxi and ν =

m

X

j=1

bjδyj.

     

b1 ... bm a1

· · · · · · · · ·

. . .

· · · P 1m = a · · ·

an

· · · · · · · · ·            

y1 ... ym x1

· · ·

. . .

· D(xi, yj)p ·

xn

· · ·      

SLIDE 26

Wasserstein on Empirical Measures

18

U(a, b)

def

= {P ∈ Rn×m

+

|P 1m = a,P T 1n = b} MXY

def

= [D(xi, yj)p]ij Consider µ =

n

X

i=1

aiδxi and ν =

m

X

j=1

bjδyj.

     

b1 ... bm a1

. . . . . . . . .

. . .

. . . P T 1n = b . . .

an

. . . . . . . . .      

     

y1 ... ym x1

· · ·

. . .

· D(xi, yj)p ·

xn

· · ·      

SLIDE 27

Wasserstein on Empirical Measures

18

U(a, b)

def

= {P ∈ Rn×m

+

|P 1m = a,P T 1n = b} MXY

def

= [D(xi, yj)p]ij

Def. Optimal Transport Problem

W p

p (µ, ν) =

min

P ∈U(a,b)hP , MXY i

Consider µ =

n

X

i=1

aiδxi and ν =

m

X

j=1

bjδyj.

SLIDE 28

Discrete OT Problem

19

MXY U(a, b)

SLIDE 29

Discrete OT Problem

20

MXY U(a, b) P ?

SLIDE 30

Discrete OT Problem

20

Def. Dual OT problem

W p

p (µ, ν) =

max

α∈Rn,β∈Rm αi+βj≤D(xi,yj)p

αT a + βT b MXY U(a, b) P ?

SLIDE 31

Discrete OT Problem

20

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

Note: flow/PDE formulations [Beckman’61]/[Benamou’98] can be used for p=1/p=2 for a sparse-graph metric/Euclidean metric.

SLIDE 32

Discrete OT Problem

21

MXY U(a, b) P ?

SLIDE 33

Discrete OT Problem

21

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 34

Discrete OT Problem

21

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 35

Discrete OT Problem

22

MXY U(a, b) P ?

SLIDE 36

Discrete OT Problem

22

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 37

Discrete OT Problem

23

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 38

Discrete OT Problem

23

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

P ?

Solution unstable and not always unique.

SLIDE 39

Discrete OT Problem

23

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

P ?

Solution unstable and not always unique.

{P ?}

SLIDE 40

Discrete OT Problem

24

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

{P ?} P ?

Solution unstable and not always unique.

SLIDE 41

Discrete OT Problem

24

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

P ? P ?

Solution unstable and not always unique.

SLIDE 42

Discrete OT Problem

24

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

P ? P ?

Solution unstable and not always unique.

W p

p (µ, ν) not differentiable.

SLIDE 43

Entropic Regularization [Wilson’62]

25

Note: Unique optimal solution because of strong concavity of Entropy

E(P)

def

= −

nm

X

i,j=1

Pij(log Pij)

Def. Regularized Wasserstein, γ ≥ 0

Wγ(µ, ν)

def

= min

P ∈U(a,b)hP , MXY i γE(P )

SLIDE 44

Entropic Regularization [Wilson’62]

25

γ

µ ν Pγ

Note: Unique optimal solution because of strong concavity of Entropy

Def. Regularized Wasserstein, γ ≥ 0

Wγ(µ, ν)

def

= min

P ∈U(a,b)hP , MXY i γE(P )

SLIDE 45

Fast & Scalable Algorithm

26

Prop. If Pγ

def

= argmin

P ∈U(a,b)

hP , MXY iγE(P ) then 9!u 2 Rn

+, v 2 Rm +, such that

Pγ = diag(u)Kdiag(v), K

def

= e−MXY /γ

SLIDE 46

Fast & Scalable Algorithm

26

Prop. If Pγ

def

= argmin

P ∈U(a,b)

hP , MXY iγE(P ) then 9!u 2 Rn

+, v 2 Rm +, such that

Pγ = diag(u)Kdiag(v), K

def

= e−MXY /γ

L(P, α, β) = X

ij

PijMij + γPij log Pij + αT (P1 − a) + βT (P T 1 − b) ∂L/∂Pij = Mij + γ(log Pij + 1) + αi + βj (∂L/∂Pij = 0) ⇒Pij = e

αi γ + 1 2 e − Mij γ

e

βj γ + 1 2 = ui Kijvj

SLIDE 47

Fast & Scalable Algorithm

26

[Sinkhorn’64] fixed-point iterations for
complexity, GPGPU parallel [C’13] .
if and separable.
Prop. If Pγ

def

= argmin

P ∈U(a,b)

hP , MXY iγE(P ) then 9!u 2 Rn

+, v 2 Rm +, such that

Pγ = diag(u)Kdiag(v), K

def

= e−MXY /γ (u, v) O(nm) Dp

[S..C..’15]

Ω = {1, . . . , n}d O(nd+1) u ← a/Kv, v ← b/KT u

SLIDE 48

Very Fast EMD Approx. Solver

27

Note. is a random graph with shortest path metric, histograms

sampled uniformly on simplex, Sinkhorn tolerance 10-2.

(Ω, D)

64 128 256 512 1024 2048 4096 10

−6

10

−4

10

−2

10 10

2

10

4

Histogram Dimension

Avg. Execution Time per Distance (in s.)

FastEMD Rubner’s emd CPU γ=0.02 CPU γ=0.1 GPU γ=0.02 GPU γ=0.1

SLIDE 49

28

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Regularization ⤑ Differentiability

Wγ((a, X), (b, Y )) = min

P ∈U(a,b)hP , MXY iγE(P )

SLIDE 50

28

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Regularization ⤑ Differentiability

Wγ((a + ∆a, X), (b, Y )) = Wγ((a, X), (b, Y ))+??

SLIDE 51

28

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Regularization ⤑ Differentiability

a ← a + ∆a

Wγ((a + ∆a, X), (b, Y )) = Wγ((a, X), (b, Y ))+??

SLIDE 52

29

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Wγ((a, X + ∆X), (b, Y )) = Wγ((a, X), (b, Y ))+??

Regularization ⤑ Differentiability

SLIDE 53

29

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj X ← X + ∆X

Wγ((a, X + ∆X), (b, Y )) = Wγ((a, X), (b, Y ))+??

Regularization ⤑ Differentiability

SLIDE 54

30

Quantization, k-means problem [Lloyd’82]
[McCann’95] Interpolant
[JKO’98] PDE’s as gradient flows in

min

µ∈P(Ω)(1 − t)W 2 2 (µ, ν1) + tW 2 2 (µ, ν2)

min

µ∈P(Rd) | supp µ|=k

W 2

2 (µ, νdata)

µt+1 = argmin

µ∈P(Ω)

J(µ) + λtW p

p (µ, µt)

(P(Ω), W).

Crucial for “min data + W ” problems

SLIDE 55

30

Quantization,
[McCann’95]
[JKO’98]

min

µ∈P(Ω)(1 − t)W 2 2 (µ, ν1) + tW 2 2 (µ, ν2)

min

µ∈P(Rd) | supp µ|=k

W 2

2 (µ, νdata)

µt+1 = argmin

µ∈P(Ω)

J(µ) + λtW p

p (µ, µt)

(P(Ω), W).

Any (ML) problem involving a KL or L2 loss   between (parameterized) histograms or probabilility measures can be easily Wasserstein-ized if we can differentiate W efficiently.

Crucial for “min data + W ” problems

SLIDE 56

1. Differentiability of Regularized OT

31

Def. Dual regularized OT Problem

Wγ(µ, ν) = max

α,β αT a + βT b − 1

γ (eα/γ)T K Keβ/γ

Prop. W(µ, ν) is
1. convex w.r.t. a (Danskin),

raW = α? = γ log(u).

2. decreased, when p = 2, Ω = Rd, using

X Y P T

D(a−1).

[CD’14]

SLIDE 57

32

[CP’16]

Prop. Writing Hν : a 7! Wγ(µ, ν),
1. Hν has simple Legendre transform:

H∗

ν : g 2 Rn 7! γ

⇣ E(b) + bT log(Keg/γ) ⌘

2. If A 2 Rn×d, f convex on Rd,

min

a∈ΣnHν(a)+f(Aa)=max g∈RdH∗ ν(

ATg)f ∗( g)

2. Duality for Regularized OT’s

SLIDE 58

33

3. Stochastic Formulation

Wγ(µ, ν) = max

α,β αT a + βT b − 1

γ (eα/γ)T Keβ/γ = max

α αT a − γ(log Keα/γ)T b

= max

α m

X

j=1

bj ⇣ αT a − γ log KT

·jeα/γ⌘

= max

α m

X

j=1

f j(α)

[GCPB’16] shows how incremental gradient

methods can be used to scale this further.

SLIDE 59

34

4. Algorithmic Formulation

Prop.

∂WL ∂X , ∂WL ∂a

can be computed recur- sively, in O(L) kernel K×vector products.

Def. For L 1, define

WL(µ, ν)

def

= hPL, MXY i, where PL

def

= diag(uL)Kdiag(vL), v0 = 1m; l 0, ul

def

= a/ Kvl, vl+1

def

= b/ KT ul.

SLIDE 60

35

✓∂v0 ∂a ◆T = 0m×n, ✓∂ul ∂a ◆T x = x Kvl

✓∂vl

∂a ◆T KT x a ( Kvl)2 , ✓∂vl+1 ∂a ◆T y = ✓∂ul ∂a ◆T K y b ( KT ul)2 . Example: Differentiability w.r.t. a

Algorithmic Formulation of Reg. OT

SLIDE 61

36

Example: Differentiability w.r.t. a

N = K MXY raWL(µ, ν) = ✓∂uL ∂a ◆T NvL + ✓∂vL ∂a ◆T N T uL

Algorithmic Formulation of Reg. OT

SLIDE 62

37

Algorithmic Formulation of Wasserstein

SLIDE 63

38

SLIDE 64

39

SLIDE 65

40

SLIDE 66

41

[Agueh’11] Barycenters [CD’14][BCCNP’15]

[GCP’15][S..C..’15]

[Burger’12] TV gradient flow using duality [CP’16]
Dictionary Learning / Latent Factors [RCP’16]
[Bigot’15] W-PCA [SC’15]
Density fitting / parameter estimation [MMC’16]
Inverse problems / Wasserstein regression [BPC’16]

Thanks to these tricks…

SLIDE 67

Wasserstein Barycenters

42

Wasserstein Barycenter [Agueh’11]

min

µ∈P(Ω) N

X

i=1

λiW p

p (µ, νi)

ν1 ν2 ν3 P(Ω)

SLIDE 68

Multimarginal Formulation

Exact solution (W2) using MM-OT. [Agueh’11]

−1 −0.5 0.5 1 1.5 2 2.5 3 −1.5 −1 −0.5 0.5 1

43

SLIDE 69

Multimarginal Formulation

Exact solution (W2) using MM-OT. [Agueh’11]

−1 −0.5 0.5 1 1.5 2 2.5 3 −1.5 −1 −0.5 0.5 1

If | supp νi| = ni, LP of size (Q

i ni, P i ni)

−1 −0.5 0.5 1 1.5 2 2.5 3 −1.5 −1 −0.5 0.5 1

43

SLIDE 70

When is a finite set, metric M, another LP.

Finite Case, LP Formulation

44

Ω min

µ

X

i

λiW p

p (µ, νi)

SLIDE 71

When is a finite set, metric M, another LP.

Finite Case, LP Formulation

44

Ω min

P1,··· ,PN ,a N

X

i=1

λihPi, M i s.t. Pi

T 1n = bi, 8i  N,

P11n = · · · = PN1d = a.

If |Ω| = n, LP of size (Nn2, (2N − 1)n); unstable

SLIDE 72

Primal Descent on Regularized W

45

[CD’14]

min

µ∈Q⊂P(Ω) N

X

i=1

λiWγ(µ, νi)

Fast Computation of Wasserstein Barycenters International Conference on Machine Learning 2014

SLIDE 73

Primal Descent on Regularized W

45

[CD’14]

min

µ∈Q⊂P(Ω) N

X

i=1

λiWγ(µ, νi)

Fast Computation of Wasserstein Barycenters International Conference on Machine Learning 2014

SLIDE 74

Primal Descent on Regularized W

45

[CD’14]

min

µ∈Q⊂P(Ω) N

X

i=1

λiWγ(µ, νi)

Fast Computation of Wasserstein Barycenters International Conference on Machine Learning 2014

SLIDE 75

Wasserstein Barycenter = KL Projections

46

[BCCNP’15]

hP, MXY i γE(P) = γKL(P| K) C1 = {P|∃a, ∀i, Pi1m = a} C2 =

P|∀i, P T

i 1n = bi

min

a N

X

i=1

λiWγ(a, bi) = min

P=[P1,...,PN ] P∈C1∩C2 N

X

i=1

λiKL(Pi|K)

SLIDE 76

Wasserstein Barycenter = KL Projections

46

[ K · · · K] Pγ

[BCCNP’15]

C1 = {P|∃a, ∀i, Pi1m = a} C2 =

P|∀i, P T

i 1n = bi

min

a N

X

i=1

λiWγ(a, bi) = min

P=[P1,...,PN ] P∈C1∩C2 N

X

i=1

λiKL(Pi|K)

SLIDE 77

Wasserstein Barycenter = KL Projections

46

[ K · · · K] Pγ

u=ones(size(B)); % d x N matrix while not converged v=u.*(K’*(B./(K*u))); % 2(Nd^2) cost u=bsxfun(@times,u,exp(log(v)*weights))./v; end a=mean(v,2);

[BCCNP’15]

C1 = {P|∃a, ∀i, Pi1m = a} C2 =

P|∀i, P T

i 1n = bi

min

a N

X

i=1

λiWγ(a, bi) = min

P=[P1,...,PN ] P∈C1∩C2 N

X

i=1

λiKL(Pi|K)

Iterative Bregman Projections for Regularized Transportation Problems SIAM J. on Sci. Comp. 2015

SLIDE 78