[PPT] - Can Random Matrices Change the Future of Machine Learning? Malik PowerPoint Presentation

SLIDE 1

Can Random Matrices Change the Future of Machine Learning?

Malik TIOMOKO and Romain COUILLET

CentraleSup´ elec, L2S, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble–Alpes, France.

March 8, 2020

1 / 46

SLIDE 2

A long story short...

2 / 46

SLIDE 3

A long story short...

2 / 46

SLIDE 4

A long story short...

2 / 46

SLIDE 5

A long story short...

2 / 46

SLIDE 6

A long story short...

2 / 46

SLIDE 7

A long story short...

2 / 46

SLIDE 8

A long story short...

2 / 46

SLIDE 9

A long story short...

2 / 46

SLIDE 10

A long story short...

2 / 46

SLIDE 11

A long story short...

2 / 46

SLIDE 12

A long story short...

2 / 46

SLIDE 13

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

3 / 46

SLIDE 14

Basics of Random Matrix Theory/ 4/46

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

4 / 46

SLIDE 15

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/46

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

5 / 46

SLIDE 16

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/46

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

6 / 46

SLIDE 17

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/46

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n).

6 / 46

SLIDE 18

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/46

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

r equivalently, in spectral norm
ˆ

Cp − Cp

a.s.

− → 0.

6 / 46

SLIDE 19

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/46

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

r equivalently, in spectral norm
ˆ

Cp − Cp

a.s.

− → 0.

Random Matrix Regime

◮ No longer valid if p, n → ∞ with p/n → c ∈ (0, ∞),

ˆ

Cp − Cp

→ 0.

6 / 46

SLIDE 20

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/46

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

r equivalently, in spectral norm
ˆ

Cp − Cp

a.s.

− → 0.

Random Matrix Regime

◮ No longer valid if p, n → ∞ with p/n → c ∈ (0, ∞),

ˆ

Cp − Cp

→ 0.

◮ For practical p, n with p ≃ n, leads to dramatically wrong conclusions

6 / 46

SLIDE 21

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/46

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗

1] = Cp:

◮ If y1 ∼ N(0, Cp), ML estimator for Cp is the sample covariance matrix (SCM) ˆ Cp = 1 nYpY ∗

p = 1

n

i=1

yiy∗

i

(Yp = [y1, . . . , yn] ∈ Cp×n). ◮ If n → ∞, then, strong law of large numbers ˆ Cp

a.s.

− → Cp.

r equivalently, in spectral norm
ˆ

Cp − Cp

a.s.

− → 0.

Random Matrix Regime

◮ No longer valid if p, n → ∞ with p/n → c ∈ (0, ∞),

ˆ

Cp − Cp

→ 0.

◮ For practical p, n with p ≃ n, leads to dramatically wrong conclusions ◮ Even for p = n/100.

6 / 46

SLIDE 22

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 50, n = 200

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

7 / 46

SLIDE 23

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 100, n = 400

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

7 / 46

SLIDE 24

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 250, n = 1000

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

7 / 46

SLIDE 25

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 500, n = 2000

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

7 / 46

SLIDE 26

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 1000, n = 4000

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

7 / 46

SLIDE 27

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 Eigenvalues of ˆ Cp Density

p = 1000, n = 4000 The Mar˘ cenko–Pastur law

Figure: Histogram of the eigenvalues of ˆ Cp for c = 1/4, Cp = Ip.

7 / 46

SLIDE 28

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/46

The Mar˘ cenko–Pastur law

Definition (Empirical Spectral Density)

Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is µp = 1 p

p

i=1

δλi(Ap).

8 / 46

SLIDE 29

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/46

The Mar˘ cenko–Pastur law

Definition (Empirical Spectral Density)

Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is µp = 1 p

p

i=1

δλi(Ap).

Theorem (Mar˘ cenko–Pastur Law [Mar˘ cenko,Pastur’67])

Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries. As p, n → ∞ with p/n → c ∈ (0, ∞), e.s.d. µp of 1

nXpX∗ p satisfies

µp

a.s.

− → µc weakly, where ◮ µc({0}) = max{0, 1 − c−1}

8 / 46

SLIDE 30

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/46

The Mar˘ cenko–Pastur law

Definition (Empirical Spectral Density)

Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is µp = 1 p

p

i=1

δλi(Ap).

Theorem (Mar˘ cenko–Pastur Law [Mar˘ cenko,Pastur’67])

Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries. As p, n → ∞ with p/n → c ∈ (0, ∞), e.s.d. µp of 1

nXpX∗ p satisfies

µp

a.s.

− → µc weakly, where ◮ µc({0}) = max{0, 1 − c−1} ◮ on (0, ∞), µc has continuous density fc supported on [(1 − √c)2, (1 + √c)2] fc(x) = 1 2πcx

(x − (1 − √c)2)((1 + √c)2 − x).

8 / 46

SLIDE 31

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 x Density fc(x) c = 0.1

Figure: Mar˘ cenko-Pastur law for different limit ratios c = limp→∞ p/n.

9 / 46

SLIDE 32

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 x Density fc(x) c = 0.1 c = 0.2

Figure: Mar˘ cenko-Pastur law for different limit ratios c = limp→∞ p/n.

9 / 46

SLIDE 33

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/46

The Mar˘ cenko–Pastur law

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 x Density fc(x) c = 0.1 c = 0.2 c = 0.5

Figure: Mar˘ cenko-Pastur law for different limit ratios c = limp→∞ p/n.

9 / 46

SLIDE 34

Basics of Random Matrix Theory/Spiked Models 10/46

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

10 / 46

SLIDE 35

Basics of Random Matrix Theory/Spiked Models 11/46

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 1/4 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

11 / 46

SLIDE 36

Basics of Random Matrix Theory/Spiked Models 11/46

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 1/2 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

11 / 46

SLIDE 37

Basics of Random Matrix Theory/Spiked Models 11/46

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 1 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

11 / 46

SLIDE 38

Basics of Random Matrix Theory/Spiked Models 11/46

Spiked Models

Small rank perturbation: Cp = Ip + P, P of low rank.

1 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 1 p/n = 2 (p = 500)

Figure: Eigenvalues of

1 n YpY T p , eig(Cp) = {1, . . . , 1 p−4

, 2, 3, 4, 5}.

11 / 46

SLIDE 39

Basics of Random Matrix Theory/Spiked Models 12/46

Spiked Models

Theorem (Eigenvalues [Baik,Silverstein’06])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗, where, for K fixed, Ω = diag (ω1, . . . , ωK) ∈ RK×K, with ω1 ≥ . . . ≥ ωK > 0.

12 / 46

SLIDE 40

Basics of Random Matrix Theory/Spiked Models 12/46

Spiked Models

Theorem (Eigenvalues [Baik,Silverstein’06])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗, where, for K fixed, Ω = diag (ω1, . . . , ωK) ∈ RK×K, with ω1 ≥ . . . ≥ ωK > 0. Then, as p, n → ∞, p/n → c ∈ (0, ∞), denoting λm = λm( 1

nYpY ∗ p ) (λm > λm+1),

λm

a.s.

− →

1 + ωm + c 1+ωm

ωm

> (1 + √c)2 , ωm > √c (1 + √c)2 , ωm ∈ (0, √c].

12 / 46

SLIDE 41

Basics of Random Matrix Theory/Spiked Models 13/46

Spiked Models

Theorem (Eigenvectors [Paul’07])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗ = K

i=1 ωiuiu∗ i , ω1 > . . . > ωM > 0.

13 / 46

SLIDE 42

Basics of Random Matrix Theory/Spiked Models 13/46

Spiked Models

Theorem (Eigenvectors [Paul’07])

Let Yp = C

1 2

p Xp, with

◮ Xp with i.i.d. zero mean, unit variance, E[|Xp|4

ij] < ∞.

◮ Cp = Ip + P, P = UΩU∗ = K

i=1 ωiuiu∗ i , ω1 > . . . > ωM > 0.

Then, as p, n → ∞, p/n → c ∈ (0, ∞), for a, b ∈ Cp deterministic and ˆ ui eigenvector

f λi( 1

nYpY ∗ p ),

a∗ˆ uiˆ u∗

i b − 1 − cω−2 i

1 + cω−1

i

a∗uiu∗

i b · 1ωi>√c a.s.

− → 0 In particular, |ˆ u∗

i ui|2 a.s.

− → 1 − cω−2

i

1 + cω−1

i

· 1ωi>√c.

13 / 46

SLIDE 43

Basics of Random Matrix Theory/Spiked Models 14/46

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

14 / 46

SLIDE 44

Basics of Random Matrix Theory/Spiked Models 14/46

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100 p = 200

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

14 / 46

SLIDE 45

Basics of Random Matrix Theory/Spiked Models 14/46

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100 p = 200 p = 400

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

14 / 46

SLIDE 46

Basics of Random Matrix Theory/Spiked Models 14/46

Spiked Models

1 2 3 4 0.2 0.4 0.6 0.8 1 Population spike ω1 |ˆ uT

1u1|2 p = 100 p = 200 p = 400 1−c/ω2 1 1+c/ω1

Figure: Simulated versus limiting |ˆ uT

1u1|2 for Yp = C 1 2 p Xp, Cp = Ip + ω1u1uT 1, p/n = 1/3,

varying ω1.

14 / 46

SLIDE 47

Basics of Random Matrix Theory/Spiked Models 15/46

Other Spiked Models

Similar results for multiple matrix models: ◮ Yp = 1

n(I + P)

1 2 XpX∗

p(I + P)

1 2

◮ Yp = 1

nXpX∗ p + P

◮ Yp = 1

nX∗ p(I + P)X

◮ Yp = 1

n(Xp + P)∗(Xp + P)

◮ etc.

15 / 46

SLIDE 48

Application to Machine Learning/ 16/46

Outline

Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Application to Machine Learning

16 / 46

SLIDE 49

Takeaway Message 1 “RMT Explains Why Machine Learning Intuitions Collapse in Large Dimensions”

SLIDE 50

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p:

18 / 46

SLIDE 51

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

18 / 46

SLIDE 52

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p)

18 / 46

SLIDE 53

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) Classical method: spectral clustering

18 / 46

SLIDE 54

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = {κ(xi, xj)}n

i,j=1

18 / 46

SLIDE 55

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = {κ(xi, xj)}n

i,j=1 ,

κ(xi, xj) = f

1 pxi − xj2 .

18 / 46

SLIDE 56

Application to Machine Learning/ 18/46

The curse of dimensionality and its consequences

Clustering setting in (not so) large n, p: ◮ GMM setting: x(a)

1

, . . . , x(a)

na ∼ N(µa, Ca), a = 1, . . . , k

◮ Non-trivial task: µa − µb = O(1), tr (Ca − Cb) = O(√p), tr [(Ca − Cb)2] = O(p) Classical method: spectral clustering ◮ Extract and cluster the dominant eigenvectors of K = {κ(xi, xj)}n

i,j=1 ,

κ(xi, xj) = f

1 pxi − xj2 . ◮ Why? Finite-dimensional intuition

18 / 46

SLIDE 57

Application to Machine Learning/ 19/46

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp).

19 / 46

SLIDE 58

Application to Machine Learning/ 19/46

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp).

19 / 46

SLIDE 59

Application to Machine Learning/ 19/46

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp).

19 / 46

SLIDE 60

Application to Machine Learning/ 19/46

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp). Key observation: Under growth rate assumptions, max

1≤i=j≤n

1

p xi − xj2 − τ

a.s.

− → 0 , τ = 2 p

k

i=1

tr na n Ca.

19 / 46

SLIDE 61

Application to Machine Learning/ 19/46

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp). Key observation: Under growth rate assumptions, max

1≤i=j≤n

1

p xi − xj2 − τ

a.s.

− → 0 , τ = 2 p

k

i=1

tr na n Ca. ◮ this suggests K ≃ f(τ)1n1T

n!

19 / 46

SLIDE 62

Application to Machine Learning/ 19/46

The curse of dimensionality and its consequences (2)

In reality, here is what happens... Kernel Kij = exp(− 1

2p xi − xj2) and second eigenvector v2

(xi ∼ N(±µ, Ip), µ = (2, 0, . . . , 0)T ∈ Rp). Key observation: Under growth rate assumptions, max

1≤i=j≤n

1

p xi − xj2 − τ

a.s.

− → 0 , τ = 2 p

k

i=1

tr na n Ca. ◮ this suggests K ≃ f(τ)1n1T

n!

19 / 46

SLIDE 63

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse

20 / 46

SLIDE 64

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

20 / 46

SLIDE 65

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

K − ˆ

K

a.s.

− → 0, ˆ K ≃ f(τ)1n1T

n

O·(n)

20 / 46

SLIDE 66

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

K − ˆ

K

a.s.

− → 0, ˆ K ≃ f(τ)1n1T

n

O·(n)

+ 1 pZZT + JAJT + ∗

20 / 46

SLIDE 67

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

K − ˆ

K

a.s.

− → 0, ˆ K ≃ f(τ)1n1T

n

O·(n)

+ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!)

20 / 46

SLIDE 68

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

K − ˆ

K

a.s.

− → 0, ˆ K ≃ f(τ)1n1T

n

O·(n)

+ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!) and A ∈ Rk×k function of: ◮ f(τ), f′(τ), f′′(τ) ◮ µa − µb, tr (Ca − Cb), tr ((Ca − Cb)2), for a, b ∈ {1, . . . , k}.

20 / 46

SLIDE 69

Application to Machine Learning/ 20/46

The curse of dimensionality and its consequences (3)

(Major) consequences: ◮ Most machine learning intuitions collapse ◮ But luckily, concentration of distances allows for Taylor expansion, linearization...

Theorem ([C-Benaych’16] Asymptotic Kernel Behavior)

Under growth rate assumptions, as p, n → ∞,

K − ˆ

K

a.s.

− → 0, ˆ K ≃ f(τ)1n1T

n

O·(n)

+ 1 pZZT + JAJT + ∗ with J = [j1, . . . , jk] ∈ Rn×k, ja = (0, 1na, 0)T (the clusters!) and A ∈ Rk×k function of: ◮ f(τ), f′(τ), f′′(τ) ◮ µa − µb, tr (Ca − Cb), tr ((Ca − Cb)2), for a, b ∈ {1, . . . , k}. ➫ This is a spiked model! We can study it fully!

20 / 46

SLIDE 70

Application to Machine Learning/ 21/46

Theoretical Findings versus MNIST

10 20 30 40 50 5 · 10−2 0.1 0.15 0.2

Eigenvalues of K

Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784, n = 192.

21 / 46

SLIDE 71

Application to Machine Learning/ 21/46

Theoretical Findings versus MNIST

10 20 30 40 50 5 · 10−2 0.1 0.15 0.2

Eigenvalues of K Eigenvalues of ˆ K as if Gaussian model

Figure: Eigenvalues of K (red) and (equivalent Gaussian model) ˆ K (white), MNIST data, p = 784, n = 192.

21 / 46

SLIDE 72

Application to Machine Learning/ 22/46

Theoretical Findings versus MNIST

Figure: Leading four eigenvectors of K for MNIST data (red) and theoretical findings (blue).

22 / 46

SLIDE 73

Application to Machine Learning/ 22/46

Theoretical Findings versus MNIST

Figure: Leading four eigenvectors of K for MNIST data (red) and theoretical findings (blue).

22 / 46

SLIDE 74

Application to Machine Learning/ 23/46

Theoretical Findings versus MNIST

−.08 −.07 −.06 −0.1 0.1 Eigenvector 2/Eigenvector 1 −0.1 0.1 −0.1 0.1 0.2 Eigenvector 3/Eigenvector 2

Figure: 2D representation of eigenvectors of K, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.

23 / 46

SLIDE 75

Application to Machine Learning/ 23/46

Theoretical Findings versus MNIST

−.08 −.07 −.06 −0.1 0.1 Eigenvector 2/Eigenvector 1 −0.1 0.1 −0.1 0.1 0.2 Eigenvector 3/Eigenvector 2

Figure: 2D representation of eigenvectors of K, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.

23 / 46

SLIDE 76

Takeaway Message 2 “RMT Reassesses and Improves Data Processing”

SLIDE 77

Application to Machine Learning/ 25/46

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels:

25 / 46

SLIDE 78

Application to Machine Learning/ 25/46

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ) ◮ by “killing” non discriminative feature directions.

25 / 46

SLIDE 79

Application to Machine Learning/ 25/46

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f(t) = exp(− 1

2 t) versus

f(t) = (t − τ)2 (think about the surprising kernel shape!)

25 / 46

SLIDE 80

Application to Machine Learning/ 25/46

Improving Kernel Spectral Clustering

Thanks to [C-Benaych’16]: Possibility to improve kernels: ◮ by “focusing kernels” on best discriminative statistics: tune f′(τ), f′′(τ) ◮ by “killing” non discriminative feature directions. Example: Covariance-based discrimation, kernel f(t) = exp(− 1

2 t) versus

f(t) = (t − τ)2 (think about the surprising kernel shape!)

25 / 46

SLIDE 81

Application to Machine Learning/ 26/46

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked!

26 / 46

SLIDE 82

Application to Machine Learning/ 26/46

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot). 26 / 46

SLIDE 83

Application to Machine Learning/ 26/46

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot).

◮ Machine Learning original idea: find “scores” Fia for xi to belong to class a F = argminF ∈Rn×k

k

a=1
i,j

Kij

Fia

− Fja

2,

F [l]

ia = δ{xi∈Ca}.

26 / 46

SLIDE 84

Application to Machine Learning/ 26/46

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot).

◮ Machine Learning original idea: find “scores” Fia for xi to belong to class a F = argminF ∈Rn×k

k

a=1
i,j

Kij

FiaDα

ii − FjaDα jj

2,

F [l]

ia = δ{xi∈Ca}.

26 / 46

SLIDE 85

Application to Machine Learning/ 26/46

Another, more striking, example: Semi-supervised Learning

Semi-supervised learning: a great idea that never worked! ◮ Setting: assume now

◮ x(a)

1

, . . . , x(a)

na,[l] already labelled (few),

◮ x(a)

na,[l]+1, . . . , x(a) na unlabelled (a lot).

◮ Machine Learning original idea: find “scores” Fia for xi to belong to class a F = argminF ∈Rn×k

k

a=1
i,j

Kij

FiaDα

ii − FjaDα jj

2,

F [l]

ia = δ{xi∈Ca}.

◮ Explicit solution: F [u] =

In[u] − D−1−α

[u]

K[uu]Dα

[u]

−1

D−1−α

[u]

K[ul]Dα

[l]F [l]

where D = diag(K1n) (degree matrix) and [ul], [uu], . . . blocks of labeled/unlabeled data.

26 / 46

SLIDE 86

Application to Machine Learning/ 27/46

The finite-dimensional case: What we expect

30 100 130 200 1

C1 Labelled Unlabelled C2 Labelled Unlabelled

[F]·,1 (scores for C1) Figure: Outcome F of Laplacian algorithms (α = −1) for N(±µ, Ip) with p = 1.

27 / 46

SLIDE 87

Application to Machine Learning/ 27/46

The finite-dimensional case: What we expect

30 100 130 200 1

C1 Labelled Unlabelled C2 Labelled Unlabelled

[F]·,1 (scores for C1) [F]·,2 (scores for C2) Figure: Outcome F of Laplacian algorithms (α = −1) for N(±µ, Ip) with p = 1.

27 / 46

SLIDE 88

Application to Machine Learning/ 28/46

The reality: What we see!

30 100 130 200 1

C1 Labelled Unlabelled C2 Labelled Unlabelled

[F]·,1 (scores for C1) Figure: Outcome F of Laplacian algorithms (α = −1) for N(±µ, Ip) with p = 80.

28 / 46

SLIDE 89

Application to Machine Learning/ 28/46

The reality: What we see!

30 100 130 200 1

C1 Labelled Unlabelled C2 Labelled Unlabelled

[F]·,1 (scores for C1) [F]·,2 (scores for C2) Figure: Outcome F of Laplacian algorithms (α = −1) for N(±µ, Ip) with p = 80.

28 / 46

SLIDE 90

Application to Machine Learning/ 29/46

The reality: What we see! (on MNIST)

20 40 60 80 100 120 140 160 180 0.5 1 F (u)

·,a [F(u)]·,1 (Zeros)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

29 / 46

SLIDE 91

Application to Machine Learning/ 29/46

The reality: What we see! (on MNIST)

20 40 60 80 100 120 140 160 180 0.5 1 F (u)

·,a [F(u)]·,1 (Zeros) [F(u)]·,2 (Ones)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

29 / 46

SLIDE 92

Application to Machine Learning/ 29/46

The reality: What we see! (on MNIST)

20 40 60 80 100 120 140 160 180 0.5 1 F (u)

·,a [F(u)]·,1 (Zeros) [F(u)]·,2 (Ones) [F(u)]·,3 (Twos)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

29 / 46

SLIDE 93

Application to Machine Learning/ 30/46

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch”

30 / 46

SLIDE 94

Application to Machine Learning/ 30/46

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work

30 / 46

SLIDE 95

Application to Machine Learning/ 30/46

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not!

30 / 46

SLIDE 96

Application to Machine Learning/ 30/46

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = −1, Fi· ← Fi·/n[l],i), it works again...

30 / 46

SLIDE 97

Application to Machine Learning/ 30/46

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = −1, Fi· ← Fi·/n[l],i), it works again... ◮ BUT it does not use efficiently unlabelled data!

30 / 46

SLIDE 98

Application to Machine Learning/ 30/46

Exploiting RMT to resurrect SSL

Consequences of the finite-dimensional “mismatch” ◮ A priori, the algorithm should not work ◮ Indeed “in general” it does not! ◮ But, luckily, after some (not clearly motivated) renormalization (e.g., α = −1, Fi· ← Fi·/n[l],i), it works again... ◮ BUT it does not use efficiently unlabelled data! Chapelle, Sch¨

lkopf, Zien, “Semi-Supervised Learning”, Chapter 4, 2009.

Our concern is this: it is frequently the case that we would be better off just discarding the unlabeled data and employing a supervised method, rather than taking a semi-supervised route. Thus we worry about the embarrassing situation where the addition of unlabeled data degrades the performance of a classifier.

30 / 46

SLIDE 99

Application to Machine Learning/ 31/46

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of ◮ f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck ◮ only nl.

31 / 46

SLIDE 100

Application to Machine Learning/ 31/46

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of ◮ f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck ◮ only nl.

2 4 6 8 10 0.76 0.78 0.8 0.82 n[u]/p Accuracy Laplacian regularization

Figure: Accuracy as a function of n[u]/p with n[l]/p = 2, c1 = c2, p = 100, −µ1 = µ2 = [1; 0p−1], {C}i,j = .1|i−j|. Graph constructed with Kij = e−xi−xj 2/p.

31 / 46

SLIDE 101

Application to Machine Learning/ 31/46

Asymptotic Performance Analysis

Theorem ([Mai,C’18] Asymptotic Performance of SSL)

For xi ∈ Cb unlabelled, score vector Fi,· ∈ Rk satisfies: Fi,· − Gb → 0, Gb ∼ N(mb, Σb) with mb ∈ Rk, Σb ∈ Rk×k function of ◮ f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck ◮ only nl.

2 4 6 8 10 0.76 0.78 0.8 0.82 n[u]/p Accuracy Laplacian regularization Spectral clustering (unsupervised)

Figure: Accuracy as a function of n[u]/p with n[l]/p = 2, c1 = c2, p = 100, −µ1 = µ2 = [1; 0p−1], {C}i,j = .1|i−j|. Graph constructed with Kij = e−xi−xj 2/p.

31 / 46

SLIDE 102

Application to Machine Learning/ 32/46

Improved SSL

Solution: From RMT calculus (but not from ML intuition!), solution is to replace K by ˜ K ≡ PKP, P = In − 1 n1n1T

n.

32 / 46

SLIDE 103

Application to Machine Learning/ 32/46

Improved SSL

Solution: From RMT calculus (but not from ML intuition!), solution is to replace K by ˜ K ≡ PKP, P = In − 1 n1n1T

n.

Theorem ([Mai,C’19] Asymptotic Performance of Improved SSL)

For xi ∈ Cb unlabelled, score vector ˜ Fi,· ∈ Rk satisfies: ˜ Fi,· − ˜ Gb → 0, ˜ Gb ∼ N( ˜ mb, ˜ Σb) with ˜ mb ∈ Rk, ˜ Σb ∈ Rk×k function of ◮ f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck ◮ nl and nu.

32 / 46

SLIDE 104

Application to Machine Learning/ 32/46

Improved SSL

Solution: From RMT calculus (but not from ML intuition!), solution is to replace K by ˜ K ≡ PKP, P = In − 1 n1n1T

n.

Theorem ([Mai,C’19] Asymptotic Performance of Improved SSL)

For xi ∈ Cb unlabelled, score vector ˜ Fi,· ∈ Rk satisfies: ˜ Fi,· − ˜ Gb → 0, ˜ Gb ∼ N( ˜ mb, ˜ Σb) with ˜ mb ∈ Rk, ˜ Σb ∈ Rk×k function of ◮ f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck ◮ nl and nu.

2 4 6 8 10 0.76 0.78 0.8 0.82 n[u]/p Accuracy Laplacian regularization Spectral clustering (unsupervised)

32 / 46

SLIDE 105

Application to Machine Learning/ 32/46

Improved SSL

Solution: From RMT calculus (but not from ML intuition!), solution is to replace K by ˜ K ≡ PKP, P = In − 1 n1n1T

n.

Theorem ([Mai,C’19] Asymptotic Performance of Improved SSL)

For xi ∈ Cb unlabelled, score vector ˜ Fi,· ∈ Rk satisfies: ˜ Fi,· − ˜ Gb → 0, ˜ Gb ∼ N( ˜ mb, ˜ Σb) with ˜ mb ∈ Rk, ˜ Σb ∈ Rk×k function of ◮ f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck ◮ nl and nu.

2 4 6 8 10 0.76 0.78 0.8 0.82 n[u]/p Accuracy Laplacian regularization Spectral clustering (unsupervised) Centered regularization

32 / 46

SLIDE 106

Application to Machine Learning/ 33/46

What about real data?

200 400 .92 .94 n[u]

Laplacian

Figure: Top: distribution of normalized pairwise distances for noisy MNIST data (8,9). Bottom: average accuracy as a function of n[u] with n[l] = 10, computed over 1000 random realizations.

33 / 46

SLIDE 107

Application to Machine Learning/ 33/46

What about real data?

200 400 .92 .94 n[u]

Laplacian Proposed

Figure: Top: distribution of normalized pairwise distances for noisy MNIST data (8,9). Bottom: average accuracy as a function of n[u] with n[l] = 10, computed over 1000 random realizations.

33 / 46

SLIDE 108

Application to Machine Learning/ 33/46

What about real data?

SNR = +∞dB

1 2 .10 Normalized distances Intra Inter 200 400 .92 .94 n[u]

Laplacian Proposed

Figure: Top: distribution of normalized pairwise distances for noisy MNIST data (8,9). Bottom: average accuracy as a function of n[u] with n[l] = 10, computed over 1000 random realizations.

33 / 46

SLIDE 109

Application to Machine Learning/ 33/46

What about real data?

SNR = +∞dB SNR = −5dB

1 2 .10 Normalized distances Intra Inter 0.8 1 1.2 .10 Normalized distances 200 400 .92 .94 n[u]

Laplacian Proposed

200 400 .82 .84 .86 .88 n[u]

Figure: Top: distribution of normalized pairwise distances for noisy MNIST data (8,9). Bottom: average accuracy as a function of n[u] with n[l] = 10, computed over 1000 random realizations.

33 / 46

SLIDE 110

Application to Machine Learning/ 33/46

What about real data?

SNR = +∞dB SNR = −5dB SNR = −10dB

1 2 .10 Normalized distances Intra Inter 0.8 1 1.2 .10 Normalized distances 0.8 1 1.2 .10 Normalized distances 200 400 .92 .94 n[u]

Laplacian Proposed

200 400 .82 .84 .86 .88 n[u] 200 400 .65 .70 .75 .80 n[u]

Figure: Top: distribution of normalized pairwise distances for noisy MNIST data (8,9). Bottom: average accuracy as a function of n[u] with n[l] = 10, computed over 1000 random realizations.

33 / 46

SLIDE 111

Application to Machine Learning/ 34/46

Experimental evidence: MNIST

Digits (0,8) (2,7) (6,9) nu = 100 Centered kernel (RMT) 89.5±3.6 89.5±3.4 85.3±5.9 Iterated centered kernel (RMT) 89.5±3.6 89.5±3.4 85.3±5.9 Laplacian 75.5±5.6 74.2±5.8 70.0±5.5 Iterated Laplacian 87.2±4.7 86.0±5.2 81.4±6.8 Manifold 88.0±4.7 88.4±3.9 82.8±6.5 nu = 1000 Centered kernel (RMT) 92.2±0.9 92.5±0.8 92.6±1.6 Iterated centered kernel (RMT) 92.3±0.9 92.5± 0.8 92.9±1.4 Laplacian 65.6±4.1 74.4±4.0 69.5±3.7 Iterated Laplacian 92.2±0.9 92.4±0.9 92.0±1.6 Manifold 91.1±1.7 91.4±1.9 91.4±2.0 Table: Comparison of classification accuracy (%) on MNIST datasets with nl = 10. Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.

34 / 46

SLIDE 112

Application to Machine Learning/ 35/46

Experimental evidence: Traffic signs (HOG features)

Class ID (2,7) (9,10) (11,18) nu = 100 Centered kernel (RMT) 79.0±10.4 77.5±9.2 78.5±7.1 Iterated centered kernel (RMT) 85.3±5.9 89.2±5.6 90.1±6.7 Laplacian 73.8±9.8 77.3±9.5 78.6±7.2 Iterated Laplacian 83.7±7.2 88.0±6.8 87.1±8.8 Manifold 77.6±8.9 81.4±10.4 82.3±10.8 nu = 1000 Centered kernel (RMT) 83.6±2.4 84.6±2.4 88.7±9.4 Iterated centered kernel (RMT) 84.8±3.8 88.0±5.5 96.4±3.0 Laplacian 72.7±4.2 88.9±5.7 95.8±3.2 Iterated Laplacian 83.0±5.5 88.2±6.0 92.7±6.1 Manifold 77.7±5.8 85.0±9.0 90.6±8.1 Table: Comparison of classification accuracy (%) on German Traffic Sign datasets with nl = 10. Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.

35 / 46

SLIDE 113

Takeaway Message 3 “RMT Also Grasps ‘Real Data’ Processing”

SLIDE 114

Application to Machine Learning/ 37/46

From i.i.d. to concentrated random vectors

Beyond Gaussian Mixtures: results still valid for concentrated random vectors.

37 / 46

SLIDE 115

Application to Machine Learning/ 37/46

From i.i.d. to concentrated random vectors

Beyond Gaussian Mixtures: results still valid for concentrated random vectors.

Definition (Concentrated Random Vector)

x ∈ Rp is concentrated if, for all Lipschitz f : Rp → R, there exists mf ∈ R, such that P |f(x) − mf| > ε ≤ e−g(ε), g increasing function.

37 / 46

SLIDE 116

Application to Machine Learning/ 37/46

From i.i.d. to concentrated random vectors

Beyond Gaussian Mixtures: results still valid for concentrated random vectors.

Definition (Concentrated Random Vector)

x ∈ Rp is concentrated if, for all Lipschitz f : Rp → R, there exists mf ∈ R, such that P |f(x) − mf| > ε ≤ e−g(ε), g increasing function. · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ··· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · ·

√p

x = (x1, . . . , xp) ∼ sp

x1+···+xp √p

x∞

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . .. . . . . . . . . . . . . . . . . .. . . .. . . . . . . .. . . . . .. . .. . . .. . . . . .. . . . . . . . . . .. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .. .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .. . .. . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Observations

O(1) O(1) 37 / 46

SLIDE 117

Application to Machine Learning/ 38/46

From i.i.d. to concentrated random vectors

Theorem ([Louart,C’18] [Seddik,C’19] Kernel Universality)

For xi ∼ L(µa, Ca) concentrated random vector, under the conditions of [C-Benaych’16], K − ˆ K

a.s.

− → 0, ˆ K = f(τ)1n1T

n + 1

pZZT + JAJT + ∗ with A only dependent on f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck.

38 / 46

SLIDE 118

Application to Machine Learning/ 38/46

From i.i.d. to concentrated random vectors

Theorem ([Louart,C’18] [Seddik,C’19] Kernel Universality)

For xi ∼ L(µa, Ca) concentrated random vector, under the conditions of [C-Benaych’16], K − ˆ K

a.s.

− → 0, ˆ K = f(τ)1n1T

n + 1

pZZT + JAJT + ∗ with A only dependent on f(τ), f′(τ), f′′(τ), µ1, . . . , µk, C1, . . . , Ck. ➫ Same result as [C-Benaych’16]. . . Universality of first two moments!

38 / 46

SLIDE 119

Application to Machine Learning/ 39/46

Ok. . . so what?

39 / 46

SLIDE 120

Application to Machine Learning/ 39/46

Ok. . . so what?

Key Finding. GAN-generated data are concentrated random vectors!

39 / 46

SLIDE 121

Application to Machine Learning/ 39/46

Ok. . . so what?

Key Finding. GAN-generated data are concentrated random vectors!

39 / 46

SLIDE 122

Application to Machine Learning/ 40/46

Ok. . . so what?

40 / 46

SLIDE 123

Application to Machine Learning/ 41/46

Ok. . . so what?

41 / 46

SLIDE 124

Application to Machine Learning/ 42/46

Gaussian, GAN, and real data

Results. [Seddik,C’19]

GAN Images Real Images

42 / 46

SLIDE 125

Application to Machine Learning/ 43/46

Gaussian, GAN, and real data

GAN Images Real Images

43 / 46

SLIDE 126

Application to Machine Learning/ 43/46

Gaussian, GAN, and real data

GAN Images Real Images

43 / 46

SLIDE 127

Application to Machine Learning/ 43/46