Learning Statistical Property Testers Sreeram Kannan University of - - PowerPoint PPT Presentation

learning statistical property testers
SMART_READER_LITE
LIVE PREVIEW

Learning Statistical Property Testers Sreeram Kannan University of - - PowerPoint PPT Presentation

Learning Statistical Property Testers Sreeram Kannan University of Washington Seattle Collaborators Rajat Karthikeyan Sudipto Himanshu Arman Sen Shanmugan Mukherjee Asnani Rahimzamani UT, Austin IBM Research University of


slide-1
SLIDE 1

Learning Statistical Property Testers

Sreeram Kannan University of Washington Seattle

slide-2
SLIDE 2

Collaborators

Arman Rahimzamani Himanshu Asnani

University of Washington, Seattle

Sudipto 
 Mukherjee Rajat Sen Karthikeyan Shanmugan

UT, Austin IBM Research

slide-3
SLIDE 3

Statistical Property Testing

✤ Closeness testing ✤ Independence testing ✤ Conditional Independence testing ✤ Information estimation

slide-4
SLIDE 4

25 50 75 100

Testing Total Variation Distance

25 50 75 100

Q

P

slide-5
SLIDE 5

25 50 75 100

Testing Total Variation Distance

25 50 75 100

Q

P

n samples n samples

slide-6
SLIDE 6

25 50 75 100 25 50 75 100

Q

P

n samples n samples

Estimate DT V (P, Q) ?

Testing Total Variation Distance

slide-7
SLIDE 7

25 50 75 100 25 50 75 100

Q

P

n samples n samples Search beyond Traditional Density Estimation Methods

Estimate DT V (P, Q) ?

Testing Total Variation Distance

P and Q can be arbitrary.

slide-8
SLIDE 8

Testing Total Variation: Prior Art

✤ Lots of work in CS theory on DTV testing ✤ Based on closeness testing between P and Q ✤ Sample complexity = O(na), where n = alphabet size ✤ Curse of dimensionality if n = 2d Complexity is O(2ad)

* Chan et al, Optimal Algorithms for testing closeness of discrete distributions, SODA 2014

* Sriperumbudur et al, Kernel choice and classifiability for RKHS embeddings of probability distributions, NIPS 2009

slide-9
SLIDE 9

Classifiers beat curse-of-dimensionality

✤ Deep NN and boosted random forests

achieve state-of-the-art performance

✤ Works very well even in practice

when X is high dimensional.

✤ Exploits generic inductive bias: ✤ Invariance ✤ Hierarchical Structure ✤ Symmetry

Theoretical guarantees lag severely behind practice!

slide-10
SLIDE 10

25 50 75 100

Distance Estimation via Classification

25 50 75 100

n samples ∼ P n samples ∼ Q

slide-11
SLIDE 11

25 50 75 100

Distance Estimation via Classification

25 50 75 100

n samples ∼ P n samples ∼ Q (Label 0) (Label 1)

Classifier

slide-12
SLIDE 12

25 50 75 100

Distance Estimation via Classification

25 50 75 100

n samples ∼ P n samples ∼ Q (Label 0) (Label 1)

Classifier

1 2 − 1 2DTV(P, Q).

Classification Error

  • f Optimal Bayes

Classifier =

slide-13
SLIDE 13

25 50 75 100

Distance Estimation via Classification

25 50 75 100

n samples ∼ P n samples ∼ Q (Label 0) (Label 1)

Deep NN, Boosted Trees etc.

1 2 − 1 2DTV(P, Q).

Classification Error of Optimal Classifier =

* Lopez-Paz et al, Revisiting Classifier two-sample tests, ICLR 2017

* Sriperumbudur et al, Kernel choice and classifiability for RKHS embeddings of probability distributions, NIPS 2009

slide-14
SLIDE 14

25 50 75 100

Distance Estimation via Classification

25 50 75 100

n samples ∼ P n samples ∼ Q (Label 0) (Label 1)

Deep NN, Boosted Trees etc.

1 2 − 1 2DTV(P, Q).

Classification Error of Any Classifier >=

* Lopez-Paz et al, Revisiting Classifier two-sample tests, ICLR 2017

* Sriperumbudur et al, Kernel choice and classifiability for RKHS embeddings of probability distributions, NIPS 2009

Can get P-value control

slide-15
SLIDE 15

Independence Testing

n samples {xi, yi}n

i=1

* Lopez-Paz et al, Revisiting Classifier two-sample tests, ICLR 2017

* Sriperumbudur et al, Kernel choice and classifiability for RKHS embeddings of probability distributions, NIPS 2009

slide-16
SLIDE 16

Independence Testing

n samples {xi, yi}n

i=1

nH0 : X || Y (PCI)

H1 : X 6? ? Y (P)

slide-17
SLIDE 17

Independence Testing

n samples {xi, yi}n

i=1

nH0 : X || Y (PCI)

H1 : X 6? ? Y (P)

Classify

P(p(x, y))

PCI(p(x)p(y))

slide-18
SLIDE 18

Independence Testing

n samples {xi, yi}n

i=1

nH0 : X || Y (PCI)

H1 : X 6? ? Y (P)

Classify

P(p(x, y))

PCI(p(x)p(y))

PCI(p(x)p(y))

slide-19
SLIDE 19

Independence Testing

n samples {xi, yi}n

i=1

nH0 : X || Y (PCI)

H1 : X 6? ? Y (P)

Classify

P(p(x, y))

PCI(p(x)p(y))

Permutation

PCI(p(x)p(y))

slide-20
SLIDE 20

Independence Testing

n samples {xi, yi}n

i=1

Split Equally

slide-21
SLIDE 21

Independence Testing

n samples {xi, yi}n

i=1

P(p(x, y))

Split Equally

slide-22
SLIDE 22

Independence Testing

n samples {xi, yi}n

i=1

P(p(x, y))

Split Equally Label 0

slide-23
SLIDE 23

Independence Testing

n samples {xi, yi}n

i=1

P(p(x, y))

Split Equally Label 0

yi’s are permuted

slide-24
SLIDE 24

Independence Testing

n samples {xi, yi}n

i=1

P(p(x, y))

Split Equally Label 0

yi’s are permuted

PCI(p(x)p(y))

slide-25
SLIDE 25

Independence Testing

n samples {xi, yi}n

i=1

P(p(x, y))

Split Equally Label 0

yi’s are permuted

PCI(p(x)p(y))

Label 1

slide-26
SLIDE 26

Independence Testing

n samples {xi, yi}n

i=1

P(p(x, y))

Split Equally Label 0

yi’s are permuted

PCI(p(x)p(y))

Label 1

*Lopez-Paz et al, Revisiting Classifier two-sample tests, ICLR 2017

* Sriperumbudur et al, Kernel choice and classifiability for RKHS embeddings of probability distributions, NIPS 2009

P-value control

slide-27
SLIDE 27

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

vs

slide-28
SLIDE 28

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

P(p(x, y, z)) PCI(p(z)p(x|z)p(y|z))

slide-29
SLIDE 29

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

P(p(x, y, z)) PCI(p(z)p(x|z)p(y|z))

How to get PCI(p(z)p(x|z)p(y|z)?

slide-30
SLIDE 30

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

P(p(x, y, z)) PCI(p(z)p(x|z)p(y|z))

Given samples ∼ p(x, z) How to emulate p(y|z)?

slide-31
SLIDE 31

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

✤ KNN Based

Methods

✤ Kernel

Methods

P(p(x, y, z)) PCI(p(z)p(x|z)p(y|z))

Emulate p(y|z) as q(y|z)

slide-32
SLIDE 32

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

P(p(x, y, z)) ˜ PCI(p(z)p(x|z)q(y|z))

˜ PCI(p(z)p(x|z)q(y|z))

Emulate p(y|z) as q(y|z)

✤ KNN Based

Methods

✤ Kernel

Methods

slide-33
SLIDE 33

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

P(p(x, y, z)) ˜ PCI(p(z)p(x|z)q(y|z))

˜ PCI(p(z)p(x|z)q(y|z))

Emulate p(y|z) as q(y|z)

✤ KNN Based

Methods

✤ Kernel

Methods

✤ [KCIT] Gretton et al, Kernel-based conditional independence test and

application in causal discovery, NIPS 2008

✤ [KCIPT] Doran et al, A permutation-based kernel conditional

independence test, UAI 2014

✤ [CCIT] Sen et al, Model-Powered Conditional Independence Test, NIPS

2017

✤ [RCIT] Strobl et al, Approximate Kernel-based Conditional Independence

Tests for Fast Non-Parametric Causal Discovery, arXiv


slide-34
SLIDE 34

Conditional Independence Testing

n samples {xi, yi, zi}n

i=1

n H0 : X || Y |Z (PCI)

H1 : X 6? ? Y |Z (P)

Classify vs

P(p(x, y, z)) ˜ PCI(p(z)p(x|z)q(y|z))

˜ PCI(p(z)p(x|z)q(y|z))

Emulate p(y|z) as q(y|z)

✤ KNN Based

Methods

✤ Kernel

Methods

✤ Limited to low-dimensional Z.

In practice, Z is often high dimensional. (Eg. In graphical model, conditioning set can be entire graph.)

slide-35
SLIDE 35

Generative Models beat curse-of-dimensionality

Generator z x Low-dimensional Latent Space High-dimensional data Space

slide-36
SLIDE 36

Generative Models beat curse-of-dimensionality

Generator z x

✤ Trained Real Samples of x ✤ Can generate any number of new samples

Low-dimensional Latent Space High-dimensional data Space

slide-37
SLIDE 37

Generative Models beat curse-of-dimensionality

Generator z x

✤ Trained Real Samples of x ✤ Can generate any number of new samples

Low-dimensional Latent Space High-dimensional data Space

slide-38
SLIDE 38

How loose can the estimate be for ˜ PCI or q(y|z)?

slide-39
SLIDE 39

As long as the density function q(y|z) > 0 whenever p(y, z) > 0. How loose can the estimate be for ˜ PCI or q(y|z)?

Mimic-and-Classify works

slide-40
SLIDE 40

As long as the density function q(y|z) > 0 whenever p(y, z) > 0.

Mimic Functions : GANs, Regressors etc.

How loose can the estimate be for ˜ PCI or q(y|z)?

Novel Bias Cancellation Method in Mimic-and-Classify works

slide-41
SLIDE 41

Mimic and Classify

Mimic Step Classify Step

slide-42
SLIDE 42

Mimic and Classify

50 100

Mimic Step Classify Step

D ∼ p(x, y, z)

slide-43
SLIDE 43

Mimic and Classify

50 100 50 100 50 100

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

slide-44
SLIDE 44

Mimic and Classify

50 100 50 100 50 100

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

Dataset D2

(xi, yi, zi) zi y’i (xi, y’i, zi)

Dataset D’

MIMIC

slide-45
SLIDE 45

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

slide-46
SLIDE 46

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

slide-47
SLIDE 47

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

slide-48
SLIDE 48

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

˜ D

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

slide-49
SLIDE 49

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

˜ D

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

Classification Error :

Exyz

slide-50
SLIDE 50

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

˜ D ˜ D−x

Drop x

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

Classification Error :

Exyz

slide-51
SLIDE 51

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

˜ D ˜ D−x

Drop x

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

Classification Error : Classification Error :

Exyz Eyz

slide-52
SLIDE 52

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

˜ D ˜ D−x

Drop x

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

Classification Error : Classification Error :

Exyz Eyz

D(p(xyz)|p(xz)q(y|z)) D(p(yz)|p(z)q(y|z))

slide-53
SLIDE 53

Mimic and Classify

50 100

MIMIC

50 100 50 100 50 100

(Label 0) (Label 1)

˜ D = D1 ∪ D0

˜ D ˜ D−x

Drop x

Mimic Step Classify Step

D1 ∼ p(x, y, z)

D2 ∼ p(x, y, z)

D ∼ p(x, y, z)

D0 ∼ p(z)p(x|z)q(y|z)

Classification Error : Classification Error :

Exyz Eyz

D(p(xyz)|p(xz)q(y|z)) D(p(yz)|p(z)q(y|z)) Statistic = Exyz-Eyz Cancels bias due to q(y|z)

slide-54
SLIDE 54

Mimic and Classify

Mimic Step

As long as the density function q(y|z) > 0 whenever p(y, z) > 0.

Classify Step

slide-55
SLIDE 55

Mimic and Classify

Mimic Step

*The errors here are the corresponding optimal Bayes classifier errors.

As long as the density function q(y|z) > 0 whenever p(y, z) > 0.

Classify Step

|ED[Exyz] − ED[Eyz]| = 0 ↔ H0 is true | − | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-56
SLIDE 56

Mimic and Classify (Theory)

| − | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-57
SLIDE 57

Mimic and Classify (Theory)

≥ Z

y,z

min(p(z)q(y|z), p(z)p(y|z))(1 − ✏(y, z))d(y, z) | − | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-58
SLIDE 58

Mimic and Classify (Theory)

≥ Z

y,z

min(p(z)q(y|z), p(z)p(y|z))(1 − ✏(y, z))d(y, z)

Where: ✏(y, z) =

max

π∈Π(p(x|z),p(x0|y,z)) Eπ[1{x=x0}|y, z]

Conditional dependence ↔ ✏(y, z) < 1 with non-zero probability

| − | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-59
SLIDE 59

Mimic and Classify (Theory)

| − | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) ≥ Z

y,z

min(p(z)q(y|z), p(z)p(y|z))(1 − ✏(y, z))d(y, z)

Where: ✏(y, z) =

max

π∈Π(p(x|z),p(x0|y,z)) Eπ[1{x=x0}|y, z]

Conditional dependence ↔ ✏(y, z) < 1 with non-zero probability

As long as the density function q(y|z) > 0 whenever p(y, z) > 0, E E

−x

Theorem 1

2|ED[Exyz] − ED[Eyz]|

| then conditional dependence implies that 2|ED[Exyz] − ED[Eyz]| > 0

slide-60
SLIDE 60

Conditional independence implies p(x, y, z) = p(z)p(y|z)p(x|z).

DTV(p(z)p(y|z), p(z)q(y|z)) = DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z))

Mimic and Classify (Theory)

slide-61
SLIDE 61

Conditional independence implies p(x, y, z) = p(z)p(y|z)p(x|z).

DTV(p(z)p(y|z), p(z)q(y|z)) = DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z))

Mimic and Classify (Theory)

|E − E | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-62
SLIDE 62

Conditional independence implies p(x, y, z) = p(z)p(y|z)p(x|z).

DTV(p(z)p(y|z), p(z)q(y|z)) = DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z))

= DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z))

Mimic and Classify (Theory)

|E − E | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-63
SLIDE 63

Conditional independence implies p(x, y, z) = p(z)p(y|z)p(x|z).

DTV(p(z)p(y|z), p(z)q(y|z)) = DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z))

= DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z)) = DTV(p(x, y, z), p(x|z)p(z)q(y|z))

Mimic and Classify (Theory)

|E − E | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z)) 2|ED[Exyz] − ED[Eyz]|

slide-64
SLIDE 64

Conditional independence implies that Conditional independence implies p(x, y, z) = p(z)p(y|z)p(x|z).

DTV(p(z)p(y|z), p(z)q(y|z)) = DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z))

|E − E | = DTV(p(z, x, y), p(z)q(y|z)p(x|z)) − DTV(p(y, z), p(z)q(y|z))

= DTV(p(x|z)p(z)p(y|z), p(x|z)p(z)q(y|z)) = DTV(p(x, y, z), p(x|z)p(z)q(y|z))

Mimic and Classify (Theory)

Theorem 2

2|ED[Exyz] − ED[Eyz]| = 0

2|ED[Exyz] − ED[Eyz]|

slide-65
SLIDE 65

Combining Theorem 1 and Theorem 2

Mimic and Classify (Theory)

Theorem 3

As long as the density function q(y|z) > 0 when p(y, z) > 0 |ED[Exyz] − ED[Eyz]| = 0 ↔ H0 is true

slide-66
SLIDE 66

MIMIFY - CGAN

Deep Learning based MIMIC Functions

Generator G(z,s) s z Discriminator D(y,z) (y,z) [0,1]

Gaussian Latent Space

slide-67
SLIDE 67

MIMIFY - CGAN

Deep Learning based MIMIC Functions

Generator G(z,s) s z Discriminator D(y,z) (y,z) [0,1]

∼ q(y|z)

Gaussian Latent Space

slide-68
SLIDE 68

MIMIFY - CGAN

Deep Learning based MIMIC Functions

MIMIFY - REG Generator G(z,s) s z Discriminator D(y,z) (y,z) [0,1]

∼ q(y|z)

Gaussian Latent Space

Regress to estimate r(z) = E[Y |Z = z]

slide-69
SLIDE 69

MIMIFY - CGAN

Deep Learning based MIMIC Functions

MIMIFY - REG Generator G(z,s) s z Discriminator D(y,z) (y,z) [0,1]

∼ q(y|z)

Gaussian Latent Space

Regress to estimate r(z) = E[Y |Z = z] ˆ y = r(z)+ Gaussian Noise ∼ q(y|z)

slide-70
SLIDE 70

MIMIFY - CGAN

Deep Learning based MIMIC Functions

MIMIFY - REG Generator G(z,s) s z Discriminator D(y,z) (y,z) [0,1]

∼ q(y|z)

Gaussian Latent Space

Regress to estimate r(z) = E[Y |Z = z] ˆ y = r(z)+ Gaussian Noise ∼ q(y|z)

(or, laplacian noise)

slide-71
SLIDE 71

Post-Nonlinear Noise Synthetic Experiments: AUROC

Experiments

slide-72
SLIDE 72

Flow-cytometry Data

Experiments

slide-73
SLIDE 73

Gene Regulatory Network Inference (DREAM)

Experiments

slide-74
SLIDE 74

Estimating Information Measures

slide-75
SLIDE 75

25 50 75 100 25 50 75 100

Q

P

n samples n samples

Estimating Kullback-Leibler Distance

Estimate DKL(P k Q) ?

slide-76
SLIDE 76

25 50 75 100 25 50 75 100

Q

P

n samples n samples

Estimating Kullback-Leibler Distance

Estimate DKL(P k Q) ?

Curse of dimensionality: Sample complexity O(n/log n) with n = 2d

slide-77
SLIDE 77

25 50 75 100

Neural Network Approximation

25 50 75 100

n samples ∼ P n samples ∼ Q

slide-78
SLIDE 78

25 50 75 100

Neural Network Approximation

25 50 75 100

n samples ∼ P n samples ∼ Q Donsker-Varadhan Dual Representation: DKL(P k Q) = supT EP [T] log(EQ[eT ])

slide-79
SLIDE 79

25 50 75 100

MINE: Neural Network Approximation

25 50 75 100

n samples ∼ P n samples ∼ Q Donsker-Varadhan Dual Representation: DKL(P k Q) = supT EP [T] log(EQ[eT ])

  • T ← Rich NN class
  • E ← Sample Averages
  • supT ← Obtained via Stochastic Gradient search

*Benghazi et al, MINE : Mutual Information Neural Estimation, ICML 2018

slide-80
SLIDE 80

25 50 75 100

MINE is Unstable to T rain

25 50 75 100

n samples ∼ P n samples ∼ Q

*Benghazi et al, MINE : Mutual Information Neural Estimation, ICML 2018

(x100) (x100) (x100)

True !(#; %)

slide-81
SLIDE 81

Divergence estimation via Classification

§ "#$(& ' = sup

,∈ℱ

/0∼2 0 [4 5 ] − log (/0∼; 0 [ <, 0 ]) 4∗ 5 = ?@A & 5 ' 5 BC @&DBEF?

(RN derivative) Classifiers can estimate f*

!" !# $ = 1 $ = 0

(), (+ ∼ - (), (+ (), (+ ∼ . (), (+

  • (0 = )|(), (+)
  • (0 = 3|(), (+)

Label 0 for q(x) Label 1 for p(x)

f* =

p(l = 1|x) p(l = 0|x)

Plug in to DV-bound Highly stable training! True lower bound

slide-82
SLIDE 82

Classifiers require calibration

Require classifiers that are calibrated => Require p(l=1|x) Can get well-calibrated neural networks

Lakshminarayan et al, Simple and scalable predictive uncertainty estimation using deep ensembles, NeurIPS 17

slide-83
SLIDE 83

Mutual Information Estimation

I(X; Y ) = DKL(PXY k PXPY )

Permutated samples Samples Classifier-MI: use classifier to estimate I(X;Y) Theorem-1: NeuralNet-Classifier-MI is consistent Theorem-2: Classifier-MI is a “true” lower-bound on MI

slide-84
SLIDE 84

Performance: MI Estimation

!", $" ∼ & 0, 1 ) ) 1 , * = !,, !-, … !0 , 1 = $,, $-, … $0 , !"⊥ $3 (5 ≠ 7)

slide-85
SLIDE 85

Performance: Conditional MI Estimation

§ Modular Estimation §/ 0; 2 3 = / 0; 2, 3 − /(0; 3)

CGAN/CVAE/ 1-NN

9:, ;: :<=

>

9:′, ;: :<=

>

∼ A(9|;)

Classifier-MI

C:, 9:, ;: :<=

>

DEF A C, 9, ; A C, ; A(9|;)) C:, 9:′, ;: :<=

>

Estimate I(X;Y|Z) = DKL (pXYZ|pXZpY|Z)

slide-86
SLIDE 86

Performance: CMI

Model – I (Linear) ! ∼ # 0, 1 , ( ∼ ) −0.5, 0.5 -. , / ∼ # (0, 0.01 1 = ! + / Model – II (Linear) ! ∼ # 0, 1 , ( ∼ # 0, 1 -. , / ∼ # 45(, 0.01 1 = ! + / Non-Linear models : ( ∼ # 6, 7-. , ! = 8

0 90 1 = 8 : ;<=! + ;>=( + 9:

8

0, 8 : ∼ { @ABℎ, DEF, exp

(−| ⋅ |)}, 90, 9: ∼ #(0, 0.1)

slide-87
SLIDE 87

Performance: CMI

"# = %& ' = %&, &&&

Model-1 (linear)

slide-88
SLIDE 88

Performance: CMI

Model-2 (linear)

"# = %& ' = %&, &&&

slide-89
SLIDE 89

Performance: CMI

Model-3 (Non-linear)

slide-90
SLIDE 90

Performance: Conditional Indep Testing

§ ! ∼ # $, &'( , ) = cos ./! + 12

3 = 5 cos 67! + 18 9: ) ⊥ 3 | ! cos =) + 67! + 18 9: ) ⊥ 3 | ! ./, 67 ∼ > 0, 1 '(, ||.|| = 1, 6 = 1, = ∼ > 0,2 , 1B ∼ # 0, 0.25 .

slide-91
SLIDE 91

Performance: Real Data (Flow Cytometry)

§ 50 CI and 50 NI relations. § Examples – pkc ⊥ akt | raf, mek, p38, pka, jnk pip3 ⊥ pip2 | p38, raf, pkc, jnk, erk, akt § CCMI (Mean AuROC) = 0.75 CCIT (Mean AuROC) = 0.66

slide-92
SLIDE 92

Open Problems

✤ Closeness testing problems ✤ Beyond DTV: Distance measure estimation using classifiers? ✤ Stable trainability ✤ Time-series data (Directed information estimation and testing) ✤ Statistical property testing ✤ Uniformity testing