Lecture 3: Dependence measures using RKHS embeddings MLSS T - - PowerPoint PPT Presentation

lecture 3 dependence measures using rkhs embeddings
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Dependence measures using RKHS embeddings MLSS T - - PowerPoint PPT Presentation

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]


slide-1
SLIDE 1

Lecture 3: Dependence measures using RKHS embeddings

MLSS T¨ ubingen, 2015

Arthur Gretton Gatsby Unit, CSML, UCL

slide-2
SLIDE 2

Outline

  • Three or more variable interactions, comparison with conditional

dependence testing [Sejdinovic et al., 2013a]

  • Dependence detection in detail, covariance operators
  • Bayesian inference without models, comparison with approximate

Bayesian computation (ABC) [Fukumizu et al., 2013]

  • Recent work (2014/2015) (not in this talk, see my webpage)

– Testing for time series Chwialkowski and Gretton [2014], Chwialkowski et al. [2014] – Nonparametric adaptive expectation propagation Jitkrittum et al. [2015] – Infinite dimensional exponential families Sriperumbudur et al. [2014] – Adaptive MCMC, and adaptive Hamiltonian Monte Carlo Sejdinovic et al.

[2014], Strathmann et al. [2015]

slide-3
SLIDE 3

Lancaster (3-way) Interactions

slide-4
SLIDE 4

Detecting a higher order interaction

  • How to detect V-structures with pairwise weak individual dependence?

X Y Z

slide-5
SLIDE 5

Detecting a higher order interaction

  • How to detect V-structures with pairwise weak individual dependence?
slide-6
SLIDE 6

Detecting a higher order interaction

  • How to detect V-structures with pairwise weak individual dependence?
  • X ⊥

⊥ Y , Y ⊥ ⊥ Z, X ⊥ ⊥ Z

X vs Y Y vs Z X vs Z XY vs Z

X Y Z

  • X, Y i.i.d.

∼ N(0, 1),

  • Z| X, Y ∼ sign(XY )Exp( 1

√ 2)

Faithfulness violated here

slide-7
SLIDE 7

V-structure Discovery

X Y Z

Assume X ⊥ ⊥ Y has been established. V-structure can then be detected by:

  • Consistent CI test: H0 : X ⊥

⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

slide-8
SLIDE 8

V-structure Discovery

X Y Z

Assume X ⊥ ⊥ Y has been established. V-structure can then be detected by:

  • Consistent CI test: H0 : X ⊥

⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

  • Factorisation test: H0 : (X, Y ) ⊥

⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X (multiple standard two-variable tests) – compute p-values for each of the marginal tests for (Y, Z) ⊥ ⊥ X, (X, Z) ⊥ ⊥ Y , or (X, Y ) ⊥ ⊥ Z – apply Holm-Bonferroni (HB) sequentially rejective correction

(Holm 1979)

slide-9
SLIDE 9

V-structure Discovery (2)

  • How to detect V-structures with pairwise weak (or nonexistent)

dependence?

  • X ⊥

⊥ Y , Y ⊥ ⊥ Z, X ⊥ ⊥ Z

X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1

X Y Z

  • X1, Y1

i.i.d.

∼ N(0, 1),

  • Z1| X1, Y1 ∼ sign(X1Y1)Exp( 1

√ 2)

  • X2:p, Y2:p, Z2:p

i.i.d.

∼ N(0, Ip−1) Faith-

fulness violated here

slide-10
SLIDE 10

V-structure Discovery (3)

CI: X ⊥ ⊥Y |Z 2var: Factor

Null acceptance rate (Type II error) V-structure discovery: Dataset A Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 1: CI test for X ⊥ ⊥ Y |Z from Zhang et al (2011), and a factorisation test with a HB correction, n = 500

slide-11
SLIDE 11

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

slide-12
SLIDE 12

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

slide-13
SLIDE 13

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y Z X Y Z X Y Z X Y Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP =

slide-14
SLIDE 14

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y Z X Y Z X Y Z X Y Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP = 0

Case of PX ⊥ ⊥ PY Z

slide-15
SLIDE 15

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ (X, Y ) ⊥ ⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X ⇒ ∆LP = 0. ...so what might be missed?

slide-16
SLIDE 16

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ ∆LP = 0 (X, Y ) ⊥ ⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X Example:

P(0, 0, 0) = 0.2 P(0, 0, 1) = 0.1 P(1, 0, 0) = 0.1 P(1, 0, 1) = 0.1 P(0, 1, 0) = 0.1 P(0, 1, 1) = 0.1 P(1, 1, 0) = 0.1 P(1, 1, 1) = 0.2

slide-17
SLIDE 17

A Test using Lancaster Measure

  • Test statistic is empirical estimate of µκ (∆LP)2

Hκ , where

κ = k ⊗ l ⊗ m: µκ(PXY Z − PXY PZ − · · · )2

Hκ =

µκPXY Z, µκPXY ZHκ − 2 µκPXY Z, µκPXY PZHκ · · ·

slide-18
SLIDE 18

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ PXY Z (K ◦ L ◦ M)++ ((K ◦ L) M)++ ((K ◦ M) L)++ ((M ◦ L) K)++ tr(K+ ◦ L+ ◦ M+) PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++ PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++ PY ZPX (L ◦ M)++ K++ (LM)++K++ PXPY PZ K++L++M++

Table 1: V -statistic estimators of µκν, µκν′Hκ

slide-19
SLIDE 19

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ PXY Z (K ◦ L ◦ M)++ ((K ◦ L) M)++ ((K ◦ M) L)++ ((M ◦ L) K)++ tr(K+ ◦ L+ ◦ M+) PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++ PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++ PY ZPX (L ◦ M)++ K++ (LM)++K++ PXPY PZ K++L++M++

Table 2: V -statistic estimators of µκν, µκν′Hκ µκ (∆LP)2

Hκ = 1

n2 (HKH ◦ HLH ◦ HMH)++ . Empirical joint central moment in the feature space

slide-20
SLIDE 20

Example A: factorisation tests

CI: X ⊥ ⊥Y |Z ∆L: Factor 2var: Factor

Null acceptance rate (Type II error) V-structure discovery: Dataset A Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable based test (both with HB correction); Test for X ⊥ ⊥ Y |Z from Zhang et al

(2011), n = 500

slide-21
SLIDE 21

Example B: Joint dependence can be easier to detect

  • X1, Y1

i.i.d.

∼ N(0, 1)

  • Z1 =

         X2

1 + ǫ,

w.p. 1/3, Y 2

1 + ǫ,

w.p. 1/3, X1Y1 + ǫ, w.p. 1/3, where ǫ ∼ N(0, 0.12).

  • X2:p, Y2:p, Z2:p

i.i.d.

∼ N(0, Ip−1)

  • dependence of Z on pair (X, Y ) is stronger than on X and Y individually
  • Satisfies faithfulness
slide-22
SLIDE 22

Example B: factorisation tests

CI: X ⊥ ⊥Y |Z ∆L: Factor 2var: Factor

Null acceptance rate (Type II error) V-structure discovery: Dataset B Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable based test (both with HB correction); Test for X ⊥ ⊥ Y |Z from Zhang et al

(2011), n = 500

slide-23
SLIDE 23

Interaction for D ≥ 4

  • Interaction measure valid for all D

(Streitberg, 1990):

∆SP =

  • π

(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.

slide-24
SLIDE 24

Interaction for D ≥ 4

  • Interaction measure valid for all D

(Streitberg, 1990):

∆SP =

  • π

(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.

slide-25
SLIDE 25

Interaction for D ≥ 4

  • Interaction measure valid for all D

(Streitberg, 1990):

∆SP =

  • π

(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.

1e+04 1e+09 1e+14 1e+19 1 3 5 7 9 11 13 15 17 19 21 23 25

D Number of partitions of {1,...,D} Bell numbers growth

joint central moments (Lancaster interaction) vs. joint cumulants (Streitberg interaction)

slide-26
SLIDE 26

Total independence test

  • Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z = PXPY PZ

slide-27
SLIDE 27

Total independence test

  • Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z = PXPY PZ

  • For (X1, . . . , XD) ∼ PX, and κ = D

i=1 k(i):

  • µκ
  • ˆ

PX −

D

  • i=1

ˆ PXi

  • ∆tot ˆ

P

  • 2

= 1 n2

n

  • a=1

n

  • b=1

D

  • i=1

K(i)

ab −

2 nD+1

n

  • a=1

D

  • i=1

n

  • b=1

K(i)

ab

+ 1 n2D

D

  • i=1

n

  • a=1

n

  • b=1

K(i)

ab .

  • Coincides with the test proposed by Kankainen (1995) using empirical

characteristic functions: similar relationship to that between dCov and HSIC (DS et al, 2013)

slide-28
SLIDE 28

Example B: total independence tests

∆tot : total indep. ∆L: total indep.

Null acceptance rate (Type II error) Total independence test: Dataset B Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 4: Total independence: ∆tot ˆ P vs. ∆L ˆ P, n = 500

slide-29
SLIDE 29

Kernel dependence measures - in detail

slide-30
SLIDE 30

MMD for independence: HSIC

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++

slide-31
SLIDE 31

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

slide-32
SLIDE 32

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

slide-33
SLIDE 33

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

slide-34
SLIDE 34

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.90 COCO: 0.14

slide-35
SLIDE 35

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.90 COCO: 0.14

How do we define covariance in (infinite) feature spaces?

slide-36
SLIDE 36

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent?

slide-37
SLIDE 37

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent? Compute their covariance matrix:

(ignore centering)

Cxy = E

  • xy⊤

How to get a single “summary” number?

slide-38
SLIDE 38

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent? Compute their covariance matrix:

(ignore centering)

Cxy = E

  • xy⊤

How to get a single “summary” number? Solve for vectors f ∈ Rd, g ∈ Rd′ argmax

f=1,g=1

f⊤Cxyg = argmax

f=1,g=1

Exy

  • f⊤x

g⊤y

  • =

argmax

f=1,g=1

Ex,y[f(x)g(y)] = argmax

f=1,g=1

cov (f(x)g(y)) (maximum singular value)

slide-39
SLIDE 39

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 1: Can we define a feature space analog to x y⊤? YES:

  • Given f ∈ Rd, g ∈ Rd′, h ∈ Rd′, define matrix f g⊤ such that

(f g⊤)h = f(g⊤h).

  • Given f ∈ F, g ∈ G, h ∈ G, define tensor product operator f ⊗ g such

that (f ⊗ g)h = fg, hG.

  • Now just set f := φ(x), g = ψ(y), to get x y⊤ → φ(x) ⊗ ψ(y)
slide-40
SLIDE 40

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some CXY : G → F such that f, CXY gF = Ex,y[f(x)g(y)] = cov (f(x), g(y))

slide-41
SLIDE 41

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some CXY : G → F such that f, CXY gF = Ex,y[f(x)g(y)] = cov (f(x), g(y)) YES: via Bochner integrability argument (as with mean embedding). Under the condition Ex,y

  • k(x, x)l(y, y)
  • < ∞, we can define:

CXY := Ex,y [φ(x) ⊗ ψ(y)] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

slide-42
SLIDE 42

REMINDER: functions revealing dependence

COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.90 COCO: 0.14

How do we compute this from finite data?

slide-43
SLIDE 43

Empirical covariance operator

The empirical covariance given z := (xi, yi)n

i=1 (now include centering)

  • CXY := 1

n

n

  • i=1

φ(xi) ⊗ ψ(yi) − ˆ µx ⊗ ˆ µy, where ˆ µx := 1

n

n

i=1 φ(xi). More concisely,

  • CXY = 1

nXHY ⊤, where H = In − n−11n, and 1n is an n × n matrix of ones, and X =

  • φ(x1)

. . . φ(xn)

  • Y =
  • ψ(y1)

. . . ψ(yn)

  • .

Define the kernel matrices Kij =

  • X⊤X
  • ij = k(xi, xj)

Lij = l(yi, yj),

slide-44
SLIDE 44

Functions revealing dependence

Optimization problem: COCO(z; F, G) := max

  • f,

CXY g

  • F

subject to fF ≤ 1 gG ≤ 1 Assume f =

n

  • i=1

αi [φ(xi) − ˆ µx] = XHα g =

n

  • j=1

βi [ψ(yi) − ˆ µy] = Y Hβ, The associated Lagrangian is L(f, g, λ, γ) = f⊤ CXY g − λ 2

  • f2

F − 1

  • − γ

2

  • g2

G − 1

  • ,
slide-45
SLIDE 45

Covariance to reveal dependence

  • Empirical COCO(z; F, G) largest eigenvalue of

 

1 n

K L

1 n

L K     α β   = γ  

  • K
  • L

    α β   .

K and L are matrices of inner products between centred observations in respective feature spaces:

  • K = HKH

where H = I − 1 n11⊤

slide-46
SLIDE 46

Covariance to reveal dependence

  • Empirical COCO(z; F, G) largest eigenvalue of

 

1 n

K L

1 n

L K     α β   = γ  

  • K
  • L

    α β   .

K and L are matrices of inner products between centred observations in respective feature spaces:

  • K = HKH

where H = I − 1 n11⊤

  • Mapping function for x:

f(x) =

n

  • i=1

αi  k(xi, x) − 1 n

n

  • j=1

k(xj, x)  

slide-47
SLIDE 47

Hard-to-detect dependence

−2 2 −3 −2 −1 1 2 3

X Y Smooth density

−4 −2 2 4 −4 −2 2 4

X Y 500 Samples, smooth density

−2 2 −3 −2 −1 1 2 3

X Y Rough density

−4 −2 2 4 −4 −2 2 4

X Y 500 samples, rough density

Density takes the form: Px,y ∝ 1 + sin(ωx) sin(ωy)

slide-48
SLIDE 48

Hard-to-detect dependence

  • Example: sinusoids of increasing frequency

ω=1 ω=2 ω=3 ω=4 ω=5 ω=6

1 2 3 4 5 6 7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Frequency of non−constant density component COCO COCO (empirical average, 1500 samples)

slide-49
SLIDE 49

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

slide-50
SLIDE 50

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 1

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.27

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: −0.50 COCO: 0.09

slide-51
SLIDE 51

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 2

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.04

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: 0.51 COCO: 0.07

slide-52
SLIDE 52

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 3

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.03

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −0.6 −0.4 −0.2 0.2 0.4 −0.5 0.5

f(X) g(Y) Correlation: −0.45 COCO: 0.03

slide-53
SLIDE 53

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 4

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.03

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: 0.21 COCO: 0.02

slide-54
SLIDE 54

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω =??

−4 −2 2 4 −3 −2 −1 1 2 3

X Y Correlation: 0.00

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: −0.13 COCO: 0.02

slide-55
SLIDE 55

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size.

−4 −2 2 4 −3 −2 −1 1 2 3

X Y Correlation: 0.00

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: −0.13 COCO: 0.02

slide-56
SLIDE 56

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

  • As dependence is encoded at higher frequencies, the smooth mappings

f, g achieve lower linear covariance.

  • Even for independent variables, COCO will not be zero at finite sample

sizes, since some mild linear dependence will be induced by f, g (bias)

  • This bias will decrease with increasing sample size.
slide-57
SLIDE 57

More functions revealing dependence

  • Can we do better than COCO?
slide-58
SLIDE 58

More functions revealing dependence

  • Can we do better than COCO?
  • A second example with zero correlation

−1 1 −1.5 −1 −0.5 0.5 1

X Y Correlation: 0

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.80 COCO: 0.11

slide-59
SLIDE 59

More functions revealing dependence

  • Can we do better than COCO?
  • A second example with zero correlation

−1 1 −1.5 −1 −0.5 0.5 1

X Y Correlation: 0

−2 2 −1 −0.5 0.5 1

x f2(x) 2nd dependence witness, X

−2 2 −1 −0.5 0.5 1

y g2(y) 2nd dependence witness, Y

slide-60
SLIDE 60

More functions revealing dependence

  • Can we do better than COCO?
  • A second example with zero correlation

−1 1 −1.5 −1 −0.5 0.5 1

X Y Correlation: 0

−2 2 −1 −0.5 0.5 1

x f2(x) 2nd dependence witness, X

−2 2 −1 −0.5 0.5 1

y g2(y) 2nd dependence witness, Y

−1 1 −1 −0.5 0.5 1

f2(X) g2(Y) Correlation: −0.37 COCO2: 0.06

slide-61
SLIDE 61

Hilbert-Schmidt Independence Criterion

  • Given γi := COCOi(z; F, G), define Hilbert-Schmidt Independence

Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : HSIC(z; F, G) :=

n

  • i=1

γ2

i

slide-62
SLIDE 62

Hilbert-Schmidt Independence Criterion

  • Given γi := COCOi(z; F, G), define Hilbert-Schmidt Independence

Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : HSIC(z; F, G) :=

n

  • i=1

γ2

i

  • In limit of infinite samples:

HSIC(P; F, G) := Cxy2

HS

= Cxy, CxyHS = Ex,x′,y,y′[k(x, x′)l(y, y′)] + Ex,x′[k(x, x′)]Ey,y′[l(y, y′)] − 2Ex,y

  • Ex′[k(x, x′)]Ey′[l(y, y′)]

x′ an independent copy of x, y′ a copy of y HSIC is identical to MMD(PXY , PXPY )

slide-63
SLIDE 63

When does HSIC determine independence?

Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff Px,y = PxPy [Gretton, 2015]. Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish Px,y from Qx,y).

slide-64
SLIDE 64

Intuition: why characteristic needed on both X and Y

Question: Wouldn’t it be enough just to use a rich mapping from X to Y, e.g. via ridge regression with characteristic F: f∗ = arg min

f∈F

  • EXY (Y − f, φ(X)F)2 + λf2

F

  • ,
slide-65
SLIDE 65

Intuition: why characteristic needed on both X and Y

Question: Wouldn’t it be enough just to use a rich mapping from X to Y, e.g. via ridge regression with characteristic F: f∗ = arg min

f∈F

  • EXY (Y − f, φ(X)F)2 + λf2

F

  • ,

Counterexample: density symmetric about x-axis, s.t. p(x, y) = p(x, −y)

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

slide-66
SLIDE 66

Energy Distance and the MMD

slide-67
SLIDE 67

Energy distance and MMD

Distance between probability distributions: Energy distance:

[Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005]

DE(P, Q) = EPX − X′q + EQY − Y ′q − 2EP,QX − Y q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD2(P, Q; F) = EPk(X, X′) + EQk(Y, Y ′) − 2EP,Qk(X, Y )

slide-68
SLIDE 68

Energy distance and MMD

Distance between probability distributions: Energy distance:

[Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005]

DE(P, Q) = EPX − X′q + EQY − Y ′q − 2EP,QX − Y q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD2(P, Q; F) = EPk(X, X′) + EQk(Y, Y ′) − 2EP,Qk(X, Y ) Energy distance is MMD with a particular kernel!

[Sejdinovic et al., 2013b]

slide-69
SLIDE 69

Distance covariance and HSIC

Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´

ekely et al., 2007]

V2(X, Y ) = EXY EX′Y ′ X − X′qY − Y ′r + EXEX′X − X′qEY EY ′Y − Y ′r − 2EXY

  • EX′X − X′qEY ′Y − Y ′r

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y

with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(X, X′)l(Y, Y ′) + EXEX′k(X, X′)EY EY ′l(Y, Y ′) − 2EX′Y ′ EXk(X, X′)EY l(Y, Y ′)

  • .
slide-70
SLIDE 70

Distance covariance and HSIC

Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´

ekely et al., 2007]

V2(X, Y ) = EXY EX′Y ′ X − X′qY − Y ′r + EXEX′X − X′qEY EY ′Y − Y ′r − 2EXY

  • EX′X − X′qEY ′Y − Y ′r

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y

with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(X, X′)l(Y, Y ′) + EXEX′k(X, X′)EY EY ′l(Y, Y ′) − 2EX′Y ′ EXk(X, X′)EY l(Y, Y ′)

  • .

Distance covariance is HSIC with particular kernels!

[Sejdinovic et al., 2013b]

slide-71
SLIDE 71

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = ρ(z, z0) + ρ(z′, z0) − ρ(z, z′). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Negative type: The semimetric space (Z, ρ) is said to have negative type if ∀n ≥ 2, z1, . . . , zn ∈ Z, and α1, . . . , αn ∈ R, with n

i=1 αi = 0, n

  • i=1

n

  • j=1

αiαjρ(zi, zj) ≤ 0. (1)

slide-72
SLIDE 72

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = ρ(z, z0) + ρ(z′, z0) − ρ(z, z′). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Special case: Z ⊆ Rd and ρq(z, z′) = z − z′q. Then ρq is a valid semimetric

  • f negative type for 0 < q ≤ 2.
slide-73
SLIDE 73

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = ρ(z, z0) + ρ(z′, z0) − ρ(z, z′). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Special case: Z ⊆ Rd and ρq(z, z′) = z − z′q. Then ρq is a valid semimetric

  • f negative type for 0 < q ≤ 2.

Energy distance is MMD with a distance induced kernel Distance covariance is HSIC with distance induced kernels

slide-74
SLIDE 74

Two-sample testing benchmark

Two-sample testing example in 1-D:

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

slide-75
SLIDE 75

Two-sample test, MMD with distance kernel

Obtain more powerful tests on this problem when q = 1 (exponent of distance) Key:

  • Gaussian kernel
  • q = 1
  • Best: q = 1/3
  • Worst: q = 2
slide-76
SLIDE 76

Nonparametric Bayesian inference using distribution embeddings

slide-77
SLIDE 77

Motivating Example: Bayesian inference without a model

  • 3600 downsampled frames of

20 × 20 RGB pixels (Yt ∈ [0, 1]1200)

  • 1800 training frames,

remaining for test.

  • Gaussian noise added to Yt.

Challenges:

  • No parametric model of camera dynamics (only samples)
  • No parametric model of map from camera angle to image (only samples)
  • Want to do filtering: Bayesian inference
slide-78
SLIDE 78

ABC: an approach to Bayesian inference without a model

Bayes rule: P(y|x) = P(x|y)π(y)

  • P(x|y)π(y)dy
  • P(x|y) is likelihood
  • π(y) is prior

One approach: Approximate Bayesian Computation (ABC)

slide-79
SLIDE 79

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 5 10 −10 −5 5 10 Prior y ∼ π(Y ) Likelihood x ∼ P (X |y)

ABC demonstration

slide-80
SLIDE 80

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 5 10 −10 −5 5 10 Prior y ∼ π(Y ) Likelihood x ∼ P (X |y)

ABC demonstration

x∗

slide-81
SLIDE 81

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 5 10 −10 −5 5 10 Prior y ∼ π(Y ) Likelihood x ∼ P (X |y)

ABC demonstration

x∗

slide-82
SLIDE 82

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 5 10 −10 −5 5 10 Prior y ∼ π(Y ) Likelihood x ∼ P (X |y)

ABC demonstration

x∗

slide-83
SLIDE 83

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 5 10 −10 −5 5 10 Prior y ∼ π(Y ) Likelihood x ∼ P (X |y)

ABC demonstration

x∗ ˆ P(Y |x∗)

slide-84
SLIDE 84

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 5 10 −10 −5 5 10 Prior y ∼ π(Y ) Likelihood x ∼ P (X |y)

ABC demonstration

x∗ ˆ P(Y |x∗)

Needed: distance measure D, tolerance parameter τ.

slide-85
SLIDE 85

ABC: an approach to Bayesian inference without a model

Bayes rule: P(y|x) = P(x|y)π(y)

  • P(x|y)π(y)dy
  • P(x|y) is likelihood
  • π(y) is prior

ABC generates a sample from p(Y |x∗) as follows:

  • 1. generate a sample yt from the prior π,
  • 2. generate a sample xt from P(X|yt),
  • 3. if D(x∗, xt) < τ, accept y = yt; otherwise reject,
  • 4. go to (i).

In step (3), D is a distance measure, and τ is a tolerance parameter.

slide-86
SLIDE 86

Motivating example 2: simple Gaussian case

  • p(x, y) is N((0, 1T

d )T , V ) with V a randomly generated covariance

Posterior mean on x: ABC vs kernel approach

10 10

1

10

2

10

3

10

4

10

−2

10

−1

CPU time vs Error (6 dim.) CPU time (sec) Mean Square Errors

KBI COND ABC

3.3×102 1.3×106 7.0×104 1.0×104 2.5×103 9.5×102 200 400 2000 600 800 1500 1000 5000 6000 4000 3000 200 400 2000 4000 600 800 1000 1500 6000 3000 5000

slide-87
SLIDE 87

Bayes again

Bayes rule: P(y|x) = P(x|y)π(y)

  • P(x|y)π(y)dy
  • P(x|y) is likelihood
  • π is prior

How would this look with kernel embeddings?

slide-88
SLIDE 88

Bayes again

Bayes rule: P(y|x) = P(x|y)π(y)

  • P(x|y)π(y)dy
  • P(x|y) is likelihood
  • π is prior

How would this look with kernel embeddings? Define RKHS G on Y with feature map ψy and kernel l(y, ·) We need a conditional mean embedding: for all g ∈ G, EY |x∗g(Y ) = g, µP(y|x∗)G This will be obtained by RKHS-valued ridge regression

slide-89
SLIDE 89

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these could be d′ nonlinear features of y): Define training data X =

  • x1

. . . xm

  • ∈ Rd×m

Y =

  • y1

. . . ym

  • ∈ Rd′×m
slide-90
SLIDE 90

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these could be d′ nonlinear features of y): Define training data X =

  • x1

. . . xm

  • ∈ Rd×m

Y =

  • y1

. . . ym

  • ∈ Rd′×m

Solve ˘ A = arg minA∈Rd′×d

  • Y − AX2 + λA2

HS

  • ,

where A2

HS = tr(A⊤A) = min{d,d′}

  • i=1

γ2

A,i

slide-91
SLIDE 91

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these could be d′ nonlinear features of y): Define training data X =

  • x1

. . . xm

  • ∈ Rd×m

Y =

  • y1

. . . ym

  • ∈ Rd′×m

Solve ˘ A = arg minA∈Rd′×d

  • Y − AX2 + λA2

HS

  • ,

where A2

HS = tr(A⊤A) = min{d,d′}

  • i=1

γ2

A,i

Solution: ˘ A = CY X (CXX + mλI)−1

slide-92
SLIDE 92

Ridge regression and the conditional feature mean

Prediction at new point x: y∗ = ˘ Ax = CY X (CXX + mλI)−1 x =

m

  • i=1

βi(x)yi where βi(x) = (K + λmI)−1 k(x1, x) . . . k(xm, x) ⊤ and K := X⊤X k(x1, x) = x⊤

1 x

slide-93
SLIDE 93

Ridge regression and the conditional feature mean

Prediction at new point x: y∗ = ˘ Ax = CY X (CXX + mλI)−1 x =

m

  • i=1

βi(x)yi where βi(x) = (K + λmI)−1 k(x1, x) . . . k(xm, x) ⊤ and K := X⊤X k(x1, x) = x⊤

1 x

What if we do everything in kernel space?

slide-94
SLIDE 94

Ridge regression and the conditional feature mean

Recall our setup:

  • Given training pairs:

(xi, yi) ∼ PXY

  • F on X with feature map ϕx and kernel k(x, ·)
  • G on Y with feature map ψy and kernel l(y, ·)

We define the covariance between feature maps: CXX = EX (ϕX ⊗ ϕX) CXY = EXY (ϕX ⊗ ψY ) and matrices of feature mapped training data X =

  • ϕx1

. . . ϕxm

  • Y :=
  • ψy1

. . . ψym

slide-95
SLIDE 95

Ridge regression and the conditional feature mean

Objective:

[Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder et al. (2012, 2013) ]

˘ A = arg min

A∈HS(F,G)

  • EXY Y − AX2

G + λA2 HS

  • ,

A2

HS = ∞

  • i=1

γ2

A,i

Solution same as vector case: ˘ A = CY X (CXX + mλI)−1 , Prediction at new x using kernels: ˘ Aϕx =

  • ψy1

. . . ψym

  • (K + λmI)−1

k(x1, x) . . . k(xm, x)

  • =

m

  • i=1

βi(x)ψyi where Kij = k(xi, xj)

slide-96
SLIDE 96

Ridge regression and the conditional feature mean

How is loss Y − AX2

G relevant to conditional expectation of some

EY |xg(Y )? Define:

[Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

slide-97
SLIDE 97

Ridge regression and the conditional feature mean

How is loss Y − AX2

G relevant to conditional expectation of some

EY |xg(Y )? Define:

[Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx We need A to have the property EY |xg(Y ) ≈ g, µY |xG = g, AϕxG = A∗g, ϕxF = (A∗g)(x)

slide-98
SLIDE 98

Ridge regression and the conditional feature mean

How is loss Y − AX2

G relevant to conditional expectation of some

EY |xg(Y )? Define:

[Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx We need A to have the property EY |xg(Y ) ≈ g, µY |xG = g, AϕxG = A∗g, ϕxF = (A∗g)(x) Natural risk function for conditional mean L(A, PXY ) := sup

g≤1

EX   

  • EY |Xg(Y )
  • Target

(X) − (A∗g)

Estimator

(X)   

2

,

slide-99
SLIDE 99

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

slide-100
SLIDE 100

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

Proof: Jensen and Cauchy Schwarz L(A, PXY ) := sup

g≤1

EX

  • EY |Xg(Y )
  • (X) − (A∗g) (X)

2 ≤ EXY sup

g≤1

[g(Y ) − (A∗g) (X)]2

slide-101
SLIDE 101

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

Proof: Jensen and Cauchy Schwarz L(A, PXY ) := sup

g≤1

EX

  • EY |Xg(Y )
  • (X) − (A∗g) (X)

2 ≤ EXY sup

g≤1

[g(Y ) − (A∗g) (X)]2 = EXY sup

g≤1

[g, ψY G − A∗g, ϕXF]2

slide-102
SLIDE 102

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

Proof: Jensen and Cauchy Schwarz L(A, PXY ) := sup

g≤1

EX

  • EY |Xg(Y )
  • (X) − (A∗g) (X)

2 ≤ EXY sup

g≤1

[g(Y ) − (A∗g) (X)]2 = EXY sup

g≤1

[g, ψY G − g, AϕXG]2

slide-103
SLIDE 103

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

Proof: Jensen and Cauchy Schwarz L(A, PXY ) := sup

g≤1

EX

  • EY |Xg(Y )
  • (X) − (A∗g) (X)

2 ≤ EXY sup

g≤1

[g(Y ) − (A∗g) (X)]2 = EXY sup

g≤1

g, ψY − AϕX2

G

slide-104
SLIDE 104

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

Proof: Jensen and Cauchy Schwarz L(A, PXY ) := sup

g≤1

EX

  • EY |Xg(Y )
  • (X) − (A∗g) (X)

2 ≤ EXY sup

g≤1

[g(Y ) − (A∗g) (X)]2 = EXY sup

g≤1

g, ψY − AϕX2

G

≤ EXY ψY − AϕX2

G

slide-105
SLIDE 105

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk. L(A, PXY ) ≤ EXY ψY − AϕX2

G

Proof: Jensen and Cauchy Schwarz L(A, PXY ) := sup

g≤1

EX

  • EY |Xg(Y )
  • (X) − (A∗g) (X)

2 ≤ EXY sup

g≤1

[g(Y ) − (A∗g) (X)]2 = EXY sup

g≤1

g, ψY − AϕX2

G

≤ EXY ψY − AϕX2

G

If we assume EY [g(Y )|X = x] ∈ F then upper bound tight (next slide).

slide-106
SLIDE 106

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ F Given a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F. Then CXXEY |X [g(Y )|X = ·] = CXY g. Why this is useful: EY |X [g(Y )|X = x] = EY |X [g(Y )|X = ·], ϕxF = C−1

XXCXY g, ϕxF

= g, CY XC−1

XX

  • regression

ϕxG

slide-107
SLIDE 107

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ F Given a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F. Then CXXEY |X [g(Y )|X = ·] = CXY g. Proof:

[Fukumizu et al., 2004]

For all f ∈ F, by definition of CXX,

  • f, CXXEY |X [g(Y )|X = ·]
  • F

= cov

  • f, EY |X [g(Y )|X = ·]
  • = EX
  • f(X) EY |X [g(Y )|X]
  • = EXY (f(X)g(Y ))

= f, CXY g , by definition of CXY .

slide-108
SLIDE 108

Kernel Bayes’ law

  • Prior: Y ∼ π(y)
  • Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)
slide-109
SLIDE 109

Kernel Bayes’ law

  • Prior: Y ∼ π(y)
  • Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)
  • Joint distribution: Q(x, y) = P(x|y)π(y)

Warning: Q = P, change of measure from P(y) to π(y)

  • Marginal for x:

Q(x) :=

  • P(x|y)π(y)dy.
  • Bayes’ law:

Q(y|x) = P(x|y)π(y) Q(x)

slide-110
SLIDE 110

Kernel Bayes’ law

  • Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1

Q(x,x)φx.

slide-111
SLIDE 111

Kernel Bayes’ law

  • Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1

Q(x,x)φx.

  • Given mean embedding of prior: µπ(y)
  • Define marginal covariance:

CQ(x,x) =

  • (ϕx ⊗ ϕx) P(x|y)π(y)dx = C(xx)yC−1

yy µπ(y)

  • Define cross-covariance:

CQ(y,x) =

  • (φy ⊗ ϕx) P(x|y)π(y)dx = C(yx)yC−1

yy µπ(y).

slide-112
SLIDE 112

Kernel Bayes’ law: consistency result

  • How to compute posterior expectation from data?
  • Given samples: {(xi, yi)}n

i=1 from Pxy, {(uj)}n j=1 from prior π.

  • Want to compute E[g(Y )|X = x] for g in G
  • For any x ∈ X,
  • gT

y RY |XkX(x) − E[f(Y )|X = x]

  • = Op(n− 4

27 ),

(n → ∞), where – gy = (g(y1), . . . , g(yn))T ∈ Rn. – kX(x) = (k(x1, x), . . . , k(xn, x))T ∈ Rn – RY |X learned from the samples, contains the uj

Smoothness assumptions:

  • π/pY ∈ R(C1/2

Y Y ), where pY p.d.f. of PY ,

  • E[g(Y )|X = ·] ∈ R(C2

Q(xx)).

slide-113
SLIDE 113

Experiment: Kernel Bayes’ law vs EKF

slide-114
SLIDE 114

Experiment: Kernel Bayes’ law vs EKF

  • Compare with extended Kalman

filter (EKF) on camera

  • rientation task
  • 3600 downsampled frames of

20 × 20 RGB pixels (Yt ∈ [0, 1]1200)

  • 1800 training frames, remaining

for test.

  • Gaussian noise added to Yt.
slide-115
SLIDE 115

Experiment: Kernel Bayes’ law vs EKF

  • Compare with extended Kalman

filter (EKF) on camera

  • rientation task
  • 3600 downsampled frames of

20 × 20 RGB pixels (Yt ∈ [0, 1]1200)

  • 1800 training frames, remaining

for test.

  • Gaussian noise added to Yt.

Average MSE and standard errors (10 runs)

KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.) σ2 = 10−4 0.210 ± 0.015 0.146 ± 0.003 1.980 ± 0.083 0.557 ± 0.023 σ2 = 10−3 0.222 ± 0.009 0.210 ± 0.008 1.935 ± 0.064 0.541 ± 0.022

slide-116
SLIDE 116

Co-authors

  • From UCL:

– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic

  • External:

– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

slide-117
SLIDE 117

Selected references

Characteristic kernels and mean embeddings:

  • Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
  • Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

  • Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
  • independence. NIPS
  • Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
  • Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
  • test. NIPS.
  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

  • Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

  • Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
slide-118
SLIDE 118

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

  • Weston, J., Chapelle, O., Elisseeff, A., Sch¨
  • lkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

  • Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
  • Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

  • Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
  • Distributions. ICML.
  • Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

  • Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

  • Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

  • Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

slide-119
SLIDE 119
slide-120
SLIDE 120

Kernel CCA: Definition

  • There exists a factorization of Cxy such that [Baker, 1973]

Cxy = C1/2

xx VxyC1/2 Y Y

VxyS ≤ 1

  • Regularized empirical estimate of spectral norm:

[JMLR07]

ˆ VxyS := sup

f∈F,g∈G

f, ˆ CxygF subject to    f, ( ˆ Cxx + ǫnI)fF = 1, g, ( ˆ Cyy + ǫnI)gG = 1, – First canonical correlate

slide-121
SLIDE 121

Kernel CCA: Definition

  • There exists a factorization of Cxy such that [Baker, 1973]

Cxy = C1/2

xx VxyC1/2 Y Y

VxyS ≤ 1

  • Regularized empirical estimate of spectral norm:

[JMLR07]

ˆ VxyS := sup

f∈F,g∈G

f, ˆ CxygF subject to    f, ( ˆ Cxx + ǫnI)fF = 1, g, ( ˆ Cyy + ǫnI)gG = 1, – First canonical correlate

  • Regularized empirical estimate of HS norm:

[NIPS07b]

NOCCO(z; F, G) := ˆ Vxy2

HS = tr

  • RyRx
  • ,

Rx := Kx( Kx + nǫnIn)−1

slide-122
SLIDE 122

Kernel CCA: Illustration

  • Ring-shaped density, first eigenvalue

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −0.8 −0.6 −0.4 −0.2

x f(x) Dependence witness, X

−2 2 0.2 0.4 0.6 0.8

y g(y) Dependence witness, Y

−1 −0.5 0.2 0.3 0.4 0.5 0.6 0.7

f(X) g(Y) Correlation: 1.00

slide-123
SLIDE 123

Kernel CCA: Illustration

  • Ring-shaped density, third eigenvalue

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 0.2 0.25 0.3 0.35 0.4

x f(x) Dependence witness, X

−2 2 0.2 0.25 0.3 0.35 0.4

y g(y) Dependence witness, Y

0.25 0.3 0.35 0.27 0.28 0.29 0.3 0.31 0.32 0.33

f(X) g(Y) Correlation: 0.97

slide-124
SLIDE 124

NOCCO: HS Norm of Normalized Cross Covariance

  • Define NOCCO as

NOCCO := Vxy2

HS

  • Characteristic kernels: population NOCCO is mean-square contingency,
  • indep. of RKHS

NOCCO =

X×Y

pxy(x, y) px(x)py(y) − 1 2 px(x)py(y)dµ(x)dµ(y).

– µ(x) and µ(y) Lebesgue measures on X and Y; Pxy absolutely continuous w.r.t. µ(x) × µ(y), density pxy, marginal densities px and py

  • Convergence result: assume regularization ǫn satisfies ǫn → 0 and

ǫ3

nn → ∞, Then

ˆ Vxy − VxyHS → 0 in probability

slide-125
SLIDE 125

References

  • C. Baker. Joint measures and cross-covariance operators. Transactions of the

American Mathematical Society, 186:273–289, 1973.

  • L. Baringhaus and C. Franz.

On a new multivariate two-sample test. J. Multivariate Anal., 88:190–206, 2004.

  • C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.

Springer, New York, 1984.

  • K. Chwialkowski and A. Gretton.

A kernel independence test for random

  • processes. ICML, 2014.
  • K. Chwialkowski, D. Sejdinovic, and A. Gretton. A wild bootstrap for de-

generate kernel tests. NIPS, 2014. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.

  • K. Fukumizu, F. R. Bach, and M. I. Jordan.

Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004.

  • K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨
  • lkopf.

Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008. MIT Press.

  • K. Fukumizu, L. Song, and A. Gretton. Kernel bayes’ rule: Bayesian infer-

ence with positive definite kernels. Journal of Machine Learning Research, 14: 3753–3783, 2013.

  • A. Gretton and L. Gyorfi. Consistent nonparametric tests of independence.

Journal of Machine Learning Research, 11:1391–1423, 2010.

  • A. Gretton, O. Bousquet, A. J. Smola, and B. Sch¨
  • lkopf. Measuring statisti-

cal dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and

  • E. Tomita, editors, Proceedings of the International Conference on Algorithmic

Learning Theory, pages 63–77. Springer-Verlag, 2005.

  • A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨
  • lkopf, and A. J. Smola. A ker-

nel method for the two-sample problem. In Advances in Neural Information Processing Systems 15, pages 513–520, Cambridge, MA, 2007. MIT Press.

70-1

slide-126
SLIDE 126
  • A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨
  • lkopf, and A. J. Smola.

A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592, Cambridge, MA, 2008. MIT Press.

  • A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel

two-sample test. JMLR, 13:723–773, 2012. Arthur Gretton. A simpler condition for consistency of a kernel independence

  • test. Technical Report 1501.06103, arXiv, 2015.

Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, S. M. Ali Eslami, Bal- aji Lakshminarayanan, Dino Sejdinovic, and Zolt´ an Szab´

  • . Kernel-based

just-in-time learning for passing expectation propagation messages. UAI, 2015.

  • D. Sejdinovic, A. Gretton, and W. Bergsma. A kernel test for three-variable
  • interactions. In NIPS, 2013a.
  • D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence
  • f distance-based and rkhs-based statistics in hypothesis testing. Annals
  • f Statistics, 41(5):2263–2702, 2013b.
  • D. Sejdinovic, H. Strathmann, M. Lomeli Garcia, C. Andrieu, and A. Gret-
  • ton. Kernel adaptive Metropolis-Hastings. ICML, 2014.
  • A. J. Smola, A. Gretton, L. Song, and B. Sch¨
  • lkopf.

A Hilbert space em- bedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754, pages 13–31. Springer, 2007.

  • B. Sriperumbudur, K. Fukumizu, A. Gretton, and A. Hyv¨

arinen. Density estimation in infinite dimensional exponential families. Technical Report 1312.3516, ArXiv e-prints, 2014. Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zolt´ an Szab´

  • , and

Arthur Gretton. Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. arxiv, 2015.

  • G. Sz´

ekely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5, 2004.

  • G. Sz´

ekely and M. Rizzo. A new test for multivariate normality. J. Multivariate Anal., 93:58–80, 2005.

  • G. Sz´

ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.

  • K. Zhang, J. Peters, D. Janzing, B., and B. Sch¨
  • lkopf.

Kernel-based con- ditional independence test and application in causal discovery. In 27th Conference on Uncertainty in Artificial Intelligence, pages 804–813, 2011.

70-2