[PPT] - Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral PowerPoint Presentation

SLIDE 1

Spectral Algorithms for Latent Variable Models Part III: Latent Tree Models

ICML 2012 Tutorial on Spectral Algorithms for Latent Variable Models, Edinburgh, UK

Joint work with Mariya Ishteva, Ankur Parikh, Eric Xing, Byron Boots , Geoff Gordon, Alex Smola and Kenji Fukumizu

Le Song

SLIDE 2

Graphical model: nodes represent variables, edges represent conditional independence relation Latent tree graphical models: latent and observed variables are arranged in a tree structure Many real world applications, eg., time-series prediction, topic modeling

Latent Tree Graphical Models

2

Latent Variable Observed Variable Latent Tree

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

Hidden Markov Model

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10 𝑌7 𝑌11 𝑌12 𝑌6

SLIDE 3

Scope of This Tutorial

Estimating marginal probability of the observed variables

Spectral HMMs (Hsu et al. COLT’09) Kernel spectral HMMs (Song et al. ICML’10) Spectral latent tree (Parikh et al. ICML’11, Song et al. NIPS’11) Spectral dimensional reduction for HMMs (Foster et al. Arxiv) More recent: Cohen et al. ACL’12, Balle et al. ICML’12

Estimating latent parameters

PCA approach (Mossel & Roch AOAP’06) PCA and SVD approach, (Anandkumar et al. COLT’12, Arxiv)

Estimating the structure of latent variable models

Recursive grouping (Choi et al. JMLR’11) Spectral short quartet (Anandkumar et al. NIPS’11)

3

SLIDE 4

Exponential number of entries in 𝑄 𝑌1, 𝑌2, … , 𝑌6

Discrete variable taking 𝑜 possible values, 𝑄 has 𝑃(𝑜6) entries!

Latent tree reduces the number of parameters

𝑄 𝑌10 𝑄 𝑦7 𝑦10 𝑄 𝑌1 𝑦7 𝑄 𝑌2 𝑦7 𝑄 𝑦8 𝑦10 𝑄 𝑌3 𝑦8 𝑄(𝑌4|𝑦8) 𝑄 𝑦9 𝑦10 𝑄 𝑌5 𝑦9 𝑄(𝑌6|𝑦9)

Challenge of Estimating Marginal of Observed Variables

4

𝑄 𝑌1, 𝑌2, … , 𝑌6 =

𝑦7,𝑦8,𝑦9,𝑦10

𝑄 𝑌1, 𝑌2, … , 𝑌6, 𝑦7, … , 𝑦10

𝑦7,𝑦8,𝑦9,𝑦10

𝑃 𝑜 params 𝑃 3𝑜2 params

Latent tree has 𝑃 9𝑜2 params Significant saving!

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

SLIDE 5

EM Algorithm for Parameter Estimation

Do not observe latent variables, need to estimate the corresponding parameters, eg., 𝑄(𝑌7|𝑌10) and 𝑄 𝑌1 𝑌7 Expectation maximization: maximize likelihood of observations

max 𝑄(𝑦1

𝑗, … , 𝑦6 𝑗) 𝑛 𝑗=1

Drawback: local maxima, slow to converge, difficult to analyze

5 𝑦1

1

𝑦2

1

𝑦3

1

𝑦4

1

𝑦5

1

𝑦6

1

𝑗 = 1

𝑦1

𝑛

𝑦2

𝑛

𝑦3

𝑛

𝑦4

𝑛

𝑦5

𝑛

𝑦6

𝑛

𝑗 = 𝑛

… … …

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

Goal of spectral algorithm: Estimate the marginal in local-minimum-free fashion

SLIDE 6

Key Features of Spectral Algorithms

Represent joint probability table of observed variables with low rank factorization, without using the joint table in the computation!

Eg. 𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒 = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝑄 𝑌1, … , 𝑌2𝑒 , 1, … , 𝑒 )

6

𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒

𝑜𝑒 𝑜𝑒

Represent it by low rank factors

to avoid exponential blowup

Use clever decomposition

technique to avoid directly using all entries from the table

Use singular value decomposition

SLIDE 7

Tensor View of Marginal Probability

Marginal probability table 𝓤 = 𝑄 𝑌1, 𝑌2, … , 𝑌6

Discrete variable taking 𝑜 possible values 1, … , 𝑜

6-way table, or 6th order tensor Dimension labeled by the variable

Value of the variable is the index to the corresponding dimension, need 6 indexes to access a single entry

𝑄(𝑌1 = 1, 𝑌2 = 4, … , 𝑌6 = 3) is the entry 𝓤[1,4, … , 3]

7

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

Hidden Markov Model

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10 𝑌7 𝑌11 𝑌12 𝑌6

Latent Tree

Running Examples:

SLIDE 8

Reshaping Tensor into Matrices

𝑈 = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓 𝓤, 𝒟 : multi-index 𝒟 mapped into row index, and the remaining indexes into column index

Eg. 𝓤 = 𝑄 𝑌1, 𝑌2, 𝑌3 , a 3rd order tensor and 𝑜 = 3

𝑄 2 ;{1,3} = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓 𝓤, {2} turns the dimension of 𝑌2 into row

8

𝓤

𝑌1 𝑌2 𝑌3

𝑈 =

𝑌2 𝑌1 𝑌3 = 1 𝑌3 = 2 𝑌3 = 3

Slice at dimension of 𝑌3

𝑌2 𝑌1

SLIDE 9

𝑌4 𝑌5 𝑌6

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3

𝑌3 𝑌2 𝑌1

Reshaping 6th Order Tensor

𝑈 = 𝑄 1,2,3 ;{4,5,6} = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝑄 𝑌1, … , 𝑌6 , 1,2,3 )

9

Each entry is the probability

f a unique

assignment to 𝑌1, … , 𝑌6 𝑄(2,3,1,2,1,2)

SLIDE 10

Reshaping according to Latent Tree Structure

For marginal 𝓠 = 𝑄 𝑌1, 𝑌2, … , 𝑌6 of a latent tree model, reshape it according to the edges in the tree 𝑄 1 ;{2,3,4,5,6} = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝓠, 1 ) 𝑄 1,2 ;{3,4,5,6} = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝓠, 1,2 ) 𝑄 1,2,3,4 ;{5,6} = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝓠, 1,2,3,4 )

10

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1 ;{2,3,4,5,6}

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1,2 ;{3,4,5,6}

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1,2,3,4 ;{5,6}

SLIDE 11

Low Rank Structure after Reshaping

Size of 𝑄 1,2 ;{3,4,5,6} is 𝑜2 × 𝑜4, but its rank is just 𝑜 𝑄 𝑌1, 𝑌2, … , 𝑌6 = Use matrix multiplications to express summation over 𝑌7, 𝑌10

𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 | 7 𝑄 7 ;{10} 𝑄 3,4,5,6 |{10}

⊤

𝑄 1,2 | 7 ≔ 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝑄 𝑌1, 𝑌2 𝑌7 , 1,2 ) 𝑄 3,4,5,6 | 10 ≔ 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝑄 𝑌3, 𝑌4, 𝑌5, 𝑌6 𝑌10 , 3,4,5,6 )

11

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1,2 ;{3,4,5,6} 𝑄 𝑦7, 𝑦10

𝑦7,𝑦10

𝑄 𝑌1, 𝑌2 𝑦7

𝑄(𝑌3, 𝑌4, 𝑌5, 𝑌6|𝑦10)

𝑜2 𝑜 𝑜 𝑜 𝑜 𝑜4 𝑄 1,2 ;{3,4,5,6} 𝑜2 𝑜4

=

𝑄 7 ;{10}

SLIDE 12

Low Rank Structure of Latent Tree Model

𝑄 3,4 ;{1,2,5,6} = 𝑄 3,4 | 8 𝑄 8 ;{10} 𝑄 1,2,5,6 |{10}

⊤

𝑄 1 ;{2,3,4,5,6} = 𝑄 1 | 7 𝑄 7 ;{7} 𝑄 2,3,4,5,6 |{7}

⊤

12

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑜2 𝑜 𝑜 𝑜 𝑜 𝑜4 𝑜2 𝑜4

=

𝑜 𝑜 𝑜 𝑜5

=

𝑜 𝑜5 𝑜 𝑜

All these reshapings are low rank, and with rank 𝑜

SLIDE 13

Low Rank Structure of Hidden Markov Models

𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 | 8 𝑄 8 ;{9} 𝑄 3,4,5,6 |{9}

⊤

𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 | 9 𝑄 9 ;{10} 𝑄 4,5,6 |{10}

⊤

13

𝑜3 𝑜 𝑜 𝑜 𝑜 𝑜3 𝑜3 𝑜3

=

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10 𝑌7 𝑌11 𝑌12 𝑌6

𝑜2 𝑜 𝑜 𝑜 𝑜 𝑜4 𝑜2 𝑜4

=

SLIDE 14

Key Features of Spectral Algorithms

Represent joint probability table of observed variables with low rank factorization, without using the joint table in the computation!

Eg. 𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒 = 𝑆𝑓𝑡ℎ𝑏𝑞𝑓(𝑄 𝑌1, … , 𝑌2𝑒 , 1, … , 𝑒 )

14

𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒

𝑜𝑒 𝑜𝑒

Represent it by low rank factors

to avoid exponential blowup

Use clever decomposition

technique to avoid directly using all entries from the table

Use singular value decomposition

SLIDE 15

Key Theorem

𝑄 will be the reshaped joint probability table 𝐵 and 𝐶 will be marginalization operator Theorem 1 will be applied recursively Recover several existing spectral algorithms as special cases

15

Theorem 1: 𝑄: 𝑡𝑗𝑨𝑓 𝑛 × 𝑜, 𝑠𝑏𝑜𝑙 𝑙 𝐵: 𝑡𝑗𝑨𝑓 𝑜 × 𝑙, 𝑠𝑏𝑜𝑙 𝑙 𝐶: 𝑡𝑗𝑨𝑓 𝑙 × 𝑛, 𝑠𝑏𝑜𝑙 𝑙 𝐽𝑔 𝐶𝑄𝐵 𝑗𝑜𝑤𝑓𝑠𝑢𝑗𝑐𝑚𝑓, 𝑢ℎ𝑓𝑜 𝑄 = 𝑄𝐵 𝐶𝑄𝐵 −1 𝐶𝑄

SLIDE 16

Marginalization Operator A and B

Compute the marginal probability of a subset of variables can be expressed as matrix product 𝑄 𝑌1, 𝑌2, 𝑌3, 𝑌4 = 𝑄 𝑌1, 𝑌2, 𝑌3, 𝑌4, 𝑦5, 𝑦6

𝑦5,𝑦6

𝑄 1,2,3 ; 4 = 𝑄 1,2,3 ;{4,5,6}𝐵, where 𝐵 = 1𝑜 ⊗ 1𝑜 ⊗ 𝐽𝑜

16

𝐵

=

𝑜3 𝑜 𝑜3 𝑜3 𝑜 𝑜3 𝐽 𝐵 𝑜3 𝑜

= ⊗ ⊗

𝑜 𝑜 1 1 𝑜 𝑜 𝑜2 1

SLIDE 17

=

Zoom into Marginalization Operation

17

𝑌4 𝑌5 𝑌6

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3

𝑄 1,2,3 ;{4,5,6} 𝑄 1,2,3 ;{4} 13 ⊗ 13 ⊗ 𝐽3

SLIDE 18

Apply Theorem 1 to Latent Tree Model

Let

𝑄 = 𝑄 1,2 ;{3,4,5,6} 𝐵 = 1𝑜 ⊗ 1𝑜 ⊗ 1𝑜 ⊗ 𝐽𝑜 𝐶 = 𝐽𝑜 ⊗ 1𝑜 ⊤

Then

𝑄 1,2 ; 3,4,5,6 𝐵 = 𝑄 1,2 ;{3} 𝐶𝑄 1,2 ; 3,4,5,6 = 𝑄 2 ;{3,4,5,6} 𝐶𝑄 1,2 ; 3,4,5,6 𝐵 = 𝑄 2 ;{3} Finally use 𝑄 = 𝑄𝐵

𝐶𝑄𝐵 −1 𝐶𝑄

𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 ;{3}𝑄 2 ; 3

−1

𝑄 2 ;{3,4,5,6}

18

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1,2 ;{3,4,5,6}

SLIDE 19

Latent Tree Decomposition

𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 ;{3}𝑄 2 ; 3

−1

𝑄 2 ;{3,4,5,6}

19

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1,2 ;{3,4,5,6}

𝑌1 𝑌2 𝑌3 𝑌7 𝑌8 𝑌10

𝑄 1,2 ;{3}

𝑌2 𝑌3 𝑌7 𝑌8 𝑌10

𝑄 2 ; 3

−1

Decompose:

𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 2 ;{3,4,5,6}

𝑌2 𝑌7

SLIDE 20

Apply Theorem 1 to Hidden Markov Models

Let

𝑄 = 𝑄 1,2,3 ;{4,5,6} 𝐵 = 1𝑜 ⊗ 1𝑜 ⊗ 𝐽𝑜 𝐶 = 𝐽𝑜 ⊗ 1𝑜 ⊗ 1𝑜 ⊤

Then

𝑄 1,2,3 ; 4,5,6 𝐵 = 𝑄 1,2,3 ;{4} 𝐶𝑄 1,2,3 ; 4,5,6 = 𝑄 3 ;{4,5,6} 𝐶𝑄 1,2,3 ; 4,5,6 𝐵 = 𝑄 3 ;{4} Finally use 𝑄 = 𝑄𝐵

𝐶𝑄𝐵 −1 𝐶𝑄

𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 ;{4}𝑄 3 ; 4

−1

𝑄 3 ;{4,5,6}

20

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10 𝑌7 𝑌11 𝑌12 𝑌6

𝑄 1,2,3 ;{4,5,6}

SLIDE 21

Hidden Markov Model Decomposition

𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 ;{4}𝑄 3 ; 4

−1

𝑄 3 ;{4,5,6}

21

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10 𝑌7 𝑌11 𝑌12 𝑌6

𝑄 1,2,3 ;{4,5,6}

𝑌3 𝑌4 𝑌5 𝑌9 𝑌10 𝑌11 𝑌12 𝑌6

𝑄 3 ;{4,5,6}

𝑌1 𝑌2 𝑌3 𝑌4 𝑌8 𝑌9 𝑌10 𝑌7

𝑄 1,2,3 ;{4}

𝑌3 𝑌4 𝑌9 𝑌10

𝑄 3 ; 4

−1

Decompose:

SLIDE 22

Recursive Decomposition of Latent Tree

22

𝑌7 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 1,2 ;{3,4,5,6}

𝑌7 𝑌1 𝑌2 𝑌3 𝑌8 𝑌10

𝑄 1,2 ;{3}

𝑌7 𝑌2 𝑌3 𝑌8 𝑌10

𝑄 2 ; 3

−1

𝑌7 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 2 ;{3,4,5,6}

𝑌7 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 2,3,4 ;{5,6}

Reshape

𝑌7 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10

𝑄 2,3,4 ;{5}

𝑌4 𝑌5 𝑌8 𝑌9 𝑌10

𝑄 4 ; 5

−1

𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

𝑄 4 ;{5,6}

Reshape

𝑌7 𝑌2 𝑌3 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10

𝑄 3,4 ;{2,5}

𝑌7 𝑌2 𝑌3 𝑌4 𝑌8 𝑌10

𝑄 3,4 ;{2}

𝑌7 𝑌2 𝑌4 𝑌8 𝑌10

𝑄 4 ; 2

−1

𝑌7 𝑌2 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10

𝑄 4 ;{2,5}

SLIDE 23

Recursive Decomposition of HMM

23

𝑌1 𝑌2 𝑌4 𝑌5 𝑌8 𝑌9 𝑌10 𝑌7 𝑌11 𝑌12 𝑌6

𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 ;{4}𝑄 3 ; 4

−1

𝑄 3 ;{4,5,6} 𝑄 1,2,3 ;{4,5,6} 𝑄 1,2,3 ;{4}

𝑌1 𝑌2 𝑌4 𝑌8 𝑌9 𝑌10 𝑌7

𝑄 3 ;{4,5,6}

𝑌4 𝑌5 𝑌9 𝑌10 𝑌11 𝑌12 𝑌6

𝑄 3 ; 4

−1

𝑌4 𝑌9 𝑌10

𝑄 1,2 ;{3,4}

Reshape

𝑌1 𝑌2 𝑌4 𝑌8 𝑌9 𝑌10 𝑌7

𝑄 3,4 ;{5,6}

Reshape

𝑌4 𝑌5 𝑌9 𝑌10 𝑌11 𝑌12 𝑌6

𝑄 1,2 ;{3}

𝑌1 𝑌2 𝑌8 𝑌9 𝑌7

𝑄 2 ;{3,4}

𝑌2 𝑌4 𝑌8 𝑌9 𝑌10

𝑄 2 ; 3

−1

𝑌2 𝑌8 𝑌9

𝑄 1,2 ;{3,4} = 𝑄 1,2 ;{3}𝑄 3 ; 4

−1

𝑄 3 ;{4,5}

𝑄 4 ; 5

−1

𝑌4 𝑌5 𝑌10 𝑌11

𝑄 3,4 ;{5}

𝑌4 𝑌5 𝑌9 𝑌10 𝑌11

𝑄 4 ;{5,6}

𝑌4 𝑌5 𝑌10 𝑌11 𝑌12 𝑌6

𝑄 1,2 ;{3,4} = 𝑄 1,2 ;{3}𝑄 3 ; 4

−1

𝑄 3 ;{4,5,6}

SLIDE 24

One Entries in Joint Probability Table of HMM

Fix some observations

Fix 𝑌3 = 𝑦3, 𝑄 2 ;𝑦3;{4} ≔ 𝑄 𝑌2, 𝑦3, 𝑌4 is a matrix Fix 𝑌2= 𝑦2, 𝑌2= 𝑦2, 𝑄𝑦1;𝑦2;{3} ≔ 𝑄 𝑦1, 𝑦2, 𝑌3 is a vector

𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6 = 𝑄𝑦1;𝑦2;{3} 𝑄 2 ; 3

−1

𝑄 2 ;𝑦3;{4}𝑄 3 ; 4

−1

𝑄 3 ;𝑦4;{5}𝑄 4 ; 5

−1

𝑄{4};𝑦5;𝑦6

24

𝑄 3 ; 4

−1

𝑌4 𝑌9 𝑌10

𝑄 1,2 ;{3}

𝑌1 𝑌2 𝑌8 𝑌9 𝑌7

𝑄 2 ;{3,4}

𝑌2 𝑌4 𝑌8 𝑌9 𝑌10

𝑄 2 ; 3

−1

𝑌2 𝑌8 𝑌9

𝑄 4 ; 5

−1

𝑌4 𝑌5 𝑌10 𝑌11

𝑄 3,4 ;{5}

𝑌4 𝑌5 𝑌9 𝑌10 𝑌11

𝑄 4 ;{5,6}

𝑌4 𝑌5 𝑌10 𝑌11 𝑌12 𝑌6

SLIDE 25

𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6 = 𝑄𝑦1;𝑦2;{3} 𝑄 2 ; 3

−1

𝑄 2 ;𝑦3;{3}𝑄 3 ; 4

−1

𝑄 4 ;𝑦4;{5}𝑄 4 ; 5

−1

𝑄{4};𝑦5;𝑦6 Introduce variable 𝑌0 into 𝑄𝑦1;𝑦2;{3}

= 1⊤𝑄 0 ;𝑦1;{1}𝑄 1 ; 2

−1

𝑄{1};𝑦2;{3} = 1⊤𝑄 0 ;{1}𝑄 0 ; 1

−1

𝑄 0 ;𝑦1;{1}𝑄 1 ; 2

−1

𝑄{1};𝑦2;{3} = 𝑄 1

⊤ 𝑄 0 ; 1 −1

𝑄 0 ;𝑦1;{1}𝑄 1 ; 2

−1

𝑄{1};𝑦2;{3}

Do similar things to 𝑄{6};𝑦5;𝑦6 Assume time homogeneous

𝑄 0 ; 1

−1

= 𝑄 1 ; 2

−1

, 𝑄{1,2};{3} = 𝑄{2,3};{4}

Connection to Foster et al.

25

𝑦1 𝑦2 𝑌3 𝑌8 𝑌7 𝑌9 𝑌0 𝑦1 𝑦2 𝑌3 𝑌8 𝑌7 𝑌9 𝐼

SLIDE 26

What if Hidden State 𝑙 < Observed State 𝑜

26

Let 𝑄 = 𝑄 1,2,3 ;{4,5,6}, 𝐵 = 1𝑜 ⊗ 1𝑜 ⊗ 𝐽𝑜, 𝐶 = 𝐽𝑜 ⊗ 1𝑜 ⊗ 1𝑜 ⊤. Use 𝑄 = 𝑄𝐵 𝐶𝑄𝐵 −1 𝐶𝑄

𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 ;{4}𝑄 3 ; 4

−1

𝑄 3 ;{4,5,6}

𝑄 3 ; 4 of size 𝑜 × 𝑜 has rank 𝑙 and not invertible!

Singular Value decomposition of 𝑄 3 ;{4} = 𝑉𝑙Σ𝑙𝑊

𝑙 ⊤

Solution: Use further projection such that (𝐶𝑄𝐵) is invertible

Let 𝐵 = 1𝑜 ⊗ 1𝑜 ⊗ 𝐽𝑜 𝑊

𝑙, 𝐶 = 𝑉𝑙 ⊤ 𝐽𝑜 ⊗ 1𝑜 ⊗ 1𝑜 ⊤

𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 ;{4}𝑊

𝑙 𝑉𝑙 ⊤𝑄 3 ; 4 𝑊 𝑙 −1𝑉𝑙 ⊤𝑄 3 ;{4,5,6}

𝑄 3 ; 4

−1

𝑌4 𝑌9 𝑌10

SLIDE 27

Connection to Hsu et al.

Two equivalent forms of applying further projection 𝑉𝑙 and 𝑊

𝑙

𝑄 3 ;{4} = 𝑉𝑙Σ𝑙𝑊

𝑙 ⊤

𝑉𝑙

⊤𝑄 3 ; 4 𝑊 𝑙 −1𝑉𝑙 ⊤𝑄 3 ;{4,5,6} = 𝑄 3 ; 4 𝑊 𝑙 †𝑄 3 ;{4,5,6}

𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6 = 𝑄𝑦1;𝑦2; 3 𝑊

𝑙 𝑄 2 ; 3 −1

𝑊

𝑙 †𝑄 2 ;𝑦3; 4 𝑊 𝑙

𝑄 3 ; 4

−1

𝑊

𝑙 †𝑄 4 ;𝑦4; 5 𝑊 𝑙 𝑄 4 ; 5 −1

𝑊

𝑙 †𝑄{4};𝑦5;𝑦6

𝑐1

⊤𝐶𝑦1 … 𝐶𝑦6𝑐∞

27

SLIDE 28

Proof of Theorem 1

SVD: 𝑄 = 𝑉𝑙Σ𝑙𝑊

𝑙 ⊤ + 𝑉⊥0𝑊 ⊥ ⊤

Assume

𝐵 = (𝑊

𝑙, 𝑊 ⊥) 𝐷

𝐸 , 𝐷 of size 𝑙 × 𝑙 and invertible 𝐶 = (𝑉𝑙, 𝑉⊥) 𝐹 𝐺 , 𝐹 of size 𝑙 × 𝑙 and invertible Plug the above 𝐵 and 𝐶 into 𝑄𝐵

𝐶𝑄𝐵 −1 𝐶𝑄

28

Theorem 1: 𝑀𝑓𝑢 𝑄 𝑐𝑓 𝑏 𝑠𝑏𝑜𝑙 𝑙 𝑛𝑏𝑢𝑠𝑗𝑦 𝑝𝑔 𝑡𝑗𝑨𝑓 𝑛 × 𝑜, 𝐵 𝑐𝑓 𝑏 𝑠𝑏𝑜𝑙 𝑙 𝑛𝑏𝑢𝑠𝑗𝑦 𝑝𝑔 𝑡𝑗𝑨𝑓 𝑜 × 𝑙, 𝐶 𝑐𝑓 𝑏 𝑠𝑏𝑜𝑙 𝑙 𝑛𝑏𝑢𝑠𝑗𝑦 𝑝𝑔 𝑡𝑗𝑨𝑓 𝑙 × 𝑛, 𝑢ℎ𝑓𝑜 𝑄 = 𝑄𝐵 𝐶𝑄𝐵 −1 𝐶𝑄

SLIDE 29

Finite Sample Estimator

Given 𝑛 iid samples, estimate pairwise and triplet marginals One-of-𝑜 encoding, e.g., 𝜚 𝑦 = 1 = 1 ⋮ , 𝜚 𝑦 = 2 = 1 ⋮ 𝑄 1 ; 2 ;{3} =

1 𝑛

𝜚 𝑦1

𝑗 𝑛 𝑗=1

⊗ 𝜚 𝑦2

𝑗

⊗ 𝜚 𝑦3

𝑗

𝑄 1 ;{2} =

1 𝑛

𝜚 𝑦1

𝑗 𝑛 𝑗=1

𝜚 𝑦2

𝑗 ⊤

𝑄 1 =

1 𝑛

𝜚 𝑦1

𝑗 𝑛 𝑗=1

29 𝑦1

1

𝑦2

1

𝑦3

1

𝑦4

1

𝑦5

1

𝑦6

1

𝑗 = 1

𝑦1

𝑛

𝑦2

𝑛

𝑦3

𝑛

𝑦4

𝑛

𝑦5

𝑛

𝑦6

𝑛

𝑗 = 𝑛

… … …

𝑌1 𝑌2 𝑌3 𝑌7 𝑌4 𝑌5 𝑌6 𝑌8 𝑌9 𝑌10

SLIDE 30

Sample Complexity Analysis

Error in estimate 𝑄 1 ; 2 ;{3}, 𝑄 1 ;{2} Error propagation in the recursive decomposition

𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6

= 𝑄𝑦1;𝑦2;{3} 𝑄 2 ; 3

−1

𝑄 2 ;𝑦3;{3}𝑄 3 ; 4

−1

𝑄 4 ;𝑦4;{5}𝑄 4 ; 5

−1

𝑄{4};𝑦5;𝑦6

Depends on the smallest singular value of the invesion terms eg., 𝑄 1 ;{2}

Spectral algorithms

Use SVD for further projection Error depends on singular value

30

SLIDE 31

Synthetic Data

31

SLIDE 32

Stock Trend Prediction Data

59 stocks, 6800 samples, learn latent structure first and then estimate the marginal MAP prediction task 𝑦𝑗 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑄(𝑌𝑗|𝑦1, 𝑦2, … , 𝑦𝑗−1) (query 𝑗 variables) Absolute error |𝑦𝑗 − 𝑦𝑗

⋆|

Also compared with Chow-Liu tree (fully observed model)

32

SLIDE 33

Non-discrete, Non-Gaussian Case

Previous approach all about discrete variables Real world data can be continuous, and have multimodal distribution and other rich statistical features Replace discrete probabilities by kernel embedding of distributions

𝑙 𝑦, 𝑦′ = 𝜚 𝑦 , 𝜚(𝑦′) , eg., exp −𝑡 𝑦 − 𝑦′ 2 Expected feature of distribution 𝜈𝑌 = 𝔽𝑌 𝜚 𝑌 (can be infinite dimensional feature) One-of-𝑜 feature of discrete case is a special case

33

𝑌 𝑄(𝑌) 𝑌 𝑍 𝑄(𝑌, 𝑍) 𝑍 𝑌 𝑎 𝑄(𝑌, 𝑍, 𝑎)

SLIDE 34

Kernel Embedding and Covariance Operator

34

𝜈𝑌 ≔ 𝔽𝑌[𝜚(𝑌)] 𝓓𝑌𝑍 ≔ 𝔽𝑌𝑍[𝜚 𝑌 ⊗ 𝜚(𝑍)] 𝓓𝑌𝑍𝑎 ≔ 𝔽𝑌𝑍𝑎[𝜚 𝑌 ⊗ 𝜚 𝑍 ⊗ 𝜚 𝑎 ] 𝑄(𝑌) 𝑜 × 1 ∞ × 1 𝑄(𝑌, 𝑍) 𝑜 × 𝑜 ∞ × ∞ 𝑄(𝑌, 𝑍, 𝑎) 𝑜 × 𝑜 × 𝑜 ∞ × ∞ × ∞ 𝑌 𝑄(𝑌) 𝑌 𝑍 𝑄(𝑌, 𝑍) 𝑍 𝑌 𝑎 𝑄(𝑌, 𝑍, 𝑎) Discrete Kernel Embedding

SLIDE 35

Kernel Embedding with Finite Sample

35

Joint Feature space

𝑄(𝑌, 𝑍)

𝓓𝑌𝑍 = 𝔽𝑌𝑍 𝜚 𝑌 ⊗ 𝜚 𝑍 ≈ 𝓓 𝑌𝑍 = 1 𝑛 𝜚 𝑦𝑗 ⊗ 𝜚 𝑧𝑗

𝑛 𝑗=1

Use finite sample mean to approximate expectation, Then apply the recursively low rank decomposition

𝜚 𝑦𝑗 ⊗ 𝜚 𝑧𝑗 𝓓 𝑌𝑍

𝔽𝑌𝑍 𝜚 𝑌 ⊗ 𝜚 𝑍 𝓓𝑌𝑍

SLIDE 36

How to Deal with Infinite Features?

Kernel trick: never explicitly compute features, always turn it into inner product 𝑙 𝑦, 𝑦′ = 𝜚 𝑦 , 𝜚(𝑦′)

Eg. kernel Singular Value Decomposition

𝓓 𝑌𝑍 = 1

𝑛

𝜚 𝑦𝑗 ⊗ 𝜚 𝑧𝑗

𝑛 𝑗=1

= 𝑉Σ𝑊⊤ Run kernel principal component analysis on 𝓓 𝑌𝑍𝓓 𝑌𝑍

⊤

Eigenvector lies the span of data 𝑉 = 𝛽𝑗𝜚 𝑦𝑗

𝑛 𝑗=1

Solve a generalized eigenvalue problem

𝐿𝐻𝐿𝛽 = 𝜇𝐿𝛽 Kernel matrix 𝐿𝑗𝑘 = 𝑙(𝑦𝑗, 𝑦𝑘) and 𝐻𝑗𝑘 = 𝑙(𝑧𝑗, 𝑧𝑘)

36

SLIDE 37

Video and Slot Car Senor Prediction

37

SLIDE 38

Demographic Feature Prediction

50 variables, 1400 samples, learn the latent structure first and then run spectral algorithms Compare to Gaussian latent variable model and Gaussian copula model (NPN), absolute error |𝑦 − 𝑦⋆|

38

SLIDE 39

Summary and Future direction (more)

Spectral algorithm is the consequence of low rank structure of latent variable model

𝑄 = 𝑄𝐵 𝐶𝑄𝐵 −1 𝐶𝑄 Recursively decomposition Better low rank approximation?

What if the latent variable model is the wrong model?

Estimating latent parameters

PCA approach (Mossel & Roch AOAP’06), PCA and SVD approach, (Anandkumar et al. COLT’12, Arxiv)

Estimating the structure of latent variable models

Recursive grouping (Choi et al. JMLR’11), Spectral short quartet (Anandkumar et al. NIPS’11)

39

SLIDE 40

Questions?

Thanks

40