A Method of Moments for Mixture Models and Hidden Markov Models - - PowerPoint PPT Presentation

a method of moments for mixture models and hidden markov
SMART_READER_LITE
LIVE PREVIEW

A Method of Moments for Mixture Models and Hidden Markov Models - - PowerPoint PPT Presentation

A Method of Moments for Mixture Models and Hidden Markov Models Anima Anandkumar @ Daniel Hsu # Sham M. Kakade # @ University of California, Irvine # Microsoft Research, New England Outline 1. Latent class models and parameter estimation 2.


slide-1
SLIDE 1

A Method of Moments for Mixture Models and Hidden Markov Models

Anima Anandkumar@ Daniel Hsu# Sham M. Kakade#

@University of California, Irvine #Microsoft Research, New England

slide-2
SLIDE 2

Outline

  • 1. Latent class models and parameter estimation
  • 2. Multi-view method of moments
  • 3. Some applications
  • 4. Concluding remarks
slide-3
SLIDE 3
  • 1. Latent class models and parameter

estimation

slide-4
SLIDE 4

Latent class models / multi-view mixture models

Random vectors h ∈ { e1, e2, . . . , ek} ∈ Rk, x1, x2, . . . , xℓ ∈ Rd.

  • h
  • x1
  • x2

· · ·

  • xℓ
slide-5
SLIDE 5

Latent class models / multi-view mixture models

Random vectors h ∈ { e1, e2, . . . , ek} ∈ Rk, x1, x2, . . . , xℓ ∈ Rd.

  • h
  • x1
  • x2

· · ·

  • xℓ

◮ Bags-of-words clustering model: k = number of topics,

d = vocabulary size, h = topic of document,

  • x1,

x2, . . . , xℓ ∈ { e1, e2, . . . , ed} words in the document.

slide-6
SLIDE 6

Latent class models / multi-view mixture models

Random vectors h ∈ { e1, e2, . . . , ek} ∈ Rk, x1, x2, . . . , xℓ ∈ Rd.

  • h
  • x1
  • x2

· · ·

  • xℓ

◮ Bags-of-words clustering model: k = number of topics,

d = vocabulary size, h = topic of document,

  • x1,

x2, . . . , xℓ ∈ { e1, e2, . . . , ed} words in the document.

◮ Multi-view clustering: k = number of clusters, ℓ =

number of views (e.g., audio, video, text); views assumed to be conditionally independent given the cluster.

slide-7
SLIDE 7

Latent class models / multi-view mixture models

Random vectors h ∈ { e1, e2, . . . , ek} ∈ Rk, x1, x2, . . . , xℓ ∈ Rd.

  • h
  • x1
  • x2

· · ·

  • xℓ

◮ Bags-of-words clustering model: k = number of topics,

d = vocabulary size, h = topic of document,

  • x1,

x2, . . . , xℓ ∈ { e1, e2, . . . , ed} words in the document.

◮ Multi-view clustering: k = number of clusters, ℓ =

number of views (e.g., audio, video, text); views assumed to be conditionally independent given the cluster.

◮ Hidden Markov model: (ℓ = 3) past, present, and future

  • bservations are conditionally independent given present

hidden state.

slide-8
SLIDE 8

Latent class models / multi-view mixture models

Random vectors h ∈ { e1, e2, . . . , ek} ∈ Rk, x1, x2, . . . , xℓ ∈ Rd.

  • h
  • x1
  • x2

· · ·

  • xℓ

◮ Bags-of-words clustering model: k = number of topics,

d = vocabulary size, h = topic of document,

  • x1,

x2, . . . , xℓ ∈ { e1, e2, . . . , ed} words in the document.

◮ Multi-view clustering: k = number of clusters, ℓ =

number of views (e.g., audio, video, text); views assumed to be conditionally independent given the cluster.

◮ Hidden Markov model: (ℓ = 3) past, present, and future

  • bservations are conditionally independent given present

hidden state.

◮ etc.

slide-9
SLIDE 9

Parameter estimation task

Model parameters: mixing weights and conditional means wj := Pr[ h = ej], j ∈ [k];

  • µv,j := E[

xv| h = ej] ∈ Rd, v ∈ [ℓ], j ∈ [k]. Goal: given i.i.d. copies of ( x1, x2, . . . , xℓ), estimate matrix of conditional means Mv := [ µv,1| µv,2| · · · | µv,k] for each view v ∈ [ℓ], and mixing weights w := (w1, w2, . . . , wk).

slide-10
SLIDE 10

Parameter estimation task

Model parameters: mixing weights and conditional means wj := Pr[ h = ej], j ∈ [k];

  • µv,j := E[

xv| h = ej] ∈ Rd, v ∈ [ℓ], j ∈ [k]. Goal: given i.i.d. copies of ( x1, x2, . . . , xℓ), estimate matrix of conditional means Mv := [ µv,1| µv,2| · · · | µv,k] for each view v ∈ [ℓ], and mixing weights w := (w1, w2, . . . , wk). Unsupervised learning, as h is not observed.

slide-11
SLIDE 11

Parameter estimation task

Model parameters: mixing weights and conditional means wj := Pr[ h = ej], j ∈ [k];

  • µv,j := E[

xv| h = ej] ∈ Rd, v ∈ [ℓ], j ∈ [k]. Goal: given i.i.d. copies of ( x1, x2, . . . , xℓ), estimate matrix of conditional means Mv := [ µv,1| µv,2| · · · | µv,k] for each view v ∈ [ℓ], and mixing weights w := (w1, w2, . . . , wk). Unsupervised learning, as h is not observed. This talk: very general and computationally efficient method-of-moments estimator for w and Mv.

slide-12
SLIDE 12

Some barriers to efficient estimation

Cryptographic barrier: HMM parameter es- timation as hard as learning parity functions with noise (Mossel-Roch, ’06).

slide-13
SLIDE 13

Some barriers to efficient estimation

Cryptographic barrier: HMM parameter es- timation as hard as learning parity functions with noise (Mossel-Roch, ’06). Statistical barrier: mixtures of Gaussians in R1 can require exp(Ω(k)) samples to estimate, even if components are Ω(1/k)- separated (Moitra-Valiant, ’10).

slide-14
SLIDE 14

Some barriers to efficient estimation

Cryptographic barrier: HMM parameter es- timation as hard as learning parity functions with noise (Mossel-Roch, ’06). Statistical barrier: mixtures of Gaussians in R1 can require exp(Ω(k)) samples to estimate, even if components are Ω(1/k)- separated (Moitra-Valiant, ’10). Practitioners typically resort to local search heuristics (EM); plagued by slow convergence and inaccurate local optima.

slide-15
SLIDE 15

Making progress: Gaussian mixture model

Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means

(Dasgupta, ’99):

sep := min

i=j

  • µi −

µj max{σi, σj}.

slide-16
SLIDE 16

Making progress: Gaussian mixture model

Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means

(Dasgupta, ’99):

sep := min

i=j

  • µi −

µj max{σi, σj}.

◮ sep = Ω(dc): interpoint distance-based methods / EM

(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)

◮ sep = Ω(kc): first use PCA to k dimensions

(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05)

slide-17
SLIDE 17

Making progress: Gaussian mixture model

Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means

(Dasgupta, ’99):

sep := min

i=j

  • µi −

µj max{σi, σj}.

◮ sep = Ω(dc): interpoint distance-based methods / EM

(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)

◮ sep = Ω(kc): first use PCA to k dimensions

(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05)

◮ No minimum separation requirement: method-of-moments

but exp(Ω(k)) running time / sample size

(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

slide-18
SLIDE 18

Making progress: hidden Markov models

Hardness reductions create HMMs where different states may have near-identical output and next-state distributions.

Pr[ xt = ·| ht = e2]

1 4 5 6 7 8 2 3

Pr[ xt = ·| ht = e1]

1 4 5 6 7 8 2 3

Can avoid these instances if we assume transition and output parameter matrices are full-rank.

slide-19
SLIDE 19

Making progress: hidden Markov models

Hardness reductions create HMMs where different states may have near-identical output and next-state distributions.

Pr[ xt = ·| ht = e2]

1 4 5 6 7 8 2 3

Pr[ xt = ·| ht = e1]

1 4 5 6 7 8 2 3

Can avoid these instances if we assume transition and output parameter matrices are full-rank.

◮ d = k: eigenvalue decompositions (Chang, ’96;

Mossel-Roch, ’06)

◮ d ≥ k: subspace ID + observable operator model

(Hsu-Kakade-Zhang, ’09)

slide-20
SLIDE 20

What we do

This work: Concept of “full rank” parameter matrices is generic and very powerful; adapt Chang’s method for more general mixture models.

slide-21
SLIDE 21

What we do

This work: Concept of “full rank” parameter matrices is generic and very powerful; adapt Chang’s method for more general mixture models.

◮ Non-degeneracy condition for latent class model:

Mv has full column rank (∀v ∈ [ℓ]), and w > 0.

slide-22
SLIDE 22

What we do

This work: Concept of “full rank” parameter matrices is generic and very powerful; adapt Chang’s method for more general mixture models.

◮ Non-degeneracy condition for latent class model:

Mv has full column rank (∀v ∈ [ℓ]), and w > 0.

◮ New efficient learning results for:

◮ Certain Gaussian mixture models, with no minimum

separation requirement and poly(k) sample / computational complexity

◮ HMMs with discrete or continuous output distributions (e.g.,

Gaussian mixture outputs)

slide-23
SLIDE 23
  • 2. Multi-view method of moments
slide-24
SLIDE 24

Simplified model and low-order statistics

Simplification: Mv ≡ M (same conditional means for all views);

slide-25
SLIDE 25

Simplified model and low-order statistics

Simplification: Mv ≡ M (same conditional means for all views); If xv ∈ { e1, e2, . . . , ed} (discrete outputs), then Pr[ xv = ei| h = ej] = Mi,j, i ∈ [d], j ∈ [k].

slide-26
SLIDE 26

Simplified model and low-order statistics

Simplification: Mv ≡ M (same conditional means for all views); If xv ∈ { e1, e2, . . . , ed} (discrete outputs), then Pr[ xv = ei| h = ej] = Mi,j, i ∈ [d], j ∈ [k]. So pair-wise and triple-wise statistics are: Pairsi,j := Pr[ x1 = ei ∧ x2 = ej], i, j ∈ [d] Triplesi,j,κ := Pr[ x1 = ei ∧ x2 = ej ∧ x3 = eκ], i, j, κ ∈ [d].

slide-27
SLIDE 27

Simplified model and low-order statistics

Simplification: Mv ≡ M (same conditional means for all views); If xv ∈ { e1, e2, . . . , ed} (discrete outputs), then Pr[ xv = ei| h = ej] = Mi,j, i ∈ [d], j ∈ [k]. So pair-wise and triple-wise statistics are: Pairsi,j := Pr[ x1 = ei ∧ x2 = ej], i, j ∈ [d] Triplesi,j,κ := Pr[ x1 = ei ∧ x2 = ej ∧ x3 = eκ], i, j, κ ∈ [d]. Notation: for η = (η1, η2, . . . , ηd) ∈ Rd, Triplesi,j( η) :=

d

  • κ=1

ηκ Pr[ x1 = ei ∧ x2 = ej ∧ x3 = eκ], i, j ∈ [d].

slide-28
SLIDE 28

Algebraic structure in moments

  • h
  • x1
  • x2

· · ·

  • xℓ

By conditional independence of x1, x2, x3 given h, Pairs = M diag( w)M⊤ Triples( η) = M diag(M⊤ η) diag( w)M⊤. (Low-rank matrix factorizations, but M not necessarily orthonormal.)

slide-29
SLIDE 29

Developing a method of moments

For simplicity, assume d = k (all matrices are square).

slide-30
SLIDE 30

Developing a method of moments

For simplicity, assume d = k (all matrices are square). Recall: Pairs = M diag( w) M⊤ Triples( η) = M diag(M⊤ η) diag( w) M⊤

slide-31
SLIDE 31

Developing a method of moments

For simplicity, assume d = k (all matrices are square). Recall: Pairs = M diag( w) M⊤ Triples( η) = M diag(M⊤ η) diag( w) M⊤ and therefore Triples( η) Pairs−1 = M diag(M⊤ η)M−1,

slide-32
SLIDE 32

Developing a method of moments

For simplicity, assume d = k (all matrices are square). Recall: Pairs = M diag( w) M⊤ Triples( η) = M diag(M⊤ η) diag( w) M⊤ and therefore Triples( η) Pairs−1 = M diag(M⊤ η)M−1, a diagonalizable matrix of the form VΛV −1, where V = M (eigenvectors) and Λ = diag(M⊤ η) (eigenvalues).

slide-33
SLIDE 33

Developing a method of moments

For simplicity, assume d = k (all matrices are square). Recall: Pairs = M diag( w) M⊤ Triples( η) = M diag(M⊤ η) diag( w) M⊤ and therefore Triples( η) Pairs−1 = M diag(M⊤ η)M−1, a diagonalizable matrix of the form VΛV −1, where V = M (eigenvectors) and Λ = diag(M⊤ η) (eigenvalues). (If d > k, use SVD to reduce dimension.)

slide-34
SLIDE 34

Plug-in estimator

  • 1. Obtain empirical estimates

Pairs and Triples of Pairs and Triples.

  • 2. Compute matrix of k orthonormal left singular vectors

U using rank-k SVD of Pairs.

  • 3. Randomly pick unit vector

θ ∈ Rk.

  • 4. Compute right eigenvectors

v1, v2, . . . , vk of

  • U⊤

Triples( U θ) U

  • U⊤

Pairs U −1 and return

  • M := [

U v1| U v2| · · · | U vk] as conditional mean parameter estimates (up to scaling). In general, proper scaling can be determined from eigenvalues.

slide-35
SLIDE 35

Accuracy guarantee

Theorem (discrete outputs)

Assume non-degeneracy condition holds. If Pairs and Triples are empirical frequencies obtained from random sample of size poly

  • k, σmin(M)−1, w−1

min

  • ǫ2

, then with high probability, there exists a permutation matrix Π such that the M returned by plug-in estimator satisfies MΠ − M ≤ ǫ.

slide-36
SLIDE 36

Accuracy guarantee

Theorem (discrete outputs)

Assume non-degeneracy condition holds. If Pairs and Triples are empirical frequencies obtained from random sample of size poly

  • k, σmin(M)−1, w−1

min

  • ǫ2

, then with high probability, there exists a permutation matrix Π such that the M returned by plug-in estimator satisfies MΠ − M ≤ ǫ. Role of non-degeneracy: σmin(M)−1 and w−1

min in sample

complexity bound.

slide-37
SLIDE 37

Additional details (see paper)

slide-38
SLIDE 38

Additional details (see paper)

◮ Can also obtain estimate for mixing weights

w.

slide-39
SLIDE 39

Additional details (see paper)

◮ Can also obtain estimate for mixing weights

w.

◮ General setting: different conditional mean matrices for

different views; some non-discrete observed variables.

slide-40
SLIDE 40

Additional details (see paper)

◮ Can also obtain estimate for mixing weights

w.

◮ General setting: different conditional mean matrices for

different views; some non-discrete observed variables.

◮ Similar sample complexity bound for models with

continuous but subgaussian (or log-concave, etc.) xv’s.

slide-41
SLIDE 41

Additional details (see paper)

◮ Can also obtain estimate for mixing weights

w.

◮ General setting: different conditional mean matrices for

different views; some non-discrete observed variables.

◮ Similar sample complexity bound for models with

continuous but subgaussian (or log-concave, etc.) xv’s.

◮ Delicate alignment issue: how to make sure columns of

M1 are in same order as columns of M2?

slide-42
SLIDE 42

Additional details (see paper)

◮ Can also obtain estimate for mixing weights

w.

◮ General setting: different conditional mean matrices for

different views; some non-discrete observed variables.

◮ Similar sample complexity bound for models with

continuous but subgaussian (or log-concave, etc.) xv’s.

◮ Delicate alignment issue: how to make sure columns of

M1 are in same order as columns of M2?

◮ Solution: reuse eigenvectors whenever possible and align

based on eigenvalues.

slide-43
SLIDE 43

Additional details (see paper)

◮ Can also obtain estimate for mixing weights

w.

◮ General setting: different conditional mean matrices for

different views; some non-discrete observed variables.

◮ Similar sample complexity bound for models with

continuous but subgaussian (or log-concave, etc.) xv’s.

◮ Delicate alignment issue: how to make sure columns of

M1 are in same order as columns of M2?

◮ Solution: reuse eigenvectors whenever possible and align

based on eigenvalues.

◮ Many variants possible (e.g., symmetrization to only deal

with orthogonal eigenvectors) — easy to design once you see the structure.

slide-44
SLIDE 44
  • 3. Some applications
slide-45
SLIDE 45

Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn, with component means

  • µ1,

µ2, . . . , µk ∈ Rn; no minimum separation requirement.

  • h

x1 x2 · · · xn

slide-46
SLIDE 46

Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn, with component means

  • µ1,

µ2, . . . , µk ∈ Rn; no minimum separation requirement.

  • h

x1 x2 · · · xn Assumptions:

◮ non-degeneracy: component means span k dimensional

subspace.

◮ incoherence condition: component means not perfectly

aligned with coordinate axes — similar to spreading condition of (Chaudhuri-Rao, ’08).

slide-47
SLIDE 47

Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn, with component means

  • µ1,

µ2, . . . , µk ∈ Rn; no minimum separation requirement.

  • h

x1 x2 · · · xn Assumptions:

◮ non-degeneracy: component means span k dimensional

subspace.

◮ incoherence condition: component means not perfectly

aligned with coordinate axes — similar to spreading condition of (Chaudhuri-Rao, ’08). Then, randomly partitioning coordinates into ℓ ≥ 3 views guarantees (w.h.p.) that non-degeneracy holds in all ℓ views.

slide-48
SLIDE 48

Hidden Markov models

  • h1
  • h2
  • h3
  • x1
  • x2
  • x3
slide-49
SLIDE 49

Hidden Markov models

  • h1
  • h2
  • h3
  • x1
  • x2
  • x3

− →

  • h
  • x1
  • x2
  • x3
slide-50
SLIDE 50

Bag-of-words clustering model

Mi,j = Pr[see word i in article|article topic is j].

◮ Corpus: New York Times (from UCI), 300000 articles. ◮ Vocabulary size: d = 102660 words. ◮ Chose k = 50. ◮ For each topic j, show top 10 words i ordered by

Mi,j value.

slide-51
SLIDE 51

Bag-of-words clustering model

Mi,j = Pr[see word i in article|article topic is j].

◮ Corpus: New York Times (from UCI), 300000 articles. ◮ Vocabulary size: d = 102660 words. ◮ Chose k = 50. ◮ For each topic j, show top 10 words i ordered by

Mi,j value.

sales run school drug player economic inning student patient tiger_wood consumer hit teacher million won major game program company shot home season

  • fficial

doctor play indicator home public companies round weekly right children percent win

  • rder

games high cost tournament claim dodger education program tour scheduled left district health right

slide-52
SLIDE 52

Bag-of-words clustering model

palestinian tax cup point yard israel cut minutes game game israeli percent

  • il

team play yasser_arafat bush water shot season peace billion add play team israeli plan tablespoon laker touchdown israelis bill food season quarterback leader taxes teaspoon half coach

  • fficial

million pepper lead defense attack congress sugar games quarter

slide-53
SLIDE 53

Bag-of-words clustering model

percent al_gore car book taliban stock campaign race children attack market president driver ages afghanistan fund george_bush team author

  • fficial

investor bush won read military companies clinton win newspaper u_s analyst vice racing web united_states money presidential track writer terrorist investment million season written war economy democratic lap sales bin

slide-54
SLIDE 54

Bag-of-words clustering model

com court show film music www case network movie song site law season director group web lawyer nbc play part sites federal cb character new_york information government program actor company

  • nline

decision television show million mail trial series movies band internet microsoft night million show telegram right new_york part album etc.

slide-55
SLIDE 55
  • 4. Concluding remarks
slide-56
SLIDE 56

Concluding remarks

Take-home messages:

slide-57
SLIDE 57

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

slide-58
SLIDE 58

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

◮ Algebraic structure of moments can be exploited using

simple eigendecomposition techniques.

slide-59
SLIDE 59

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

◮ Algebraic structure of moments can be exploited using

simple eigendecomposition techniques. Some follow-up works (see arXiv reports):

slide-60
SLIDE 60

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

◮ Algebraic structure of moments can be exploited using

simple eigendecomposition techniques. Some follow-up works (see arXiv reports):

◮ Mixtures of (single-view) spherical Gaussians —

non-degeneracy, without incoherence condition.

slide-61
SLIDE 61

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

◮ Algebraic structure of moments can be exploited using

simple eigendecomposition techniques. Some follow-up works (see arXiv reports):

◮ Mixtures of (single-view) spherical Gaussians —

non-degeneracy, without incoherence condition.

◮ Latent Dirichlet Allocation (joint with Dean Foster and

Yi-Kai Liu).

slide-62
SLIDE 62

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

◮ Algebraic structure of moments can be exploited using

simple eigendecomposition techniques. Some follow-up works (see arXiv reports):

◮ Mixtures of (single-view) spherical Gaussians —

non-degeneracy, without incoherence condition.

◮ Latent Dirichlet Allocation (joint with Dean Foster and

Yi-Kai Liu).

◮ Dynamic parsing models (joint with Percy Liang) — need

a new trick to handle unobserved random tree structure (e.g., PCFGs, dependency parsing trees).

slide-63
SLIDE 63

Concluding remarks

Take-home messages:

◮ Some provably hard parameter estimation problems

become easy after ruling out “degenerate” cases.

◮ Algebraic structure of moments can be exploited using

simple eigendecomposition techniques. Some follow-up works (see arXiv reports):

◮ Mixtures of (single-view) spherical Gaussians —

non-degeneracy, without incoherence condition.

◮ Latent Dirichlet Allocation (joint with Dean Foster and

Yi-Kai Liu).

◮ Dynamic parsing models (joint with Percy Liang) — need

a new trick to handle unobserved random tree structure (e.g., PCFGs, dependency parsing trees). The end. Thanks!

slide-64
SLIDE 64
  • 5. Blank slide ———————————–
slide-65
SLIDE 65
slide-66
SLIDE 66
  • 6. Connections to other moment methods
slide-67
SLIDE 67

Connections to other moment methods

Basic recipe:

◮ Express moments of observable variables as system of

polynomials in the desired parameters.

◮ Solve system of polynomials for desired parameters.

slide-68
SLIDE 68

Connections to other moment methods

Basic recipe:

◮ Express moments of observable variables as system of

polynomials in the desired parameters.

◮ Solve system of polynomials for desired parameters.

Pros:

◮ Very general technique; does not even require explicit

specification of likelihood.

◮ Example: learn vertices of convex polytope from random

samples (Gravin-Lassere-Pasechnik-Robins, ’12) — very powerful generalization of Prony’s method.

slide-69
SLIDE 69

Connections to other moment methods

Basic recipe:

◮ Express moments of observable variables as system of

polynomials in the desired parameters.

◮ Solve system of polynomials for desired parameters.

Pros:

◮ Very general technique; does not even require explicit

specification of likelihood.

◮ Example: learn vertices of convex polytope from random

samples (Gravin-Lassere-Pasechnik-Robins, ’12) — very powerful generalization of Prony’s method. Cons:

◮ Typically require high-order moments, which are difficult to

estimate.

◮ Computationally prohibitive to solve general systems of

multivariate polynomials.

slide-70
SLIDE 70
  • 7. Moments
slide-71
SLIDE 71

Simplified model and low-order moments

Simplification: Mv ≡ M (same conditional means for all views);

slide-72
SLIDE 72

Simplified model and low-order moments

Simplification: Mv ≡ M (same conditional means for all views); By conditional independence of x1, x2, x3 given h, Pairs := E[ x1 ⊗ x2] = E[(M h) ⊗ (M h)] = M diag( w)M⊤

slide-73
SLIDE 73

Simplified model and low-order moments

Simplification: Mv ≡ M (same conditional means for all views); By conditional independence of x1, x2, x3 given h, Pairs := E[ x1 ⊗ x2] = E[(M h) ⊗ (M h)] = M diag( w)M⊤ Triples := E[ x1 ⊗ x2 ⊗ x3] = E[(M h) ⊗ (M h) ⊗ (M h)] = E[ h ⊗ h ⊗ h](M, M, M)

slide-74
SLIDE 74

Simplified model and low-order moments

Simplification: Mv ≡ M (same conditional means for all views); By conditional independence of x1, x2, x3 given h, Pairs := E[ x1 ⊗ x2] = E[(M h) ⊗ (M h)] = M diag( w)M⊤ Triples := E[ x1 ⊗ x2 ⊗ x3] = E[(M h) ⊗ (M h) ⊗ (M h)] = E[ h ⊗ h ⊗ h](M, M, M) Triples( η) := E[ η, x1( x2 ⊗ x3)] = E[M⊤ η, h((M h) ⊗ (M h))] = M diag(M⊤ η) diag( w)M⊤.

slide-75
SLIDE 75
  • 8. Symmetric plug-in estimator
slide-76
SLIDE 76

Symmetric plug-in estimator

  • 1. Obtain empirical estimates

Pairs and Triples of Pairs and Triples.

  • 2. Compute matrix of k orthonormal left singular vectors

U using rank-k SVD of Pairs; W := U

  • U⊤

Pairs U −1/2, B := U

  • U⊤

Pairs U 1/2.

  • 3. Randomly pick unit vector

θ ∈ Rk.

  • 4. Compute right eigenvectors

v1, v2, . . . , vk of

  • W ⊤

Triples( W θ) W and return

  • M := [

B v1| B v2| · · · | B vk] as conditional mean parameter estimates (up to scaling).

slide-77
SLIDE 77

Symmetric plug-in estimator

Recall: W := U

  • U⊤PairsU

−1/2, B := U

  • U⊤PairsU

1/2. Then Triples(W, W, W) =

k

  • i=1

λi( vi ⊗ vi ⊗ vi) where [ v1| v2| · · · | vk] = (U⊤PairsU)−1/2(U⊤M diag( w)1/2) is

  • rthogonal.

Therefore B vi is i-th column of M scaled by √wi.

slide-78
SLIDE 78
  • 9. Hidden Markov models
slide-79
SLIDE 79

Hidden Markov models

  • h1
  • h2
  • h3

· · ·

  • hℓ
  • x1
  • x2
  • x3
  • xℓ

Parameters ( π, T, O): Pr[ h1 = ei] = πi, i ∈ [k] Pr[ ht+1 = ei| ht = ej] = T i,j, i, j ∈ [k] E[ xt| ht = ej] = O ej, j ∈ [k].

slide-80
SLIDE 80

Hidden Markov models

  • h1
  • h2
  • h3

· · ·

  • hℓ
  • x1
  • x2
  • x3
  • xℓ

Parameters ( π, T, O): Pr[ h1 = ei] = πi, i ∈ [k] Pr[ ht+1 = ei| ht = ej] = T i,j, i, j ∈ [k] E[ xt| ht = ej] = O ej, j ∈ [k]. As a latent class model:

  • w := T

π M1 := O diag( π)T ⊤ diag(T π)−1 M2 := O M3 := OT.

slide-81
SLIDE 81
  • 10. Comparison to HKZ
slide-82
SLIDE 82

Comparison to previous spectral methods

◮ Previous works for estimating observable operator model

for HMMs and other sequence / fixed-tree models

(Hsu-Kakade-Zhang, ’09; Langford-Salakhutdinov-Zhang, ’09; Siddiqi-Boots-Gordon, ’10; Song et al, ’10; Foster et al, ’11; Parikh et al, ’11; Song et al, ’11; Cohen et al, ’12; Balle et al, ’12; etc.)

◮ Based on regression idea: best prediction of

xt+1 given history x≤t.

◮ Observable operator model (Jaeger, ’00) provides way to

predict further ahead xt+1, xt+2, . . . .

◮ This work: Eigendecomposition method is rather different

— looks for skewed directions using third-order moments. (Related to looking for kurtotic directions using fourth-order moments, like ICA.)

◮ Can recover actual HMM parameters (transition and

emission matrices).