[PPT] - Beyond the graphical Lasso: Structure learning via inverse PowerPoint Presentation

SLIDE 1

Beyond the graphical Lasso: Structure learning via inverse covariance estimation

Po-Ling Loh

UC Berkeley Department of Statistics

ICML Workshop on Covariance Selection and Graphical Model Structure Learning

June 26, 2014 Joint work with Martin Wainwright (UC Berkeley) & Peter B¨ uhlmann (ETH Z¨ urich)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 1 / 40

SLIDE 2

Outline

1

Introduction

2

Generalized inverse covariances

3

Linear structural equation models

4

Corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 2 / 40

SLIDE 3

Outline

1

Introduction

2

Generalized inverse covariances

3

Linear structural equation models

4

Corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 3 / 40

SLIDE 4

Undirected graphical models

Undirected graph G = (V , E) Joint distribution of (X1, . . . , Xp), where |V | = p

X1 X2 X3 Xp

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 4 / 40

SLIDE 5

Undirected graphical models

Undirected graph G = (V , E) Joint distribution of (X1, . . . , Xp), where |V | = p

X1 X2 X3 Xp

Markov property: (s, t) / ∈ E = ⇒ Xs ⊥ ⊥ Xt | X\{s,t}

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 4 / 40

SLIDE 6

Undirected graphical models

Undirected graph G = (V , E) Joint distribution of (X1, . . . , Xp), where |V | = p X1 X2 X3 Xp

A B S

More generally, XA ⊥ ⊥ XB | XS when S ⊆ V separates A from B

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 4 / 40

SLIDE 7

Directed graphical models

Directed acyclic graph G = (V , E)

X1 X2 X3 Xp

Markov property: Xj ⊥ ⊥ XNondesc(j) | XPa(j), ∀j

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 5 / 40

SLIDE 8

Structure learning

Goal: Edge recovery from n samples:

(X (i)

1 , X (i) 2 , . . . , X (i) p )

n

i=1

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 6 / 40

SLIDE 9

Structure learning

Goal: Edge recovery from n samples:

(X (i)

1 , X (i) 2 , . . . , X (i) p )

n

i=1

High-dimensional setting: p ≫ n, assume deg(G) ≤ d

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 6 / 40

SLIDE 10

Structure learning

Goal: Edge recovery from n samples:

(X (i)

1 , X (i) 2 , . . . , X (i) p )

n

i=1

High-dimensional setting: p ≫ n, assume deg(G) ≤ d Sources of corruption: non-i.i.d. observations, contamination by noise/missing data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 6 / 40

SLIDE 11

Structure learning

Goal: Edge recovery from n samples:

(X (i)

1 , X (i) 2 , . . . , X (i) p )

n

i=1

High-dimensional setting: p ≫ n, assume deg(G) ≤ d Sources of corruption: non-i.i.d. observations, contamination by noise/missing data Note: Structure learning generally harder for directed graphs (topological order unknown)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 6 / 40

SLIDE 12

Graphical Lasso

When (X1, . . . , Xp) ∼ N(0, Σ), well-known fact: (Σ−1)st = 0 ⇐ ⇒ (s, t) / ∈ E

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 7 / 40

SLIDE 13

Graphical Lasso

When (X1, . . . , Xp) ∼ N(0, Σ), well-known fact: (Σ−1)st = 0 ⇐ ⇒ (s, t) / ∈ E Establishes statistical consistency of graphical Lasso (Yuan & Lin ’07):

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|   

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 7 / 40

SLIDE 14

Some observations

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 8 / 40

SLIDE 15

Some observations

Only sample-based quantity is Σ:

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|   

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 8 / 40

SLIDE 16

Some observations

Only sample-based quantity is Σ:

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|    Although graphical Lasso is penalized Gaussian MLE, can always be used to estimate Θ from Σ: (Σ∗)−1 = arg min

Θ

trace(Σ∗Θ) − log det(Θ)
P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 8 / 40

SLIDE 17

Some observations

Only sample-based quantity is Σ:

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|    Although graphical Lasso is penalized Gaussian MLE, can always be used to estimate Θ from Σ: (Σ∗)−1 = arg min

Θ

trace(Σ∗Θ) − log det(Θ)
We extend graphical Lasso to discrete-valued data (undirected

case) and linear structural equation models (directed case)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 8 / 40

SLIDE 18

Theory for graphical Lasso

If

Σ − Σ∗max
log p

n and λ

log p

n , then

Θ − Θ∗max
log p

n + λ

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 9 / 40

SLIDE 19

Theory for graphical Lasso

If

Σ − Σ∗max
log p

n and λ

log p

n , then

Θ − Θ∗max
log p

n + λ

Deviation condition holds w.h.p. for various ensembles (e.g.,

sub-Gaussian)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 9 / 40

SLIDE 20

Theory for graphical Lasso

If

Σ − Σ∗max
log p

n and λ

log p

n , then

Θ − Θ∗max
log p

n + λ

Deviation condition holds w.h.p. for various ensembles (e.g.,

sub-Gaussian) Thresholding Θ at level

log p

n

yields correct support

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 9 / 40

SLIDE 21

Outline

1

Introduction

2

Generalized inverse covariances

3

Linear structural equation models

4

Corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 10 / 40

SLIDE 22

Non-Gaussian distributions

(Liu et al. ’09, ’12): (X1, . . . , Xp) follows nonparanormal distribution if (f1(X1), . . . , fp(Xp)) ∼ N(0, Σ), and fj’s monotone and differentiable

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 11 / 40

SLIDE 23

Non-Gaussian distributions

(Liu et al. ’09, ’12): (X1, . . . , Xp) follows nonparanormal distribution if (f1(X1), . . . , fp(Xp)) ∼ N(0, Σ), and fj’s monotone and differentiable Then (i, j) / ∈ E iff Θij = 0

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 11 / 40

SLIDE 24

Non-Gaussian distributions

(Liu et al. ’09, ’12): (X1, . . . , Xp) follows nonparanormal distribution if (f1(X1), . . . , fp(Xp)) ∼ N(0, Σ), and fj’s monotone and differentiable Then (i, j) / ∈ E iff Θij = 0 In general non-Gaussian setting, relationship between entries of Θ = Σ−1 and edges of G unknown

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 11 / 40

SLIDE 25

Discrete graphical models

Assume Xi’s take values in a discrete set: {0, 1, . . . , m − 1}

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 12 / 40

SLIDE 26

Discrete graphical models

Assume Xi’s take values in a discrete set: {0, 1, . . . , m − 1} Our results: Establish relationship between augmented inverse covariance matrices and edge structure New algorithms for structure learning in discrete graphs

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 12 / 40

SLIDE 27

An illustrative example

Binary Ising model: Pθ(x1, . . . , xp) ∝ exp  

s∈V

θsxs +

(s,t)∈E

θstxsxt   ,

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 13 / 40

SLIDE 28

An illustrative example

Binary Ising model: Pθ(x1, . . . , xp) ∝ exp  

s∈V

θsxs +

(s,t)∈E

θstxsxt   , θ ∈ Rp+(p

2),

(x1, . . . , xp) ∈ {0, 1}p

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 13 / 40

SLIDE 29

An illustrative example

Ising models with θs = 0.1, θst = 2 X1 X2 X3 X4 X1 X2 X3 X4

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 14 / 40

SLIDE 30

An illustrative example

Ising models with θs = 0.1, θst = 2 X1 X2 X3 X4

Θchain =     9.80 −3.59 −3.59 34.30 −4.77 −4.77 34.30 −3.59 −3.59 9.80    

X1 X2 X3 X4

Θloop =     51.37 −5.37 −0.17 −5.37 −5.37 51.37 −5.37 −0.17 −0.17 −5.37 51.37 −5.37 −5.37 −0.17 −5.37 51.37    

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 14 / 40

SLIDE 31

An illustrative example

Ising models with θs = 0.1, θst = 2 X1 X2 X3 X4

Θchain =     9.80 −3.59 −3.59 34.30 −4.77 −4.77 34.30 −3.59 −3.59 9.80    

X1 X2 X3 X4

Θloop =     51.37 −5.37 −0.17 −5.37 −5.37 51.37 −5.37 −0.17 −0.17 −5.37 51.37 −5.37 −5.37 −0.17 −5.37 51.37    

Θ is graph-structured for chain, but not loop

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 14 / 40

SLIDE 32

An illustrative example

X1 X2 X3 X4

X1

X2 X3 X4 X1 X2 X3 X4

Θ graph-structured

Θ not graph-structured

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 15 / 40

SLIDE 33

An illustrative example

X1 X2 X3 X4

X1

X2 X3 X4 X1 X2 X3 X4

Θ graph-structured

Θ not graph-structured However, letting Γaug = Cov(X1, X2, X3, X4, X1X3)−1 for loop: Γaug ∝    

115 −2 109 −2 −114 −2 5 −2 1 109 −2 114 −2 −114 −2 −2 5 1 −114 1 −114 1 119

   

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 15 / 40

SLIDE 34

Notation

Assume (X1, . . . , Xp) ∈ {0, . . . , m − 1}p

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 16 / 40

SLIDE 35

Notation

Assume (X1, . . . , Xp) ∈ {0, . . . , m − 1}p For any subset U ⊆ V , associate vector φU of sufficient statistics

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 16 / 40

SLIDE 36

Notation

Assume (X1, . . . , Xp) ∈ {0, . . . , m − 1}p For any subset U ⊆ V , associate vector φU of sufficient statistics Ex: When m = 2 and U = {1, 2}, φU = (x1, x2, x1x2)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 16 / 40

SLIDE 37

Notation

Assume (X1, . . . , Xp) ∈ {0, . . . , m − 1}p For any subset U ⊆ V , associate vector φU of sufficient statistics Ex: When m = 2 and U = {1, 2}, φU = (x1, x2, x1x2) Ex: When U = {1}, φU = (I {x1 = 1}, . . . , I {x1 = m − 1})

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 16 / 40

SLIDE 38

Notation

Assume (X1, . . . , Xp) ∈ {0, . . . , m − 1}p For any subset U ⊆ V , associate vector φU of sufficient statistics Ex: When m = 2 and U = {1, 2}, φU = (x1, x2, x1x2) Ex: When U = {1}, φU = (I {x1 = 1}, . . . , I {x1 = m − 1}) In general: Clique C ∈ C has (m − 1)|C| indicators of nonzero states

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 16 / 40

SLIDE 39

Augmented covariance matrices

X1 X2 X3 X4

G

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 17 / 40

SLIDE 40

Augmented covariance matrices

Triangulate G

X1 X2 X3 X4 X1 X2 X3 X4

G triangulated

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 17 / 40

SLIDE 41

Augmented covariance matrices

Triangulate G Form junction tree with separator sets

X1 X2 X3 X4 X1 X2 X3 X4

123 134 13

G triangulated junction tree

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 17 / 40

SLIDE 42

Augmented covariance matrices

Triangulate G Form junction tree with separator sets Let S+ = nodes + separator sets

X1 X2 X3 X4 X1 X2 X3 X4

123 134 13

h

X1 X2 X3 X4 X1X3 X1 X2 X3 X4 X1X3

Cov(φS+)

h

G triangulated junction tree augmented matrix

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 17 / 40

SLIDE 43

Augmented covariance matrices

Triangulate G Form junction tree with separator sets Let S+ = nodes + separator sets

X1 X2 X3 X4 X1 X2 X3 X4

123 134 13

h

X1 X2 X3 X4 X1X3 X1 X2 X3 X4 X1X3

Cov(φS+)

h

G triangulated junction tree augmented matrix

Theorem

The inverse covariance matrix of {φU : U ∈ S+} from any junction tree triangulation is graph-structured: ΓA,B = 0 iff A, B are contained in a common clique

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 17 / 40

SLIDE 44

Example: Binary Ising model

X1 X2 X3 X4 X1 X2 X3 X4

123 134 13

h

X1 X2 X3 X4 X1X3 X1 X2 X3 X4 X1X3

Cov(φS+)

h

G triangulated junction tree augmented matrix Γ = (Cov(φS+))−1 ∝    

115 −2 109 −2 −114 −2 5 −2 1 109 −2 114 −2 −114 −2 −2 5 1 −114 1 −114 1 119

   

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 18 / 40

SLIDE 45

Example: Binary Ising model

X1 X2 X3 X4 X1 X2 X3 X4

123 134 13

h

X1 X2 X3 X4 X1X3 X1 X2 X3 X4 X1X3

Cov(φS+)

h

G triangulated junction tree augmented matrix Statistics included in φS+ depend on triangulation

X1 X2 X3 X4 X1 X2 X3 X4

124 234 24

h

X1 X2 X3 X4 X2X4 X1 X2 X3 X4 X2X4

Cov(φS+)

h

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 18 / 40

SLIDE 46

Consequences for trees

When ∃ triangulation with singleton separator sets, S+ = {1, . . . , p}

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 19 / 40

SLIDE 47

Consequences for trees

When ∃ triangulation with singleton separator sets, S+ = {1, . . . , p}

Corollary

When G has only singleton separators, inverse covariance matrix of sufficient statistics on nodes is graph-structured

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 19 / 40

SLIDE 48

Consequences for trees

When ∃ triangulation with singleton separator sets, S+ = {1, . . . , p}

Corollary

When G has only singleton separators, inverse covariance matrix of sufficient statistics on nodes is graph-structured

X1 X2 X3 X4

(Cov(X1, . . . , Xp))−1

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 19 / 40

SLIDE 49

Proof sketch

Based on exponential family representation of pdf: qθ(x1, . . . , xp) = exp

C∈C

θC, I C − Φ(θ)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 20 / 40

SLIDE 50

Proof sketch

Based on exponential family representation of pdf: qθ(x1, . . . , xp) = exp

C∈C

θC, I C − Φ(θ)

(covθ[I (X)])−1 = ∇2Φ∗(µ), where

Φ∗(µ) := sup

θ∈RD{µ, θ − Φ(θ)}

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 20 / 40

SLIDE 51

Proof sketch

Based on exponential family representation of pdf: qθ(x1, . . . , xp) = exp

C∈C

θC, I C − Φ(θ)

(covθ[I (X)])−1 = ∇2Φ∗(µ), where

Φ∗(µ) := sup

θ∈RD{µ, θ − Φ(θ)}

Relationship between Φ∗ and entropy: −Φ∗(µ) = H(qθ(µ)(x)) = −

x

qθ(µ)(x) log qθ(µ)(x)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 20 / 40

SLIDE 52

Proof sketch

Junction tree theorem: q(x1, . . . , xp) =

C∈C qC(xC)
S∈S qS(xS) ,

so H(q) =

C∈C

HC(qC) −

S∈S

HS(qS)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 21 / 40

SLIDE 53

Proof sketch

Junction tree theorem: q(x1, . . . , xp) =

C∈C qC(xC)
S∈S qS(xS) ,

so H(q) =

C∈C

HC(qC) −

S∈S

HS(qS) Then take Hessian

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 21 / 40

SLIDE 54

Structure learning

Plug in sample covariance matrix of augmented vector to graphical Lasso, then compute supp( Θ)

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|   

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 22 / 40

SLIDE 55

Structure learning

Plug in sample covariance matrix of augmented vector to graphical Lasso, then compute supp( Θ)

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|    When graph has singleton separators, ordinary graphical Lasso suffices

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 22 / 40

SLIDE 56

Structure learning

Plug in sample covariance matrix of augmented vector to graphical Lasso, then compute supp( Θ)

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|    When graph has singleton separators, ordinary graphical Lasso suffices

Corollary

For binary Ising models with singleton separators, the graphical Lasso succeeds w.h.p. when n d2 log p

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 22 / 40

SLIDE 57

Structure learning

Plug in sample covariance matrix of augmented vector to graphical Lasso, then compute supp( Θ)

Θ ∈ arg min

Θ0

  trace( ΣΘ) − log det(Θ) + λ

s=t

|Θst|    When graph has singleton separators, ordinary graphical Lasso suffices

Corollary

For binary Ising models with singleton separators, the graphical Lasso succeeds w.h.p. when n d2 log p Group graphical Lasso for m > 2, similar theoretical guarantees

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 22 / 40

SLIDE 58

Problem

However, augmented vector depends on structure of graph . . .

X1 X2 X3 X4 X1 X2 X3 X4

123 134 13

h

X1 X2 X3 X4 X1X3 X1 X2 X3 X4 X1X3

Cov(φS+)

h

G triangulated junction tree augmented matrix

X1 X2 X3 X4 X1 X2 X3 X4

124 234 24

h

X1 X2 X3 X4 X2X4 X1 X2 X3 X4 X2X4

Cov(φS+)

h

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 23 / 40

SLIDE 59

Beyond graphical Lasso

Nodewise method: recovers neighborhood N(s) for any fixed s ∈ V

s N(s)

G

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 24 / 40

SLIDE 60

Beyond graphical Lasso

Nodewise method: recovers neighborhood N(s) for any fixed s ∈ V Form junction tree by fully-connecting all nodes in V \s

s N(s) s N(s)

G triangulated

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 24 / 40

SLIDE 61

Beyond graphical Lasso

Nodewise method: recovers neighborhood N(s) for any fixed s ∈ V Form junction tree by fully-connecting all nodes in V \s

s N(s) s N(s)

s [ N(s)

V \s N(s)

G triangulated junction tree

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 24 / 40

SLIDE 62

Inference for general graphs

By theorem, inverse covariance matrix over nodes and sufficient statistics of N(s) exposes neighbors of s

s N(s)

n

V

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 25 / 40

SLIDE 63

Inference for general graphs

By theorem, inverse covariance matrix over nodes and sufficient statistics of N(s) exposes neighbors of s

s N(s)

n

V

sufficient statistics of d-wise subsets of V \ s

Same result holds for matrix augmented by all d-subsets of V \s

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 25 / 40

SLIDE 64

Nodewise algorithm

For each s ∈ V :

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 26 / 40

SLIDE 65

Nodewise algorithm

For each s ∈ V :

Regress sufficient statistics of Xs against sufficient statistics of all subsets of V \s of size ≤ d, using Lasso

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 26 / 40

SLIDE 66

Nodewise algorithm

For each s ∈ V :

Regress sufficient statistics of Xs against sufficient statistics of all subsets of V \s of size ≤ d, using Lasso Threshold entries of regression vector to obtain N(s)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 26 / 40

SLIDE 67

Nodewise algorithm

For each s ∈ V :

Regress sufficient statistics of Xs against sufficient statistics of all subsets of V \s of size ≤ d, using Lasso Threshold entries of regression vector to obtain N(s)

Combine estimates N(s) with AND/OR to recover edges of graph

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 26 / 40

SLIDE 68

Nodewise algorithm

For each s ∈ V :

Regress sufficient statistics of Xs against sufficient statistics of all subsets of V \s of size ≤ d, using Lasso Threshold entries of regression vector to obtain N(s)

Combine estimates N(s) with AND/OR to recover edges of graph Method succeeds w.h.p. for n 2d log p

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 26 / 40

SLIDE 69

Nodewise algorithm

For each s ∈ V :

Regress sufficient statistics of Xs against sufficient statistics of all subsets of V \s of size ≤ d, using Lasso Threshold entries of regression vector to obtain N(s)

Combine estimates N(s) with AND/OR to recover edges of graph Method succeeds w.h.p. for n 2d log p Can incorporate noisy/missing data into Lasso-based regression

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 26 / 40

SLIDE 70

Simulations for nodewise method

100 200 300 400 500 600 0.2 0.4 0.6 0.8 1 success prob vs. sample size for Erdos−Renyi graph n/log p success prob, avg over 500 trials p=64 p=128 p=256 200 400 600 800 1000 1200 0.2 0.4 0.6 0.8 1 success prob vs. sample size for grid graph n/log p success prob, avg over 500 trials p=64 p=144 p=256

Erd¨

s-Renyi graph, d ≈ 3

grid-shaped graph, d = 4

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 27 / 40

SLIDE 71

Outline

1

Introduction

2

Generalized inverse covariances

3

Linear structural equation models

4

Corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 28 / 40

SLIDE 72

Linear structural equation models

X1 X2 X3 Xp

y: Xj ⊥ ⊥ XNondesc(j) | XPa(j) Xj = bT

j XPa(j) + j,

j ⊥ ⊥ XPa(j)

Markov property: linear SEM:

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 29 / 40

SLIDE 73

Linear structural equation models

X1 X2 X3 Xp

y: Xj ⊥ ⊥ XNondesc(j) | XPa(j) Xj = bT

j XPa(j) + j,

j ⊥ ⊥ XPa(j)

Markov property: linear SEM:

X = BTX + ǫ, X, ǫ ∈ Rp and B ∈ Rp×p strictly upper triangular

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 29 / 40

SLIDE 74

Linear structural equation models

X1 X2 X3 Xp

y: Xj ⊥ ⊥ XNondesc(j) | XPa(j) Xj = bT

j XPa(j) + j,

j ⊥ ⊥ XPa(j)

Markov property: linear SEM:

X = BTX + ǫ, X, ǫ ∈ Rp and B ∈ Rp×p strictly upper triangular Goal: Learn support of B (Bjk = 0 iff j → k is edge in DAG)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 29 / 40

SLIDE 75

Inverse covariance matrix

Denote Cov(ǫ) = diag(σ2

1, . . . , σ2 p) and Θ = Cov(X)−1

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 30 / 40

SLIDE 76

Inverse covariance matrix

Denote Cov(ǫ) = diag(σ2

1, . . . , σ2 p) and Θ = Cov(X)−1

Theorem

The inverse covariance matrix of X is given by Θjk = −σ−2

k Bjk +

ℓ>k

σ−2

ℓ BjℓBkℓ,

∀j < k Θjj = σ−2

j

+

ℓ>j

σ−2

ℓ B2 jk,

∀j

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 30 / 40

SLIDE 77

Inverse covariance matrix

Denote Cov(ǫ) = diag(σ2

1, . . . , σ2 p) and Θ = Cov(X)−1

Theorem

The inverse covariance matrix of X is given by Θjk = −σ−2

k Bjk +

ℓ>k

σ−2

ℓ BjℓBkℓ,

∀j < k Θjj = σ−2

j

+

ℓ>j

σ−2

ℓ B2 jk,

∀j = ⇒ Θjk = 0 only when j → k is an edge or j, k are parents to ℓ

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 30 / 40

SLIDE 78

Consequence for structure learning

Faithfulness assumption: −σ−2

k Bjk +

ℓ>k

σ−2

ℓ BjℓBkℓ = 0

nly if Bjk = 0 and BjℓBkℓ = 0 for all ℓ > k
P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 31 / 40

SLIDE 79

Consequence for structure learning

Faithfulness assumption: −σ−2

k Bjk +

ℓ>k

σ−2

ℓ BjℓBkℓ = 0

nly if Bjk = 0 and BjℓBkℓ = 0 for all ℓ > k

Under faithfulness assumption, M(G) = supp(Θ)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 31 / 40

SLIDE 80

Consequence for structure learning

Faithfulness assumption: −σ−2

k Bjk +

ℓ>k

σ−2

ℓ BjℓBkℓ = 0

nly if Bjk = 0 and BjℓBkℓ = 0 for all ℓ > k

Under faithfulness assumption, M(G) = supp(Θ) Apply graphical Lasso to estimate moralized graph

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 31 / 40

SLIDE 81

Graphical Lasso for preprocessing

Score-based approaches for learning DAG may be sped up with superstructure of skeleton or moralized graph (Perrier et al. ’08,

Ordyniak & Szeider ’12)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 32 / 40

SLIDE 82

Graphical Lasso for preprocessing

Score-based approaches for learning DAG may be sped up with superstructure of skeleton or moralized graph (Perrier et al. ’08,

Ordyniak & Szeider ’12)

For linear SEM, first apply graphical Lasso to learn moralized graph

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 32 / 40

SLIDE 83

Graphical Lasso for preprocessing

Score-based approaches for learning DAG may be sped up with superstructure of skeleton or moralized graph (Perrier et al. ’08,

Ordyniak & Szeider ’12)

For linear SEM, first apply graphical Lasso to learn moralized graph Can also accommodate systematically corrupted data (next section)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 32 / 40

SLIDE 84

Outline

1

Introduction

2

Generalized inverse covariances

3

Linear structural equation models

4

Corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 33 / 40

SLIDE 85

Systematically corrupted data

Observe corrupted samples {(Z (i)

1 , Z (i) 2 , . . . , Z (i) p )}n i=1, where Z (i) is

noisy version of X (i)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 34 / 40

SLIDE 86

Systematically corrupted data

Observe corrupted samples {(Z (i)

1 , Z (i) 2 , . . . , Z (i) p )}n i=1, where Z (i) is

noisy version of X (i) Examples:

Additive noise: Z (i) = X (i) + W (i), W (i) ⊥ ⊥ X (i)

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 34 / 40

SLIDE 87

Systematically corrupted data

Observe corrupted samples {(Z (i)

1 , Z (i) 2 , . . . , Z (i) p )}n i=1, where Z (i) is

noisy version of X (i) Examples:

Additive noise: Z (i) = X (i) + W (i), W (i) ⊥ ⊥ X (i) Missing data: Z (i)

j

=

X (i)

j

with prob. 1 − α ⋆ with prob. α

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 34 / 40

SLIDE 88

Systematically corrupted data

Observe corrupted samples {(Z (i)

1 , Z (i) 2 , . . . , Z (i) p )}n i=1, where Z (i) is

noisy version of X (i) Examples:

Additive noise: Z (i) = X (i) + W (i), W (i) ⊥ ⊥ X (i) Missing data: Z (i)

j

=

X (i)

j

with prob. 1 − α ⋆ with prob. α

Goal: Structure learning based on {(Z (i)

1 , Z (i) 2 , . . . , Z (i) p )}n i=1

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 34 / 40

SLIDE 89

Modified graphical Lasso

Idea: Construct surrogate for Σ based on corrupted samples {Z (i)}n

i=1

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 35 / 40

SLIDE 90

Modified graphical Lasso

Idea: Construct surrogate for Σ based on corrupted samples {Z (i)}n

i=1

Additive noise:

Σ = Z TZ

n − Σw

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 35 / 40

SLIDE 91

Modified graphical Lasso

Idea: Construct surrogate for Σ based on corrupted samples {Z (i)}n

i=1

Additive noise:

Σ = Z TZ

n − Σw Missing data: Let

Z (i)

j

=   

Z (i)

j

1−α

if Z (i)

j

bserved
therwise,

use

Σ =
Z T

Z n − α diag Z T Z n

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 35 / 40

SLIDE 92

Theory for graphical Lasso

If

Σ − Σ∗max
log p

n and λ

log p

n , then

Θ − Θ∗max
log p

n + λ

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 36 / 40

SLIDE 93

Theory for graphical Lasso

If

Σ − Σ∗max
log p

n and λ

log p

n , then

Θ − Θ∗max
log p

n + λ

Can establish deviation condition w.h.p. for modified estimators with

corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 36 / 40

SLIDE 94

Simulation study

Graphical Lasso for dinosaur graph: probability of success for recovering 15 edges vs. rescaled sample size (with missing data)

100 200 300 400 500 0.2 0.4 0.6 0.8 1 success prob vs. sample size for dino graph with missing data n/log p success prob, avg over 1000 trials rho = 0 rho = 0.05 rho = 0.1 rho = 0.15 rho = 0.2

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 37 / 40

SLIDE 95

Summary

Significance of inverse covariance matrix for non-Gaussian data

For discrete variables, inverse of augmented covariance matrix is graph structured For linear SEMs, support of inverse covariance is moralized graph

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 38 / 40

SLIDE 96

Summary

Significance of inverse covariance matrix for non-Gaussian data

For discrete variables, inverse of augmented covariance matrix is graph structured For linear SEMs, support of inverse covariance is moralized graph

Use graphical Lasso to estimate (augmented) inverse Nodewise method for general discrete-valued graphs

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 38 / 40

SLIDE 97

Summary

Significance of inverse covariance matrix for non-Gaussian data

For discrete variables, inverse of augmented covariance matrix is graph structured For linear SEMs, support of inverse covariance is moralized graph

Use graphical Lasso to estimate (augmented) inverse Nodewise method for general discrete-valued graphs Modifications for corrupted data

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 38 / 40

SLIDE 98

Open questions

Computationally tractable method for structure learning in general discrete graphs Robustness results: Inverse covariance matrix of approximately Gaussian and/or approximately tree-structured graphs More general analysis of inverse covariances via exponential family representation

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 39 / 40

SLIDE 99

References

P. Loh and P. B¨

uhlmann (2013). High-dimensional learning of linear causal networks via inverse covariance estimation. ArXiv paper.

P. Loh and M.J. Wainwright (2012). High-dimensional regression

with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics.

P. Loh and M.J. Wainwright (2013). Structure estimation for discrete

graphical models: Generalized covariance matrices and their inverses. Annals of Statistics.

P. Loh (UC Berkeley)

Beyond the graphical Lasso June 26, 2014 40 / 40