[PPT] - LOCAL and GLOBAL INDEPENDENCE of PARAMETERS in DISCRETE BAYESIAN PowerPoint Presentation

SLIDE 1

LOCAL and GLOBAL INDEPENDENCE

f PARAMETERS

in DISCRETE BAYESIAN GRAPHICAL MODELS

Jacek Wesołowski (GUS & Politechnika Warszawska, Warszawa) XLII Konferencja "STATYSTYKA MATEMATYCZNA" B˛ edlewo, Nov. 28 - Dec. 2, 2016 with H. Massam (York Univ., Toronto)

SLIDE 2

Plan

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

SLIDE 3

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

SLIDE 4

Discrete model

Let X = (Xv, v ∈ V) be a random vector assuming values in I = ×v∈V Iv, where #(Iv) < ∞, v ∈ V. We write p(i) := PX(i) = P(X = i), i ∈ I. Let X1, . . . , Xn be iid with distribution PX. Let Mi =

n

j=1

I(Xj = i), i ∈ I. Then M = (Mi, i ∈ I) has a multinomial distribution, i.e. P(M = m) = n m

i∈I

p(i)mi, m = (mi, i ∈ I),

i∈I

mi = n.

SLIDE 5

Dirichlet law as an a priori distribution

Bayesian approach means that one imposes some distribution

n π = (p(i), i ∈ I).

Since the only restriction on π are: p(i) ≥ 0, i ∈ I and

i∈I p(i) = 1 we need a probability measure supported on a

unit simplex of proper dimension. A random vector (Y1, . . . , Yr) has a (classical) Dirichlet distribution D(αi, i = 1, . . . , r) if the density of the distirbution of (Y1, . . . , Yr−1) has the form f(y1, . . . , yr−1) =

Γ(α) r

i=1 Γ(αi)

r

i=1

yαi

i ITr (y),

where α = r

i=1 αi oraz yr = 1 − y1 − . . . − yr−1.

SLIDE 6

Dirichlet conjugacy and moments

If π = (p(i), i ∈ I) has a Dirichlet distribution D(αi, i ∈ I) then a posteriori law is also Dirichlet π|M ∼ D(αi + Mi, i ∈ I). Exercise: Prove conjugacy of the Dirichlet law using only the form of its joint moments E

i∈I

p(i)ri =

i∈I(αi)ri

(α)r

, where r =

i∈I ri and (a)s = Γ(a+s) Γ(a) .

Note that in this case the moments uniquely determine the distribution.

SLIDE 7

Example

Let X = (X1, X2, X2) assume values in I = {0, 1}3. Obviously, P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1)P(X3 = i3|X1 = i1, X2 = i2), This is different than the Markov structure imposed by P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1)P(X3 = i3|X2 = i2), associated with an ordered graph: 1 → 2 → 3. Equivalent to p(000)p(101) = p(100)p(001) (1) and p(010)p(111) = p(110)p(011). (2)

SLIDE 8

Example, cont.

Conditions (1) and (2) are equivalent to each of the Markov structures imposed by two other ordered graphs (with skeleton 1 − 2 − 3) 1 ← 2 ← 3, i.e. P(X = i) = P(X1 = i1|X2 = i2)P(X2 = i2|X3 = i3)P(X3 = i3); 1 ← 2 → 3, i.e. P(X = i) = P(X1 = i1|X2 = i2)P(X2 = i2)P(X3 = i3|X2 = i2).

SLIDE 9

Example, cont.: prior on π

We seek a convenient prior on π, which is a probab. measure

n (5-dimensional) manifold in [0, ∞)8 described by equations:

           x1 + . . . + x8 = 1, x1x2 = x3x4, x5x6 = x7x8. Some Dirichlet-like distribution would be fine!

SLIDE 10

Example, cont. - one more ordered graph

The graph 1 → 2 ← 3 introduces a different Markov structure P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1, X3 = i3)P(X3 = i3). Equivalently

(p(000)+p(010))(p(101)+p(111)) = (p(100)+p(110))(p(001)+p(011)).

So we seek a probab. measure on a (6-dimensional) manifold in [0, ∞)8 defined by    x1 + . . . + x8 = 1, (x1 + x5)(x2 + x6) = (x3 + x7)(x4 + x8).

SLIDE 11

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

SLIDE 12

DAG

For a graph G = (V, E) define a DAG (directed acycylic graph) with skeleton G by changing all unordered edges in E into arrows in a acyclic way. DAG can be identified with a parent function p : V → 2V defined by p(v) = {w ∈ V : w → v}, v ∈ V and having the "acyclicity" property ∀ k ≥ 1 {v} ∩ pk(v) = ∅. We will use also another function, q : V → 2V defined by q(v) = {v} ∪ p(v), v ∈ V.

SLIDE 13

p-Markov model

Let p be a DAG with a chordal skeleton G = (V, E). X (or π = (p(i), i ∈ I)) is called p-Markov iff p(i) = P(X = i) =

v∈V

pv|p(v)

iv|ipv ,

∀ i ∈ I, where

v∈V

pv|p(v)

iv|ipv

:= P(Xv = iv|Xp(v) = ip(v)). Note that pv|p(v)

m|n

= pq(v)((n,m))

pp(v)(n)

, m ∈ Iv, n ∈ Ip(v), where pA

n = j∈IV\A p((j, n)) = P(XA = n), n ∈ IA, A ⊂ V.

SLIDE 14

Moral DAGs

A DAG p with chordal skeleton G = (V, E) is moral if ∀ v ∈ V the subgraph induced in G by p(v) ⊂ V is complete. π is p′-Markov for a moral DAG p′ with a chordal skeleton G iff π is p-Markov with respect to any moral p DAG with the same skeleton G. The family of DAGs with skeleton 1 − 2 − 3 splits into: moral DAGs 1 → 2 → 3, 1 ← 2 ← 3, 1 ← 2 → 3 an immoral DAG 1 → 2 ← 3.

SLIDE 15

Cliques and separators

Let G = (V, E) be a chordal graph. Any induced maximal complete subgraph is called a clique. Denote C the set of cliques of G. A perfect ordering of cliques is a numbering C1, . . . , CK of element of C, such that ∀ j = 2, . . . , K ∃ i < j : Sj := Cj ∩

j−1

l=1

Cl ⊂ Ci. S = {(S1 = ∅), Sj, j = 2, . . . , K} is called a set of separators.

SLIDE 16

G-Markov model

For a chordal G = (V, E) we say that X (or π) is G-Markov if p(i) =

C∈C pC(iC)
S∈S pS(iS),

i ∈ I, where pA

iA = P(XA = iA) and XA = (Xv, v ∈ A), for A ⊂ V.

Equivalently, X (or π) is p-Markov for (any) moral DAG p with skeleton G, i.e. p(i) =

v∈V

pv|p(v)

iv|ip(v),

i ∈ I. Equivalently, Xw ⊥ Xv|XV\{w,v} if only {w, v} ∈ E.

SLIDE 17

Dawid & Lauritzen, Ann. Statist. (1993)

Assume that π is G-Markov where G = (V, E) is a chordal

graph. We say that π has a hyper-Dirichlet distribution,

HD(αC

m, m ∈ IC, C ∈ C), iff its moments are

E

i∈I

p(i)ri =

C∈C
m∈IC (αC

m) rC m

S∈S
n∈IS (αS

n) rS n ,

where for S ∋ S ⊂ C ∈ C αS

n =

m∈IC\S

αC

m,n,

n ∈ IS and r A

m =

n∈IV\A

rm,n, m ∈ IA.

SLIDE 18

HD distribution

Equivalently, for any moral DAG p (with skeleton G) in the decomposition p(i) =

v∈V

pv|p(v)

iv|ip(v),

i ∈ I the vectors of conditional probabilities (pv|p(v)

iv|ip(v), iv ∈ Iv),

ip(v) ∈ Ip(v), v ∈ V, are independent and have classical Dirichlet distributions D(αv|p(v)

iv|ip(v), iv ∈ Iv).

Then ∀ C ∈ C and ∀ iC ∈ IC αC

iC = αv|p(v) iv|ip(v)

if only C = {v} ∪ p(v) = q(v).

SLIDE 19

Multinomial mixture

Let X1, . . . , Xm be observations on X and M =

Mi =

m

k=1

I(Xk = i), i ∈ I

.

The conditional law of M = m

k=1 Xk given π is a multinomial

distribution with parameters m and π = (p(i), i ∈ I).

SLIDE 20

HD as a conjugate prior law

Th. If the a priori law of π is HD(αC

m, m ∈ IC, C ∈ C) the

posterior law of π|M is also hiper-Dirichlet, HD(αC

m + MC m, m ∈ IC, C ∈ C), where

MC

m =

n∈IV\C

M(m,n), m ∈ IC.

SLIDE 21

Proof

The generalized Bayes rule reads E  

i∈I

p(i)ri

M = m

  =

E

i∈I p(i)ri (m m)

i∈I p(i)mi

E(m

m)

i∈I p(i)mi

. Apply the moment formula for the HD distribution in the numerator and denominator: E  

i∈I

p(i)ri

M = m

  =

C∈C
j∈IC
αC

j

rC

j +mC j

αC

j

mC

j

S∈S
n∈IS

(αS

n) mS n

(αS

n) rS n +mS n ,

where mA

n = j∈IV\A m(n,j), n ∈ IA, A ⊂ V.

SLIDE 22

Proof, cont.

Since

(a)b+c (a)b

= (a + b)c, then the last formula gives E  

i∈I

p(i)ri

M = m

  =

C∈C
j∈IC
αC

j +mC j

rC

j

S∈S
n∈IS (αS

n +mS n) rS n .

✷

SLIDE 23

p-Dirichlet and P-Dirichlet distributins

Let p be a moral DAG with a chordal skeleton G = (V, E). A G-Markow random vector π has a p-Dirichlet law if only the random vectors (pv|p(v)

m|n

, m ∈ Iv), n ∈ Ip(v), v ∈ V, have (classical) Dirichlet laws and are independent. Let P be a family of moral DAGs with a chordal skeleton G = (V, E). We say that G-Markov π has a P-Dirichlet distribution if it has a p-Dirichlet ∀ p ∈ P.

SLIDE 24

HD as a special P-Dirichlet law

Let P be a family of all moral DAGs with the chordal skeletonG. If G-Markov π has a P-Dirichlet distribution then π has a HD distribution. Question: Can we have a similar description of the HD law through a smaller family P?

SLIDE 25

p-perfect ordering of cliques

Let p be a moral DAG with a (chordal) skeleton G = (V, E). A perfect ordering of cliques o = (C1, . . . , CK) is called p-perfect (notation: op) if ∀ ℓ = 1, . . . , K ∃ v ∈ Cℓ \ Sℓ : Sℓ = p(v).

Lemat. For any moral DAGu p there exists a p-perfect ordering
f cliques.

SLIDE 26

Pairing a separator with a clique

For S ∈ S, C ∈ C such that S ⊂ C we say that S and C are paired by a perfect ordering of cliques o = (C1, . . . , CK) (notation: S

→ C) if

∃ ℓ ∈ {1, . . . , K} : S = Sℓ

raz

C = Cℓ. We say that a family P of moral DAGs (with a chordal skeleton G = (V, E)) is a pairing familyif ∀ S ∈ S, C ∈ C such that S ⊂ C ∃ p ∈ P : S

p

→ C.

SLIDE 27

HD as a P-Dirichlet law

Th. (MW’16) Let P be a family of moral DAGs with a chordal

skeleton G = (V, E). Assume that P is a pairing family,

p∈P

p(V) = S. If L is a P-Dirichlet law then L is necessarily a hyper-Dirichlet distribution. Of course, p(V) = {p(v), v ∈ V}.

SLIDE 28

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

SLIDE 29

Independencies

If π (G-Markov wrt to a chordal G = (V, E)) has a hyper Dirichlet distribution then for any moral DAG p with skeleton G the random vectors

pv|p(v)

i|n

, i ∈ Iv, n ∈ Ip(v)

,

v ∈ V are independent (global independence of parameters, notation: GI(p)). for an arbitrary fixed v ∈ V the random vectors

pv|p(v)

i|n

, i ∈ Iv

, n ∈ Ip(v),

are independent (local independence of parameters, notation: LI(p)).

SLIDE 30

Heckerman, Geiger and Chickering (1995) Geiger and Heckerman (1997)

Let G = (V, E) be a complete graph with d vertices. Any DAG (all DAGs are moral) is uniquely determined by ordering of vertices: p = (v1, . . . , vd) iff #p(vj) = j − 1, j = 1, . . . , d. For a complete graph, they proved that (under some smothness assumptions for denisties) independence conditions GI i LI wrt the DAGs p = (1, 2, 3 . . . , d − 1, d) and p′ = (d, 1, 2, . . . , d − 2, d − 1) imply that the distribution of π is necessarily classical hyper-Dirichlet.

SLIDE 31

Separation and characterization of P-Dirichlet

We say that a family of moral DAGs P (with a chordal skeleton) G = (V, E) is a separating family if ∀ v ∈ V ∃ p, p′ ∈ P : p(v) = p′(v).

Th. (MW’16) Let π be G-Markov, where G = (V, E) is chordal.

Let P be a separating family pf moral DAGs with skeleton G. If ∀ p ∈ P the independence conditions GI(p) and LI(p) hold then π has a P-Dirichlet distribution.

SLIDE 32

Characterization of the hyper-Dirichlet law

Cor. 0 Let P be a pairing and separating family of moral DAGs

(with a chordal skeleton G = (V, E)) satisfying

p∈P

p(V) = S. If ∀ p ∈ P the independence conditions GI(p) and LI(p) hold then π has a hyper-Dirichlet distribution.

SLIDE 33

The case of a chain

For a chain G = 1 − 2 − . . . − d consider two DAGs p = 1 → 2 → . . . → d

raz

p′ = 1 ← 2 ← . . . ← d.

Cor. 1 If the random vectors

(pj|j−1

ℓ|k

, ℓ ∈ Ij), k ∈ Ij−1, j = 1, . . . , d I0 = ∅, are jointly independent and the random vectors (pj|j+1

ℓ|k

, ℓ ∈ Ij), k ∈ Ij+1, j = 1, . . . , d, Id+1 = ∅ are also jointly indpendent, then π has a hyper-Dirichlet distribution.

SLIDE 34

The case of a tree

Let T = (V, E) be a tree, i.e. a non-directed graph without loops. A vertex v ∈ V is a leaf if it has only one neighbour. Let L ⊂ V denote the set of leaves of the tree T. For a DAG p with skeleton T a vertex v ∈ V is called a root if p(v) = ∅. If p is a moral DAG (with skeleton T) then the (unique) root vertex v determines uniquely the DAG (notation: pv).

Cor. 2 Assume that π has the independence properties GI(pv)

and LI(pv) ∀ v ∈ L. Then π has a hyper-Dirichlet distribution.

SLIDE 35

The case of a complete graph

Recal that any DAG with skeleton G being a complete graph (all such DAGs are moral) is uniquely determined by ordering of vertices: (v1, . . . , vd) means that #p(vj) = j − 1, j = 1, . . . , d. For two such DAGs: p = (v1, . . . , vd) and p′ = (v′

1, . . . , v′ d)

consider the condition ∀ j = 2, . . . , d p(vj) = p′(v′

j ).

(3)

Cor. 3 Assume tha π has the independence properties GI and

LI wrt p and p′ satisfying (3). Then π has a classical Dirichlet distribution.

SLIDE 36

Heckerman, Geiger and Chickering (1995), revisited

For a complete graph, HGC assumed the indpendence conditions GI i LI wrt the DAGs p = (1, 2, 3 . . . , d − 1, d) and p′ = (d, 1, 2, . . . , d − 2, d − 1). Note that for j = 2, 3, . . . , d d ∈ p′(v′

j )

and d ∈ p(vj), that is, the condition (3) is satisfied. Consequently, the HGC characterization is an immediate consequence of Cor. 3. Therefore it holds withouth regularity assumptions assumed in HGC (1995).

SLIDE 37

Literature

[1.] ANDERSSON, S.A., MADIGAN, D., PERLMAN, M.D. (1997) A characterization of Markov equivalence classes for acyclic

digraphs. Ann. Statist. 25, 505-541.

[2.] DAWID, A.P., LAURITZEN, S.L. (1993) Hyper-Markov laws in the statistical analysis of decomposable graphical models.

Ann. Statist. 21, 1272-1317.

[3.] GEIGER, D., HECKERMAN, D. (1997) A characterization of the Dirichlet distribution through global and local parameter

independence. Ann. Statist. 25, 1344-1369.

[4.] HECKERMAN, D., GEIGER, D., CHICKERING, D.M. (1995) Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 20, 197-243. [5.] LAURITZEN, S.L. (1996) Graphical Models, Oxford Univ. Press. [6.] MASSAM, H., WESOŁOWSKI, J. (2016) A new prior for discrete DAG models with a restricted set of directions. Ann.

Statist. 44, 1010-1037 (with Supplement).