LOCAL and GLOBAL INDEPENDENCE of PARAMETERS in DISCRETE BAYESIAN - - PowerPoint PPT Presentation

local and global independence of parameters in discrete
SMART_READER_LITE
LIVE PREVIEW

LOCAL and GLOBAL INDEPENDENCE of PARAMETERS in DISCRETE BAYESIAN - - PowerPoint PPT Presentation

LOCAL and GLOBAL INDEPENDENCE of PARAMETERS in DISCRETE BAYESIAN GRAPHICAL MODELS Jacek Wesoowski (GUS & Politechnika Warszawska , Warszawa) XLII Konferencja "STATYSTYKA MATEMATYCZNA" B edlewo, Nov. 28 - Dec. 2, 2016 with


slide-1
SLIDE 1

LOCAL and GLOBAL INDEPENDENCE

  • f PARAMETERS

in DISCRETE BAYESIAN GRAPHICAL MODELS

Jacek Wesołowski (GUS & Politechnika Warszawska, Warszawa) XLII Konferencja "STATYSTYKA MATEMATYCZNA" B˛ edlewo, Nov. 28 - Dec. 2, 2016 with H. Massam (York Univ., Toronto)

slide-2
SLIDE 2

Plan

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

slide-3
SLIDE 3

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

slide-4
SLIDE 4

Discrete model

Let X = (Xv, v ∈ V) be a random vector assuming values in I = ×v∈V Iv, where #(Iv) < ∞, v ∈ V. We write p(i) := PX(i) = P(X = i), i ∈ I. Let X1, . . . , Xn be iid with distribution PX. Let Mi =

n

  • j=1

I(Xj = i), i ∈ I. Then M = (Mi, i ∈ I) has a multinomial distribution, i.e. P(M = m) = n m

i∈I

p(i)mi, m = (mi, i ∈ I),

  • i∈I

mi = n.

slide-5
SLIDE 5

Dirichlet law as an a priori distribution

Bayesian approach means that one imposes some distribution

  • n π = (p(i), i ∈ I).

Since the only restriction on π are: p(i) ≥ 0, i ∈ I and

  • i∈I p(i) = 1 we need a probability measure supported on a

unit simplex of proper dimension. A random vector (Y1, . . . , Yr) has a (classical) Dirichlet distribution D(αi, i = 1, . . . , r) if the density of the distirbution of (Y1, . . . , Yr−1) has the form f(y1, . . . , yr−1) =

Γ(α) r

i=1 Γ(αi)

r

  • i=1

yαi

i ITr (y),

where α = r

i=1 αi oraz yr = 1 − y1 − . . . − yr−1.

slide-6
SLIDE 6

Dirichlet conjugacy and moments

If π = (p(i), i ∈ I) has a Dirichlet distribution D(αi, i ∈ I) then a posteriori law is also Dirichlet π|M ∼ D(αi + Mi, i ∈ I). Exercise: Prove conjugacy of the Dirichlet law using only the form of its joint moments E

  • i∈I

p(i)ri =

  • i∈I(αi)ri

(α)r

, where r =

i∈I ri and (a)s = Γ(a+s) Γ(a) .

Note that in this case the moments uniquely determine the distribution.

slide-7
SLIDE 7

Example

Let X = (X1, X2, X2) assume values in I = {0, 1}3. Obviously, P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1)P(X3 = i3|X1 = i1, X2 = i2), This is different than the Markov structure imposed by P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1)P(X3 = i3|X2 = i2), associated with an ordered graph: 1 → 2 → 3. Equivalent to p(000)p(101) = p(100)p(001) (1) and p(010)p(111) = p(110)p(011). (2)

slide-8
SLIDE 8

Example, cont.

Conditions (1) and (2) are equivalent to each of the Markov structures imposed by two other ordered graphs (with skeleton 1 − 2 − 3) 1 ← 2 ← 3, i.e. P(X = i) = P(X1 = i1|X2 = i2)P(X2 = i2|X3 = i3)P(X3 = i3); 1 ← 2 → 3, i.e. P(X = i) = P(X1 = i1|X2 = i2)P(X2 = i2)P(X3 = i3|X2 = i2).

slide-9
SLIDE 9

Example, cont.: prior on π

We seek a convenient prior on π, which is a probab. measure

  • n (5-dimensional) manifold in [0, ∞)8 described by equations:

           x1 + . . . + x8 = 1, x1x2 = x3x4, x5x6 = x7x8. Some Dirichlet-like distribution would be fine!

slide-10
SLIDE 10

Example, cont. - one more ordered graph

The graph 1 → 2 ← 3 introduces a different Markov structure P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1, X3 = i3)P(X3 = i3). Equivalently

(p(000)+p(010))(p(101)+p(111)) = (p(100)+p(110))(p(001)+p(011)).

So we seek a probab. measure on a (6-dimensional) manifold in [0, ∞)8 defined by    x1 + . . . + x8 = 1, (x1 + x5)(x2 + x6) = (x3 + x7)(x4 + x8).

slide-11
SLIDE 11

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

slide-12
SLIDE 12

DAG

For a graph G = (V, E) define a DAG (directed acycylic graph) with skeleton G by changing all unordered edges in E into arrows in a acyclic way. DAG can be identified with a parent function p : V → 2V defined by p(v) = {w ∈ V : w → v}, v ∈ V and having the "acyclicity" property ∀ k ≥ 1 {v} ∩ pk(v) = ∅. We will use also another function, q : V → 2V defined by q(v) = {v} ∪ p(v), v ∈ V.

slide-13
SLIDE 13

p-Markov model

Let p be a DAG with a chordal skeleton G = (V, E). X (or π = (p(i), i ∈ I)) is called p-Markov iff p(i) = P(X = i) =

  • v∈V

pv|p(v)

iv|ipv ,

∀ i ∈ I, where

  • v∈V

pv|p(v)

iv|ipv

:= P(Xv = iv|Xp(v) = ip(v)). Note that pv|p(v)

m|n

= pq(v)((n,m))

pp(v)(n)

, m ∈ Iv, n ∈ Ip(v), where pA

n = j∈IV\A p((j, n)) = P(XA = n), n ∈ IA, A ⊂ V.

slide-14
SLIDE 14

Moral DAGs

A DAG p with chordal skeleton G = (V, E) is moral if ∀ v ∈ V the subgraph induced in G by p(v) ⊂ V is complete. π is p′-Markov for a moral DAG p′ with a chordal skeleton G iff π is p-Markov with respect to any moral p DAG with the same skeleton G. The family of DAGs with skeleton 1 − 2 − 3 splits into: moral DAGs 1 → 2 → 3, 1 ← 2 ← 3, 1 ← 2 → 3 an immoral DAG 1 → 2 ← 3.

slide-15
SLIDE 15

Cliques and separators

Let G = (V, E) be a chordal graph. Any induced maximal complete subgraph is called a clique. Denote C the set of cliques of G. A perfect ordering of cliques is a numbering C1, . . . , CK of element of C, such that ∀ j = 2, . . . , K ∃ i < j : Sj := Cj ∩

j−1

  • l=1

Cl ⊂ Ci. S = {(S1 = ∅), Sj, j = 2, . . . , K} is called a set of separators.

slide-16
SLIDE 16

G-Markov model

For a chordal G = (V, E) we say that X (or π) is G-Markov if p(i) =

  • C∈C pC(iC)
  • S∈S pS(iS),

i ∈ I, where pA

iA = P(XA = iA) and XA = (Xv, v ∈ A), for A ⊂ V.

Equivalently, X (or π) is p-Markov for (any) moral DAG p with skeleton G, i.e. p(i) =

  • v∈V

pv|p(v)

iv|ip(v),

i ∈ I. Equivalently, Xw ⊥ Xv|XV\{w,v} if only {w, v} ∈ E.

slide-17
SLIDE 17

Dawid & Lauritzen, Ann. Statist. (1993)

Assume that π is G-Markov where G = (V, E) is a chordal

  • graph. We say that π has a hyper-Dirichlet distribution,

HD(αC

m, m ∈ IC, C ∈ C), iff its moments are

E

  • i∈I

p(i)ri =

  • C∈C
  • m∈IC (αC

m) rC m

  • S∈S
  • n∈IS (αS

n) rS n ,

where for S ∋ S ⊂ C ∈ C αS

n =

  • m∈IC\S

αC

m,n,

n ∈ IS and r A

m =

  • n∈IV\A

rm,n, m ∈ IA.

slide-18
SLIDE 18

HD distribution

Equivalently, for any moral DAG p (with skeleton G) in the decomposition p(i) =

  • v∈V

pv|p(v)

iv|ip(v),

i ∈ I the vectors of conditional probabilities (pv|p(v)

iv|ip(v), iv ∈ Iv),

ip(v) ∈ Ip(v), v ∈ V, are independent and have classical Dirichlet distributions D(αv|p(v)

iv|ip(v), iv ∈ Iv).

Then ∀ C ∈ C and ∀ iC ∈ IC αC

iC = αv|p(v) iv|ip(v)

if only C = {v} ∪ p(v) = q(v).

slide-19
SLIDE 19

Multinomial mixture

Let X1, . . . , Xm be observations on X and M =

  • Mi =

m

  • k=1

I(Xk = i), i ∈ I

  • .

The conditional law of M = m

k=1 Xk given π is a multinomial

distribution with parameters m and π = (p(i), i ∈ I).

slide-20
SLIDE 20

HD as a conjugate prior law

  • Th. If the a priori law of π is HD(αC

m, m ∈ IC, C ∈ C) the

posterior law of π|M is also hiper-Dirichlet, HD(αC

m + MC m, m ∈ IC, C ∈ C), where

MC

m =

  • n∈IV\C

M(m,n), m ∈ IC.

slide-21
SLIDE 21

Proof

The generalized Bayes rule reads E  

i∈I

p(i)ri

  • M = m

  =

E

i∈I p(i)ri (m m)

  • i∈I p(i)mi

E(m

m)

  • i∈I p(i)mi

. Apply the moment formula for the HD distribution in the numerator and denominator: E  

i∈I

p(i)ri

  • M = m

  =

  • C∈C
  • j∈IC
  • αC

j

rC

j +mC j

  • αC

j

mC

j

  • S∈S
  • n∈IS

(αS

n) mS n

(αS

n) rS n +mS n ,

where mA

n = j∈IV\A m(n,j), n ∈ IA, A ⊂ V.

slide-22
SLIDE 22

Proof, cont.

Since

(a)b+c (a)b

= (a + b)c, then the last formula gives E  

i∈I

p(i)ri

  • M = m

  =

  • C∈C
  • j∈IC
  • αC

j +mC j

rC

j

  • S∈S
  • n∈IS (αS

n +mS n) rS n .

slide-23
SLIDE 23

p-Dirichlet and P-Dirichlet distributins

Let p be a moral DAG with a chordal skeleton G = (V, E). A G-Markow random vector π has a p-Dirichlet law if only the random vectors (pv|p(v)

m|n

, m ∈ Iv), n ∈ Ip(v), v ∈ V, have (classical) Dirichlet laws and are independent. Let P be a family of moral DAGs with a chordal skeleton G = (V, E). We say that G-Markov π has a P-Dirichlet distribution if it has a p-Dirichlet ∀ p ∈ P.

slide-24
SLIDE 24

HD as a special P-Dirichlet law

Let P be a family of all moral DAGs with the chordal skeletonG. If G-Markov π has a P-Dirichlet distribution then π has a HD distribution. Question: Can we have a similar description of the HD law through a smaller family P?

slide-25
SLIDE 25

p-perfect ordering of cliques

Let p be a moral DAG with a (chordal) skeleton G = (V, E). A perfect ordering of cliques o = (C1, . . . , CK) is called p-perfect (notation: op) if ∀ ℓ = 1, . . . , K ∃ v ∈ Cℓ \ Sℓ : Sℓ = p(v).

  • Lemat. For any moral DAGu p there exists a p-perfect ordering
  • f cliques.
slide-26
SLIDE 26

Pairing a separator with a clique

For S ∈ S, C ∈ C such that S ⊂ C we say that S and C are paired by a perfect ordering of cliques o = (C1, . . . , CK) (notation: S

  • → C) if

∃ ℓ ∈ {1, . . . , K} : S = Sℓ

  • raz

C = Cℓ. We say that a family P of moral DAGs (with a chordal skeleton G = (V, E)) is a pairing familyif ∀ S ∈ S, C ∈ C such that S ⊂ C ∃ p ∈ P : S

  • p

→ C.

slide-27
SLIDE 27

HD as a P-Dirichlet law

  • Th. (MW’16) Let P be a family of moral DAGs with a chordal

skeleton G = (V, E). Assume that P is a pairing family,

  • p∈P

p(V) = S. If L is a P-Dirichlet law then L is necessarily a hyper-Dirichlet distribution. Of course, p(V) = {p(v), v ∈ V}.

slide-28
SLIDE 28

1

Introduction

2

Markovian structure imposed by DAGs

3

Local and global independence vs. HD law

slide-29
SLIDE 29

Independencies

If π (G-Markov wrt to a chordal G = (V, E)) has a hyper Dirichlet distribution then for any moral DAG p with skeleton G the random vectors

  • pv|p(v)

i|n

, i ∈ Iv, n ∈ Ip(v)

  • ,

v ∈ V are independent (global independence of parameters, notation: GI(p)). for an arbitrary fixed v ∈ V the random vectors

  • pv|p(v)

i|n

, i ∈ Iv

  • , n ∈ Ip(v),

are independent (local independence of parameters, notation: LI(p)).

slide-30
SLIDE 30

Heckerman, Geiger and Chickering (1995) Geiger and Heckerman (1997)

Let G = (V, E) be a complete graph with d vertices. Any DAG (all DAGs are moral) is uniquely determined by ordering of vertices: p = (v1, . . . , vd) iff #p(vj) = j − 1, j = 1, . . . , d. For a complete graph, they proved that (under some smothness assumptions for denisties) independence conditions GI i LI wrt the DAGs p = (1, 2, 3 . . . , d − 1, d) and p′ = (d, 1, 2, . . . , d − 2, d − 1) imply that the distribution of π is necessarily classical hyper-Dirichlet.

slide-31
SLIDE 31

Separation and characterization of P-Dirichlet

We say that a family of moral DAGs P (with a chordal skeleton) G = (V, E) is a separating family if ∀ v ∈ V ∃ p, p′ ∈ P : p(v) = p′(v).

  • Th. (MW’16) Let π be G-Markov, where G = (V, E) is chordal.

Let P be a separating family pf moral DAGs with skeleton G. If ∀ p ∈ P the independence conditions GI(p) and LI(p) hold then π has a P-Dirichlet distribution.

slide-32
SLIDE 32

Characterization of the hyper-Dirichlet law

  • Cor. 0 Let P be a pairing and separating family of moral DAGs

(with a chordal skeleton G = (V, E)) satisfying

  • p∈P

p(V) = S. If ∀ p ∈ P the independence conditions GI(p) and LI(p) hold then π has a hyper-Dirichlet distribution.

slide-33
SLIDE 33

The case of a chain

For a chain G = 1 − 2 − . . . − d consider two DAGs p = 1 → 2 → . . . → d

  • raz

p′ = 1 ← 2 ← . . . ← d.

  • Cor. 1 If the random vectors

(pj|j−1

ℓ|k

, ℓ ∈ Ij), k ∈ Ij−1, j = 1, . . . , d I0 = ∅, are jointly independent and the random vectors (pj|j+1

ℓ|k

, ℓ ∈ Ij), k ∈ Ij+1, j = 1, . . . , d, Id+1 = ∅ are also jointly indpendent, then π has a hyper-Dirichlet distribution.

slide-34
SLIDE 34

The case of a tree

Let T = (V, E) be a tree, i.e. a non-directed graph without loops. A vertex v ∈ V is a leaf if it has only one neighbour. Let L ⊂ V denote the set of leaves of the tree T. For a DAG p with skeleton T a vertex v ∈ V is called a root if p(v) = ∅. If p is a moral DAG (with skeleton T) then the (unique) root vertex v determines uniquely the DAG (notation: pv).

  • Cor. 2 Assume that π has the independence properties GI(pv)

and LI(pv) ∀ v ∈ L. Then π has a hyper-Dirichlet distribution.

slide-35
SLIDE 35

The case of a complete graph

Recal that any DAG with skeleton G being a complete graph (all such DAGs are moral) is uniquely determined by ordering of vertices: (v1, . . . , vd) means that #p(vj) = j − 1, j = 1, . . . , d. For two such DAGs: p = (v1, . . . , vd) and p′ = (v′

1, . . . , v′ d)

consider the condition ∀ j = 2, . . . , d p(vj) = p′(v′

j ).

(3)

  • Cor. 3 Assume tha π has the independence properties GI and

LI wrt p and p′ satisfying (3). Then π has a classical Dirichlet distribution.

slide-36
SLIDE 36

Heckerman, Geiger and Chickering (1995), revisited

For a complete graph, HGC assumed the indpendence conditions GI i LI wrt the DAGs p = (1, 2, 3 . . . , d − 1, d) and p′ = (d, 1, 2, . . . , d − 2, d − 1). Note that for j = 2, 3, . . . , d d ∈ p′(v′

j )

and d ∈ p(vj), that is, the condition (3) is satisfied. Consequently, the HGC characterization is an immediate consequence of Cor. 3. Therefore it holds withouth regularity assumptions assumed in HGC (1995).

slide-37
SLIDE 37

Literature

[1.] ANDERSSON, S.A., MADIGAN, D., PERLMAN, M.D. (1997) A characterization of Markov equivalence classes for acyclic

  • digraphs. Ann. Statist. 25, 505-541.

[2.] DAWID, A.P., LAURITZEN, S.L. (1993) Hyper-Markov laws in the statistical analysis of decomposable graphical models.

  • Ann. Statist. 21, 1272-1317.

[3.] GEIGER, D., HECKERMAN, D. (1997) A characterization of the Dirichlet distribution through global and local parameter

  • independence. Ann. Statist. 25, 1344-1369.

[4.] HECKERMAN, D., GEIGER, D., CHICKERING, D.M. (1995) Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 20, 197-243. [5.] LAURITZEN, S.L. (1996) Graphical Models, Oxford Univ. Press. [6.] MASSAM, H., WESOŁOWSKI, J. (2016) A new prior for discrete DAG models with a restricted set of directions. Ann.

  • Statist. 44, 1010-1037 (with Supplement).