Bayesian Networks Part 2
CS 760@UW-Madison
Bayesian Networks Part 2 CS 760@UW-Madison Goals for the lecture - - PowerPoint PPT Presentation
Bayesian Networks Part 2 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the parameter learning task for Bayes nets the structure learning task for Bayes nets maximum likelihood estimation
CS 760@UW-Madison
you should understand the following concepts
learning
B E A J M f f f t f f t f f f f f t f t …
Burglary Earthquake Alarm JohnCalls MaryCalls
B E A J M f f f t f f t f f f f f t f t …
and a set of data D
consider trying to estimate the parameter θ (probability of heads) of a biased coin from a sequence of flips (1 stands for head)
the likelihood function for θ is given by:
i D d d i d i D d i d i d i D d d n d d
) ( ) ( ) ( ) ( ) ( ) ( 2 ) ( 1
independent parameter learning problem for each CPD
i D d d i d i D d i d i d i D d d n d d
) ( ) ( ) ( ) ( ) ( ) ( 2 ) ( 1
B E A J M f f f t f f t f f f f f f t t t f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
now consider estimating the CPD parameters for B and J in the alarm network given the following data set
875 . 8 7 ) ( 125 . 8 1 ) ( = =
= b P b P
B E A J M f f f t f f t f f f f f f t t t f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
now consider estimating the CPD parameters for B and J in the alarm network given the following data set
875 . 8 7 ) ( 125 . 8 1 ) ( = =
= b P b P 5 . 4 2 ) | ( 5 . 4 2 ) | ( 25 . 4 1 ) | ( 75 . 4 3 ) | ( = =
=
=
= a j P a j P a j P a j P
B E A J M f f f t f f t f f f f f f t t f f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
suppose instead, our data set was this… do we really want to set this to 0?
1 8 8 ) ( 8 ) ( = =
= b P b P
we could start with some prior belief for each
value v
pseudocounts
) ( Values
X v v x
vÎ Values(X )
number of “virtual” instances prior probability of value x
B E A J M f f f t f f t f f f f f f t t f f f f t f f t t f f f t f t f f t t t f f t t t
A B E M J
now let’s estimate parameters for B using m=4 and pb=0.25
be missing does not depend on the missing data itself
value being missing depends on the value
diagnosis given earlier test results
those with poor scores
Given:
Repeat until convergence
expectation over missing values
B E A J M f f ? f f f f ? t f t f ? t t f f ? f t f t ? t f f f ? f t t t ? t t f f ? f f f f ? t f f f ? f t
A B E M J
B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t 0.9 f 0.2 A P(M) t 0.8 f 0.1
suppose we’re given the following initial BN and training set
B E A J M f f
t: 0.0069 f: 0.9931
f f f f
t:0.2 f:0.8
t f t f
t:0.98 f: 0.02
t t f f
t: 0.2 f: 0.8
f t f t
t: 0.3 f: 0.7
t f f f
t:0.2 f: 0.8
f t t t
t: 0.997 f: 0.003
t t f f
t: 0.0069 f: 0.9931
f f f f
t:0.2 f: 0.8
t f f f
t: 0.2 f: 0.8
f t
A B E M J
B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t 0.9 f 0.2 A P(M) t 0.8 f 0.1
0069 . 4176 . 00288 . 9 . 8 . 8 . 8 . 9 . 2 . 1 . 2 . 8 . 9 . 2 . 1 . 2 . 8 . 9 . ) , , , , ( ) , , , , ( ) , , , , ( ) , , , | ( = = + =
j a e b P m j a e b P m j a e b P m j e b a P 2 . 1296 . 02592 . 9 . 2 . 8 . 8 . 9 . 2 . 9 . 2 . 8 . 9 . 2 . 9 . 2 . 8 . 9 . ) , , , , ( ) , , , , ( ) , , , , ( ) , , , | ( = = + =
j a e b P m j a e b P m j a e b P m j e b a P
A B E M J
B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t 0.9 f 0.2 A P(M) t 0.8 f 0.1
B E A J M f f
t: 0.0069 f: 0.9931
f f f f
t:0.2 f:0.8
t f t f
t:0.98 f: 0.02
t t f f
t: 0.2 f: 0.8
f t f t
t: 0.3 f: 0.7
t f f f
t:0.2 f: 0.8
f t t t
t: 0.997 f: 0.003
t t f f
t: 0.0069 f: 0.9931
f f f f
t:0.2 f: 0.8
t f f f
t: 0.2 f: 0.8
f t
A B E M J
re-estimate probabilities using expected counts
B E P(A) t t 0.997 t f 0.98 f t 0.3 f f 0.145
re-estimate probabilities for P(J | A) and P(M | A) in same way ) ( # ) ( # ) , | ( e b E e b a E e b a P = 1 997 . ) , | ( = e b a P 1 98 . ) , | ( =
b a P 1 3 . ) , | ( =
b a P
7 2 . 2 . 0069 . 2 . 2 . 2 . 0069 . ) , | ( + + + + + + =
b a P
B E A J M f f
t: 0.0069 f: 0.9931
f f f f
t:0.2 f:0.8
t f t f
t:0.98 f: 0.02
t t f f
t: 0.2 f: 0.8
f t f t
t: 0.3 f: 0.7
t f f f
t:0.2 f: 0.8
f t t t
t: 0.997 f: 0.003
t t f f
t: 0.0069 f: 0.9931
f f f f
t:0.2 f: 0.8
t f f f
t: 0.2 f: 0.8
f t re-estimate probabilities using expected counts ) ( # ) ( # ) | ( a E j a E a j P =
2 . 2 . 0069 . 997 . 2 . 3 . 2 . 98 . 2 . 0069 . 2 . 997 . 3 . 98 . 2 . ) | ( + + + + + + + + + + + + + = a j P 8 . 8 . 9931 . 003 . 8 . 7 . 8 . 02 . 8 . 9931 . 8 . 003 . 7 . 02 . 8 . ) | ( + + + + + + + + + + + + + =
j P
converge
variables
(e.g. networks with tree DAGs)
likelihood of the training data
) ( values ) ( values 2
X x Y y
tree that connects all vertices in a graph
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
The Chow-Liu algo always have a complete graph, but here we use a non-complete graph as the example for clarity.
given: graph with vertices V and edges E Enew ← { } for each (u, v) in E ordered by weight (from high to low) { remove (u, v) from E if adding (u, v) to Enew does not create a cycle add (u, v) to Enew } return V and Enew which represent an MST
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
iii.
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
iv.
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
v.
A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
vi.
A B C D E F G A B C D E F G
1 5 1 5 1 7 1 8 1 9 1 7 1 15 1 6 1 8 1 9 1 11
maximizes the data likelihood?
data likelihood given directed edges we’re interested in finding the graph G that maximizes this if we assume a tree, each node has at most one parent
edge directions don’t matter for likelihood, because MI is symmetric
i i i i G D d i i d i G
2 ) ( 2 2
i i i G G G
2
edges ) , ( 2
j i X
X j i G G G
Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.