Learning Bayesian network : Given structure and completely observed - - PowerPoint PPT Presentation
Learning Bayesian network : Given structure and completely observed - - PowerPoint PPT Presentation
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution that maybe correspond to
Learning problem
2
ο½ Target: true distribution πβ that maybe correspond to β³β
= π§β, πΎβ
ο½ Hypothesis space: specified probabilistic graphical models ο½ Data: set of instances sampled from πβ ο½ Learning goal: selecting a model
β³ to construct the best approximation to πβ according to a performance metric
Learning tasks on graphical models
3
ο½ Parameter learning / structure learning ο½ Completely observable / partially observable data ο½ Directed model / undirected model
Parameter learning in directed models Complete data
4
ο½ We assume that the structure of the model is known
ο½ consider learning parameters for a BN with a given structure
ο½ Goal: estimate CPDs from a dataset π = {π 1 , . . . , π(π)} of π
independent, identically distributed (i.i.d.) training samples.
ο½ Each training sample π π = π¦1
(π), β¦ , π¦π (π)
is a vector that every element π¦π
(π) is known (no missing values, no hidden variables)
Density estimation review
5
ο½ We use density estimation to solve this learning problem ο½ Density estimation: Estimating the probability density function
π(π), given a set of data points π π
π=1 π
drawn from it.
ο½ Parametric methods: Assume that π(π) in terms of a specific
functional form which has a number of adjustable parameters
ο½ MLE and Bayesian estimate
ο½ MLE: Need to determine πΎβ given {π 1 , β¦ , π(π)}
ο½ MLE overfitting problem
ο½ Bayesian estimate: Probability distribution π(πΎ) over spectrum of
hypotheses
ο½ Needs prior distribution on parameters
Density estimation: Graphical model
6
ο½ i.i.d assumption
πΎ
π(1) π(2) π(π)
β¦ πΎ π(π) π = 1, . . . , π πΎ πΎ
π(1) π(2) π(π)
β¦ πΎ π(π) π = 1, . . . , π πΎ π· π· π· hyperparametrs
Maximum Likelihood Estimation (MLE)
7
ο½ Likelihood is the conditional probability of observations π
= π(1), π(2), β¦ , π(π) given the value of parameters πΎ
ο½ Assuming i.i.d. (independent, identically distributed) samples
π π πΎ = π π(1), β¦ , π(π) πΎ =
π=1 π
π(π(π)|πΎ)
ο½ Maximum Likelihood estimation
πΎππ = argmax
πΎ
π π πΎ = argmax
πΎ π=1 π
π(π(π)|πΎ)
likelihood of πΎ w.r.t. the samples
πΎππ = argmax
πΎ π=1 π
ln π π(π) πΎ
MLE has closed form solution for many parametric distributions
MLE: Bernoulli distribution
ο½ Given: π = π¦(1), π¦(2), β¦ , π¦(π) , π heads (1), π β π tails (0):
π π¦ π = ππ¦ 1 β π 1βπ¦ π π π =
π=1 π
π(π¦ π |π) =
π=1 π
ππ¦ π 1 β π 1βπ¦ π ln π π π =
π=1 π
ln π(π¦ π |π) =
π=1 π
{π¦ π ln π + (1 β π¦ π ) ln 1 β π } π ln π π π ππ = 0 β πππ = π=1
π
π¦(π) π = π π
π π¦ = 1 π = π
MLE: Multinomial distribution
9
ο½ Multinomial distribution (on variable with πΏ state):
π π πΎ =
π=1 πΏ
ππ
π¦π
Parameter space: πΎ = π1, β¦ , ππΏ ππ β 0,1
π=1 πΏ
ππ = 1 Variable: 1-of-K coding π = π¦1, β¦ , π¦πΏ π¦π β {0,1}
π=1 πΏ
π¦π = 1 π π¦π = 1 = ππ π1 π2 π3 π1 + π2 + π3 = 1 where ππ β 0,1 that is a simplex showing the set of valid parameters
MLE: Multinomial distribution
10
π = π(1), π(2), β¦ , π(π)
π π πΎ =
π=1 π
π(π π |πΎ) =
π=1 π π=1 πΏ
ππ
π¦π
(π)
=
π=1 πΏ
ππ
π=1
π
π¦π
(π)
β πΎ, π = ln π π πΎ + π(1 β
π=1 πΏ
ππ) ππ = π=1
π
π¦π
(π)
π = ππ π
ππ =
π=1 π
π¦π
(π) π=1 πΏ
ππ = π
MLE: Gaussian with unknown π
12
ln π(π¦ π |π) = β 1 2 ln 2ππ β 1 2π2 π¦ π β π
2
π ln π π π ππ = 0 β π ππ
π=1 π
ππ π π¦(π) π = 0 β
π=1 π
1 π2 π¦(π) β π = 0 β πππ = 1 π
π=1 π
π¦ π
Bayesian approach
13
ο½ Parameters πΎ as random variables with a priori distribution
ο½ utilizes the available prior information about the unknown parameter ο½ As opposed to ML estimation, it does not seek a specific point estimate
- f the unknown parameter vector πΎ
ο½ Samples π convert the prior densities π πΎ
into a posterior density π πΎ|π
ο½ Keep track of beliefs about πΎβs values and uses these beliefs for reaching
conclusions
Maximum A Posteriori (MAP) estimation
14
ο½ MAP estimation
πΎππ΅π = argmax
πΎ
π πΎ π
ο½ Since π πΎ|π β π π |πΎ π(πΎ)
πΎππ΅π = argmax
πΎ
π π πΎ π(πΎ)
ο½ Example of prior distribution:
π π = πͺ(π0, π2)
Bayesian approach: Predictive distribution
15
ο½ Given a set of samples π = π π
π=1 π , a prior distribution on
the parameters π(πΎ), and the form of the distribution π π πΎ
ο½ We find π πΎ|π and use it to specify
π π = π(π|π ) on new data as an estimate of π(π):
π π π = π π, πΎ|π ππΎ = π π π , πΎ π πΎ|π ππΎ = π π πΎ π πΎ|π ππΎ
ο½ Analytical solutions exist for very special forms of the involved
functions
Predictive distribution
Conjugate Priors
16
ο½ We consider a form of prior distribution that has a simple
interpretation as well as some useful analytical properties
ο½ Choosing a prior such that the posterior distribution that
is proportional to π(π |πΎ)π(πΎ) will have the same functional form as the prior. βπ·, π βπ·β² π(πΎ|π·β²) β π π πΎ π(πΎ|π·)
Having the same functional form
Prior for Bernoulli Likelihood
ο½ Beta distribution over π β [0,1]:
Beta π π½1, π½0 β ππ½1β1 1 β π π½0β1 Beta π π½1, π½0 = Ξ(π½0 + π½1) Ξ(π½0)Ξ(π½1) ππ½1β1 1 β π π½0β1
ο½ Beta distribution is the conjugate prior of Bernoulli:
π π¦ π = ππ¦ 1 β π 1βπ¦
πΉ π = π½1 π½0 + π½1 π = π½1 β 1 π½0 β 1 + π½1 β 1 most probable π
17
Beta distribution
18
Benoulli likelihood: posterior
Given: π = π¦(1), π¦(2), β¦ , π¦(π) , π heads (1), π β π tails (0)
π π π β π π π π(π) =
π=1 π
ππ¦ π 1 β π
1βπ¦ π
Beta π π½1, π½0 β ππ+π½1β1 1 β π πβπ+π½0β1 β π π π β πΆππ’π π π½1
β², π½0 β²
π½1
β² = π½1 + π
π½0
β² = π½0 + π β π β ππ½1β1 1 β π π½0β1
19
π =
π=1 π
π¦(π)
Example
20
Bernoulli π½0 = π½1 = 2 π = 1,1,1 β π = 3, π = 3 πππ΅π = argmax
π
π π π = π½1
β² β 1
π½1
β² β 1 + π½0 β² β 1 = 4
5 Posterior Beta:π½1
β² = 5, π½0 β² = 2
Prior Beta: π½0 = π½1 = 2 π π π π¦ = 1 π π Given: π = π¦(1), π¦(2), β¦ , π¦(π) : π heads (1), π β π tails (0) π π¦ π = ππ¦ 1 β π 1βπ¦
Benoulli: Predictive distribution
21
ο½ Training samples: π = π¦(1), β¦ , π¦(π)
π π = πΆππ’π π π½1, π½0 β ππ½1β1 1 β π π½0β1 π π|π = πΆππ’π π π½1 + π, π½0 + π β π β ππ½1+πβ1 1 β π π½0+ πβπ β1 π π¦|π = π π¦|π π π|π ππ = πΉπ π|π π(π¦|π) β π π¦ = 1|π = πΉπ π|π π = π½1 + π π½0 + π½1 + π
Dirichlet distribution
22
π πΎ π· β
π=1 πΏ
ππ
π½πβ1
= Ξ(π½) Ξ π½1 β¦ Ξ(π½πΏ)
π=1 πΏ
ππ
π½πβ1
π½ =
π=1 πΏ
π½π
πΉ ππ = π½π π½ ππ = π½π β 1 π½ β πΏ
Input space: πΎ = π1, β¦ , ππΏ π ππ β 0,1
π=1 πΏ
ππ = 1
Dirichlet distribution: Examples
23
π· = [0.1,0.1,0.1] π· = [1,1,1] π· = [10,10,10] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of π½ correspond to more confidence on the prior belief (i.e., more imaginary samples)
Dirichlet distribution: Example
24
π· = [20,2,2] π· = [2,2,2]
Multinomial distribution: Prior
25
ο½ Dirichlet
distribution is the conjugate prior
- f
Multinomial π πΎ π , π· β π π πΎ π πΎ π· β
π=1 πΏ
ππ
ππ+π½πβ1
π πΎ π , π· = πΈππ πΎ π· + π
π = π1, β¦ , ππΏ π
π πΎ π· πΎ~πΈππ (π½1, β¦ , π½πΏ) π πΎ π , π· πΎ~πΈππ (π½1 + π1, β¦ , π½πΏ + ππΏ)
sufficient statistics of data
Multinomial: Predictive distribution
26
ο½ Training samples: π = π(1), β¦ , π(π)
π πΎ = πΈππ πΎ π½1, β¦ , π½πΏ π πΎ|π = πΈππ πΎ π½1 + π1, β¦ , π½πΏ + ππΏ π π|π = π π|πΎ π πΎ|π ππΎ = πΉπ πΎ|π π(π|πΎ) β π π¦π = 1|π = πΉπ πΎ|π ππ = π½π + ππ π½ + π
Larger π½ more confidence in our prior π½π + ππ π½ + π = π½ π½ + π Γ π½π π½ + π π½ + π Γ ππ π
Multinomial: Predictive distribution
27
π π|π = ππ£ππ’π π½1 + π1 π½ + π , β¦ , π½πΏ + ππΏ π½ + π
ο½ Bayesian prediction combines sufficient statistics from
imaginary Dirichlet samples and real data samples
Example: MLE vs. Bayesian
28
ο½ Effect of different priors on smoothing parameter estimates π~πΆππ’π(1,1) π~πΆππ’π(10,10) [Koller & Friedman Book]
Bayesian Estimation Gaussian distribution with unknown π (known π)
29
ο½ π π¦|π ~π(π, π2) ο½ π(π)~π(π0, π0
2)
π π π β π π π π π =
π=1 π
π(π¦ π π π(π) π π π β exp β 1 2 π π2 + 1 π0
2 π2 β 2 π=1 π
π¦ π π2 + π0 π0
2
π
Bayesian Estimation Gaussian distribution with unknown π (known π)
30
β π π π ~π(ππ, ππ
2) ππ =
ππ0
2
ππ0
2 + π2
π=1
π
π¦ π π
+ π2 ππ0
2 + π2 π0
1 ππ
2 = 1
π0
2 + π
π2
π π¦ π = π π¦ π π π|π ππ β exp β 1 2 π¦ β ππ 2 π2 + ππ
2
β π π¦ π ~π(ππ, π2 + ππ
2)
Conjugate prior for Gaussian distribution
31
ο½ Known π, unknown π2
ο½ The conjugate prior for π = 1/π2 is a Gamma with shape π0 and rate
(inverse scale) π0
ο½ The conjugate prior for π2 is Inverse-Gamma
ο½ Unknown π and unknown π2
ο½ The conjugate prior π π, π is Normal-Gamma
ο½ Multivariate case:
ο½ The conjugate prior π π, π³ is Normal-Wishart
π π, π = πͺ π π0, πΎπ β1 π»ππ (π|π0, π0) π»ππ π π, π = 1 Ξ(π) πππ¦πβ1 exp βππ π½ππ€π»ππ π2 π, π = 1 Ξ(π) πππ¦βπβ1exp(β π π2)
Bayesian estimation: Summary
32
ο½ Bayesian approach treats parameters as random variables
ο½ Learning is then a special case of inference
ο½ Asymptotically equivalent to MLE ο½ For some parametric distributions, has closed form (that
is obtained based on prior parameters and sufficient statistics from data)
Learning in Bayesian networks
33
ο½ Learning
ο½ MLE
ο½ Likelihood decomposes on nodes conditional distributions
ο½ Bayesian
ο½ We can make some assumptions (global & local independencies) to
simplify the learning
Likelihood decomposition: Example
34
Directed factorization causes likelihood to locally decompose:
π π πΎ = π π1 πΎ1 π π2 π1, πΎ2 π π3 π1, πΎ3 π π4 π2, π3, πΎ4
π1 π3 π2 π4
Decomposition of the likelihood
35
ο½ If we assume the parameters for each CPD are disjoint (i.e., disjoint
parameter sets πΎππ|ππππ for different variables), and the data is complete, then the log-likelihood function decomposes on nodes:
π πΎ; π = π π πΎ =
π=1 π πβπ²
π ππ
(π) ππππ
(π), πΎπ
=
πβπ² π=1 π
π ππ
(π) ππππ (π), πΎπ
=
πβπ²
ππ(πΎπ; π ) ππ πΎπ; π =
π=1 π
π ππ
(π) ππππ (π), πΎπ
πΎπ β‘ πΎππ|ππππ πΎπ = argmax
πΎπ
ππ πΎπ; π πΎ = [ πΎ1, β¦ , πΎ π¦ ] πΎ = argmax
πΎ
π πΎ; π
Local decomposition of the likelihood: Table-CPDs
36
The choice of parameters given different value assignment to the parents are independent of each other Consider only data points that agree with the parent assignment:
ππ πΎππ|ππππ; π =
π=1 π
πππ
(π)|ππππ (π)
=
πβπππ(ππππ) π¦βπππ(ππ)
ππ¦|π
π=1
π
π½ ππ
(π)=π¦,ππππ (π)=π
Local decomposition of the likelihood: Table-CPDs
37
ππ π, π =
π=1 π
π½ ππ
(π) = π, ππππ (π) = π
βπ
π
ππ|π
π
= 1 ππ|π
π
= ππ π, π π ππ π, π = π=1
π
π½ ππ
(π) = π, ππππ (π) = π
π=1
π
π½ ππππ
(π) = π
ππ = 1 β¦ ππ = πΏ ππππ = π1 π1|ππ
π
β¦ ππΏ|ππ
π
β¦ ππππ = ππ π1|ππ΄
π
β¦ ππΏ|ππ΄
π
πΎπ β‘ πΎππ|ππππ Each row is multinomial distribution πΎππ|π1 πΎππ|ππ
Overfitting
38
ο½ For large
πβππππ
|πππ(π)|, most βbucketsβ will have very few
instances
ο½ Number of parameters will increases exponentially with ππππ ο½ Poor estimation ο½ With limited data, overfitting occurs
ο½ we often get better generalization with simpler structures
ο½ zero count problem or the sparse data problem
ο½ e.g., consider a naΓ―ve bayes classifier and an email π containing a word
that does not occur in any training doc for the class π, then π π π = 0
ML Example: NaΓ―ve Bayes classifier
39
π(π) ππ
(π)
π = 1, β¦ , πΈ πΎπ π = 1, β¦ , π π π1 π2 ππΈ β¦ πΎ1 πΎπΈ πΎ2 π π ππ π = 1, . . , πΈ
πΎπ = πΎππ|1, β¦ , πΎππ|π·
πΎπ = πΎππ|π = πΎππ|1, β¦ , πΎππ|π· π β {1, β¦ , π·} Training phase Test phase π π
ML Example: NaΓ―ve Bayes classifier
40
π πΎ; π =
π=1 π
π(π π |π)
π=1 πΈ
π(ππ
π |π π , πΎππ|π)
ππ = π=1
π
π½(π π = π) π
ο½ Discrete inputs or features (ππ β {1, β¦ , πΏ}):
ο½
πΎπ|π
π
=
π=1
π
π½(ππ
π =π,π π =π)
π=1
π
π½(π π =π) π ππ = π|π = π
Recall: Maximum conditional likelihood
41
ο½ Example: discriminative classifier
ο½ Needs
to learn the conditional distribution (not joint distribution)
ο½ Given π =
π π , π§ π
π=1 π
and a parametric conditional distribution for π(π§|π; πΎ), we find: πΎ = argmax
πΎ π=1 π
π(π§ π |π π ; πΎ)
ο½ We will also see maximum conditional likelihood for
more general CPDs than tabular ones
Maximum conditional likelihood Linear regression
42
π π§ π
π π§ π, π = π π§|πππ π , π2 Linear: π π = [1, π¦1, β¦ , π¦πΈ] ππ|π π; π =
π=1 π
π π(π)|π π , π =
π=1 π
π π π πππ π π , π2 π = argmax
πΎ π=1 π
π π π πππ π π , π2 = πΎππΎ β1πΎππ
π = π π , π§ π
π=1 π
Maximum conditional likelihood Logistic regression
43
π π§ π
π π§ = 1 π = π πππ π Linear LR: π π = [1, π¦1, β¦ , π¦πΈ] π π§ π, π = πΆππ (π§|π πππ π )
π β {0,1} π π¨ = 1 1 + exp βπ¨
ππ|π π, π =
π=1 π
π π§(π)|π π , π =
π=1 π
πΆππ π§ π π πππ π π =
π=1 π
π πππ π π
π§(π)
1 β π πππ π π
1βπ§(π)
π = π π , π§ π
π=1 π
MLE for Bayesian networks: Summary
44
ο½ For a BN with disjoint sets of parameters in CPDs, likelihood
decomposes as product of local likelihood functions on nodes
ο½ Thus, we optimize likelihoods on different nodes separately
ο½ For table CPDs, local likelihood further decomposes as
product of likelihood for multinomials, one for each parent combination
ο½ Sparse data problem of MLE
Bayesian learning for BNs Global parameter independence
45
ο½ Global parameter independence assumption:
π πΎ =
πβπ²
π(πΎπ)
ππ πΎπ; π =
π=1 π
π ππ
(π) ππππ
(π), πΎπ
π π πΎ =
πβπ²
ππ(πΎπ; π ) π πΎ|π = 1 π π π π πΎ π πΎ = 1 π π
πβπ²
ππ πΎπ; π π πΎπ =
πβπ²
π πΎπ|π
π1
(π)
π3
(π)
π2
(π)
πΎ1 πΎ3 πΎ2 π β Posteriors of πΎ are independent given complete data and independent priors πΎπ β‘ πΎππ|ππππ
Bayesian learning for BNs Global parameter independence
46
ο½ How can we find directly from the structure of the
Bayesian network when
π πΎ
satisfies global independence ?
Bayesian learning for BNs Global parameter independence
47
π π π = π π πΎ, π π πΎ π ππΎ π π π = π π πΎ π πΎ π ππΎ =
πβπ²
π π¦π|ππππ, πΎπ
πβπ²
π πΎπ|π ππΎ =
πβπ²
π π¦π|ππππ, πΎπ π πΎπ|π ππΎπ
ο½ Solve the prediction problem for each CPD independently and
then combine the results.
Instances are independent given the parameters (for complete data) πΎπ β‘ πΎππ|ππππ
Predictive distribution
π π¦ π π§ ππ¦ππ§ = π π¦ ππ¦ π π§ ππ§
Local decomposition for table CPDs
48
ο½ Prior
π πΎπ|πππ satisfies local parameter independence assumption if: π πΎπ|πππ =
πππ=π
π πΎπ|π
ο½ The posterior of π πΎπ|π|π and π πΎπ|πβ²|π are independent
despite vβstructure on π, because πππ acts like a multiplexer
π(π) π(π) πΎπ|1 πΎπ|π· β¦ π β {1, β¦ , π·} π = 1, β¦ , π The groups of parameters in different rows of a CPD are independent
Global and local parameter independence
49
ο½ Let the data be complete and CPDs be tabular. If the
prior π(πΎ) satisfies global and local parameter independence, then: π πΎ π =
πβπ² ππππ
π(πΎππ|ππππ|π ) π ππ = π π, π =
π½π|π
π
+ ππ(π, π) πβ²=1
πΏ
π½πβ²|π
π
+ ππ(πβ², π)
πΎππ|π~πΈππ (π½1|π
π
, β¦ , π½πΏ|π
π
) πΎππ|π|π ~πΈππ (π½1|π
π
+ ππ(1, π), β¦ , π½πΏ|π
π
+ ππ(πΏ, π)) ππππ = π πΏ = πππ(ππ)
Priors for Bayesian learning
50
ο½ Priors π½ππ|ππππ must be defined:
π½π¦|π|π¦ β πππ ππ , π β πππ(ππππ)
ο½ πΏ2 prior: a fixed prior for all hyper-parameters π½π¦|π = 1 ο½ We can store an π½ and a prior distribution πβ² on the
variables in the network: π½ππ|ππππ = π½πβ² ππ, ππππ
ο½ BDe prior: Define πβ² as a set of independent marginals
- ver ππβs
NaΓ―ve Bayes Example: Bayesian learning
51
π(π) ππ
(π)
π = 1, β¦ , πΈ πΎππ|π π = 1, β¦ , π πΎπ = πΎππ|π = πΎππ|1, β¦ , πΎππ|π· π β {1, β¦ , π·} Training phase π π·ππ|π π· π = 1, β¦ , π·
NaΓ―ve Bayes Example: Bayesian learning
52
ο½ π~ππ£ππ’ π (i.e., π π = π π = ππ)
ο½ π π π½1, β¦ , π½π· = πΈππ (π|π½1, β¦ , π½π·) ο½ π π π , π½1, β¦ , π½π· = πΈππ π π½1 + π1, β¦ , π½π· + ππ
π π = π π = π½π + ππ πβ²=1
π·
π½πβ² + ππβ²
ο½ Discrete inputs or features (ππ β {1, β¦ , πΏ}): ο½ π ππ = π π = π, π =
π½π|π
π +ππ(π,π)
πβ²=1
πΏ
π½πβ²|π
π
+ππ(πβ²,π)
ππ π, π = π½(ππ
π = π, π π = π)
ππ =
π=1 π
π½(π π = π)
Global & local independence
53
ο½ For nodes with no parents, parameters define a single
distribution
ο½ Bayesian or ML learning can be used as in the simple density estimation
- n a variable
ο½ More generally, for tabular CPDs there are multiple categorical
distributions per node, one for every combination of parent variables
ο½ Learning objective decomposes into multiple terms, one for subset of
training data with each parent configuration
ο½ Apply independent Bayesian or ML learning to each
Shared parameters
54
ο½ Sharing CPDs or sharing parameters within a single CPD:
π ππ ππππ, πΎ = π π
π ππππ, πΎ
π ππ π1, ππ
π‘ = π π π π2, ππ π‘
ο½ MLE: For networks with shared CPDs, sufficient statistics
accumulate over all uses of CPD
ο½ Aggregate sufficient statistics: collect all the instances generated
from the same conditional distribution and combine sufficient statistics from them
Markov chain
55
π π1, β¦ , ππ πΎ = π π1 π
π’=2 π
π(ππ’|ππ’β1, π©π’β1,π’)
ο½ Shared Parameters:
π π1, β¦ , ππ πΎ = π π1 π
π’=2 π
π(ππ’|ππ’β1, π©)
π1 π2 ππ ππβ1 β¦
Markov chain
56
ο½ Initial state probability:
ππ = π π1 = π , 1 β€ π β€ πΏ
ο½ State transition probability:
π΅ππ = π
π|π = π ππ’+1 = π ππ’ = π ,
1 β€ π, π β€ πΏ
π=1 πΏ
π΅ππ = 1
π1 π2 ππ ππβ1 β¦ π π© = πΎπβ²|π
Markov chain: Parameter learning by MLE
57
π π |πΎ =
π=1 π
π π1
(π) π π’=2 π
π(ππ’
(π)|ππ’β1 (π), π©)
π΅ππ = π=1
π
π’=2
π
π½ ππ’β1
(π) = π, ππ’ (π) = π
π=1
π
π’=2
π
π½ ππ’β1
(π) = π
ππ = π=1
π
π½ π1
(π) = π
π
Markov chain: Bayesian learning
58
ο½ Dirichlet prior π· on π©
π ππ’+1 = π ππ’ = π, π , π·π,. = π=1
π
π’=2
π
π½ ππ’β1
(π) = π, ππ’ (π) = π + π½π,π
π=1
π
π’=2
π
π½ ππ’β1
(π) = π + πβ²=1 πΏ
π½π,πβ²
π΅ππ = π ππ’+1 = π ππ’ = π π΅π,.~πΈππ (π½π,1, β¦ , π½π,πΏ)
Assign a Dirichlet prior to each row of the transition matrix π΅
Hyper-parameters in Bayesian approach
59
ο½ We already consider either parameter independencies or
parameter sharing
ο½ Hierarchical priors gives us a flexible language to introduce
dependencies in the priors over parameters.
ο½ useful when we have small amount of examples relevant to each
parameter but we believe that some parameters are reasonably similar.
ο½ In such situations, hierarchical priors βspreadβ the effect of the
- bservations between parameters with shared hyper-parameters.
ο½ However, when we use hyper-parameters we transformed our
learning problem into one that includes a hidden variable.
Hyper-parameters: Example
60
[Koller & Friedman Book] The effect of the prior will be to shift both ππ|π¦0 and ππ|π¦1 to be closer to each other
Hyper-parameters: Example
61
[Koller & Friedman Book] If we believe that two variables Y and Z depend on X in a similar (but not identical) way
References
62