STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer - - PowerPoint PPT Presentation
STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer - - PowerPoint PPT Presentation
STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types Back to Supervised
Outline
Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types
Back to Supervised Learning...
▸ How can we use Bayesian methods to do classification? ▸ General idea: model the class conditional
distributions, p(x ∣ t = c) for each category, c, and include a prior distribution over categories. Then use Bayes’ rule. p(tnew = c ∣ xnew) = p(tnew = c)p(xnew ∣ tnew) ∑c′ p(tnew = c′)p(xnew ∣ tnew = c′)
Example: Federalist Papers (Mosteller and Wallace, 1963)
▸ Dataset: 12 anonymous essays to try to convince New
York to ratify U.S Constitution. Presumed written by either John Jay, James Madison, or Alexander Hamilton.
▸ Training data: known-author Federalist papers ▸ t = author, x = vector of words ▸ Question: how to model p(x ∣ t)?
The Naive Bayes Assumption
▸ To simplify, make the unrealistic (hence “naive”)
assumption that the words are all independent given the class (author). That is for an N word document p(x ∣ t) =
N
∏
n=1
p(xn ∣ t)
▸ Then, the problem reduces to estimating a word
distribution for each author.
Categorical and Multinomial Distributions
A given word, xn is one of many in a vocabulary. We can model word distribution for author t as a categorical distribution with parameter vector θt = {θtw},w = 1,...,W, where θtw = p(xn = w ∣ t). p(xn ∣ t) =
W
∏
w=1
θI(xn=w)
tw
Then p(x ∣ t,θt) = ∏N
n=1 ∏W w=1 θI(xn=w) tw
= ∏W
w=1 θntw tw , where ntw
is the number of times w appears in documents written by author t. The distribution of the ntw (for fixed t) is a multinomial distribution.
Samples from a Multinomial
5 10 15 20 25 30 bin count 1 2 3 4 5 6
- Figure: Each line represents one sample from a multinomial
distribution over the values {1,2,...,6} with equal probabilities for each category. Sample size is 100.
Maximum Likelihood Estimation
▸ The likelihood and log likelihood for θt:
L(θt) =
W
∏
w=1
θntw
tw
log L(θt) =
W
∑
w=1
ntw log(θtw)
▸ Because the θtw are constrained to sum to 1 for each t,
we can’t just maximize this function freely; we need a constrained optimization technique such as Lagrange multipliers.
▸ Omitting details, we get
ˆ θtw = ntw ∑W
w′=1 ntw′
i.e., the sample proportions.
Sparse Data
▸ Many of the counts will be zero for a particular author in
the training set. Do we really want to say that these words have probability zero for that author?
▸ Ad-hoc approach: “Laplace smoothing”. Add a small
number to each count to get rid of zeroes.
▸ Bayesian approach: use a prior on the parameter vectors
θt, t = 1,...,T (T being the number of different classes).
Dirichlet-Multinomial Model
▸ The conjugate prior for a multinomial parameter vector is
the Dirichlet distribution. p(θt ∣ α) = Γ(∑w αw) ∏w Γ(αw)
W
∏
w=1
θαw−1
tw
(where all αw > 0 and ∑w θtw = 1)
Mean of a Dirichlet
The mean vector for a Dirichlet is E{θt ∣ α} = (E{θt1 ∣ α},...,E{θtW ∣ α}) where E{θtw0 ∣ α} = Γ(∑w αw) ∏w Γ(αw) ∫ θw0
W
∏
w=1
θαw−1
tw
dθtw0 = Γ(∑w αw) ∏w Γ(αw) ∫
W
∏
w=1
θαw+I(w=w0)−1
tw
dθtw0 = Γ(∑w αw) Γ(αw0)∏w≠w0 Γ(αw) Γ(αw0 + 1)∏w≠w0 Γ(αw) Γ(∑w αw + 1) = αw0 ∑w αw
Dirichlet-Multinomial Model
▸ The conjugate prior for a multinomial parameter vector is
the Dirichlet distribution. p(θt ∣ α) = Γ(∑w αw) ∏w Γ(αw)
W
∏
w=1
θαw−1
tw ▸ Together with the multinomial likelihood:
p(x ∣ θt) =
W
∏
w=1
θntw
tw ▸ The posterior is Dirichlet
p(θt ∣ x) = Γ(N + ∑w αw) ∏w Γ(αw + ntw)
W
∏
w=1
θαw+ntw−1
tw
Dirichlet-Multinomial Predictive Distribution
▸ The posterior is Dirichlet
p(θt ∣ x,t) = Γ(N + ∑w αw) ∏w Γ(αw + ntw)
W
∏
w=1
θαw+ntw
tw ▸ The predictive probability that xnew = w is then
p(xnew = w0 ∣ x,t) = ∫ p(xnew = w ∣ θt)p(θt ∣ x)dθt = Γ(N + ∑w αw) ∏w Γ(αw + ntw) ∫ θw
W
∏
w=1
θαw+ntw
tw
= E{θtw ∣ x,t} = αw + ntw N + ∑w αw
Summary: Naive Bayes with Categorical Features
▸ We make the “naive Bayes” assumption that the feature
dimensions are independent: p(x ∣ t = c) =
D
∏
d=1
p(xd ∣ t = c)
▸ Each p(xd ∣ t) is a categorical distribution with parameter
vector θdt, which are probabilities over values of xd.
▸ The MLEs are just the training proportions; but to “smooth”,
can use a Dirichlet prior, which yields predictive probabilities (integrating out θdt): p(xnew,d ∣ t = c,x) = αw + ndtw Ndt + ∑w αw
▸ Can use the same procedure to estimate prior probs p(t).
Example: Iris Data
A Generative Model for Continuous Features
0.5 0.3 0.2 (a) 0.5 1 0.5 1
Naive Bayes with Continuous Features
The naive Bayes assumption can be made regardless of the individual feature types: p(x ∣ t = c) =
D
∏
d=1
p(xd ∣ t = c) for example, suppose xd ∣ t can be modeled as Normal: p(xd ∣ t) = 1 √ 2πσ2
td
exp{− 1 2σ2
td
(xd − µtd)2} Then, the joint likelihood function is L(µ,σ2) =
N
∏
n=1 D
∏
d=1
1 √ 2πσ2
tnd
exp{− 1 2σ2
tnd
(xnd − µtnd)2}
MLE Parameter Estimates
The joint likelihood function is L(µ,σ2) =
N
∏
n=1 D
∏
d=1
1 √ 2πσ2
tnd
exp{− 1 2σ2
tnd
(xnd − µtnd)2} When considering the µt0d0,σ2
t0d0 part of the gradient, all
terms for which tn ≠ t0 and/or d ≠ d0 are constants, so we get ∂ log L ∂µt0d0 = − 1 σ2
t0d0
∑
n∶tn=t0
(xnd0 − µt0d0) ∂ log L ∂σ2
t0d0
= ∑
n∶tn=t0
(− 1 2σ2
t0d0
+ 1 2(σ2
t0d0)2(xnd0 − µt0d0)2)
which means we can consider each class and each coordinate separately, and the MLEs are the MLEs for the corresponding univariate Normal.
Naive Bayes Generative Model of Iris Data
- ●
- 2.0
2.5 3.0 3.5 4.0 4.5 5.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Sepal.Width Sepal.Length
- setosa
versicolor virginica
Figure: Class-conditional densities are shown as bivariate Normals with diagonal covariance matrices. Means and variances estimated using MLE
Classification
Having estimated parameters, the posterior probability that tnew = c is just p(t = c ∣ ˆ µ, ˆ σ2) = p(t = c)p(xnew ∣ t = c, ˆ µ, ˆ σ2) ∑c′ p(t = c)p(xnew ∣ t = c, ˆ µ, ˆ σ2) = p(t = c)∏D
d=1 N(xnew ∣ ˆ
µcd, ˆ σ2
cd)
∑c′ p(t = c)∏D
d=1 N(xnew ∣ ˆ
µc′d, ˆ σ2
c′d)