[PPT] - STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer PowerPoint Presentation

SLIDE 1

STAT 339 Naive Bayes Classification

8-10 March 2017 Colin Reimer Dawson

SLIDE 2

Outline

Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types

SLIDE 3

Back to Supervised Learning...

▸ How can we use Bayesian methods to do classification? ▸ General idea: model the class conditional

distributions, p(x ∣ t = c) for each category, c, and include a prior distribution over categories. Then use Bayes’ rule. p(tnew = c ∣ xnew) = p(tnew = c)p(xnew ∣ tnew) ∑c′ p(tnew = c′)p(xnew ∣ tnew = c′)

SLIDE 4

Example: Federalist Papers (Mosteller and Wallace, 1963)

▸ Dataset: 12 anonymous essays to try to convince New

York to ratify U.S Constitution. Presumed written by either John Jay, James Madison, or Alexander Hamilton.

▸ Training data: known-author Federalist papers ▸ t = author, x = vector of words ▸ Question: how to model p(x ∣ t)?

SLIDE 5

The Naive Bayes Assumption

▸ To simplify, make the unrealistic (hence “naive”)

assumption that the words are all independent given the class (author). That is for an N word document p(x ∣ t) =

N

∏

n=1

p(xn ∣ t)

▸ Then, the problem reduces to estimating a word

distribution for each author.

SLIDE 6

Categorical and Multinomial Distributions

A given word, xn is one of many in a vocabulary. We can model word distribution for author t as a categorical distribution with parameter vector θt = {θtw},w = 1,...,W, where θtw = p(xn = w ∣ t). p(xn ∣ t) =

W

∏

w=1

θI(xn=w)

tw

Then p(x ∣ t,θt) = ∏N

n=1 ∏W w=1 θI(xn=w) tw

= ∏W

w=1 θntw tw , where ntw

is the number of times w appears in documents written by author t. The distribution of the ntw (for fixed t) is a multinomial distribution.

SLIDE 7

Samples from a Multinomial

5 10 15 20 25 30 bin count 1 2 3 4 5 6

Figure: Each line represents one sample from a multinomial

distribution over the values {1,2,...,6} with equal probabilities for each category. Sample size is 100.

SLIDE 8

Maximum Likelihood Estimation

▸ The likelihood and log likelihood for θt:

L(θt) =

W

∏

w=1

θntw

tw

log L(θt) =

W

∑

w=1

ntw log(θtw)

▸ Because the θtw are constrained to sum to 1 for each t,

we can’t just maximize this function freely; we need a constrained optimization technique such as Lagrange multipliers.

▸ Omitting details, we get

ˆ θtw = ntw ∑W

w′=1 ntw′

i.e., the sample proportions.

SLIDE 9

Sparse Data

▸ Many of the counts will be zero for a particular author in

the training set. Do we really want to say that these words have probability zero for that author?

▸ Ad-hoc approach: “Laplace smoothing”. Add a small

number to each count to get rid of zeroes.

▸ Bayesian approach: use a prior on the parameter vectors

θt, t = 1,...,T (T being the number of different classes).

SLIDE 10

Dirichlet-Multinomial Model

▸ The conjugate prior for a multinomial parameter vector is

the Dirichlet distribution. p(θt ∣ α) = Γ(∑w αw) ∏w Γ(αw)

W

∏

w=1

θαw−1

tw

(where all αw > 0 and ∑w θtw = 1)

SLIDE 11

Mean of a Dirichlet

The mean vector for a Dirichlet is E{θt ∣ α} = (E{θt1 ∣ α},...,E{θtW ∣ α}) where E{θtw0 ∣ α} = Γ(∑w αw) ∏w Γ(αw) ∫ θw0

W

∏

w=1

θαw−1

tw

dθtw0 = Γ(∑w αw) ∏w Γ(αw) ∫

W

∏

w=1

θαw+I(w=w0)−1

tw

dθtw0 = Γ(∑w αw) Γ(αw0)∏w≠w0 Γ(αw) Γ(αw0 + 1)∏w≠w0 Γ(αw) Γ(∑w αw + 1) = αw0 ∑w αw

SLIDE 12

Dirichlet-Multinomial Model

▸ The conjugate prior for a multinomial parameter vector is

the Dirichlet distribution. p(θt ∣ α) = Γ(∑w αw) ∏w Γ(αw)

W

∏

w=1

θαw−1

tw ▸ Together with the multinomial likelihood:

p(x ∣ θt) =

W

∏

w=1

θntw

tw ▸ The posterior is Dirichlet

p(θt ∣ x) = Γ(N + ∑w αw) ∏w Γ(αw + ntw)

W

∏

w=1

θαw+ntw−1

tw

SLIDE 13

Dirichlet-Multinomial Predictive Distribution

▸ The posterior is Dirichlet

p(θt ∣ x,t) = Γ(N + ∑w αw) ∏w Γ(αw + ntw)

W

∏

w=1

θαw+ntw

tw ▸ The predictive probability that xnew = w is then

p(xnew = w0 ∣ x,t) = ∫ p(xnew = w ∣ θt)p(θt ∣ x)dθt = Γ(N + ∑w αw) ∏w Γ(αw + ntw) ∫ θw

W

∏

w=1

θαw+ntw

tw

= E{θtw ∣ x,t} = αw + ntw N + ∑w αw

SLIDE 14

Summary: Naive Bayes with Categorical Features

▸ We make the “naive Bayes” assumption that the feature

dimensions are independent: p(x ∣ t = c) =

D

∏

d=1

p(xd ∣ t = c)

▸ Each p(xd ∣ t) is a categorical distribution with parameter

vector θdt, which are probabilities over values of xd.

▸ The MLEs are just the training proportions; but to “smooth”,

can use a Dirichlet prior, which yields predictive probabilities (integrating out θdt): p(xnew,d ∣ t = c,x) = αw + ndtw Ndt + ∑w αw

▸ Can use the same procedure to estimate prior probs p(t).

SLIDE 15

Example: Iris Data

SLIDE 16

A Generative Model for Continuous Features

0.5 0.3 0.2 (a) 0.5 1 0.5 1

SLIDE 17

Naive Bayes with Continuous Features

The naive Bayes assumption can be made regardless of the individual feature types: p(x ∣ t = c) =

D

∏

d=1

p(xd ∣ t = c) for example, suppose xd ∣ t can be modeled as Normal: p(xd ∣ t) = 1 √ 2πσ2

td

exp{− 1 2σ2

td

(xd − µtd)2} Then, the joint likelihood function is L(µ,σ2) =

N

∏

n=1 D

∏

d=1

1 √ 2πσ2

tnd

exp{− 1 2σ2

tnd

(xnd − µtnd)2}

SLIDE 18

MLE Parameter Estimates

The joint likelihood function is L(µ,σ2) =

N

∏

n=1 D

∏

d=1

1 √ 2πσ2

tnd

exp{− 1 2σ2

tnd

(xnd − µtnd)2} When considering the µt0d0,σ2

t0d0 part of the gradient, all

terms for which tn ≠ t0 and/or d ≠ d0 are constants, so we get ∂ log L ∂µt0d0 = − 1 σ2

t0d0

∑

n∶tn=t0

(xnd0 − µt0d0) ∂ log L ∂σ2

t0d0

= ∑

n∶tn=t0

(− 1 2σ2

t0d0

+ 1 2(σ2

t0d0)2(xnd0 − µt0d0)2)

which means we can consider each class and each coordinate separately, and the MLEs are the MLEs for the corresponding univariate Normal.

SLIDE 19

Naive Bayes Generative Model of Iris Data

●
2.0

2.5 3.0 3.5 4.0 4.5 5.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Sepal.Width Sepal.Length

setosa

versicolor virginica

Figure: Class-conditional densities are shown as bivariate Normals with diagonal covariance matrices. Means and variances estimated using MLE

SLIDE 20

Classification

Having estimated parameters, the posterior probability that tnew = c is just p(t = c ∣ ˆ µ, ˆ σ2) = p(t = c)p(xnew ∣ t = c, ˆ µ, ˆ σ2) ∑c′ p(t = c)p(xnew ∣ t = c, ˆ µ, ˆ σ2) = p(t = c)∏D

d=1 N(xnew ∣ ˆ

µcd, ˆ σ2

cd)

∑c′ p(t = c)∏D

d=1 N(xnew ∣ ˆ

µc′d, ˆ σ2

c′d)

SLIDE 21

Mixed Feature Types

Since the naive Bayes assumption models all features as conditionally independent if we know the class label, there is no extra difficulty in constructing models with arbitrary combinations of categorical and quantitative features.