[PPT] - Text classification I (Nave Bayes) CE-324: Modern Information PowerPoint Presentation

SLIDE 1

Text classification I (Naïve Bayes)

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Spring 2020

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

Outline

} Text classification

} definition } relevance to information retrieval

} Naïve Bayes classifier

2

SLIDE 3

Formal definition of text classification

3

} Document space 𝑌

} Docs are represented in this (typically high-dimensional) space

} Set of classes 𝐷 = {𝑑&, … , 𝑑)}

} Example: 𝐷 = {spam, non−spam}

} Training set: a set of labeled docs. Each labeled doc

𝑒, 𝑑 ∈ 𝑌×𝐷

} Using a learning method, we find a classifier 𝛿 .

that maps docs to classes: 𝛿: 𝑌 → 𝐷

SLIDE 4

Examples of using classification in IR systems

4

} Language identification (classes: English vs. French etc.) } Automatic detection of spam pages (spam vs. non-spam) } Automatic detection of secure pages for safe search } Topic-specific or vertical search – restrict search to a “vertical”

like “related to health” (relevant to vertical vs. not)

} Sentiment detection: is a movie or product review positive or

negative (positive vs. negative)

} Exercise: Find examples of uses of text classification in IR

SLIDE 5

Standing queries

} The path from IR to text classification:

} You have an information need to monitor, say:

} Unrest in the Niger delta region

} You want to rerun an appropriate query periodically to find new news

items on this topic

} You will be sent new documents that are found

} I.e., it’s not ranking but classification (relevant vs. not relevant)

} Such queries are called standing queries

} Long used by “information professionals” } A modern mass instantiation is Google Alerts

Ch. 13

5

SLIDE 6

3

6

SLIDE 7

Spam filtering Another text classification task

From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================

7

SLIDE 8

Categorization/Classification

} Given:

} A representation of a document d

} Issue: how to represent text documents. } Usually some type of high-dimensional space – bag of words

} A fixed set of classes:

C = {c1, c2,…, cJ}

} Determine:

} The category of d: γ(d) ∈ C

} γ(d) is a classification function

} We want to build classification functions (“classifiers”).

8

SLIDE 9

Classification Methods (1)

§ Manual classification

§ Used by the originalYahoo! Directory § Looksmart, about.com, ODP, PubMed

§ Accurate when job is done by experts § Consistent when the problem size and team is small § Difficult and expensive to scale

§ Means we need automatic classification methods for big

problems

9

SLIDE 10

Classification Methods (2)

} Hand-coded rule-based classifiers

} One

technique used by news agencies, intelligence agencies, etc.

} Widely deployed in government and enterprise } Vendors provide “IDE” for writing such rules

} Issues:

} Commercial systems have complex query languages } Accuracy can be high if a rule has been carefully refined

ver time by a subject expert

} Building and maintaining these rules is expensive

10

SLIDE 11

Classification Methods (3): Supervised learning

} Given:

} A document d } A fixed set of classes:

C = {c1, c2,…, cJ}

} A training set D of documents each with a label in C

} Determine:

} A learning method or algorithm which will enable us to learn

a classifier γ

} For a test document d, we assign it the class

γ(d) ∈ C

11

SLIDE 12

Bayes classifier

12

} Bayesian classifier is a probabilistic classifier:

𝑑 = argmax

7

𝑄(𝐷7|𝑒) 𝑑 = argmax

7

𝑄 𝑒 𝐷7 𝑄(𝐷7)

} 𝑒 = 𝑢&, … , 𝑢=> } There are too many parameters 𝑄( 𝑢&, … , 𝑢=> |𝐷7)

} One for each unique combination of a class and a sequence of

words.

} We would need a very, very large number of training examples to

estimate that many parameters.

SLIDE 13

Naïve bayes assumption

13

} Naïve bayes assumption:

𝑄 𝑒 𝐷7 = 𝑄 𝑢&, … , 𝑢=> 𝐷7 = ? 𝑄(𝑢@|𝐷7)

=> @A&

} 𝑀C: length of doc 𝑒 (number of tokens) } 𝑄(𝑢@|𝐷7): probability of term 𝑢@ occurring in a doc of class 𝐷7 } 𝑄(𝐷7): prior probability of class 𝐷7.

SLIDE 14

Naïve bayes assumption

14

} Naïve bayes assumption:

𝑄 𝑒 𝐷7 = 𝑄 𝑢&, … , 𝑢=> 𝐷7 = ? 𝑄(𝑢@|𝐷7)

=> @A&

} 𝑀C: length of doc 𝑒 (number of tokens) } 𝑄(𝑢@|𝐷7): probability of term 𝑢@ occurring in a doc of class 𝐷7 } 𝑄(𝐷7): prior probability of class 𝐷7.

} Equivalent to (language model view):

𝑄 𝑒 𝐷7 = ? 𝑄 𝑢@ 𝐷7

DEFG,> H @A&

SLIDE 15

Naive Bayes classifier

15

} Since log is a monotonic function, the class with the

highest score does not change.

𝑑 = argmax

7

𝑄 𝑒 𝐷7 𝑄(𝐷7) = argmax

7

𝑄(𝐷7) ? 𝑄(𝑢@|𝐷7)

=> @A&

𝑑 = argmax

7

log 𝑄(𝐷7) + M log 𝑄 𝑢@ 𝐷7

=> @A&

log(xy) = log(x) + log(y)

log 𝑄 𝑢@ 𝐷7 : a weight that indicates how good an indicator 𝑢@ is for 𝐷7

SLIDE 16

Estimating parameters

16

} Estimate 𝑄

N(𝐷7) and 𝑄 N 𝑢@ 𝐷7) from training data

} 𝑂7: number of docs in class 𝐷7 } 𝑈@,7: number of occurrence of 𝑢@ in training docs from class 𝐷7

(includes multiple occurrences)

} 𝑄

N 𝐷7 =

QR Q

} 𝑄

N 𝑢@ 𝐷7) =

SG,R ∑ SU,R

V UWX

SLIDE 17

Problem with estimates: Zeros

17

𝑒: 𝐶𝐹𝐽𝐻𝐽𝑂𝐻 𝐵𝑂𝐸 𝑈𝐵𝐽𝑄𝐹𝐽 𝐾𝑃𝐽𝑂 𝑋𝑈𝑃 𝑄 𝑋𝑈𝑃 𝐷ℎ𝑗𝑜𝑏 = 0

SLIDE 18

Problem with estimates: Zeros

18

} For doc 𝑒 containing a term 𝑢 that does not occur in any

doc of a class 𝑑 ⇒ 𝑄 N 𝑑 𝑒 = 0

} Thus 𝑒 cannot be assigned to class 𝑑

} We use

𝑄 N 𝑢 𝑑 = 𝑈D,h + 1 ∑ 𝑈Dj,h

Dj∈H

+ 𝑊

} Instead of

𝑄 N 𝑢 𝑑 = 𝑈D,h ∑ 𝑈Dj,h

Dj∈H

SLIDE 19

Naïve Bayes: summary

19

} Estimate parameters from the training corpus using add-

ne smoothing

} For a new doc 𝑒 = 𝑢&, … , 𝑢=>, for each class, compute

log 𝑄(𝐷7) + ∑ log 𝑄 𝑢@ 𝐷7

=> @A&

} Assign doc 𝑒 to the class with the largest score

SLIDE 20

Naïve Bayes: example

20

} Training phase:

} Estimate parameters of Naive Bayes classifier

} Test phase

} Classifying the test doc

SLIDE 21

Naïve Bayes: example

21

} Estimating parameters

¨ 𝑄

N 𝐷 = m

n, 𝑄

N 𝐷̅ = &

n

¨ 𝑄

N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷 = rs&

tsu = u &n 𝑄

N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷̅ = &s&

msu = v w

¨ 𝑄

N 𝑈𝑃𝐿𝑍𝑃|𝐷 = zs&

tsu = & &n 𝑄

N 𝑈𝑃𝐿𝑍𝑃|𝐷̅ = &s&

msu = v w

¨ 𝑄

N 𝐾𝐵𝑄𝐵𝑂|𝐷 = zs&

tsu = & &n 𝑄

N 𝐾𝐵𝑄𝐵𝑂|𝐷̅ = &s&

msu = v w

} Classifying the test doc:

} 𝑄

N 𝐷|𝑒 ∝ m

n × u &n m

× &

&n × & &n ≈ 0.0003

} 𝑄

N 𝐷̅|𝑒 ∝ &

n × v w m

× v

w × v w ≈ 0.0001

𝐷 = 𝐷ℎ𝑗𝑜𝑏 𝑑̂ = 𝐷

SLIDE 22

Naïve Bayes: training

22

SLIDE 23

Naïve Bayes: test

23

SLIDE 24

Time complexity of Naive Bayes

24

} 𝐸: training set, 𝑊: vocabulary, ℂ: set of classes } 𝑀€•‚: average length of a training doc } 𝑀€: length of the test doc } 𝑁€: number of distinct terms in the test doc } Thus: Naive Bayes is linear in the size of the training set

(training) and the test doc (testing).

} This is optimal time.

Generally: |ℂ||𝑊 | < 𝐸 𝑀€•‚

SLIDE 25

Why does Naive Bayes work?

25

} The independence assumptions do not really hold of docs

written in natural language.

} Naive

Bayes can work well even though these assumptions are badly violated.

} Classification is about predicting the correct class and not

about accurately estimating probabilities.

} Naive Bayes is terrible for correct estimation . . . } but it often performs well at choosing the correct class.

SLIDE 26

Naive Bayes is not so naive

26

} Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) } A good dependable baseline for text classification (but

not the best)

} Optimal if independence assumptions hold (never true for text,

but true for some domains)

} More robust to non-relevant features than some more

complex learning methods

} More robust to concept drift (changing of definition of class

ver time) than some more complex learning methods

} Very fast } Low storage requirements

SLIDE 27

Resources

27