Text classification I (Nave Bayes) CE-324: Modern Information - - PowerPoint PPT Presentation

β–Ά
text classification i na ve bayes
SMART_READER_LITE
LIVE PREVIEW

Text classification I (Nave Bayes) CE-324: Modern Information - - PowerPoint PPT Presentation

Text classification I (Nave Bayes) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline } Text


slide-1
SLIDE 1

Text classification I (NaΓ―ve Bayes)

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Spring 2020

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

Outline

} Text classification

} definition } relevance to information retrieval

} NaΓ―ve Bayes classifier

2

slide-3
SLIDE 3

Formal definition of text classification

3

} Document space π‘Œ

} Docs are represented in this (typically high-dimensional) space

} Set of classes 𝐷 = {𝑑&, … , 𝑑)}

} Example: 𝐷 = {spam, nonβˆ’spam}

} Training set: a set of labeled docs. Each labeled doc

𝑒, 𝑑 ∈ π‘ŒΓ—π·

} Using a learning method, we find a classifier 𝛿 .

that maps docs to classes: 𝛿: π‘Œ β†’ 𝐷

slide-4
SLIDE 4

Examples of using classification in IR systems

4

} Language identification (classes: English vs. French etc.) } Automatic detection of spam pages (spam vs. non-spam) } Automatic detection of secure pages for safe search } Topic-specific or vertical search – restrict search to a β€œvertical”

like β€œrelated to health” (relevant to vertical vs. not)

} Sentiment detection: is a movie or product review positive or

negative (positive vs. negative)

} Exercise: Find examples of uses of text classification in IR

slide-5
SLIDE 5

Standing queries

} The path from IR to text classification:

} You have an information need to monitor, say:

} Unrest in the Niger delta region

} You want to rerun an appropriate query periodically to find new news

items on this topic

} You will be sent new documents that are found

} I.e., it’s not ranking but classification (relevant vs. not relevant)

} Such queries are called standing queries

} Long used by β€œinformation professionals” } A modern mass instantiation is Google Alerts

  • Ch. 13

5

slide-6
SLIDE 6

3

6

slide-7
SLIDE 7

Spam filtering Another text classification task

From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================

7

slide-8
SLIDE 8

Categorization/Classification

} Given:

} A representation of a document d

} Issue: how to represent text documents. } Usually some type of high-dimensional space – bag of words

} A fixed set of classes:

C = {c1, c2,…, cJ}

} Determine:

} The category of d: γ(d) ∈ C

} Ξ³(d) is a classification function

} We want to build classification functions (β€œclassifiers”).

8

slide-9
SLIDE 9

Classification Methods (1)

Β§ Manual classification

Β§ Used by the originalYahoo! Directory Β§ Looksmart, about.com, ODP, PubMed

Β§ Accurate when job is done by experts Β§ Consistent when the problem size and team is small Β§ Difficult and expensive to scale

Β§ Means we need automatic classification methods for big

problems

9

slide-10
SLIDE 10

Classification Methods (2)

} Hand-coded rule-based classifiers

} One

technique used by news agencies, intelligence agencies, etc.

} Widely deployed in government and enterprise } Vendors provide β€œIDE” for writing such rules

} Issues:

} Commercial systems have complex query languages } Accuracy can be high if a rule has been carefully refined

  • ver time by a subject expert

} Building and maintaining these rules is expensive

10

slide-11
SLIDE 11

Classification Methods (3): Supervised learning

} Given:

} A document d } A fixed set of classes:

C = {c1, c2,…, cJ}

} A training set D of documents each with a label in C

} Determine:

} A learning method or algorithm which will enable us to learn

a classifier Ξ³

} For a test document d, we assign it the class

γ(d) ∈ C

11

slide-12
SLIDE 12

Bayes classifier

12

} Bayesian classifier is a probabilistic classifier:

𝑑 = argmax

7

𝑄(𝐷7|𝑒) 𝑑 = argmax

7

𝑄 𝑒 𝐷7 𝑄(𝐷7)

} 𝑒 = 𝑒&, … , 𝑒=> } There are too many parameters 𝑄( 𝑒&, … , 𝑒=> |𝐷7)

} One for each unique combination of a class and a sequence of

words.

} We would need a very, very large number of training examples to

estimate that many parameters.

slide-13
SLIDE 13

NaΓ―ve bayes assumption

13

} NaΓ―ve bayes assumption:

𝑄 𝑒 𝐷7 = 𝑄 𝑒&, … , 𝑒=> 𝐷7 = ? 𝑄(𝑒@|𝐷7)

=> @A&

} 𝑀C: length of doc 𝑒 (number of tokens) } 𝑄(𝑒@|𝐷7): probability of term 𝑒@ occurring in a doc of class 𝐷7 } 𝑄(𝐷7): prior probability of class 𝐷7.

slide-14
SLIDE 14

NaΓ―ve bayes assumption

14

} NaΓ―ve bayes assumption:

𝑄 𝑒 𝐷7 = 𝑄 𝑒&, … , 𝑒=> 𝐷7 = ? 𝑄(𝑒@|𝐷7)

=> @A&

} 𝑀C: length of doc 𝑒 (number of tokens) } 𝑄(𝑒@|𝐷7): probability of term 𝑒@ occurring in a doc of class 𝐷7 } 𝑄(𝐷7): prior probability of class 𝐷7.

} Equivalent to (language model view):

𝑄 𝑒 𝐷7 = ? 𝑄 𝑒@ 𝐷7

DEFG,> H @A&

slide-15
SLIDE 15

Naive Bayes classifier

15

} Since log is a monotonic function, the class with the

highest score does not change.

𝑑 = argmax

7

𝑄 𝑒 𝐷7 𝑄(𝐷7) = argmax

7

𝑄(𝐷7) ? 𝑄(𝑒@|𝐷7)

=> @A&

𝑑 = argmax

7

log 𝑄(𝐷7) + M log 𝑄 𝑒@ 𝐷7

=> @A&

log(xy) = log(x) + log(y)

log 𝑄 𝑒@ 𝐷7 : a weight that indicates how good an indicator 𝑒@ is for 𝐷7

slide-16
SLIDE 16

Estimating parameters

16

} Estimate 𝑄

N(𝐷7) and 𝑄 N 𝑒@ 𝐷7) from training data

} 𝑂7: number of docs in class 𝐷7 } π‘ˆ@,7: number of occurrence of 𝑒@ in training docs from class 𝐷7

(includes multiple occurrences)

} 𝑄

N 𝐷7 =

QR Q

} 𝑄

N 𝑒@ 𝐷7) =

SG,R βˆ‘ SU,R

V UWX

slide-17
SLIDE 17

Problem with estimates: Zeros

17

𝑒: 𝐢𝐹𝐽𝐻𝐽𝑂𝐻 𝐡𝑂𝐸 π‘ˆπ΅π½π‘„πΉπ½ 𝐾𝑃𝐽𝑂 π‘‹π‘ˆπ‘ƒ 𝑄 π‘‹π‘ˆπ‘ƒ π·β„Žπ‘—π‘œπ‘ = 0

slide-18
SLIDE 18

Problem with estimates: Zeros

18

} For doc 𝑒 containing a term 𝑒 that does not occur in any

doc of a class 𝑑 β‡’ 𝑄 N 𝑑 𝑒 = 0

} Thus 𝑒 cannot be assigned to class 𝑑

} We use

𝑄 N 𝑒 𝑑 = π‘ˆD,h + 1 βˆ‘ π‘ˆDj,h

  • Dj∈H

+ π‘Š

} Instead of

𝑄 N 𝑒 𝑑 = π‘ˆD,h βˆ‘ π‘ˆDj,h

  • Dj∈H
slide-19
SLIDE 19

NaΓ―ve Bayes: summary

19

} Estimate parameters from the training corpus using add-

  • ne smoothing

} For a new doc 𝑒 = 𝑒&, … , 𝑒=>, for each class, compute

log 𝑄(𝐷7) + βˆ‘ log 𝑄 𝑒@ 𝐷7

=> @A&

} Assign doc 𝑒 to the class with the largest score

slide-20
SLIDE 20

NaΓ―ve Bayes: example

20

} Training phase:

} Estimate parameters of Naive Bayes classifier

} Test phase

} Classifying the test doc

slide-21
SLIDE 21

NaΓ―ve Bayes: example

21

} Estimating parameters

Β¨ 𝑄

N 𝐷 = m

n, 𝑄

N 𝐷̅ = &

n

Β¨ 𝑄

N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷 = rs&

tsu = u &n 𝑄

N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷̅ = &s&

msu = v w

Β¨ 𝑄

N π‘ˆπ‘ƒπΏπ‘π‘ƒ|𝐷 = zs&

tsu = & &n 𝑄

N π‘ˆπ‘ƒπΏπ‘π‘ƒ|𝐷̅ = &s&

msu = v w

Β¨ 𝑄

N 𝐾𝐡𝑄𝐡𝑂|𝐷 = zs&

tsu = & &n 𝑄

N 𝐾𝐡𝑄𝐡𝑂|𝐷̅ = &s&

msu = v w

} Classifying the test doc:

} 𝑄

N 𝐷|𝑒 ∝ m

n Γ— u &n m

Γ— &

&n Γ— & &n β‰ˆ 0.0003

} 𝑄

N 𝐷̅|𝑒 ∝ &

n Γ— v w m

Γ— v

w Γ— v w β‰ˆ 0.0001

𝐷 = π·β„Žπ‘—π‘œπ‘ 𝑑̂ = 𝐷

slide-22
SLIDE 22

NaΓ―ve Bayes: training

22

slide-23
SLIDE 23

NaΓ―ve Bayes: test

23

slide-24
SLIDE 24

Time complexity of Naive Bayes

24

} 𝐸: training set, π‘Š: vocabulary, β„‚: set of classes } π‘€β‚¬β€’β€š: average length of a training doc } 𝑀€: length of the test doc } 𝑁€: number of distinct terms in the test doc } Thus: Naive Bayes is linear in the size of the training set

(training) and the test doc (testing).

} This is optimal time.

Generally: |β„‚||π‘Š | < 𝐸 π‘€β‚¬β€’β€š

slide-25
SLIDE 25

Why does Naive Bayes work?

25

} The independence assumptions do not really hold of docs

written in natural language.

} Naive

Bayes can work well even though these assumptions are badly violated.

} Classification is about predicting the correct class and not

about accurately estimating probabilities.

} Naive Bayes is terrible for correct estimation . . . } but it often performs well at choosing the correct class.

slide-26
SLIDE 26

Naive Bayes is not so naive

26

} Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) } A good dependable baseline for text classification (but

not the best)

} Optimal if independence assumptions hold (never true for text,

but true for some domains)

} More robust to non-relevant features than some more

complex learning methods

} More robust to concept drift (changing of definition of class

  • ver time) than some more complex learning methods

} Very fast } Low storage requirements

slide-27
SLIDE 27

Resources

27

} Chapter 13 of IIR