[PPT] - Natural Language Processing (CSE 490U): Text Classification Noah PowerPoint Presentation

SLIDE 1

Natural Language Processing (CSE 490U): Text Classification

Noah Smith

c 2017 University of Washington nasmith@cs.washington.edu

January 20–23, 2017

1 / 65

SLIDE 2

Text Classification

Input: a piece of text x ∈ V†, usually a document (r.v. X) Output: a label from a finite set L (r.v. L) Standard line of attack:

1. Human experts label some data.
2. Feed the data to a supervised machine learning algorithm that

constructs an automatic classifier classify : V† → L

3. Apply classify to as much data as you want!

Note: we assume the texts are segmented already, even the new

nes.

2 / 65

SLIDE 3

Text Classification: Examples

◮ Library-like subjects (e.g., the Dewey decimal system) ◮ News stories: politics vs. sports vs. business vs. technology ... ◮ Reviews of films, restaurants, products: postive vs. negative ◮ Author attributes: identity, political stance, gender, age, ... ◮ Email, arXiv submissions, etc.: spam vs. not ◮ What is the reading level of a piece of text? ◮ How influential will a scientific paper be? ◮ Will a piece of proposed legislation pass?

Closely related: relevance to a query.

3 / 65

SLIDE 4

Evaluation

Accuracy: A(classify) = p(classify(X) = L) =

x∈V†,ℓ∈L

p(X = x, L = ℓ) · 1 if classify(x) = ℓ

therwise

=

x∈V†,ℓ∈L

p(X = x, L = ℓ) · 1 {classify(x) = ℓ} where p is the true distribution over data. Error is 1 − A. This is estimated using a test dataset ¯ x1, ¯ ℓ1, . . . ¯ xm, ¯ ℓm: ˆ A(classify) = 1 m

m

i=1

1

classify(¯

xi) = ¯ ℓi

4 / 65

SLIDE 5

Issues with Test-Set Accuracy

5 / 65

SLIDE 6

Issues with Test-Set Accuracy

◮ Class imbalance: if p(L = not spam) = 0.99, then you can get

ˆ A ≈ 0.99 by always guessing “not spam.”

6 / 65

SLIDE 7

Issues with Test-Set Accuracy

◮ Class imbalance: if p(L = not spam) = 0.99, then you can get

ˆ A ≈ 0.99 by always guessing “not spam.”

◮ Relative importance of classes or cost of error types.

7 / 65

SLIDE 8

Issues with Test-Set Accuracy

◮ Class imbalance: if p(L = not spam) = 0.99, then you can get

ˆ A ≈ 0.99 by always guessing “not spam.”

◮ Relative importance of classes or cost of error types. ◮ Variance due to the test data.

8 / 65

SLIDE 9

Evaluation in the Two-Class Case

Suppose we have two classes, and one of them, t ∈ L is a “target.”

◮ E.g., given a query, find relevant documents.

Precision and recall encode the goals of returning a “pure” set of targeted instances and capturing all of them.

actually in the target class; L = t believed to be in the target class; classify(x) = t correctly labeled as t A B C

ˆ P(classify) = |C| |B| = |A ∩ B| |B| ˆ R(classify) = |C| |A| = |A ∩ B| |A| ˆ F1(classify) = 2 · ˆ P · ˆ R ˆ P + ˆ R

9 / 65

SLIDE 10

Another View: Contingency Table

L = t L = t classify(X) = t C (true positives) B \ C (false positives) B classify(X) = t A \ C (false negatives)

(true negatives)

A

10 / 65

SLIDE 11

Evaluation with > 2 Classes

Macroaveraged precision and recall: let each class be the target and report the average ˆ P and ˆ R across all classes. Microaveraged precision and recall: pool all one-vs.-rest decisions into a single contingency table, calculate ˆ P and ˆ R from that.

11 / 65

SLIDE 12

Cross-Validation

Remember that ˆ A, ˆ P, ˆ R, and ˆ F1 are all estimates of the classifier’s quality under the true data distribution.

◮ Estimates are noisy!

K-fold cross-validation:

◮ Partition the training set into K non-overlapping “folds”

x1, . . . , xK.

◮ For i ∈ {1, . . . , K}:

◮ Train on x1:n \ xi, using xi as development data. ◮ Estimate quality on the ith development set: ˆ

Ai

◮ Report the average:

ˆ A = 1 K

K

i=1

ˆ Ai and perhaps also the standard error.

12 / 65

SLIDE 13

Statistical Significance

Suppose we have two classifiers, classify1 and classify2.

13 / 65

SLIDE 14

Statistical Significance

Suppose we have two classifiers, classify1 and classify2. Is classify1 better? The “null hypothesis,” denoted H0, is that it isn’t. But if ˆ A1 ≫ ˆ A2, we are tempted to believe otherwise.

14 / 65

SLIDE 15

Statistical Significance

Suppose we have two classifiers, classify1 and classify2. Is classify1 better? The “null hypothesis,” denoted H0, is that it isn’t. But if ˆ A1 ≫ ˆ A2, we are tempted to believe otherwise. How much larger must ˆ A1 be than ˆ A2 to reject H0?

15 / 65

SLIDE 16

Statistical Significance

Suppose we have two classifiers, classify1 and classify2. Is classify1 better? The “null hypothesis,” denoted H0, is that it isn’t. But if ˆ A1 ≫ ˆ A2, we are tempted to believe otherwise. How much larger must ˆ A1 be than ˆ A2 to reject H0? Frequentist view: how (im)probable is the observed difference, given H0 = true?

16 / 65

SLIDE 17

Statistical Significance

Suppose we have two classifiers, classify1 and classify2. Is classify1 better? The “null hypothesis,” denoted H0, is that it isn’t. But if ˆ A1 ≫ ˆ A2, we are tempted to believe otherwise. How much larger must ˆ A1 be than ˆ A2 to reject H0? Frequentist view: how (im)probable is the observed difference, given H0 = true? Caution: statistical significance is neither necessary nor sufficient for research significance!

17 / 65

SLIDE 18

A Hypothesis Test for Text Classifiers

McNemar (1947)

1. The null hypothesis: A1 = A2
2. Pick significance level α, an “acceptably” high probability of

incorrectly rejecting H0.

3. Calculate the test statistic, k (explained in the next slide).
4. Calculate the probability of a more extreme value of k,

assuming H0 is true; this is the p-value.

5. Reject the null hypothesis if the p-value is less than α.

The p-value is p(this observation | H0 is true), not the other way around!

18 / 65

SLIDE 19

McNemar’s Test: Details

Assumptions: independent (test) samples and binary

measurements. Count test set error patterns:

classify1 classify1 is incorrect is correct classify2 is incorrect c00 c10 classify2 is correct c01 c11 m · ˆ A2 m · ˆ A1 If A1 = A2, then c01 and c10 are each distributed according to Binomial(c01 + c10, 1

2).

test statistic k = min{c01, c10} p-value = 1 2c01+c10−1

k

j=0

c01 + c10 j

19 / 65

SLIDE 20

Other Tests

Different tests make different assumptions. Sometimes we calculate an interval that would be “unsurprising” under H0 and test whether a test statistic falls in that interval (e.g., t-test and Wald test). In many cases, there is no closed form for estimating p-values, so we use random approximations (e.g., permutation test and paired bootstrap test). If you do lots of tests, you need to correct for that! Read lots more in Smith (2011), appendix B.

20 / 65

SLIDE 21

Features in Text Classification

A different representation of the text sequence r.v. X: feature r.v.s. For j ∈ {1, . . . , d}, let Fj be a discrete random variable taking a value in Fj.

◮ Often, these are term (word and perhaps n-gram) frequencies. ◮ Can also be word “presence” features. ◮ Transformations on word frequencies: logarithm, idf weighting ◮ Disjunctions of terms

◮ Clusters ◮ Task-specific lexicons 21 / 65

SLIDE 22

Probabilistic Classification

Classification rule: classify(f) = argmax

ℓ∈L

p(ℓ | f) = argmax

ℓ∈L

p(ℓ, f) p(f) = argmax

ℓ∈L

p(ℓ, f)

22 / 65

SLIDE 23

Na¨ ıve Bayes Classifier

p(L = ℓ, Fj = f1, . . . , Fd = fd) = p(ℓ)

d

j=1

p(Fj = fj | ℓ) = πℓ

d

j=1

θfj|j,ℓ Parameters:

◮ π is the “class prior” (it sums to one) ◮ For each feature function j and label ℓ, a distribution over

values θ∗|j,ℓ (sums to one for every j, ℓ pair) The “bag of words” version of na¨ ıve Bayes: Fj = Xj p(ℓ, x) = πℓ

|x|

j=1

θxj|ℓ

23 / 65

SLIDE 24

Na¨ ıve Bayes: Remarks

◮ Estimation by (smoothed) relative frequency estimation: easy!

24 / 65

SLIDE 25

Na¨ ıve Bayes: Remarks

◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different

distributions.

25 / 65

SLIDE 26

Na¨ ıve Bayes: Remarks

◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different

distributions.

◮ The bag of words version equates to building a conditional

language model for each label.

26 / 65

SLIDE 27

Na¨ ıve Bayes: Remarks

◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different

distributions.

◮ The bag of words version equates to building a conditional

language model for each label.

◮ The Collins reading assumes a binary version, with Fv

indicating whether v ∈ V occurs in x.

27 / 65

SLIDE 28

Generative vs. Discriminative Classification

Na¨ ıve Bayes is the prototypical generative classifier.

◮ It describes a probabilistic process—“generative story”—for

X and L.

◮ But why model X? It’s always observed?

Discriminative models instead:

◮ seek to optimize a performance measure, like accuracy, or a

computationally convenient surrogate;

◮ do not worry about p(X); ◮ tend to perform better when you have reasonable amounts of

data.

28 / 65

SLIDE 29

Discriminative Text Classifiers

◮ Multinomial logistic regression (also known as “max ent” and

“log-linear”)

◮ Support vector machines ◮ Neural networks ◮ Decision trees

I’ll briefly touch on three ways to train a classifier with a linear decision rule.

29 / 65

SLIDE 30

Linear Models for Classification

“Linear” decision rule: ˆ ℓ = argmax

ℓ∈L

w · φ(x, ℓ) where φ : V† × L → Rd. Parameters: w ∈ Rd What does this remind you of?

30 / 65

SLIDE 31

Linear Models for Classification

“Linear” decision rule: ˆ ℓ = argmax

ℓ∈L

w · φ(x, ℓ) where φ : V† × L → Rd. Parameters: w ∈ Rd What does this remind you of? Some notational variants define:

◮ wℓ for each ℓ ∈ L ◮ φ : V† → Rd (similar to what we had for na¨

ıve Bayes)

31 / 65

SLIDE 32

The Geometric View of Linear Classifiers

Suppose we have instance x, L = {y1, y2, y3, y4}, and there are

nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

32 / 65

SLIDE 33

The Geometric View of Linear Classifiers

Suppose we have instance x, L = {y1, y2, y3, y4}, and there are

nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

w · φ = w1φ1 + w2φ2 = 0

33 / 65

SLIDE 34

The Geometric View of Linear Classifiers

Suppose we have instance x, L = {y1, y2, y3, y4}, and there are

nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

score(y3) > score(y1) > score(y4) > score(y2)

34 / 65

SLIDE 35

The Geometric View of Linear Classifiers

Suppose we have instance x, L = {y1, y2, y3, y4}, and there are

nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

35 / 65

SLIDE 36

The Geometric View of Linear Classifiers

Suppose we have instance x, L = {y1, y2, y3, y4}, and there are

nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

score(y3) > score(y1) > score(y2) > score(y4)

36 / 65

SLIDE 37

MLE for Multinomial Logistic Regression

When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(L = ℓ | x) = exp w · φ(x, ℓ)

ℓ′∈L exp w · φ(x, ℓ′)

37 / 65

SLIDE 38

MLE for Multinomial Logistic Regression

When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(L = ℓ | x) = exp w · φ(x, ℓ)

ℓ′∈L exp w · φ(x, ℓ′)

MLE can be rewritten as a maximization problem: ˆ w = argmax

w n

i=1

w · φ(xi, ℓi)

hope

− log

ℓ′∈L

exp w · φ(xi, ℓ′)

fear

38 / 65

SLIDE 39

MLE for Multinomial Logistic Regression

When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(L = ℓ | x) = exp w · φ(x, ℓ)

ℓ′∈L exp w · φ(x, ℓ′)

MLE can be rewritten as a maximization problem: ˆ w = argmax

w n

i=1

w · φ(xi, ℓi)

hope

− log

ℓ′∈L

exp w · φ(xi, ℓ′)

fear

Recall from language models:

◮ Be wise and regularize! ◮ Solve with batch or stochastic gradient methods. ◮ wj has an interpretation.

39 / 65

SLIDE 40

Log Loss for (x, ℓ)

Another view is to minimize the negated log-likelihood, which is known as “log loss”:

log
ℓ′∈L

exp w · φ(x, ℓ′)

− w · φ(x, ℓ)

In the binary case, where “score” is the score of the correct label:

−4 −2 2 4 1 2 3 4 5 score loss

In blue is the log loss; in red is the “zero-one” loss (error).

40 / 65

SLIDE 41

“Log Sum Exp”

Consider the “log exp” part of the objective function, with two labels, one whose score is fixed.

−10 −5 5 10 −10 −5 5 10 15

log(ex + e8), log(ex + e0), log(ex + e−8)

41 / 65

SLIDE 42

Hard Maximum

Why not use a hard max instead?

−10 −5 5 10 −10 −5 5 10 15

max(x, 8), max(x, 0), max(x, −8)

42 / 65

SLIDE 43

Hinge Loss for (x, ℓ)

max

ℓ′∈L w · φ(x, ℓ′)

− w · φ(x, ℓ)

In the binary case:

−4 −2 2 4 1 2 3 4 5 score loss

In purple is the hinge loss, in blue is the log loss; in red is the “zero-one” loss (error).

43 / 65

SLIDE 44

Minimizing Hinge Loss: Perceptron

max

ℓ′∈L w · φ(x, ℓ′)

− w · φ(x, ℓ)

44 / 65

SLIDE 45

Minimizing Hinge Loss: Perceptron

max

ℓ′∈L w · φ(x, ℓ′)

− w · φ(x, ℓ)

When two labels are tied, the function is not differentiable.

45 / 65

SLIDE 46

Minimizing Hinge Loss: Perceptron

max

ℓ′∈L w · φ(x, ℓ′)

− w · φ(x, ℓ)

When two labels are tied, the function is not differentiable. But it’s still sub-differentiable. Solution: (stochastic) subgradient descent!

46 / 65

SLIDE 47

Minimizing Hinge Loss: Perceptron

max

ℓ′∈L w · φ(x, ℓ′)

− w · φ(x, ℓ)

When two labels are tied, the function is not differentiable. But it’s still sub-differentiable. Solution: (stochastic) subgradient descent! Perceptron algorithm:

◮ For t ∈ {1, . . . , T}:

◮ Pick it uniformly at random from {1, . . . , n}. ◮ ˆ

ℓit ← argmaxℓ∈L w · φ(xit, ℓ)

◮ w ← w − α

φ(xit, ˆ

ℓit) − φ(xit, ℓit)

47 / 65

SLIDE 48

Log Loss and Hinge Loss for (x, ℓ)

log loss:

log
ℓ′∈L

exp w · φ(x, ℓ′)

− w · φ(x, ℓ)

hinge loss:

max

ℓ′∈L w · φ(x, ℓ′)

− w · φ(x, ℓ)

In the binary case, where “score” is the linear score of the correct label:

−4 −2 2 4 1 2 3 4 5 score loss

In purple is the hinge loss, in blue is the log loss; in red is the

48 / 65

SLIDE 49

Minimizing Hinge Loss: Perceptron

min

w n

i=1
max

ℓ′∈L w · φ(xi, ℓ′)

− w · φ(xi, ℓi)

Stochastic subgradient descent on the above is called the perceptron algorithm.

◮ For t ∈ {1, . . . , T}:

◮ Pick it uniformly at random from {1, . . . , n}. ◮ ˆ

ℓit ← argmaxℓ∈L w · φ(xit, ℓ)

◮ w ← w − α

φ(xit, ˆ

ℓit) − φ(xit, ℓit)

49 / 65

SLIDE 50

Error Costs

Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection.

50 / 65

SLIDE 51

Error Costs

Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(ℓ, ℓ′) quantify the “badness” of substituting ℓ′ for correct label ℓ.

51 / 65

SLIDE 52

Error Costs

Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(ℓ, ℓ′) quantify the “badness” of substituting ℓ′ for correct label ℓ. Intuition: estimate the scoring function so that score(ℓi) − score(ˆ ℓ) ∝ cost(ℓi, ˆ ℓ)

52 / 65

SLIDE 53

General Hinge Loss for (x, ℓ)

max

ℓ′∈L w · φ(x, ℓ′) + cost(ℓ, ℓ′)

− w · φ(x, ℓ)

In the binary case, with cost(−1, 1) = 1:

−4 −2 2 4 1 2 3 4 5 6 x function(x) −x + pmax(x, 1)

In blue is the general hinge loss; in red is the “zero-one” loss (error).

53 / 65

SLIDE 54

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance. 54 / 65

SLIDE 55

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance.

◮ Na¨

ıve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization.

◮ You should have a basic understanding of the tradeoffs in

choosing among them.

55 / 65

SLIDE 56

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance.

◮ Na¨

ıve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization.

◮ You should have a basic understanding of the tradeoffs in

choosing among them.

◮ Rumor: random forests are widely used in industry when

performance matters more than interpretability.

56 / 65

SLIDE 57

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance.

◮ Na¨

ıve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization.

◮ You should have a basic understanding of the tradeoffs in

choosing among them.

◮ Rumor: random forests are widely used in industry when

performance matters more than interpretability.

◮ Lots of papers about neural networks, but with

hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015).

57 / 65

SLIDE 58

Readings and Reminders

◮ Jurafsky and Martin (2016b); Collins (2011); Jurafsky and

Martin (2016a)

58 / 65

SLIDE 59

References I

Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, 2011. URL http://www.cs.columbia.edu/~mcollins/em.pdf. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5): 265–292, 2001. Daniel Jurafsky and James H. Martin. Logistic regression (draft chapter), 2016a. URL https://web.stanford.edu/~jurafsky/slp3/7.pdf. Daniel Jurafsky and James H. Martin. Naive Bayes and sentiment classification (draft chapter), 2016b. URL https://web.stanford.edu/~jurafsky/slp3/6.pdf. Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947. Noah A. Smith. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, 2011. URL http://www. morganclaypool.com/doi/pdf/10.2200/S00361ED1V01Y201105HLT013.pdf. Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text

representations. In Proc. of EMNLP, 2015. URL

http://www.aclweb.org/anthology/D/D15/D15-1251.pdf.

59 / 65

SLIDE 60

Extras

60 / 65

SLIDE 61

(Linear) Support Vector Machines

A different motivation for the generalized hinge: ˆ w =

n

i=1
ℓ∈L

αi,ℓ · φ(xi, ℓ) where most only a small number of αi,ℓ are nonzero.

61 / 65

SLIDE 62

(Linear) Support Vector Machines

A different motivation for the generalized hinge: ˆ w =

n

i=1
ℓ∈L

αi,ℓ · φ(xi, ℓ) where most only a small number of αi,ℓ are nonzero. Those φ(xi, ℓ) are called “support vectors” because they “support” the decision boundary. ˆ w · φ(x, ℓ′) =

(i,ℓ)∈S

αi,ℓ · φ(xi, ℓ) · φ(x, ℓ′) See Crammer and Singer (2001) for the multiclass version.

62 / 65

SLIDE 63

(Linear) Support Vector Machines

A different motivation for the generalized hinge: ˆ w =

n

i=1
ℓ∈L

αi,ℓ · φ(xi, ℓ) where most only a small number of αi,ℓ are nonzero. Those φ(xi, ℓ) are called “support vectors” because they “support” the decision boundary. ˆ w · φ(x, ℓ′) =

(i,ℓ)∈S

αi,ℓ · φ(xi, ℓ) · φ(x, ℓ′) See Crammer and Singer (2001) for the multiclass version. Really good tool: SVMlight, http://svmlight.joachims.org

63 / 65

SLIDE 64

Support Vector Machines: Remarks

◮ Regularization is critical; squared ℓ2 is most common, and

ften used in (yet another) motivation around the idea of

“maximizing margin” around the hyperplane separator.

64 / 65

SLIDE 65

Support Vector Machines: Remarks

◮ Regularization is critical; squared ℓ2 is most common, and

ften used in (yet another) motivation around the idea of

“maximizing margin” around the hyperplane separator.

◮ Often, instead of linear models that explicitly calculate w · φ,

these methods are “kernelized” and rearrange all calculations to involve inner-products between φ vectors.

◮ Example:

Klinear(v, w) = v · w Kpolynomial(v, w) = (v · w + 1)p KGaussian(v, w) = exp −v − w2

2

2σ2

◮ Linear kernels are most common in NLP. 65 / 65