[PPT] - CS 4501 Machine Learning for NLP Text Classification (I): Logistic PowerPoint Presentation

SLIDE 1

CS 4501 Machine Learning for NLP

Text Classification (I): Logistic Regression

Yangfeng Ji

Department of Computer Science University of Virginia

SLIDE 2

Overview

1. Problem Definition
2. Bag-of-Words Representation
3. Case Study: Sentiment Analysis
4. Logistic Regression
5. 퐿2 Regularization
6. Demo Code

1

SLIDE 3

Problem Definition

SLIDE 4

Case I: Sentiment Analysis

[Pang et al., 2002]

3

SLIDE 5

Case II: Topic Classification

Example topics

◮ Business ◮ Arts ◮ Technology ◮ Sports ◮ · · ·

4

SLIDE 6

Classification

◮ Input: a text 풙

◮ Example: a product review on Amazon

◮ Output: 푦 ∈ Y, where Yis the predefined category set (sample

space)

◮ Example: Y = {Positive, Negative}

1In this course, we use 풙 for both text and its representation with no distinction

5

SLIDE 7

Classification

◮ Input: a text 풙

◮ Example: a product review on Amazon

◮ Output: 푦 ∈ Y, where Yis the predefined category set (sample

space)

◮ Example: Y = {Positive, Negative}

The pipeline of text classification:1 Text Numeric Vector 풙 Classifier Category 푦

1In this course, we use 풙 for both text and its representation with no distinction

5

SLIDE 8

Probabilistic Formulation

With the conditional probability 푃(푌 | 푿), the prediction on 푌 for a given text 푿 = 풙 is ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (1)

6

SLIDE 9

Probabilistic Formulation

With the conditional probability 푃(푌 | 푿), the prediction on 푌 for a given text 푿 = 풙 is ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (1) Or, for simplicity ˆ 푦 = argmax

푦∈Y

푃(푦 | 풙) (2)

6

SLIDE 10

Key Questions

Recall

◮ The formulation defined in the previous slide

ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (3)

◮ The pipeline of text classification

Text Numeric Vector 풙 Classifier Category 푦

7

SLIDE 11

Key Questions

Recall

◮ The formulation defined in the previous slide

ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (3)

◮ The pipeline of text classification

Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions

1. How to represent a text as 풙?
2. How to estimate 푃(푦 | 풙)?

7

SLIDE 12

Key Questions

Recall

◮ The formulation defined in the previous slide

ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (3)

◮ The pipeline of text classification

Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions

1. How to represent a text as 풙?

◮ Bag-of-words representation

2. How to estimate 푃(푦 | 풙)?

7

SLIDE 13

Key Questions

Recall

◮ The formulation defined in the previous slide

ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (3)

◮ The pipeline of text classification

Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions

1. How to represent a text as 풙?

◮ Bag-of-words representation

2. How to estimate 푃(푦 | 풙)?

◮ Logistic regression models

7

SLIDE 14

Key Questions

Recall

◮ The formulation defined in the previous slide

ˆ 푦 = argmax

푦∈Y

푃(푌 = 푦 | 푿 = 풙) (3)

◮ The pipeline of text classification

Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions

1. How to represent a text as 풙?

◮ Bag-of-words representation

2. How to estimate 푃(푦 | 풙)?

◮ Logistic regression models ◮ Neural network classifiers

7

SLIDE 15

Bag-of-Words Representation

SLIDE 16

Bag-of-Words Representation

Example Texts

Text 1: I love coffee. Text 2: I don’t like tea.

9

SLIDE 17

Bag-of-Words Representation

Example Texts

Text 1: I love coffee. Text 2: I don’t like tea. Step I: convert a text into a collection of tokens

Tokenized Texts

Tokenized text 1: I love coffee Tokenized text 2: I don t like tea

9

SLIDE 18

Bag-of-Words Representation

Example Texts

Text 1: I love coffee. Text 2: I don’t like tea. Step I: convert a text into a collection of tokens

Tokenized Texts

Tokenized text 1: I love coffee Tokenized text 2: I don t like tea Step II: build a dictionary/vocabulary

Vocabulary

{I love coffee don t like tea}

9

SLIDE 19

Bag-of-Words Representations

Step III: based on the vocab, convert each text into a numeric representation as

Bag-of-Words Representations

I love coffee don t like tea 풙(1) = [1 1 1 0]T 풙(2) = [1 1 1 1 1]T

10

SLIDE 20

Bag-of-Words Representations

Step III: based on the vocab, convert each text into a numeric representation as

Bag-of-Words Representations

I love coffee don t like tea 풙(1) = [1 1 1 0]T 풙(2) = [1 1 1 1 1]T The pipeline of text classification: Text Numeric Vector 풙 Classifier Category 푦 Bag-of-words Representation

10

SLIDE 21

Preprocessing for Building Vocab

1. convert all characters to lowercase

UVa, UVA → uva

11

SLIDE 22

Preprocessing for Building Vocab

1. convert all characters to lowercase

UVa, UVA → uva

2. map low frequency words to a special token unk

Zipf’s law: 푓 (푤푡) ∝ 1/푟푡

11

SLIDE 23

Information Embedded in BoW Representations

It is critical to keep in mind about what information is preserved in bag-of-words representations:

◮ Keep:

◮ words in texts

◮ Lose:

◮ word order ◮ sentence boundary ◮ paragraph boundary ◮ · · ·

12

SLIDE 24

Case Study: Sentiment Analysis

SLIDE 25

A Simple Predictor

Consider the following toy example

Tokenized Texts

Tokenized text 1: I love coffee Tokenized text 2: I don t like tea

14

SLIDE 26

A Simple Predictor

Consider the following toy example

Tokenized Texts

Tokenized text 1: I love coffee Tokenized text 2: I don t like tea I love coffee don t like tea 풙(1) [1 1 1 0 ]T 풘Pos [0 1 1 0 ]T 풘Neg [0 1 0 ]T

14

SLIDE 27

A Simple Predictor

Consider the following toy example

Tokenized Texts

Tokenized text 1: I love coffee Tokenized text 2: I don t like tea I love coffee don t like tea 풙(1) [1 1 1 0 ]T 풘Pos [0 1 1 0 ]T 풘Neg [0 1 0 ]T The prediction of sentiment polarity can be formulated as the following 풘T

Pos풙 = 1 > 풘T Neg풙 = 0

(4)

14

SLIDE 28

A Simple Predictor

Consider the following toy example

Tokenized Texts

Tokenized text 1: I love coffee Tokenized text 2: I don t like tea I love coffee don t like tea 풙(1) [1 1 1 0 ]T 풘Pos [0 1 1 0 ]T 풘Neg [0 1 0 ]T The prediction of sentiment polarity can be formulated as the following 풘T

Pos풙 = 1 > 풘T Neg풙 = 0

(4) Essentially, this way of prediction is counting the positive and negative words in the text.

14

SLIDE 29

Another Example

The limitation of word counting I love coffee don t like tea 풙(2) [1 1 1 1 1 ]T 풘Pos [0 1 1 0 ]T 풘Neg [0 1 0 ]T

15

SLIDE 30

Another Example

The limitation of word counting I love coffee don t like tea 풙(2) [1 1 1 1 1 ]T 풘Pos [0 1 1 0 ]T 풘Neg [0 1 0 ]T

◮ Different words should contribute differently. e.g., not vs.

dislike

15

SLIDE 31

Another Example

The limitation of word counting I love coffee don t like tea 풙(2) [1 1 1 1 1 ]T 풘Pos [0 1 1 0 ]T 풘Neg [0 1 0 ]T

◮ Different words should contribute differently. e.g., not vs.

dislike

◮ Sentiment word lists are not complete

Example II: Positive

Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why I have to keep coming back to this restaurant. · · ·

15

SLIDE 32

Logistic Regression

SLIDE 33

Log-linear Models

Directly modeling a linear classifier as ℎ푦(풙) = 풘T

푦풙 + 푏푦

(5) with

◮ 풙 ∈ ℕ푉: vector, bag-of-words representation ◮ 풘푦 ∈ ℝ푉: vector, classification weights associated with label 푦 ◮ 푏푦 ∈ ℝ: scalar, label bias in the training set 푦

17

SLIDE 34

Log-linear Models

Directly modeling a linear classifier as ℎ푦(풙) = 풘T

푦풙 + 푏푦

(5) with

◮ 풙 ∈ ℕ푉: vector, bag-of-words representation ◮ 풘푦 ∈ ℝ푉: vector, classification weights associated with label 푦 ◮ 푏푦 ∈ ℝ: scalar, label bias in the training set 푦

About Label Bias

Consider a case where we have 90 positive examples and 10 negative examples in the training set. With 푏Pos > 푏Neg, a classifier can get 90% predictions correct without even resorting the texts.

17

SLIDE 35

Logistic Regression

Rewrite the linear decision function in the log probabilitic form log 푃(푦 | 풙) ∝ 풘T

푦풙 + 푏푦

ℎ푦(풙)

(6)

18

SLIDE 36

Logistic Regression

Rewrite the linear decision function in the log probabilitic form log 푃(푦 | 풙) ∝ 풘T

푦풙 + 푏푦

ℎ푦(풙)

(6)

r, the probabilistic form is

푃(푦 | 풙) ∝ exp(풘T

푦풙 + 푏푦)

(7)

18

SLIDE 37

Logistic Regression

Rewrite the linear decision function in the log probabilitic form log 푃(푦 | 풙) ∝ 풘T

푦풙 + 푏푦

ℎ푦(풙)

(6)

r, the probabilistic form is

푃(푦 | 풙) ∝ exp(풘T

푦풙 + 푏푦)

(7) To make sure 푃(푦 | 풙) is a valid definition of probability, we need to make sure

푦 푃(푦 | 풙) = 1,

푃(푦 | 풙) = exp(풘T

푦풙 + 푏푦)

푦′∈Yexp(풘T

푦′풙 + 푏푦′)

(8)

18

SLIDE 38

Logistic Regression

Rewrite the linear decision function in the log probabilitic form log 푃(푦 | 풙) ∝ 풘T

푦풙 + 푏푦

ℎ푦(풙)

(6)

r, the probabilistic form is

푃(푦 | 풙) ∝ exp(풘T

푦풙 + 푏푦)

(7) To make sure 푃(푦 | 풙) is a valid definition of probability, we need to make sure

푦 푃(푦 | 풙) = 1,

푃(푦 | 풙) = exp(풘T

푦풙 + 푏푦)

푦′∈Yexp(풘T

푦′풙 + 푏푦′)

(8) Problem: Show the classifier based on Eq. 8 is still a linear classifier

18

SLIDE 39

Alternative Form

Rewriting 풙 and 풘 as

◮ 풙T = [푥1, 푥2, · · · , 푥푉, 1] ◮ 풘T

푦 = [푤1, 푤2, · · · , 푤푉, 푏푦]

allows us to have a more concise form 푃(푦 | 풙) = exp(풘T

푦풙)

푦′∈Yexp(풘T

푦′풙)

(9)

19

SLIDE 40

Alternative Form

Rewriting 풙 and 풘 as

◮ 풙T = [푥1, 푥2, · · · , 푥푉, 1] ◮ 풘T

푦 = [푤1, 푤2, · · · , 푤푉, 푏푦]

allows us to have a more concise form 푃(푦 | 풙) = exp(풘T

푦풙)

푦′∈Yexp(풘T

푦′풙)

(9) Comments:

◮

exp(푎)

푎′ exp(푎′) is the softmax function

◮ This form works with any size of Y— it does not have to be a

binary classification problem.

19

SLIDE 41

Binary Classifier

Assume Y = {neg, pos}, then the corresponding logistic regression classifier with 푌 = Pos is 푃(푌 = Pos | 풙) = 1 1 + exp(−풘T풙) (10) where 풘 is the only parameter.

20

SLIDE 42

Binary Classifier

Assume Y = {neg, pos}, then the corresponding logistic regression classifier with 푌 = Pos is 푃(푌 = Pos | 풙) = 1 1 + exp(−풘T풙) (10) where 풘 is the only parameter.

◮ 푃(푌 = Neg | 풙) = 1 − 푃(푌 = Pos | 풙)

20

SLIDE 43

Binary Classifier

Assume Y = {neg, pos}, then the corresponding logistic regression classifier with 푌 = Pos is 푃(푌 = Pos | 풙) = 1 1 + exp(−풘T풙) (10) where 풘 is the only parameter.

◮ 푃(푌 = Neg | 풙) = 1 − 푃(푌 = Pos | 풙) ◮

exp(푧) 1+exp(푧) is the Sigmoid function 20

SLIDE 44

Binary Classifier

Assume Y = {neg, pos}, then the corresponding logistic regression classifier with 푌 = Pos is 푃(푌 = Pos | 풙) = 1 1 + exp(−풘T풙) (10) where 풘 is the only parameter.

◮ 푃(푌 = Neg | 풙) = 1 − 푃(푌 = Pos | 풙) ◮

exp(푧) 1+exp(푧) is the Sigmoid function

Problem: Show Eq. 10 is a special form of Eq. 9.

20

SLIDE 45

Two Questions on Building LR Models

... of building a logistic regression classifier 푃(푦 | 풙) = exp(풘T

푦풙)

푦′∈Yexp(풘T

푦′풙)

(11)

◮ How to learn the parameters 푾 = {풘푦}푦∈Y?

21

SLIDE 46

Two Questions on Building LR Models

... of building a logistic regression classifier 푃(푦 | 풙) = exp(풘T

푦풙)

푦′∈Yexp(풘T

푦′풙)

(11)

◮ How to learn the parameters 푾 = {풘푦}푦∈Y? ◮ Can 풙 be better than the bag-of-words representations?

◮ Will be discussed in lecture 04

21

SLIDE 47

Review: (Log)-likelihood Function

With a collection of training examples {(풙(푖), 푦(푖))}푚

푖=1, the likelihood

function of {풘푦}푦∈Y is 퐿(푾) =

푚

푖=1

푃(푦(푖) | 풙(푖)) (12) and the log-likelihood function is ℓ({풘푦}) =

푚

푖=1

log 푃(푦(푖) | 풙(푖)) (13)

22

SLIDE 48

Log-likelihood Function of a LR Model

With the definition of a LR model 푃(푦 | 풙) = exp(풘T

푦풙)

푦′∈Yexp(풘T

푦′풙)

(14) the log-likelihood function is ℓ(푾) =

푚

푖=1

log 푃(푦(푖) | 풙(푖)) (15) =

푚

푖=1
풘T

푦(푖)풙(푖) − log

푦′∈Y

exp(풘T

푦′풙(푖))

(16)

Given the training examples {(풙(푖), 푦(푖))}푚

푖=1, ℓ(푾) is a function of

푾 = {풘푦}.

23

SLIDE 49

Optimization with Gradient

MLE is equivalent to minimize the Negative Log-Likelihood (NLL) as NLL(푾) = −ℓ(푾) =

푚

푖=1
− 풘T

푦(푖)풙(푖) + log

푦′∈Y

exp(풘T

푦′풙)

then, the parameter 풘푦 associated with label 푦 can be updated as

풘푦 ← 풘푦 − 휂 · 휕NLL({풘푦}) 휕풘푦 , ∀푦 ∈ Y (17) where 휂 is called learning rate.

24

SLIDE 50

Optimization with Gradient (II)

Two questions answered by the update equation (1) which direction? (2) how far it should go?

[Jurafsky and Martin, 2019]

25

SLIDE 51

Optimization with Gradient (II)

Two questions answered by the update equation (1) which direction? (2) how far it should go?

풘푦 ← 풘푦 − 휂

(2)

· 휕NLL({풘푦}) 휕풘푦

(1)

(18) [Jurafsky and Martin, 2019]

25

SLIDE 52

Optimization with Gradient (II)

Two questions answered by the update equation (1) which direction? (2) how far it should go?

풘푦 ← 풘푦 − 휂

(2)

· 휕NLL({풘푦}) 휕풘푦

(1)

(18) [Jurafsky and Martin, 2019]

25

SLIDE 53

Training Procedure

Steps for parameter estimation, given the current parameter {풘푦}

1. Compute the derivative

휕NLL({풘푦}) 휕풘푦 , ∀푦 ∈ Y

2. Update parameters with

풘푦 ← 풘푦 − 휂 · 휕NLL({풘푦}) 휕풘푦 , ∀푦 ∈ Y

3. If not done, retrun to step 1

26

SLIDE 54

Procedure of Building a Classifier

Review: the pipeline of text classification: Text Numeric Vector 풙 Classifier Category 푦 Bag-of-words Logistic Representation Regression

27

SLIDE 55

퐿2 Regularization

SLIDE 56

퐿2 Regularization

The commonly used regularization trick is the 퐿2 regularization. For that, we need to redefine the objective function of LR by adding an additional item Loss(푾) =

푚

푖=1
− 풘T

푦(푖)풙(푖) + log

푦′∈Y

exp(풘T

푦′풙)

NLL

(19)

29

SLIDE 57

퐿2 Regularization

The commonly used regularization trick is the 퐿2 regularization. For that, we need to redefine the objective function of LR by adding an additional item Loss(푾) =

푚

푖=1
− 풘T

푦(푖)풙(푖) + log

푦′∈Y

exp(풘T

푦′풙)

NLL

+ 휆 2 ·

푦∈Y

풘2

2

퐿2 reg

(19)

◮ 휆 is the regularization parameter

29

SLIDE 58

퐿2 Regularization in Gradient Descent

◮ The gradient of the loss function

휕Loss(푾) 휕풘푦 = 휕NLL(푾) 휕풘푦 + 휆풘푦 (20)

30

SLIDE 59

퐿2 Regularization in Gradient Descent

◮ The gradient of the loss function

휕Loss(푾) 휕풘푦 = 휕NLL(푾) 휕풘푦 + 휆풘푦 (20)

◮ To minimize the loss, we need update the parameter as

풘푦 ← 풘푦 − 휂

휕NLL(푾)

휕풘푦 + 휆풘푦

(21)

30

SLIDE 60

퐿2 Regularization in Gradient Descent

◮ The gradient of the loss function

휕Loss(푾) 휕풘푦 = 휕NLL(푾) 휕풘푦 + 휆풘푦 (20)

◮ To minimize the loss, we need update the parameter as

풘푦 ← 풘푦 − 휂

휕NLL(푾)

휕풘푦 + 휆풘푦

(21)

= (1 − 휂휆) · 풘푦 − 휂 휕NLL(푾) 휕풘푦

◮ Depending on the strength (value) of 휆, the regularization term

tries to keep the parameter values close to 0, which to some extent can help avoid overfitting

30

SLIDE 61

Learning without Regularization

In the demo code, we chose 휆 = 1

퐶 = 0.001 to approximate the case

without regularization.

◮ Training accuracy: 99.89% ◮ Val accuracy: 52.21%

31

SLIDE 62

Classification Weights without Regularization

Here are some word features and their classification weights from the previous model without regularization. Positive weights indicate the word feature contribute to positive sentiment classification and negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings Without Reg 0.011

5.63

1.80

5.68
8.20

14.16

32

SLIDE 63

Classification Weights without Regularization

Here are some word features and their classification weights from the previous model without regularization. Positive weights indicate the word feature contribute to positive sentiment classification and negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings Without Reg 0.011

5.63

1.80

5.68
8.20

14.16

◮ negative: woody allen can write and deliver a one liner as

well as anybody .

32

SLIDE 64

Classification Weights without Regularization

Here are some word features and their classification weights from the previous model without regularization. Positive weights indicate the word feature contribute to positive sentiment classification and negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings Without Reg 0.011

5.63

1.80

5.68
8.20

14.16

◮ negative: woody allen can write and deliver a one liner as

well as anybody .

◮ positive: soderbergh , like kubrick before him , may not

touch the planet ’s skin , but understands the workings of its spirit .

32

SLIDE 65

Learning with Regularization

We chose 휆 = 1

퐶 = 102

◮ Training accuracy: 62.54% ◮ Val accuracy: 63.17%

33

SLIDE 66

Classification Weights with Regularization

With regularization, the classification weights make more sense to us

interesting pleasure boring zoe write workings Without Reg 0.011

5.63

1.80

5.68
8.20

14.16 With Reg 0.16 0.36

0.21
0.057
0.066

0.040

34

SLIDE 67

Classification Weights with Regularization

With regularization, the classification weights make more sense to us

interesting pleasure boring zoe write workings Without Reg 0.011

5.63

1.80

5.68
8.20

14.16 With Reg 0.16 0.36

0.21
0.057
0.066

0.040

Regularization for Avoiding Overfitting

Reduce the correlation between class label and some noisy features.

34

SLIDE 68

Demo Code

SLIDE 69

Demo

What we are going to review from this demo code

◮ NLP

◮ Bag-of-words representations ◮ Text classifiers

◮ Machine Learning

◮ Overfitting ◮ 퐿2 regularization

36

SLIDE 70

Reference

Jurafsky, D. and Martin, J. (2019). Speech and language processing. Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for Computational Linguistics.

37