[PPT] - CS 4501 Machine Learning for NLP Introduction Yangfeng Ji PowerPoint Presentation

SLIDE 1

CS 4501 Machine Learning for NLP

Introduction

Yangfeng Ji

Department of Computer Science University of Virginia

SLIDE 2

Overview

1. Course Information
2. Basic Linear Algebra
3. Basic Probability Theory
4. Statistical Estimation

1

SLIDE 3

About Online Lectures

◮ All lectures will be recorded and uploaded to Collab ◮ By default, participants are muted upon entry. If you have a

question

◮ Chime in ◮ Use the “Raise Hand” feature ◮ Send a message via Chat

◮ By default, video is off upon entry

2

SLIDE 4

About Online Lectures

◮ All lectures will be recorded and uploaded to Collab ◮ By default, participants are muted upon entry. If you have a

question

◮ Chime in ◮ Use the “Raise Hand” feature ◮ Send a message via Chat

◮ By default, video is off upon entry ◮ Create a Slack workspace for this course (?)

2

SLIDE 5

Course Information

SLIDE 6

Course Webpage http://yangfengji.net/uva-nlp-course/

4

SLIDE 7

Instructors

◮ Instructor

◮ Yangfeng Ji ◮ Office hour: TBD

5

SLIDE 8

Instructors

◮ Instructor

◮ Yangfeng Ji ◮ Office hour: TBD

◮ TA:

◮ Stephanie Schoch ◮ Office hour: TBD

5

SLIDE 9

Clarification

This is not the class if you want to

◮ learn programming ◮ learn basic machine learning ◮ learn how to use PyTorch

6

SLIDE 10

Goal of This Course

1. Explain the fundamental NLP techniques

◮ Text classification ◮ Language modeling ◮ Word embeddings ◮ Sequence labeling ◮ Machine translation

2. Advanced topics

◮ Discourse processing, text generation, interpretability in NLP

3. Opportunities of working on some NLP problems

◮ Final project

7

SLIDE 11

Assignments

◮ No exam

8

SLIDE 12

Assignments

◮ No exam ◮ Six homeworks

◮ 14% × 6 = 84%

8

SLIDE 13

Assignments

◮ No exam ◮ Six homeworks

◮ 14% × 6 = 84%

◮ One final project

◮ 2 – 3 students per group ◮ Proposal: 4% ◮ Final presentation: 6% ◮ Final project report: 6%

8

SLIDE 14

Policy: late penalty

Homework submission will be accepted up to 72 hours late, with 20% deduction per 24 hours on the points as a penalty. For example,

◮ Deadline: August 30th, 11:59 PM ◮ Submission timestamp: September 1st, 9:00 AM (≤ 48 hours) ◮ Original points of a homework: 7 ◮ Actual points:

7 × (1 − 40%) = 4.2 (1) It is usually better if students just turn in what they have in time.

9

SLIDE 15

Policy: collaboration

◮ Homeworks

◮ Collaboration is not encouraged ◮ Students are allowed to discuss with their classmates

◮ Final project

◮ It should be a team effort

10

SLIDE 16

Policy: grades

11

SLIDE 17

Textbooks

◮ Textbook

◮ Eisenstein, Natural Language Processing, 2018

All free online

12

SLIDE 18

Textbooks

◮ Textbook

◮ Eisenstein, Natural Language Processing, 2018

◮ Additional textbooks

◮ Jurafsky and Martin, Speech and Language Processing, 3rd Edition, 2019 ◮ Smith, Linguistic Structure Prediction, 2009 ◮ Shalev-Shwartz and Ben-David, Understanding Machine Learning: From Theory to Algorithms, 2014 ◮ Goodfellow, Bengio and Courville, Deep Learning, 2016

All free online

12

SLIDE 19

Piazza

https://piazza.com/virginia/fall2020/cs4501003

◮ course announcements ◮ online QA

13

SLIDE 20

Question?

14

SLIDE 21

Basic Linear Algebra

SLIDE 22

Linear Equations

Consider the following system of equations 푥1 − 푥2 = 1 (2) Each equation represents a line in the following 2-D space 푥1 푥2

16

SLIDE 23

Linear Equations

Consider the following system of equations 푥1 − 푥2 = 1 푥1 + 2푥2 = 2 (2) Each equation represents a line in the following 2-D space 푥1 푥2

16

SLIDE 24

Linear Equations

Consider the following system of equations 푥1 − 푥2 = 1 푥1 + 2푥2 = 2 (3) In matrix notation, it can be written as a more compact from A풙 = 풃 (4) with A =

1

−1 1 2

풙 =
푥1

푥2

풃 =
1

2

(5)

17

SLIDE 25

Basic Notations

A =

1

−1 1 2

풙 =
푥1

푥2

풃 =
1

2

◮ A ∈ ℝ푚×푛: a matrix with 푚 rows and 푛 columns

◮ The element on the 푖-th row and the 푗-th column is denoted as 푎푖,푗

◮ 풙 ∈ ℝ푛: a vector with 푛 entries.

By convention, an 푛-dimensional vector is often thought of as matrix with 푛 rows and 1 column, known as a column vector.

◮ The 푖-th element is denoted as 푥푖

18

SLIDE 26

Basic Notations

A =

1

−1 1 2

풙 =
푥1

푥2

풃 =
1

2

◮ A ∈ ℝ푚×푛: a matrix with 푚 rows and 푛 columns

◮ The element on the 푖-th row and the 푗-th column is denoted as 푎푖,푗

◮ 풙 ∈ ℝ푛: a vector with 푛 entries.

By convention, an 푛-dimensional vector is often thought of as matrix with 푛 rows and 1 column, known as a column vector.

◮ The 푖-th element is denoted as 푥푖

Problem: Solve a matrix-vector multiplication with hands and with PyTorch

18

SLIDE 27

ℓ2 Norm

The ℓ2 norm of a vector 풙 ∈ ℝ푛 is defined as 풙2 =

푛
푖=1

푥2

푖

(6) 푥1 푥2 풙 풙2

19

SLIDE 28

ℓ1 Norms

The ℓ1 norm of a vector 풙 ∈ ℝ푛 is defined as 풙1 =

푛

푖=1

|푥푖| (7)

20

SLIDE 29

Dot Product

The dot product of 풙, 풚 ∈ ℝ푛 is defined as 풙, 풚 = 풙T풚 =

푛

푖=1

푥푖푦푖 (8) where 풙T is the transpose of 풙.

◮ 풙2

2 = 풙, 풙 21

SLIDE 30

Dot Product

The dot product of 풙, 풚 ∈ ℝ푛 is defined as 풙, 풚 = 풙T풚 =

푛

푖=1

푥푖푦푖 (8) where 풙T is the transpose of 풙.

◮ 풙2

2 = 풙, 풙

◮ If 풙 = (0, 0, . . . ,

1

푥푖

, . . . , 0), then 풙, 풚 = 푦푖

21

SLIDE 31

Dot Product

The dot product of 풙, 풚 ∈ ℝ푛 is defined as 풙, 풚 = 풙T풚 =

푛

푖=1

푥푖푦푖 (8) where 풙T is the transpose of 풙.

◮ 풙2

2 = 풙, 풙

◮ If 풙 = (0, 0, . . . ,

1

푥푖

, . . . , 0), then 풙, 풚 = 푦푖

◮ If 풙 is an unit vector (풙2 = 1), then 풙, 풚 is the projection of 풚

n the direction of 풙

풙 풚

21

SLIDE 32

Frobenius Norm

The Forbenius norm of a matrix A = [푎푖,푗] ∈ ℝ푚×푛 denoted by · 퐹 is defined as A퐹 =

푖

푗

푎2

푖,푗

1/2

(9)

◮ The Frobenius norm can be interpreted as the ℓ2 norm of a vector

when treating A as a vector of size 푚푛.

22

SLIDE 33

Two Special Matrices

◮ The identity matrix, denoted as I ∈ ℝ푛×푛], is a square matrix with

nes on the diagonal and zeros everywhere else.

I =

      

1 ... 1

      

(10)

23

SLIDE 34

Two Special Matrices

◮ The identity matrix, denoted as I ∈ ℝ푛×푛], is a square matrix with

nes on the diagonal and zeros everywhere else.

I =

      

1 ... 1

      

(10)

◮ A diagonal matrix, denoted as D = diag(푑1, 푑2, . . . , 푑푛), is a

matrix where all non-diagonal elements are 0. D =

      

푑1 ... 푑푛

      

(11)

23

SLIDE 35

Inverse

The inverse of a square matrix A ∈ ℝ푛×푛 is denoted as A−1, which is the unique matrix such that A−1A = I = AA−1 (12)

◮ Non-square matrices do not have inverses (by definition) ◮ Not all square matrices are invertible ◮ The solution of the linear equations in Eq. (3) is 풙 = A−1풃

24

SLIDE 36

Orthogonal Matrices

◮ Two vectors 풙, 풚 ∈ ℝ푛 are orthogonal if 풙, 풚 = 0

풙 풚

25

SLIDE 37

Orthogonal Matrices

◮ Two vectors 풙, 풚 ∈ ℝ푛 are orthogonal if 풙, 풚 = 0

풙 풚

◮ A square matrix U ∈ ℝ푛×푛 is orthogonal, if all its columns are

rthogonal to each other and normalized (orthonormal)

풖푖, 풖푗 = 0, 풖푖 = 1, 풖푗 = 1 (13) for 푖, 푗 ∈ [푛] and 푖 ≠ 푗

◮ Furthermore, UTU = I = UUT, which further implies U−1 = UT

25

SLIDE 38

Orthogonal Matrices

◮ Two vectors 풙, 풚 ∈ ℝ푛 are orthogonal if 풙, 풚 = 0

풙 풚

◮ A square matrix U ∈ ℝ푛×푛 is orthogonal, if all its columns are

rthogonal to each other and normalized (orthonormal)

풖푖, 풖푗 = 0, 풖푖 = 1, 풖푗 = 1 (13) for 푖, 푗 ∈ [푛] and 푖 ≠ 푗

◮ Furthermore, UTU = I = UUT, which further implies U−1 = UT

Problem: Create special matrices using PyTorch

25

SLIDE 39

Symmetric Matrices

A symmetric matrix A ∈ ℝ푛×푛 is defined as AT = A (14)

r, in other words,

푎푖,푗 = 푎푗,푖 ∀푖, 푗 ∈ [푛] (15) Comments

◮ The identity matrix I is symmetric ◮ A diagonal matrix is symmetric

26

SLIDE 40

Quiz

The identity matrix I is

◮ a diagonal matrix? ◮ a symmetric matrix? ◮ an orthogonal matrix?

Further reference [Kolter, 2015]

27

SLIDE 41

Quiz

The identity matrix I is

◮ a diagonal matrix? ◮ a symmetric matrix? ◮ an orthogonal matrix?

Further reference [Kolter, 2015]

27

SLIDE 42

Basic Probability Theory

SLIDE 43

What is Probability?

The probability of landing heads is 0.52

29

SLIDE 44

Two interpretations

Frequentist Probability represents the long-run frequency of an event

◮ If we flip the coin many times, we expect it to land

heads about 52% times

30

SLIDE 45

Two interpretations

Frequentist Probability represents the long-run frequency of an event

◮ If we flip the coin many times, we expect it to land

heads about 52% times Bayesian Probability quantifies our (un)certainty about an event

◮ We believe the coin is 52% of chance to land head

n the next toss

30

SLIDE 46

Two interpretations

Frequentist Probability represents the long-run frequency of an event

◮ If we flip the coin many times, we expect it to land

heads about 52% times Bayesian Probability quantifies our (un)certainty about an event

◮ We believe the coin is 52% of chance to land head

n the next toss

Comments:

◮ In machine learning, we use both of them. ◮ Depending the convenience, we will choose which one to use

30

SLIDE 47

Bayesian Interpretation

Example scenarios of Bayesian interpretation of probability:

31

SLIDE 48

Binary Random Variables

◮ Event 푋. Such as

◮ the coin will lead head on the next toss ◮ it will rain tomorrow

◮ Sample space of 푋 ∈ {False, True} or for simplicity {0, 1} ◮ Probability 푃(푋 = 푥) or 푃(푥)

32

SLIDE 49

Binary Random Variables

◮ Event 푋. Such as

◮ the coin will lead head on the next toss ◮ it will rain tomorrow

◮ Sample space of 푋 ∈ {False, True} or for simplicity {0, 1} ◮ Probability 푃(푋 = 푥) or 푃(푥)

Example: Tossing a coin

Let 푋 be the event that the coin will lead head on the next toss, then the probability from the previous example is 푃(푋 = 1) = 0.52 (16)

32

SLIDE 50

Bernoulli Distribution

Given the binary random variable 푋 and its sample space as {0, 1} 푃(푋 = 푥) = 휃푥(1 − 휃)1−푥 with a single parameter 휃 as 휃 = 푃(푋 = 1) Jacob Bernoulli

33

SLIDE 51

Bernoulli Distribution

Given the binary random variable 푋 and its sample space as {0, 1} 푃(푋 = 푥) = 휃푥(1 − 휃)1−푥 with a single parameter 휃 as 휃 = 푃(푋 = 1) Jacob Bernoulli

Examples: Distribution of Binary Classes

◮ Sentiment classification: {Positive, Negative} ◮ Spam filtering: {True, False}

33

SLIDE 52

Tossing a Dice

How to define the corresponding random variable?

◮ 푋 ∈ {1, 2, 3, 4, 5, 6} ◮ 푿 ∈ {100000, 010000, 001000, 000100, 000010, 000001}

34

SLIDE 53

Categorical Distribution

The previous random event can be described with a categorical distribution 푃(푿 = 풙) =

6

푘=1

(휃푘)푥푘 (17) where

◮ 푥푘 ∈ {0, 1}, and ◮ {휃푘}6

푘=1 are the parameters of this distribution, which is also the

probability of side 푘 showing up.

35

SLIDE 54

Categorical Distribution

The previous random event can be described with a categorical distribution 푃(푿 = 풙) =

6

푘=1

(휃푘)푥푘 (17) where

◮ 푥푘 ∈ {0, 1}, and ◮ {휃푘}6

푘=1 are the parameters of this distribution, which is also the

probability of side 푘 showing up.

Example: Multiclass Classification

◮ Topic classification on news: Business, Technology, Sports,

Science, etc

35

SLIDE 55

Categorical Distribution

The previous random event can be described with a categorical distribution 푃(푿 = 풙) =

6

푘=1

(휃푘)푥푘 (17) where

◮ 푥푘 ∈ {0, 1}, and ◮ {휃푘}6

푘=1 are the parameters of this distribution, which is also the

probability of side 푘 showing up.

Example: Multiclass Classification

◮ Topic classification on news: Business, Technology, Sports,

Science, etc Problem: Pick a random event, define its sample space and probability distribution

35

SLIDE 56

Joint Probability and Independence

Modeling two random variables together with a joint probability distribution 푃(푋, 푌) (18)

36

SLIDE 57

Joint Probability and Independence

Modeling two random variables together with a joint probability distribution 푃(푋, 푌) (18) If 푋 and 푌 are independent 푃(푋, 푌) = 푃(푋) · 푃(푌) (19) where 푃(푋) =

푌

푃(푋, 푌) (20) 푃(푌) =

푋

푃(푋, 푌) (21) are two marginal distributions.

36

SLIDE 58

Joint Probability and Independence

Modeling two random variables together with a joint probability distribution 푃(푋, 푌) (18) If 푋 and 푌 are independent 푃(푋, 푌) = 푃(푋) · 푃(푌) (19) where 푃(푋) =

푌

푃(푋, 푌) (20) 푃(푌) =

푋

푃(푋, 푌) (21) are two marginal distributions.

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

36

SLIDE 59

Conditional Probability

Conditional probability of 푌 given 푋 푃(푌 | 푋) = 푃(푋, 푌) 푃(푋) (22) Example: document classification

◮ 푋: a document ◮ 푌: the label of this document

37

SLIDE 60

Conditional Probability

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

◮ 푃(푌 | 푋 = 1):

◮ 푃(푌 = 0 | 푋 = 1) = 0.25, ◮ 푃(푌 = 1 | 푋 = 1) = 0.75

38

SLIDE 61

Conditional Probability

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

◮ 푃(푌 | 푋 = 1):

◮ 푃(푌 = 0 | 푋 = 1) = 0.25, ◮ 푃(푌 = 1 | 푋 = 1) = 0.75

◮ 푃(푌): 푃(푌 = 0) = 푃(푌 = 1) = 0.5

38

SLIDE 62

Conditional Probability

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

◮ 푃(푌 | 푋 = 1):

◮ 푃(푌 = 0 | 푋 = 1) = 0.25, ◮ 푃(푌 = 1 | 푋 = 1) = 0.75

◮ 푃(푌): 푃(푌 = 0) = 푃(푌 = 1) = 0.5

Problem: Test whether two random variables are independent

38

SLIDE 63

Statistical Estimation

SLIDE 64

Statistics vs. Probability Theory

Statistics is in a certain sense the inverse of probability theory. 푃(푋 = 푥) = 휃푥(1 − 휃)1−푥 (23)

◮ Observed: values of random variables

◮ Observations of coin tossing: {0, 1, 1, 0, 0, 1, 0}

◮ Unknown: the model parameter 휃 ◮ Task: infer the model parameter from the observed data

40

SLIDE 65

Likelihood-based Estimation

For a probability 푃(푋; 휃) with 휃 as the unknown parameter, likelihood-based estimation with observations {푥(1), 푥(2), . . . , 푥(푛)} requires two steps

1. Based on the distribution, define a likelihood function with
bservations
2. Optimize the likelihood function to estimate 휃

41

SLIDE 66

Step I: Define the Likelihood Function

With observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is defined as 퐿(휃) =

푛

푖=1

푃(푥(푖); 휃) (24)

42

SLIDE 67

Step I: Define the Likelihood Function

With observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is defined as 퐿(휃) =

푛

푖=1

푃(푥(푖); 휃) (24) Alternatively, the corresponding log-likelihood function ℓ(휃) = log 퐿(휃) =

푛

푖=1

log 푃(푥(푖); 휃) (25)

42

SLIDE 68

Step II: Maximum Likelihood Estimation

The value of 휃 can be estimated by maximizing the (log-)likelihood function as ˆ 휃 ← argmax

휃

ℓ(휃) (26)

43

SLIDE 69

Step II: Maximum Likelihood Estimation

The value of 휃 can be estimated by maximizing the (log-)likelihood function as ˆ 휃 ← argmax

휃

ℓ(휃) (26) Usually, for simple problem, we can solve the problem with 푑ℓ(휃) 푑휃 =

푛

푖=1

푑 log 푃(푥(푖); 휃) 푑휃 = 0 (27)

43

SLIDE 70

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28)

44

SLIDE 71

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28) With 푛 observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is ℓ(휃) =

푛

푖=1

log 푃(푥(푖); 휃) =

푛

푖=1

{푥(푖) log 휃 + (1 − 푥(푖)) log(1 − 휃)} (29)

44

SLIDE 72

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28) With 푛 observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is ℓ(휃) =

푛

푖=1

log 푃(푥(푖); 휃) =

푛

푖=1

{푥(푖) log 휃 + (1 − 푥(푖)) log(1 − 휃)} (29) Let 휕ℓ(휃)

휕휃

= 0, we have 휃 =

푛

푖=1 푥(푖)

푛 (30)

44

SLIDE 73

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28) With 푛 observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is ℓ(휃) =

푛

푖=1

log 푃(푥(푖); 휃) =

푛

푖=1

{푥(푖) log 휃 + (1 − 푥(푖)) log(1 − 휃)} (29) Let 휕ℓ(휃)

휕휃

= 0, we have 휃 =

푛

푖=1 푥(푖)

푛 (30) Problem: Prove Equation 30

44

SLIDE 74

Example: Bernoulli Distribution (II)

Assume the 푛 = 7 observations are {0, 1, 1, 0, 0, 1, 0} then 휃 = 3 7 (31)

45

SLIDE 75

Example: Bernoulli Distribution (II)

Assume the 푛 = 7 observations are {0, 1, 1, 0, 0, 1, 0} then 휃 = 3 7 (31)

Quiz

How to maximize the log-likelihood ℓ(휃), if we cannot find a closed-form solution of the following equation 푑ℓ(휃)

푑휃

= 0?

45

SLIDE 76

Reference

Kolter, Z. (2015). Linear algebra review and reference.

46