CS 4501 Machine Learning for NLP Introduction Yangfeng Ji - - PowerPoint PPT Presentation

cs 4501 machine learning for nlp
SMART_READER_LITE
LIVE PREVIEW

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji - - PowerPoint PPT Presentation

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Course Information 2. Basic Linear Algebra 3. Basic Probability Theory 4. Statistical Estimation 1 About Online


slide-1
SLIDE 1

CS 4501 Machine Learning for NLP

Introduction

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

Overview

  • 1. Course Information
  • 2. Basic Linear Algebra
  • 3. Basic Probability Theory
  • 4. Statistical Estimation

1

slide-3
SLIDE 3

About Online Lectures

◮ All lectures will be recorded and uploaded to Collab ◮ By default, participants are muted upon entry. If you have a

question

◮ Chime in ◮ Use the “Raise Hand” feature ◮ Send a message via Chat

◮ By default, video is off upon entry

2

slide-4
SLIDE 4

About Online Lectures

◮ All lectures will be recorded and uploaded to Collab ◮ By default, participants are muted upon entry. If you have a

question

◮ Chime in ◮ Use the “Raise Hand” feature ◮ Send a message via Chat

◮ By default, video is off upon entry ◮ Create a Slack workspace for this course (?)

2

slide-5
SLIDE 5

Course Information

slide-6
SLIDE 6

Course Webpage http://yangfengji.net/uva-nlp-course/

4

slide-7
SLIDE 7

Instructors

◮ Instructor

◮ Yangfeng Ji ◮ Office hour: TBD

5

slide-8
SLIDE 8

Instructors

◮ Instructor

◮ Yangfeng Ji ◮ Office hour: TBD

◮ TA:

◮ Stephanie Schoch ◮ Office hour: TBD

5

slide-9
SLIDE 9

Clarification

This is not the class if you want to

◮ learn programming ◮ learn basic machine learning ◮ learn how to use PyTorch

6

slide-10
SLIDE 10

Goal of This Course

  • 1. Explain the fundamental NLP techniques

◮ Text classification ◮ Language modeling ◮ Word embeddings ◮ Sequence labeling ◮ Machine translation

  • 2. Advanced topics

◮ Discourse processing, text generation, interpretability in NLP

  • 3. Opportunities of working on some NLP problems

◮ Final project

7

slide-11
SLIDE 11

Assignments

◮ No exam

8

slide-12
SLIDE 12

Assignments

◮ No exam ◮ Six homeworks

◮ 14% × 6 = 84%

8

slide-13
SLIDE 13

Assignments

◮ No exam ◮ Six homeworks

◮ 14% × 6 = 84%

◮ One final project

◮ 2 – 3 students per group ◮ Proposal: 4% ◮ Final presentation: 6% ◮ Final project report: 6%

8

slide-14
SLIDE 14

Policy: late penalty

Homework submission will be accepted up to 72 hours late, with 20% deduction per 24 hours on the points as a penalty. For example,

◮ Deadline: August 30th, 11:59 PM ◮ Submission timestamp: September 1st, 9:00 AM (≤ 48 hours) ◮ Original points of a homework: 7 ◮ Actual points:

7 × (1 − 40%) = 4.2 (1) It is usually better if students just turn in what they have in time.

9

slide-15
SLIDE 15

Policy: collaboration

◮ Homeworks

◮ Collaboration is not encouraged ◮ Students are allowed to discuss with their classmates

◮ Final project

◮ It should be a team effort

10

slide-16
SLIDE 16

Policy: grades

11

slide-17
SLIDE 17

Textbooks

◮ Textbook

◮ Eisenstein, Natural Language Processing, 2018

All free online

12

slide-18
SLIDE 18

Textbooks

◮ Textbook

◮ Eisenstein, Natural Language Processing, 2018

◮ Additional textbooks

◮ Jurafsky and Martin, Speech and Language Processing, 3rd Edition, 2019 ◮ Smith, Linguistic Structure Prediction, 2009 ◮ Shalev-Shwartz and Ben-David, Understanding Machine Learning: From Theory to Algorithms, 2014 ◮ Goodfellow, Bengio and Courville, Deep Learning, 2016

All free online

12

slide-19
SLIDE 19

Piazza

https://piazza.com/virginia/fall2020/cs4501003

◮ course announcements ◮ online QA

13

slide-20
SLIDE 20

Question?

14

slide-21
SLIDE 21

Basic Linear Algebra

slide-22
SLIDE 22

Linear Equations

Consider the following system of equations 푥1 − 푥2 = 1 (2) Each equation represents a line in the following 2-D space 푥1 푥2

16

slide-23
SLIDE 23

Linear Equations

Consider the following system of equations 푥1 − 푥2 = 1 푥1 + 2푥2 = 2 (2) Each equation represents a line in the following 2-D space 푥1 푥2

16

slide-24
SLIDE 24

Linear Equations

Consider the following system of equations 푥1 − 푥2 = 1 푥1 + 2푥2 = 2 (3) In matrix notation, it can be written as a more compact from A풙 = 풃 (4) with A =

  • 1

−1 1 2

  • 풙 =
  • 푥1

푥2

  • 풃 =
  • 1

2

  • (5)

17

slide-25
SLIDE 25

Basic Notations

A =

  • 1

−1 1 2

  • 풙 =
  • 푥1

푥2

  • 풃 =
  • 1

2

  • ◮ A ∈ ℝ푚×푛: a matrix with 푚 rows and 푛 columns

◮ The element on the 푖-th row and the 푗-th column is denoted as 푎푖,푗

◮ 풙 ∈ ℝ푛: a vector with 푛 entries.

By convention, an 푛-dimensional vector is often thought of as matrix with 푛 rows and 1 column, known as a column vector.

◮ The 푖-th element is denoted as 푥푖

18

slide-26
SLIDE 26

Basic Notations

A =

  • 1

−1 1 2

  • 풙 =
  • 푥1

푥2

  • 풃 =
  • 1

2

  • ◮ A ∈ ℝ푚×푛: a matrix with 푚 rows and 푛 columns

◮ The element on the 푖-th row and the 푗-th column is denoted as 푎푖,푗

◮ 풙 ∈ ℝ푛: a vector with 푛 entries.

By convention, an 푛-dimensional vector is often thought of as matrix with 푛 rows and 1 column, known as a column vector.

◮ The 푖-th element is denoted as 푥푖

Problem: Solve a matrix-vector multiplication with hands and with PyTorch

18

slide-27
SLIDE 27

ℓ2 Norm

The ℓ2 norm of a vector 풙 ∈ ℝ푛 is defined as 풙2 =

  • 푖=1

푥2

(6) 푥1 푥2 풙 풙2

19

slide-28
SLIDE 28

ℓ1 Norms

The ℓ1 norm of a vector 풙 ∈ ℝ푛 is defined as 풙1 =

  • 푖=1

|푥푖| (7)

20

slide-29
SLIDE 29

Dot Product

The dot product of 풙, 풚 ∈ ℝ푛 is defined as 풙, 풚 = 풙T풚 =

  • 푖=1

푥푖푦푖 (8) where 풙T is the transpose of 풙.

◮ 풙2

2 = 풙, 풙 21

slide-30
SLIDE 30

Dot Product

The dot product of 풙, 풚 ∈ ℝ푛 is defined as 풙, 풚 = 풙T풚 =

  • 푖=1

푥푖푦푖 (8) where 풙T is the transpose of 풙.

◮ 풙2

2 = 풙, 풙

◮ If 풙 = (0, 0, . . . ,

1

  • 푥푖

, . . . , 0), then 풙, 풚 = 푦푖

21

slide-31
SLIDE 31

Dot Product

The dot product of 풙, 풚 ∈ ℝ푛 is defined as 풙, 풚 = 풙T풚 =

  • 푖=1

푥푖푦푖 (8) where 풙T is the transpose of 풙.

◮ 풙2

2 = 풙, 풙

◮ If 풙 = (0, 0, . . . ,

1

  • 푥푖

, . . . , 0), then 풙, 풚 = 푦푖

◮ If 풙 is an unit vector (풙2 = 1), then 풙, 풚 is the projection of 풚

  • n the direction of 풙

풙 풚

21

slide-32
SLIDE 32

Frobenius Norm

The Forbenius norm of a matrix A = [푎푖,푗] ∈ ℝ푚×푛 denoted by · 퐹 is defined as A퐹 =

푎2

푖,푗

1/2

(9)

◮ The Frobenius norm can be interpreted as the ℓ2 norm of a vector

when treating A as a vector of size 푚푛.

22

slide-33
SLIDE 33

Two Special Matrices

◮ The identity matrix, denoted as I ∈ ℝ푛×푛], is a square matrix with

  • nes on the diagonal and zeros everywhere else.

I =

      

1 ... 1

      

(10)

23

slide-34
SLIDE 34

Two Special Matrices

◮ The identity matrix, denoted as I ∈ ℝ푛×푛], is a square matrix with

  • nes on the diagonal and zeros everywhere else.

I =

      

1 ... 1

      

(10)

◮ A diagonal matrix, denoted as D = diag(푑1, 푑2, . . . , 푑푛), is a

matrix where all non-diagonal elements are 0. D =

      

푑1 ... 푑푛

      

(11)

23

slide-35
SLIDE 35

Inverse

The inverse of a square matrix A ∈ ℝ푛×푛 is denoted as A−1, which is the unique matrix such that A−1A = I = AA−1 (12)

◮ Non-square matrices do not have inverses (by definition) ◮ Not all square matrices are invertible ◮ The solution of the linear equations in Eq. (3) is 풙 = A−1풃

24

slide-36
SLIDE 36

Orthogonal Matrices

◮ Two vectors 풙, 풚 ∈ ℝ푛 are orthogonal if 풙, 풚 = 0

풙 풚

25

slide-37
SLIDE 37

Orthogonal Matrices

◮ Two vectors 풙, 풚 ∈ ℝ푛 are orthogonal if 풙, 풚 = 0

풙 풚

◮ A square matrix U ∈ ℝ푛×푛 is orthogonal, if all its columns are

  • rthogonal to each other and normalized (orthonormal)

풖푖, 풖푗 = 0, 풖푖 = 1, 풖푗 = 1 (13) for 푖, 푗 ∈ [푛] and 푖 ≠ 푗

◮ Furthermore, UTU = I = UUT, which further implies U−1 = UT

25

slide-38
SLIDE 38

Orthogonal Matrices

◮ Two vectors 풙, 풚 ∈ ℝ푛 are orthogonal if 풙, 풚 = 0

풙 풚

◮ A square matrix U ∈ ℝ푛×푛 is orthogonal, if all its columns are

  • rthogonal to each other and normalized (orthonormal)

풖푖, 풖푗 = 0, 풖푖 = 1, 풖푗 = 1 (13) for 푖, 푗 ∈ [푛] and 푖 ≠ 푗

◮ Furthermore, UTU = I = UUT, which further implies U−1 = UT

Problem: Create special matrices using PyTorch

25

slide-39
SLIDE 39

Symmetric Matrices

A symmetric matrix A ∈ ℝ푛×푛 is defined as AT = A (14)

  • r, in other words,

푎푖,푗 = 푎푗,푖 ∀푖, 푗 ∈ [푛] (15) Comments

◮ The identity matrix I is symmetric ◮ A diagonal matrix is symmetric

26

slide-40
SLIDE 40

Quiz

Quiz

The identity matrix I is

◮ a diagonal matrix? ◮ a symmetric matrix? ◮ an orthogonal matrix?

Further reference [Kolter, 2015]

27

slide-41
SLIDE 41

Quiz

Quiz

The identity matrix I is

◮ a diagonal matrix? ◮ a symmetric matrix? ◮ an orthogonal matrix?

Further reference [Kolter, 2015]

27

slide-42
SLIDE 42

Basic Probability Theory

slide-43
SLIDE 43

What is Probability?

The probability of landing heads is 0.52

29

slide-44
SLIDE 44

Two interpretations

Frequentist Probability represents the long-run frequency of an event

◮ If we flip the coin many times, we expect it to land

heads about 52% times

30

slide-45
SLIDE 45

Two interpretations

Frequentist Probability represents the long-run frequency of an event

◮ If we flip the coin many times, we expect it to land

heads about 52% times Bayesian Probability quantifies our (un)certainty about an event

◮ We believe the coin is 52% of chance to land head

  • n the next toss

30

slide-46
SLIDE 46

Two interpretations

Frequentist Probability represents the long-run frequency of an event

◮ If we flip the coin many times, we expect it to land

heads about 52% times Bayesian Probability quantifies our (un)certainty about an event

◮ We believe the coin is 52% of chance to land head

  • n the next toss

Comments:

◮ In machine learning, we use both of them. ◮ Depending the convenience, we will choose which one to use

30

slide-47
SLIDE 47

Bayesian Interpretation

Example scenarios of Bayesian interpretation of probability:

31

slide-48
SLIDE 48

Binary Random Variables

◮ Event 푋. Such as

◮ the coin will lead head on the next toss ◮ it will rain tomorrow

◮ Sample space of 푋 ∈ {False, True} or for simplicity {0, 1} ◮ Probability 푃(푋 = 푥) or 푃(푥)

32

slide-49
SLIDE 49

Binary Random Variables

◮ Event 푋. Such as

◮ the coin will lead head on the next toss ◮ it will rain tomorrow

◮ Sample space of 푋 ∈ {False, True} or for simplicity {0, 1} ◮ Probability 푃(푋 = 푥) or 푃(푥)

Example: Tossing a coin

Let 푋 be the event that the coin will lead head on the next toss, then the probability from the previous example is 푃(푋 = 1) = 0.52 (16)

32

slide-50
SLIDE 50

Bernoulli Distribution

Given the binary random variable 푋 and its sample space as {0, 1} 푃(푋 = 푥) = 휃푥(1 − 휃)1−푥 with a single parameter 휃 as 휃 = 푃(푋 = 1) Jacob Bernoulli

33

slide-51
SLIDE 51

Bernoulli Distribution

Given the binary random variable 푋 and its sample space as {0, 1} 푃(푋 = 푥) = 휃푥(1 − 휃)1−푥 with a single parameter 휃 as 휃 = 푃(푋 = 1) Jacob Bernoulli

Examples: Distribution of Binary Classes

◮ Sentiment classification: {Positive, Negative} ◮ Spam filtering: {True, False}

33

slide-52
SLIDE 52

Tossing a Dice

How to define the corresponding random variable?

◮ 푋 ∈ {1, 2, 3, 4, 5, 6} ◮ 푿 ∈ {100000, 010000, 001000, 000100, 000010, 000001}

34

slide-53
SLIDE 53

Categorical Distribution

The previous random event can be described with a categorical distribution 푃(푿 = 풙) =

6

  • 푘=1

(휃푘)푥푘 (17) where

◮ 푥푘 ∈ {0, 1}, and ◮ {휃푘}6

푘=1 are the parameters of this distribution, which is also the

probability of side 푘 showing up.

35

slide-54
SLIDE 54

Categorical Distribution

The previous random event can be described with a categorical distribution 푃(푿 = 풙) =

6

  • 푘=1

(휃푘)푥푘 (17) where

◮ 푥푘 ∈ {0, 1}, and ◮ {휃푘}6

푘=1 are the parameters of this distribution, which is also the

probability of side 푘 showing up.

Example: Multiclass Classification

◮ Topic classification on news: Business, Technology, Sports,

Science, etc

35

slide-55
SLIDE 55

Categorical Distribution

The previous random event can be described with a categorical distribution 푃(푿 = 풙) =

6

  • 푘=1

(휃푘)푥푘 (17) where

◮ 푥푘 ∈ {0, 1}, and ◮ {휃푘}6

푘=1 are the parameters of this distribution, which is also the

probability of side 푘 showing up.

Example: Multiclass Classification

◮ Topic classification on news: Business, Technology, Sports,

Science, etc Problem: Pick a random event, define its sample space and probability distribution

35

slide-56
SLIDE 56

Joint Probability and Independence

Modeling two random variables together with a joint probability distribution 푃(푋, 푌) (18)

36

slide-57
SLIDE 57

Joint Probability and Independence

Modeling two random variables together with a joint probability distribution 푃(푋, 푌) (18) If 푋 and 푌 are independent 푃(푋, 푌) = 푃(푋) · 푃(푌) (19) where 푃(푋) =

푃(푋, 푌) (20) 푃(푌) =

푃(푋, 푌) (21) are two marginal distributions.

36

slide-58
SLIDE 58

Joint Probability and Independence

Modeling two random variables together with a joint probability distribution 푃(푋, 푌) (18) If 푋 and 푌 are independent 푃(푋, 푌) = 푃(푋) · 푃(푌) (19) where 푃(푋) =

푃(푋, 푌) (20) 푃(푌) =

푃(푋, 푌) (21) are two marginal distributions.

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

36

slide-59
SLIDE 59

Conditional Probability

Conditional probability of 푌 given 푋 푃(푌 | 푋) = 푃(푋, 푌) 푃(푋) (22) Example: document classification

◮ 푋: a document ◮ 푌: the label of this document

37

slide-60
SLIDE 60

Conditional Probability

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

◮ 푃(푌 | 푋 = 1):

◮ 푃(푌 = 0 | 푋 = 1) = 0.25, ◮ 푃(푌 = 1 | 푋 = 1) = 0.75

38

slide-61
SLIDE 61

Conditional Probability

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

◮ 푃(푌 | 푋 = 1):

◮ 푃(푌 = 0 | 푋 = 1) = 0.25, ◮ 푃(푌 = 1 | 푋 = 1) = 0.75

◮ 푃(푌): 푃(푌 = 0) = 푃(푌 = 1) = 0.5

38

slide-62
SLIDE 62

Conditional Probability

◮ 푋: whether it is cloudy ◮ 푌: whether it will rain

푃(푋, 푌) 푋 = 0 푋 = 1 푌 = 0 0.35 0.15 푌 = 1 0.05 0.45

◮ 푃(푌 | 푋 = 1):

◮ 푃(푌 = 0 | 푋 = 1) = 0.25, ◮ 푃(푌 = 1 | 푋 = 1) = 0.75

◮ 푃(푌): 푃(푌 = 0) = 푃(푌 = 1) = 0.5

Problem: Test whether two random variables are independent

38

slide-63
SLIDE 63

Statistical Estimation

slide-64
SLIDE 64

Statistics vs. Probability Theory

Statistics is in a certain sense the inverse of probability theory. 푃(푋 = 푥) = 휃푥(1 − 휃)1−푥 (23)

◮ Observed: values of random variables

◮ Observations of coin tossing: {0, 1, 1, 0, 0, 1, 0}

◮ Unknown: the model parameter 휃 ◮ Task: infer the model parameter from the observed data

40

slide-65
SLIDE 65

Likelihood-based Estimation

For a probability 푃(푋; 휃) with 휃 as the unknown parameter, likelihood-based estimation with observations {푥(1), 푥(2), . . . , 푥(푛)} requires two steps

  • 1. Based on the distribution, define a likelihood function with
  • bservations
  • 2. Optimize the likelihood function to estimate 휃

41

slide-66
SLIDE 66

Step I: Define the Likelihood Function

With observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is defined as 퐿(휃) =

  • 푖=1

푃(푥(푖); 휃) (24)

42

slide-67
SLIDE 67

Step I: Define the Likelihood Function

With observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is defined as 퐿(휃) =

  • 푖=1

푃(푥(푖); 휃) (24) Alternatively, the corresponding log-likelihood function ℓ(휃) = log 퐿(휃) =

  • 푖=1

log 푃(푥(푖); 휃) (25)

42

slide-68
SLIDE 68

Step II: Maximum Likelihood Estimation

The value of 휃 can be estimated by maximizing the (log-)likelihood function as ˆ 휃 ← argmax

ℓ(휃) (26)

43

slide-69
SLIDE 69

Step II: Maximum Likelihood Estimation

The value of 휃 can be estimated by maximizing the (log-)likelihood function as ˆ 휃 ← argmax

ℓ(휃) (26) Usually, for simple problem, we can solve the problem with 푑ℓ(휃) 푑휃 =

  • 푖=1

푑 log 푃(푥(푖); 휃) 푑휃 = 0 (27)

43

slide-70
SLIDE 70

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28)

44

slide-71
SLIDE 71

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28) With 푛 observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is ℓ(휃) =

  • 푖=1

log 푃(푥(푖); 휃) =

  • 푖=1

{푥(푖) log 휃 + (1 − 푥(푖)) log(1 − 휃)} (29)

44

slide-72
SLIDE 72

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28) With 푛 observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is ℓ(휃) =

  • 푖=1

log 푃(푥(푖); 휃) =

  • 푖=1

{푥(푖) log 휃 + (1 − 푥(푖)) log(1 − 휃)} (29) Let 휕ℓ(휃)

휕휃

= 0, we have 휃 =

푖=1 푥(푖)

푛 (30)

44

slide-73
SLIDE 73

Example: Bernoulli Distribution

Consider a Bernoulli distribution 푃(푋; 휃) with the parameter 휃 = 푃(푋 = 1; 휃) unknown 푃(푋 = 푥; 휃) = 휃푥(1 − 휃)(1−푥) (28) With 푛 observations {푥(1), 푥(2), . . . , 푥(푛)}, the likelihood function is ℓ(휃) =

  • 푖=1

log 푃(푥(푖); 휃) =

  • 푖=1

{푥(푖) log 휃 + (1 − 푥(푖)) log(1 − 휃)} (29) Let 휕ℓ(휃)

휕휃

= 0, we have 휃 =

푖=1 푥(푖)

푛 (30) Problem: Prove Equation 30

44

slide-74
SLIDE 74

Example: Bernoulli Distribution (II)

Assume the 푛 = 7 observations are {0, 1, 1, 0, 0, 1, 0} then 휃 = 3 7 (31)

45

slide-75
SLIDE 75

Example: Bernoulli Distribution (II)

Assume the 푛 = 7 observations are {0, 1, 1, 0, 0, 1, 0} then 휃 = 3 7 (31)

Quiz

How to maximize the log-likelihood ℓ(휃), if we cannot find a closed-form solution of the following equation 푑ℓ(휃)

푑휃

= 0?

45

slide-76
SLIDE 76

Reference

Kolter, Z. (2015). Linear algebra review and reference.

46