[PPT] - IN5550: Neural Methods in Natural Language Processing Lecture 2 PowerPoint Presentation

SLIDE 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks

Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

21 January 2020

1

SLIDE 2

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics:

2

SLIDE 4

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation)

2

SLIDE 5

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks

2

SLIDE 6

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

2

SLIDE 7

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

2

SLIDE 8

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

2

SLIDE 9

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure.

3

SLIDE 10

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages.

3

SLIDE 11

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

3

SLIDE 12

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra.

3

SLIDE 13

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

3

SLIDE 14

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

◮ Linked from the course page, adapted for the notation of [Goldberg, 2017].

3

SLIDE 15

Basics of supervised machine learning

4

SLIDE 17

Basics of supervised machine learning

4

SLIDE 18

Basics of supervised machine learning

4

SLIDE 19

Basics of supervised machine learning

4

SLIDE 20

Basics of supervised machine learning

4

SLIDE 21

Basics of supervised machine learning

4

SLIDE 22

Basics of supervised machine learning

4

SLIDE 23

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

5

SLIDE 24

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

5

SLIDE 25

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

5

SLIDE 26

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

5

SLIDE 27

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

◮ The trained models allow to make label predictions for unseen instances.

5

SLIDE 28

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

◮ The trained models allow to make label predictions for unseen instances. ◮ Generally: some program for mapping instances to labels.

5

SLIDE 29

Basics of supervised machine learning

Recap on data split ◮ Recall: we want the model to make good predictions for unseen data.

6

SLIDE 30

Basics of supervised machine learning

Recap on data split ◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data.

6

SLIDE 31

Basics of supervised machine learning

Recap on data split ◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data.

6

SLIDE 32

Basics of supervised machine learning

Remember: we want models that generalize ◮ Thus, the datasets are usually split into:

1. train data;

7

SLIDE 33

Basics of supervised machine learning

Remember: we want models that generalize ◮ Thus, the datasets are usually split into:

1. train data;
2. validation/development data (optional);

7

SLIDE 34

Basics of supervised machine learning

Remember: we want models that generalize ◮ Thus, the datasets are usually split into:

1. train data;
2. validation/development data (optional);
3. test/held-out data.

7

SLIDE 35

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task.

8

SLIDE 36

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible.

8

SLIDE 37

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible. ◮ To cope with that, we choose an inductive bias...

8

SLIDE 38

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible. ◮ To cope with that, we choose an inductive bias... ◮ ...and set some hypothesis class...

8

SLIDE 39

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible. ◮ To cope with that, we choose an inductive bias... ◮ ...and set some hypothesis class... ◮ ...to search only within this class.

8

SLIDE 40

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

SLIDE 41

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

SLIDE 42

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

SLIDE 43

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

SLIDE 44

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

SLIDE 45

Basics of supervised machine learning Linear functions: a popular hypothesis class

10

SLIDE 46

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1) ◮ Function input:

◮ feature vector x ∈ Rdin; ◮ each training instance is represented with din features; ◮ for example, some properties of the documents.

11

SLIDE 48

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1) ◮ Function input:

◮ feature vector x ∈ Rdin; ◮ each training instance is represented with din features; ◮ for example, some properties of the documents.

◮ Function parameters θ:

◮ matrix W ∈ Rdin×dout

◮ dout is the dimensionality of the desired prediction (number of classes)

◮ bias vector b ∈ Rdout

◮ bias ‘shifts’ the function output to some direction.

11

SLIDE 49

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b ◮ Training is finding the optimal θ.

12

SLIDE 50

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b ◮ Training is finding the optimal θ. ◮ ‘Optimal’ means ‘producing predictions ˆ y closest to the gold labels y

n our n training instances’.

12

SLIDE 51

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b ◮ Training is finding the optimal θ. ◮ ‘Optimal’ means ‘producing predictions ˆ y closest to the gold labels y

n our n training instances’.

◮ Ideally, ˆ y = y

12

SLIDE 52

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}):

13

SLIDE 53

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances.

13

SLIDE 54

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier.

13

SLIDE 55

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable.

13

SLIDE 56

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. ◮ Question: lines with 3 values of b are shown. Which is the best?

13

SLIDE 57

Linear classifiers

How can we represent our data (X)?

14

SLIDE 58

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it.

14

SLIDE 59

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this?

14

SLIDE 60

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc.

14

SLIDE 61

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal)

14

SLIDE 62

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal) ◮ The simplest way to represent these words as features is a Bag-of-Words representation

14

SLIDE 63

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature.

15

SLIDE 64

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

15

SLIDE 65

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

◮ This schema is called ‘bag of words’ (BoW).

◮ for example, if we have 1000 words in the vocabulary: ◮ xi ∈ R1000

15

SLIDE 66

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

◮ This schema is called ‘bag of words’ (BoW).

◮ for example, if we have 1000 words in the vocabulary: ◮ xi ∈ R1000 ◮ xi = [20, 16, 0, 10, 0, . . . , 3]

15

SLIDE 67

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

ne-hot vectors (o) for each token in it:

16

SLIDE 68

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}.

16

SLIDE 69

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}. ◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

16

SLIDE 70

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}. ◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] ◮ o1 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0] ◮ etc...

16

SLIDE 71

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}. ◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] ◮ o1 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0] ◮ etc... ◮ i = [1, 1, 1, 1, 1, 2, 2, 1, 1, 1] (‘the’ and ‘road’ mentioned 2 times)

16

SLIDE 72

Linear classifiers

Can we interpret the different parts of a learned model as representations

f the data?

17

SLIDE 73

Linear classifiers

Can we interpret the different parts of a learned model as representations

f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin).

17

SLIDE 74

Linear classifiers

Can we interpret the different parts of a learned model as representations

f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in (feature ∈ Rn).

17

SLIDE 75

Linear classifiers

Can we interpret the different parts of a learned model as representations

f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in (feature ∈ Rn). ◮ Together these learned representations form a W matrix, part of θ.

◮ Thus, it contains data both about the instances and their features (more about this later).

17

SLIDE 76

Linear classifiers

Can we interpret the different parts of a learned model as representations

f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in (feature ∈ Rn). ◮ Together these learned representations form a W matrix, part of θ.

◮ Thus, it contains data both about the instances and their features (more about this later).

◮ Feature engineering is deciding what features of the instances we will use during the training.

17

SLIDE 77

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

SLIDE 78

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

SLIDE 79

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

SLIDE 80

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

SLIDE 81

Linear classifiers Overview of Linear Models

19

SLIDE 82

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1):

20

SLIDE 83

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’

20

SLIDE 84

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar.

20

SLIDE 85

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or −1 (‘no’).

20

SLIDE 86

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or −1 (‘no’). ◮ NB: the model can output any number, but we convert all negatives to −1 and all positives to 1 (sign function). θ = (W ∈ Rdin, b ∈ R1)

20

SLIDE 87

Linear classifiers

1 1 0 0 1 1 1 0.5 sign(1.5) = 1

21

SLIDE 88

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification

22

SLIDE 89

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k)

22

SLIDE 90

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’

22

SLIDE 91

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components.

22

SLIDE 92

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components.

22

SLIDE 93

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components. ◮ The component corresponding to the correct author has the value of 1,

thers are zeros, for example:

ˆ y = [0, 0, 1, 0] (for k = 4) θ = (W ∈ Rdin×dout, b ∈ Rdout)

22

SLIDE 94

Linear classifiers

1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 2 2 4 argmax( ) 3

23

SLIDE 95

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision:

24

SLIDE 96

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0, 1]...

24

SLIDE 97

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0, 1]... ◮ ...by a squashing function, for example, sigmoid: ˆ y = σ(f (x)) = 1 1 + e−(f (x)) (2) ◮ The result is the probability of the prediction! σ(x)

24

SLIDE 98

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4)

25

SLIDE 99

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4) ◮ We choose the one with the highest score: ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3)

25

SLIDE 100

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4) ◮ We choose the one with the highest score: ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3) ◮ But often it is more convenient to transform scores into a probability distribution, using the softmax function: ˆ y = softmax(xW + b) (4) ˆ y[i] = e(xW +b)[i]

j e(xW +b)[j]

(5)

25

SLIDE 101

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4) ◮ We choose the one with the highest score: ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3) ◮ But often it is more convenient to transform scores into a probability distribution, using the softmax function: ˆ y = softmax(xW + b) (4) ˆ y[i] = e(xW +b)[i]

j e(xW +b)[j]

(5) ◮ ˆ y = softmax([0.4, 0.1, 0.9, 0.5]) = [0.22, 0.16, 0.37, 0.25]

◮ (all scores sum to 1)

25

SLIDE 102

Linear classifiers

1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 2 2 4 .1 .1 .8 softmax

26

SLIDE 103

Linear classifiers Break

27

SLIDE 104

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ.

28

SLIDE 106

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset.

28

SLIDE 107

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y.

28

SLIDE 108

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y. ◮ Formally, it can be any function L(ˆ y, y) returning a scalar value:

◮ for example, L = (y − ˆ y)2 (square error)

28

SLIDE 109

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y. ◮ Formally, it can be any function L(ˆ y, y) returning a scalar value:

◮ for example, L = (y − ˆ y)2 (square error)

◮ It is averaged over all training instances and gives us estimation of the model ‘fitness’.

28

SLIDE 110

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y. ◮ Formally, it can be any function L(ˆ y, y) returning a scalar value:

◮ for example, L = (y − ˆ y)2 (square error)

◮ It is averaged over all training instances and gives us estimation of the model ‘fitness’. ◮ ˆ θ is the best set of parameters: ˆ θ = arg min

θ

L(θ) (6)

28

SLIDE 111

Training as optimization

How do you choose a loss function?

29

SLIDE 112

Training as optimization

How do you choose a loss function?

1. It depends on your task...

29

SLIDE 113

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression

29

SLIDE 114

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ...

29

SLIDE 115

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification

29

SLIDE 116

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ...

29

SLIDE 117

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking

29

SLIDE 118

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

29

SLIDE 119

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

2. A mix of theoretical and practical desires often determine your final

choice

◮ For classification, we often use some variant of...

29

SLIDE 120

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

2. A mix of theoretical and practical desires often determine your final

choice

◮ For classification, we often use some variant of... ◮ ... hinge-loss (max margin)

29

SLIDE 121

Training as optimization

How do you choose a loss function?

1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

2. A mix of theoretical and practical desires often determine your final

choice

◮ For classification, we often use some variant of... ◮ ... hinge-loss (max margin) ◮ ... cross-entropy loss

29

SLIDE 122

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

30

SLIDE 123

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

30

SLIDE 124

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

30

SLIDE 125

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

30

SLIDE 126

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

5. Categorical cross-entropy (negative log-likelihood):

L(ˆ y, y) = −

i

y[i]log(ˆ y[i])

30

SLIDE 127

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

5. Categorical cross-entropy (negative log-likelihood):

L(ˆ y, y) = −

i

y[i]log(ˆ y[i])

6. Ranking losses, etc, etc...

30

SLIDE 128

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ.

31

SLIDE 129

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

31

SLIDE 130

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

31

SLIDE 131

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

◮ We can live with some errors on the training data, if it gives more generalization power.

31

SLIDE 132

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

◮ We can live with some errors on the training data, if it gives more generalization power. ◮ For that, we minimize both the loss and the regularization term R(θ): ˆ θ = arg min

θ

L(θ) + λR(θ) (7) ◮ The hyperparameter λ is regularization weight (how important is it).

31

SLIDE 133

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

◮ We can live with some errors on the training data, if it gives more generalization power. ◮ For that, we minimize both the loss and the regularization term R(θ): ˆ θ = arg min

θ

L(θ) + λR(θ) (7) ◮ The hyperparameter λ is regularization weight (how important is it). ◮ Common regularization terms:

1. L2 norm (Gaussian prior or weight decay);
2. L1 norm (sparse prior or lasso)

31

SLIDE 134

Training as optimization Now we can measure model performance. How can we change our parameters θ to improve?

32

SLIDE 135

Training as optimization

1. We could just randomly change some of the parameters and see if the

result is better (hill climbing algorithm)...

2. or we could be smarter about it (gradient-based methods).

33

SLIDE 136

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem.

34

SLIDE 137

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

1. compute the loss,

34

SLIDE 138

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

1. compute the loss,
2. compute gradient of θ parameters with respect to the loss,

34

SLIDE 139

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

1. compute the loss,
2. compute gradient of θ parameters with respect to the loss,
3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

34

SLIDE 140

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

1. compute the loss,
2. compute gradient of θ parameters with respect to the loss,
3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

4. move the parameters in the opposite direction (to decrease the loss),

34

SLIDE 141

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

1. compute the loss,
2. compute gradient of θ parameters with respect to the loss,
3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

4. move the parameters in the opposite direction (to decrease the loss),
5. repeat until the optimum is found (the derivative is 0) or until the

pre-defined number of iterations (epochs) is achieved.

34

SLIDE 142

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

1. compute the loss,
2. compute gradient of θ parameters with respect to the loss,
3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

4. move the parameters in the opposite direction (to decrease the loss),
5. repeat until the optimum is found (the derivative is 0) or until the

pre-defined number of iterations (epochs) is achieved.

Convexity ◮ Convex functions: a single optimum point. ◮ Non-convex functions: multiple optimum points.

34

SLIDE 143

Training as optimization

35

SLIDE 144

Training as optimization

Initial Parameters

35

SLIDE 145

Training as optimization

gradient

35

SLIDE 146

Training as optimization

update parameters

35

SLIDE 147

Training as optimization

global minimum

35

SLIDE 148

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function

36

SLIDE 149

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function ◮ Convex functions can be easily minimized with gradient methods, reaching the global optimum. ◮ With non-convex functions, optimization can end up in a local optimum.

36

SLIDE 150

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function ◮ Convex functions can be easily minimized with gradient methods, reaching the global optimum. ◮ With non-convex functions, optimization can end up in a local optimum. ◮ Linear and log-linear models as a rule have convex error functions.

36

SLIDE 151

Training as optimization

Stochastic gradient descent (SGD)

37

SLIDE 152

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

f the gradient on it,

37

SLIDE 153

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

f the gradient on it,

◮ then θ is updated in the opposite direction,

37

SLIDE 154

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the training time),

37

SLIDE 155

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the training time), ◮ repeat until convergence.

37

SLIDE 156

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the training time), ◮ repeat until convergence. Instead of one instance, batches can be used (more stable and computationally efficient).

37

SLIDE 157

Training as optimization

Other gradient-based optimizers: ◮ Momentum ◮ AdaGrad ◮ RMSProp ◮ Adam ◮ etc... All implemented in the libraries we are going to use: PyTorch, Scikit-Learn, TensorFlow, Keras, etc.

38

SLIDE 158

Limitations of linear models

◮ Linear classifiers are efficient and effective.

39

SLIDE 160

Limitations of linear models

◮ Linear classifiers are efficient and effective. ◮ Can be used on their own (often enough in practice)... ◮ ...or as building blocks for non-linear neural classifiers.

39

SLIDE 161

Limitations of linear models

◮ Linear classifiers are efficient and effective. ◮ Can be used on their own (often enough in practice)... ◮ ...or as building blocks for non-linear neural classifiers. ◮ Unfortunately, linear models can represent only linear relations in the data

39

SLIDE 162

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with?

40

SLIDE 163

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are.

40

SLIDE 164

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are. ◮ One example is the XOR function:

40

SLIDE 165

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are. ◮ One example is the XOR function: It is clearly not linearly separable.

40

SLIDE 166

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable.

41

SLIDE 167

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this.

41

SLIDE 168

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this. ◮ We need non-linear transformations.

41

SLIDE 169

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this. ◮ We need non-linear transformations. φ(x1, x2) = [x1 × x2, x1 + x2] maps the instances to another representation and makes the XOR problem linearly separable:

41

SLIDE 170

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand?

42

SLIDE 171

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand? ◮ Often, this implies mapping instances to a higher-dimensional space, making it even more difficult to choose φ manually.

42

SLIDE 172

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand? ◮ Often, this implies mapping instances to a higher-dimensional space, making it even more difficult to choose φ manually. ◮ Support Vector Machines (SVM) classifiers handle this to some extent... [Cortes and Vapnik, 1995]

42

SLIDE 173

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand? ◮ Often, this implies mapping instances to a higher-dimensional space, making it even more difficult to choose φ manually. ◮ Support Vector Machines (SVM) classifiers handle this to some extent... [Cortes and Vapnik, 1995] ◮ ...but they scale linearly in time on the size of the training data (slow!).

42

SLIDE 174

Limitations of linear models

Training mapping functions ◮ Idea: leave it for the algorithm to train a suitable representation mapping function!

43

SLIDE 175

Limitations of linear models

Training mapping functions ◮ Idea: leave it for the algorithm to train a suitable representation mapping function! ˆ y = φ(x)W + b (8) φ(x) = g(xW ′ + b′) (9) ◮ ...where g is a non-linear activation function, and W ′, b′ are its trainable parameters.

43

SLIDE 176

Limitations of linear models

Training mapping functions ◮ Idea: leave it for the algorithm to train a suitable representation mapping function! ˆ y = φ(x)W + b (8) φ(x) = g(xW ′ + b′) (9) ◮ ...where g is a non-linear activation function, and W ′, b′ are its trainable parameters. ◮ The equation above defines a simple multi-layer perceptron (MLP): our first neural model.

43

SLIDE 177

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

44

SLIDE 179

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer.

45

SLIDE 180

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer.

45

SLIDE 181

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer. ◮ At the last layer, the prediction ˆ y is produced.

45

SLIDE 182

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer. ◮ At the last layer, the prediction ˆ y is produced. ◮ Representation functions and the linear classifiers are trained simultaneously.

45

SLIDE 183

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer. ◮ At the last layer, the prediction ˆ y is produced. ◮ Representation functions and the linear classifiers are trained simultaneously. Important: neural networks with hidden layers can theoretically approximate any function [Leshno et al., 1993].

45

SLIDE 184

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

46

SLIDE 185

Next lecture on January 28

Training Deep Neural Networks ◮ More on multi-layer perceptrons and feed-forward neural networks. ◮ Is it really like brain? ◮ Common activation functions. ◮ Regularizing neural networks with dropout. ◮ Computation graphs.

47

SLIDE 187

Before the next lecture

Obligatory assignment ◮ The first obligatory assignment is already out! ◮ Due February 7. ◮ Look it up on the course page.

48

SLIDE 189

Before the next lecture

Obligatory assignment ◮ The first obligatory assignment is already out! ◮ Due February 7. ◮ Look it up on the course page. Group session ◮ Hands-on: using linear classifiers and MLPs in scikit-learn. ◮ Document classification with BOW features. ◮ Make sure to apply for Saga account!

48

SLIDE 190

Before the next lecture

Obligatory assignment ◮ The first obligatory assignment is already out! ◮ Due February 7. ◮ Look it up on the course page. Group session ◮ Hands-on: using linear classifiers and MLPs in scikit-learn. ◮ Document classification with BOW features. ◮ Make sure to apply for Saga account!

48

SLIDE 191

References I

Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, pages 273–297. Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867.

49

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks

Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

21 January 2020

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics:

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation)

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure.

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages.

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra.

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

◮ Linked from the course page, adapted for the notation of [Goldberg, 2017].

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

Basics of supervised machine learning

Basics of supervised machine learning

Basics of supervised machine learning

Basics of supervised machine learning

Basics of supervised machine learning

Basics of supervised machine learning

Basics of supervised machine learning

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

Basics of supervised machine learning