IN5550: Neural Methods in Natural Language Processing Lecture 2 - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550: Neural Methods in Natural Language Processing Lecture 2 - - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja vrelid, Stephan Oepen, & Erik Velldal University


slide-1
SLIDE 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks

Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

21 January 2020

1

slide-2
SLIDE 2

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

1

slide-3
SLIDE 3

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics:

2

slide-4
SLIDE 4

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation)

2

slide-5
SLIDE 5

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks

2

slide-6
SLIDE 6

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

2

slide-7
SLIDE 7

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

2

slide-8
SLIDE 8

Introduction

I am Jeremy Barnes I will do the next two lectures, covering the following topics: ◮ a review of supervised learning (introducing notation) ◮ Linear classifiers and simple feed-forward neural networks ◮ Multi-layer neural networks and training

2

slide-9
SLIDE 9

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure.

3

slide-10
SLIDE 10

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages.

3

slide-11
SLIDE 11

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

3

slide-12
SLIDE 12

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra.

3

slide-13
SLIDE 13

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

3

slide-14
SLIDE 14

Introduction

Technicalities ◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2020

◮ make sure to update your UiO github profile with your photo, and star the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

◮ Linked from the course page, adapted for the notation of [Goldberg, 2017].

3

slide-15
SLIDE 15

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

3

slide-16
SLIDE 16

Basics of supervised machine learning

4

slide-17
SLIDE 17

Basics of supervised machine learning

4

slide-18
SLIDE 18

Basics of supervised machine learning

4

slide-19
SLIDE 19

Basics of supervised machine learning

4

slide-20
SLIDE 20

Basics of supervised machine learning

4

slide-21
SLIDE 21

Basics of supervised machine learning

4

slide-22
SLIDE 22

Basics of supervised machine learning

4

slide-23
SLIDE 23

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

5

slide-24
SLIDE 24

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

5

slide-25
SLIDE 25

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

5

slide-26
SLIDE 26

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

5

slide-27
SLIDE 27

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

◮ The trained models allow to make label predictions for unseen instances.

5

slide-28
SLIDE 28

Basics of supervised machine learning

◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

◮ The trained models allow to make label predictions for unseen instances. ◮ Generally: some program for mapping instances to labels.

5

slide-29
SLIDE 29

Basics of supervised machine learning

Recap on data split ◮ Recall: we want the model to make good predictions for unseen data.

6

slide-30
SLIDE 30

Basics of supervised machine learning

Recap on data split ◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data.

6

slide-31
SLIDE 31

Basics of supervised machine learning

Recap on data split ◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data.

6

slide-32
SLIDE 32

Basics of supervised machine learning

Remember: we want models that generalize ◮ Thus, the datasets are usually split into:

  • 1. train data;

7

slide-33
SLIDE 33

Basics of supervised machine learning

Remember: we want models that generalize ◮ Thus, the datasets are usually split into:

  • 1. train data;
  • 2. validation/development data (optional);

7

slide-34
SLIDE 34

Basics of supervised machine learning

Remember: we want models that generalize ◮ Thus, the datasets are usually split into:

  • 1. train data;
  • 2. validation/development data (optional);
  • 3. test/held-out data.

7

slide-35
SLIDE 35

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task.

8

slide-36
SLIDE 36

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible.

8

slide-37
SLIDE 37

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible. ◮ To cope with that, we choose an inductive bias...

8

slide-38
SLIDE 38

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible. ◮ To cope with that, we choose an inductive bias... ◮ ...and set some hypothesis class...

8

slide-39
SLIDE 39

Basics of supervised machine learning

◮ To recap, we want to find a function which makes good, generalizable predictions for our task. ◮ Searching among all possible functions is unfeasible. ◮ To cope with that, we choose an inductive bias... ◮ ...and set some hypothesis class... ◮ ...to search only within this class.

8

slide-40
SLIDE 40

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

slide-41
SLIDE 41

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

slide-42
SLIDE 42

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

slide-43
SLIDE 43

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

slide-44
SLIDE 44

Basics of supervised machine learning

What do you think a good model would look like for this data?

9

slide-45
SLIDE 45

Basics of supervised machine learning Linear functions: a popular hypothesis class

10

slide-46
SLIDE 46

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

10

slide-47
SLIDE 47

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1) ◮ Function input:

◮ feature vector x ∈ Rdin; ◮ each training instance is represented with din features; ◮ for example, some properties of the documents.

11

slide-48
SLIDE 48

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1) ◮ Function input:

◮ feature vector x ∈ Rdin; ◮ each training instance is represented with din features; ◮ for example, some properties of the documents.

◮ Function parameters θ:

◮ matrix W ∈ Rdin×dout

◮ dout is the dimensionality of the desired prediction (number of classes)

◮ bias vector b ∈ Rdout

◮ bias ‘shifts’ the function output to some direction.

11

slide-49
SLIDE 49

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b ◮ Training is finding the optimal θ.

12

slide-50
SLIDE 50

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b ◮ Training is finding the optimal θ. ◮ ‘Optimal’ means ‘producing predictions ˆ y closest to the gold labels y

  • n our n training instances’.

12

slide-51
SLIDE 51

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b ◮ Training is finding the optimal θ. ◮ ‘Optimal’ means ‘producing predictions ˆ y closest to the gold labels y

  • n our n training instances’.

◮ Ideally, ˆ y = y

12

slide-52
SLIDE 52

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}):

13

slide-53
SLIDE 53

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances.

13

slide-54
SLIDE 54

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier.

13

slide-55
SLIDE 55

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable.

13

slide-56
SLIDE 56

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}): ◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. ◮ Question: lines with 3 values of b are shown. Which is the best?

13

slide-57
SLIDE 57

Linear classifiers

How can we represent our data (X)?

14

slide-58
SLIDE 58

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it.

14

slide-59
SLIDE 59

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this?

14

slide-60
SLIDE 60

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc.

14

slide-61
SLIDE 61

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal)

14

slide-62
SLIDE 62

Linear classifiers

How can we represent our data (X)? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal) ◮ The simplest way to represent these words as features is a Bag-of-Words representation

14

slide-63
SLIDE 63

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature.

15

slide-64
SLIDE 64

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

15

slide-65
SLIDE 65

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

◮ This schema is called ‘bag of words’ (BoW).

◮ for example, if we have 1000 words in the vocabulary: ◮ xi ∈ R1000

15

slide-66
SLIDE 66

Linear classifiers

Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

◮ This schema is called ‘bag of words’ (BoW).

◮ for example, if we have 1000 words in the vocabulary: ◮ xi ∈ R1000 ◮ xi = [20, 16, 0, 10, 0, . . . , 3]

15

slide-67
SLIDE 67

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

  • ne-hot vectors (o) for each token in it:

16

slide-68
SLIDE 68

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

  • ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}.

16

slide-69
SLIDE 69

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

  • ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}. ◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

16

slide-70
SLIDE 70

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

  • ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}. ◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] ◮ o1 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0] ◮ etc...

16

slide-71
SLIDE 71

Linear classifiers

◮ Bag-of-Words feature vector of x can be interpreted as a sum of

  • ne-hot vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}. ◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] ◮ o1 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0] ◮ etc... ◮ i = [1, 1, 1, 1, 1, 2, 2, 1, 1, 1] (‘the’ and ‘road’ mentioned 2 times)

16

slide-72
SLIDE 72

Linear classifiers

Can we interpret the different parts of a learned model as representations

  • f the data?

17

slide-73
SLIDE 73

Linear classifiers

Can we interpret the different parts of a learned model as representations

  • f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin).

17

slide-74
SLIDE 74

Linear classifiers

Can we interpret the different parts of a learned model as representations

  • f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in (feature ∈ Rn).

17

slide-75
SLIDE 75

Linear classifiers

Can we interpret the different parts of a learned model as representations

  • f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in (feature ∈ Rn). ◮ Together these learned representations form a W matrix, part of θ.

◮ Thus, it contains data both about the instances and their features (more about this later).

17

slide-76
SLIDE 76

Linear classifiers

Can we interpret the different parts of a learned model as representations

  • f the data?

◮ Each of n instances (documents) is represented by a vector of features (x ∈ Rdin). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in (feature ∈ Rn). ◮ Together these learned representations form a W matrix, part of θ.

◮ Thus, it contains data both about the instances and their features (more about this later).

◮ Feature engineering is deciding what features of the instances we will use during the training.

17

slide-77
SLIDE 77

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

slide-78
SLIDE 78

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

slide-79
SLIDE 79

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

slide-80
SLIDE 80

Linear classifiers

positive neutral negative great best terrible worst Segal the road

18

slide-81
SLIDE 81

Linear classifiers Overview of Linear Models

19

slide-82
SLIDE 82

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1):

20

slide-83
SLIDE 83

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’

20

slide-84
SLIDE 84

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar.

20

slide-85
SLIDE 85

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or −1 (‘no’).

20

slide-86
SLIDE 86

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1): ◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or −1 (‘no’). ◮ NB: the model can output any number, but we convert all negatives to −1 and all positives to 1 (sign function). θ = (W ∈ Rdin, b ∈ R1)

20

slide-87
SLIDE 87

Linear classifiers

1 1 0 0 1 1 1 0.5 sign(1.5) = 1

21

slide-88
SLIDE 88

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification

22

slide-89
SLIDE 89

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k)

22

slide-90
SLIDE 90

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’

22

slide-91
SLIDE 91

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components.

22

slide-92
SLIDE 92

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components.

22

slide-93
SLIDE 93

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k) ◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components. ◮ The component corresponding to the correct author has the value of 1,

  • thers are zeros, for example:

ˆ y = [0, 0, 1, 0] (for k = 4) θ = (W ∈ Rdin×dout, b ∈ Rdout)

22

slide-94
SLIDE 94

Linear classifiers

1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 2 2 4 argmax( ) 3

23

slide-95
SLIDE 95

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision:

24

slide-96
SLIDE 96

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0, 1]...

24

slide-97
SLIDE 97

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0, 1]... ◮ ...by a squashing function, for example, sigmoid: ˆ y = σ(f (x)) = 1 1 + e−(f (x)) (2) ◮ The result is the probability of the prediction! σ(x)

24

slide-98
SLIDE 98

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4)

25

slide-99
SLIDE 99

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4) ◮ We choose the one with the highest score: ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3)

25

slide-100
SLIDE 100

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4) ◮ We choose the one with the highest score: ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3) ◮ But often it is more convenient to transform scores into a probability distribution, using the softmax function: ˆ y = softmax(xW + b) (4) ˆ y[i] = e(xW +b)[i]

  • j e(xW +b)[j]

(5)

25

slide-101
SLIDE 101

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4) ◮ We choose the one with the highest score: ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3) ◮ But often it is more convenient to transform scores into a probability distribution, using the softmax function: ˆ y = softmax(xW + b) (4) ˆ y[i] = e(xW +b)[i]

  • j e(xW +b)[j]

(5) ◮ ˆ y = softmax([0.4, 0.1, 0.9, 0.5]) = [0.22, 0.16, 0.37, 0.25]

◮ (all scores sum to 1)

25

slide-102
SLIDE 102

Linear classifiers

1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 2 2 4 .1 .1 .8 softmax

26

slide-103
SLIDE 103

Linear classifiers Break

27

slide-104
SLIDE 104

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

27

slide-105
SLIDE 105

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ.

28

slide-106
SLIDE 106

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset.

28

slide-107
SLIDE 107

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y.

28

slide-108
SLIDE 108

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y. ◮ Formally, it can be any function L(ˆ y, y) returning a scalar value:

◮ for example, L = (y − ˆ y)2 (square error)

28

slide-109
SLIDE 109

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y. ◮ Formally, it can be any function L(ˆ y, y) returning a scalar value:

◮ for example, L = (y − ˆ y)2 (square error)

◮ It is averaged over all training instances and gives us estimation of the model ‘fitness’.

28

slide-110
SLIDE 110

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development dataset. ◮ Conceptually, loss is a measure of how ‘far away’ the model predictions ˆ y are from gold labels y. ◮ Formally, it can be any function L(ˆ y, y) returning a scalar value:

◮ for example, L = (y − ˆ y)2 (square error)

◮ It is averaged over all training instances and gives us estimation of the model ‘fitness’. ◮ ˆ θ is the best set of parameters: ˆ θ = arg min

θ

L(θ) (6)

28

slide-111
SLIDE 111

Training as optimization

How do you choose a loss function?

29

slide-112
SLIDE 112

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

29

slide-113
SLIDE 113

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression

29

slide-114
SLIDE 114

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ...

29

slide-115
SLIDE 115

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification

29

slide-116
SLIDE 116

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ...

29

slide-117
SLIDE 117

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking

29

slide-118
SLIDE 118

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

29

slide-119
SLIDE 119

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

  • 2. A mix of theoretical and practical desires often determine your final

choice

◮ For classification, we often use some variant of...

29

slide-120
SLIDE 120

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

  • 2. A mix of theoretical and practical desires often determine your final

choice

◮ For classification, we often use some variant of... ◮ ... hinge-loss (max margin)

29

slide-121
SLIDE 121

Training as optimization

How do you choose a loss function?

  • 1. It depends on your task...

◮ regression - mean absolute error, mean squared error, ... ◮ classification - hinge-loss, cross-entropy, ... ◮ ranking - ranking loss, triplet loss

  • 2. A mix of theoretical and practical desires often determine your final

choice

◮ For classification, we often use some variant of... ◮ ... hinge-loss (max margin) ◮ ... cross-entropy loss

29

slide-122
SLIDE 122

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

30

slide-123
SLIDE 123

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

  • 2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

30

slide-124
SLIDE 124

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

  • 2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

  • 3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

30

slide-125
SLIDE 125

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

  • 2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

  • 3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

  • 4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

30

slide-126
SLIDE 126

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

  • 2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

  • 3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

  • 4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

  • 5. Categorical cross-entropy (negative log-likelihood):

L(ˆ y, y) = −

i

y[i]log(ˆ y[i])

30

slide-127
SLIDE 127

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

  • 2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

  • 3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

  • 4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

  • 5. Categorical cross-entropy (negative log-likelihood):

L(ˆ y, y) = −

i

y[i]log(ˆ y[i])

  • 6. Ranking losses, etc, etc...

30

slide-128
SLIDE 128

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ.

31

slide-129
SLIDE 129

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

31

slide-130
SLIDE 130

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

31

slide-131
SLIDE 131

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

◮ We can live with some errors on the training data, if it gives more generalization power.

31

slide-132
SLIDE 132

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

◮ We can live with some errors on the training data, if it gives more generalization power. ◮ For that, we minimize both the loss and the regularization term R(θ): ˆ θ = arg min

θ

L(θ) + λR(θ) (7) ◮ The hyperparameter λ is regularization weight (how important is it).

31

slide-133
SLIDE 133

Training as optimization

Regularization ◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too complex; it should be ‘lean’ and avoid large weights.

Why do you think this is?

◮ We can live with some errors on the training data, if it gives more generalization power. ◮ For that, we minimize both the loss and the regularization term R(θ): ˆ θ = arg min

θ

L(θ) + λR(θ) (7) ◮ The hyperparameter λ is regularization weight (how important is it). ◮ Common regularization terms:

  • 1. L2 norm (Gaussian prior or weight decay);
  • 2. L1 norm (sparse prior or lasso)

31

slide-134
SLIDE 134

Training as optimization Now we can measure model performance. How can we change our parameters θ to improve?

32

slide-135
SLIDE 135

Training as optimization

  • 1. We could just randomly change some of the parameters and see if the

result is better (hill climbing algorithm)...

  • 2. or we could be smarter about it (gradient-based methods).

33

slide-136
SLIDE 136

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem.

34

slide-137
SLIDE 137

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

  • 1. compute the loss,

34

slide-138
SLIDE 138

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

  • 1. compute the loss,
  • 2. compute gradient of θ parameters with respect to the loss,

34

slide-139
SLIDE 139

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

  • 1. compute the loss,
  • 2. compute gradient of θ parameters with respect to the loss,
  • 3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

34

slide-140
SLIDE 140

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

  • 1. compute the loss,
  • 2. compute gradient of θ parameters with respect to the loss,
  • 3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

  • 4. move the parameters in the opposite direction (to decrease the loss),

34

slide-141
SLIDE 141

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

  • 1. compute the loss,
  • 2. compute gradient of θ parameters with respect to the loss,
  • 3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

  • 4. move the parameters in the opposite direction (to decrease the loss),
  • 5. repeat until the optimum is found (the derivative is 0) or until the

pre-defined number of iterations (epochs) is achieved.

34

slide-142
SLIDE 142

Training as optimization

Optimizing with gradient ◮ ˆ θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem. ◮ Commonly solved using gradient methods:

  • 1. compute the loss,
  • 2. compute gradient of θ parameters with respect to the loss,
  • 3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

  • 4. move the parameters in the opposite direction (to decrease the loss),
  • 5. repeat until the optimum is found (the derivative is 0) or until the

pre-defined number of iterations (epochs) is achieved.

Convexity ◮ Convex functions: a single optimum point. ◮ Non-convex functions: multiple optimum points.

34

slide-143
SLIDE 143

Training as optimization

35

slide-144
SLIDE 144

Training as optimization

Initial Parameters

35

slide-145
SLIDE 145

Training as optimization

gradient

35

slide-146
SLIDE 146

Training as optimization

update parameters

35

slide-147
SLIDE 147

Training as optimization

global minimum

35

slide-148
SLIDE 148

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function

36

slide-149
SLIDE 149

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function ◮ Convex functions can be easily minimized with gradient methods, reaching the global optimum. ◮ With non-convex functions, optimization can end up in a local optimum.

36

slide-150
SLIDE 150

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function ◮ Convex functions can be easily minimized with gradient methods, reaching the global optimum. ◮ With non-convex functions, optimization can end up in a local optimum. ◮ Linear and log-linear models as a rule have convex error functions.

36

slide-151
SLIDE 151

Training as optimization

Stochastic gradient descent (SGD)

37

slide-152
SLIDE 152

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

  • f the gradient on it,

37

slide-153
SLIDE 153

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

  • f the gradient on it,

◮ then θ is updated in the opposite direction,

37

slide-154
SLIDE 154

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

  • f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the training time),

37

slide-155
SLIDE 155

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

  • f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the training time), ◮ repeat until convergence.

37

slide-156
SLIDE 156

Training as optimization

Stochastic gradient descent (SGD) ◮ SGD samples one instance from the training set and computes the error

  • f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the training time), ◮ repeat until convergence. Instead of one instance, batches can be used (more stable and computationally efficient).

37

slide-157
SLIDE 157

Training as optimization

Other gradient-based optimizers: ◮ Momentum ◮ AdaGrad ◮ RMSProp ◮ Adam ◮ etc... All implemented in the libraries we are going to use: PyTorch, Scikit-Learn, TensorFlow, Keras, etc.

38

slide-158
SLIDE 158

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

38

slide-159
SLIDE 159

Limitations of linear models

◮ Linear classifiers are efficient and effective.

39

slide-160
SLIDE 160

Limitations of linear models

◮ Linear classifiers are efficient and effective. ◮ Can be used on their own (often enough in practice)... ◮ ...or as building blocks for non-linear neural classifiers.

39

slide-161
SLIDE 161

Limitations of linear models

◮ Linear classifiers are efficient and effective. ◮ Can be used on their own (often enough in practice)... ◮ ...or as building blocks for non-linear neural classifiers. ◮ Unfortunately, linear models can represent only linear relations in the data

39

slide-162
SLIDE 162

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with?

40

slide-163
SLIDE 163

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are.

40

slide-164
SLIDE 164

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are. ◮ One example is the XOR function:

40

slide-165
SLIDE 165

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are. ◮ One example is the XOR function: It is clearly not linearly separable.

40

slide-166
SLIDE 166

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable.

41

slide-167
SLIDE 167

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this.

41

slide-168
SLIDE 168

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this. ◮ We need non-linear transformations.

41

slide-169
SLIDE 169

Limitations of linear models

Possible solutions ◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this. ◮ We need non-linear transformations. φ(x1, x2) = [x1 × x2, x1 + x2] maps the instances to another representation and makes the XOR problem linearly separable:

41

slide-170
SLIDE 170

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand?

42

slide-171
SLIDE 171

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand? ◮ Often, this implies mapping instances to a higher-dimensional space, making it even more difficult to choose φ manually.

42

slide-172
SLIDE 172

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand? ◮ Often, this implies mapping instances to a higher-dimensional space, making it even more difficult to choose φ manually. ◮ Support Vector Machines (SVM) classifiers handle this to some extent... [Cortes and Vapnik, 1995]

42

slide-173
SLIDE 173

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at hand? ◮ Often, this implies mapping instances to a higher-dimensional space, making it even more difficult to choose φ manually. ◮ Support Vector Machines (SVM) classifiers handle this to some extent... [Cortes and Vapnik, 1995] ◮ ...but they scale linearly in time on the size of the training data (slow!).

42

slide-174
SLIDE 174

Limitations of linear models

Training mapping functions ◮ Idea: leave it for the algorithm to train a suitable representation mapping function!

43

slide-175
SLIDE 175

Limitations of linear models

Training mapping functions ◮ Idea: leave it for the algorithm to train a suitable representation mapping function! ˆ y = φ(x)W + b (8) φ(x) = g(xW ′ + b′) (9) ◮ ...where g is a non-linear activation function, and W ′, b′ are its trainable parameters.

43

slide-176
SLIDE 176

Limitations of linear models

Training mapping functions ◮ Idea: leave it for the algorithm to train a suitable representation mapping function! ˆ y = φ(x)W + b (8) φ(x) = g(xW ′ + b′) (9) ◮ ...where g is a non-linear activation function, and W ′, b′ are its trainable parameters. ◮ The equation above defines a simple multi-layer perceptron (MLP): our first neural model.

43

slide-177
SLIDE 177

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

43

slide-178
SLIDE 178

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

44

slide-179
SLIDE 179

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer.

45

slide-180
SLIDE 180

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer.

45

slide-181
SLIDE 181

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer. ◮ At the last layer, the prediction ˆ y is produced.

45

slide-182
SLIDE 182

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer. ◮ At the last layer, the prediction ˆ y is produced. ◮ Representation functions and the linear classifiers are trained simultaneously.

45

slide-183
SLIDE 183

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons ◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation at each hidden layer. ◮ At the last layer, the prediction ˆ y is produced. ◮ Representation functions and the linear classifiers are trained simultaneously. Important: neural networks with hidden layers can theoretically approximate any function [Leshno et al., 1993].

45

slide-184
SLIDE 184

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

46

slide-185
SLIDE 185

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

46

slide-186
SLIDE 186

Next lecture on January 28

Training Deep Neural Networks ◮ More on multi-layer perceptrons and feed-forward neural networks. ◮ Is it really like brain? ◮ Common activation functions. ◮ Regularizing neural networks with dropout. ◮ Computation graphs.

47

slide-187
SLIDE 187

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 28

8

Before the next lecture

47

slide-188
SLIDE 188

Before the next lecture

Obligatory assignment ◮ The first obligatory assignment is already out! ◮ Due February 7. ◮ Look it up on the course page.

48

slide-189
SLIDE 189

Before the next lecture

Obligatory assignment ◮ The first obligatory assignment is already out! ◮ Due February 7. ◮ Look it up on the course page. Group session ◮ Hands-on: using linear classifiers and MLPs in scikit-learn. ◮ Document classification with BOW features. ◮ Make sure to apply for Saga account!

48

slide-190
SLIDE 190

Before the next lecture

Obligatory assignment ◮ The first obligatory assignment is already out! ◮ Due February 7. ◮ Look it up on the course page. Group session ◮ Hands-on: using linear classifiers and MLPs in scikit-learn. ◮ Document classification with BOW features. ◮ Make sure to apply for Saga account!

48

slide-191
SLIDE 191

References I

Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, pages 273–297. Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867.

49