[PPT] - IN5550: Neural Methods in Natural Language Processing Lecture 2 PowerPoint Presentation

SLIDE 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks

Andrey Kutuzov, Vinit Ravishankar, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

24 January 2019

1

SLIDE 2

Introduction

I am Andrey Kutuzov I will do lectures and group sessions in January and February, covering the following topics:

◮ Linear classifiers and simple feed-forward neural networks ◮ Neural language modeling ◮ Dense representations and word embeddings

I am also partially responsible for the first 2 obligatory assignments:

1. Bag of Words Document Classification
2. Word Embedding and Semantic Similarity

2

SLIDE 4

Introduction

Technicalities

◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check Piazza and the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2019

◮ make sure to update your UiO github profile with your photo, and star

the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

◮ Linked from the course page, adapted for the notation of [Goldberg, 2017]. 3

SLIDE 5

Basics of supervised machine learning

◮ Supervised ML models are trained on example data and produce

generalizations.

◮ Supposed to ‘improve with experience’. ◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances

y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

◮ The trained models allow to make label predictions for unseen instances. ◮ Generally: some program for mapping instances to labels.

4

SLIDE 7

Basics of supervised machine learning

Recap on data split

◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data. ◮ Thus, the datasets are usually split into:

1. train data;
2. validation/development data (optional);
3. test/held-out data.

5

SLIDE 8

Basics of supervised machine learning

◮ We want to find a program which makes good predictions for our task. ◮ Searching among all possible programs is unfeasible. ◮ To cope with that, we make ourselves inductively biased... ◮ ...and set some hypothesis class... ◮ ...to search only within this class.

A popular hypothesis class: linear functions.

6

SLIDE 9

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1)

◮ Function input:

◮ feature vector x ∈ Rdin; ◮ each training instance is represented with din features; ◮ for example, some properties of the documents.

◮ Function parameters θ:

◮ matrix W ∈ Rdin×dout ◮ dout is the dimensionality of the desired prediction (number of classes) ◮ bias vector b ∈ Rdout ◮ bias ‘shifts’ the function output to some direction. 7

SLIDE 11

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b

◮ Training is finding the optimal θ. ◮ ‘Optimal’ means ‘producing predictions ˆ

y closest to the gold labels y

n our n training instances’.

◮ Ideally, ˆ

y = y

8

SLIDE 12

Linear classifiers

Representing linguistic features

◮ Each of n instances (documents) is represented by a vector of features

(x ∈ Rdin).

◮ Inversely, each feature can be represented by a vector of instances

(documents) it appears in (feature ∈ Rn).

◮ Together these learned representations form a W matrix, part of θ.

◮ Thus, it contains data both about the instances and their features (more

about this later).

◮ Feature engineering is deciding what features of the instances we will

use during the training.

9

SLIDE 13

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}):

◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane)

separating the instances.

◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. ◮ Question: lines with 3 values of b are shown. Which is the best?

10

SLIDE 14

Linear classifiers

Bag of words

◮ We can have much more features than 2

◮ (although this is much harder to visualize).

◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

◮ This schema is called ‘bag of words’ (BoW).

◮ for example, if we have 1000 words in the vocabulary: ◮ i ∈ R1000 ◮ i = [20, 16, 0, 10, 0, . . . , 3] 11

SLIDE 15

Linear classifiers

◮ Bag-of-Words feature vector of i can be interpreted as a sum of one-hot

vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’,

‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}.

◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] ◮ o1 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0] ◮ etc... ◮ i = [1, 1, 1, 1, 1, 2, 2, 1, 1, 1] (‘the’ and ‘road’ mentioned 2 times) 12

SLIDE 16

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1):

◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ

y is also a scalar: either 1 (‘yes’) or −1 (‘no’).

◮ NB: the model can output any number, but we convert all negatives to

−1 and all positives to 1 (sign function). θ = (W ∈ Rdin, b ∈ R1)

13

SLIDE 17

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k)

◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ

y is also a one-hot vector of k components.

◮ The component corresponding to the correct author has the value of 1,

thers are zeros, for example:

ˆ y = [0, 0, 1, 0] (for k = 4) θ = (W ∈ Rdin×dout, b ∈ Rdout)

14

SLIDE 18

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision:

◮ Map the predictions to the range of [0, 1]... ◮ ...by a squashing function, for example, sigmoid:

ˆ y = σ(f (x)) = 1 1 + e−(f (x)) (2)

◮ The result is the probability of the prediction!

σ(x)

15

SLIDE 19

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all

classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4)

◮ We choose the one with the highest score:

ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3)

◮ But often it is more convenient to transform scores into a probability

distribution, using the softmax function: ˆ y = softmax(xW + b) ˆ y[i] = e(xW +b)[i]

j e(xW +b)[j]

(4)

◮ ˆ

y = softmax([0.4, 0.1, 0.9, 0.5]) = [0.22, 0.16, 0.37, 0.25]

◮ (all scores sum to 1) 16

SLIDE 20

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development

dataset.

◮ Conceptually, loss is a measure of how ‘far away’ the model predictions

ˆ y are from gold labels y.

◮ Formally, it can be any function L(ˆ

y, y) returning a scalar value:

◮ for example, L = (y − ˆ

y)2 (square error)

◮ It is averaged over all training instances and gives us estimation of the

model ‘fitness’.

◮ ˆ

θ is the best set of parameters: ˆ θ = arg min

θ

L(θ) (5)

17

SLIDE 22

Training as optimization

Common loss functions

1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

5. Categorical cross-entropy (negative log-likelihood):

L(ˆ y, y) = −

i

y[i]log(ˆ y[i])

6. Ranking losses, etc, etc...

18

SLIDE 23

Training as optimization

Regularization

◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too

complex; it should be ‘lean’ and avoid large weights.

◮ We can live with some errors on the training data, if it gives more

generalization power.

◮ For that, we minimize both the loss and the regularization term R(θ):

ˆ θ = arg min

θ

L(θ) + λR(θ) (6)

◮ The hyperparameter λ is regularization weight (how important is it). ◮ Common regularization terms:

1. L2 norm (Gaussian prior or weight decay);
2. L1 norm (sparse prior or lasso)

19

SLIDE 24

Training as optimization

Optimizing with gradient

◮ ˆ

θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem.

◮ Commonly solved using gradient methods:

1. compute the loss,
2. compute gradient of θ parameters with respect to the loss,
3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

4. move the parameters in the opposite direction (to decrease the loss),
5. repeat until the optimum is found (the derivative is 0) or until the

pre-defined number of iterations (epochs) is achieved.

Convexity

◮ Convex functions: a single optimum point. ◮ Non-convex functions: multiple optimum points.

20

SLIDE 25

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function

◮ Convex functions can be easily minimized with gradient methods,

reaching the global optimum.

◮ With non-convex functions, optimization can end up in a local optimum. ◮ Linear and log-linear models as a rule have convex error functions.

21

SLIDE 26

Training as optimization

Stochastic gradient descent (SGD)

◮ SGD samples one instance from the training set and computes the error

f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the

training time),

◮ repeat until convergence.

Instead of one instance, batches can be used (more stable and computationally efficient).

22

SLIDE 27

Training as optimization

Other gradient-based optimizers:

◮ Momentum ◮ AdaGrad ◮ RMSProp ◮ Adam ◮ etc...

All implemented in the libraries we are going to use: PyTorch, Scikit-Learn, TensorFlow, Keras, etc.

23

SLIDE 28

Limitations of linear models

◮ Linear classifiers are efficient and effective. ◮ Can be used on their own (often enough in practice)... ◮ ...or as building blocks for non-linear neural classifiers. ◮ Unfortunately, linear models can represent only linear relations in the

data

24

SLIDE 30

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are. ◮ One example is the XOR function:

It is clearly not linearly separable.

25

SLIDE 31

Limitations of linear models

Possible solutions

◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this. ◮ We need non-linear transformations.

φ(x1, x2) = [x1 × x2, x1 + x2] maps the instances to another representation and makes the XOR problem linearly separable:

26

SLIDE 32

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at

hand?

◮ Often, this implies mapping instances to a higher-dimensional space,

making it even more difficult to choose φ manually.

◮ Support Vector Machines (SVM) classifiers handle this to some

extent... [Cortes and Vapnik, 1995]

◮ ...but they scale linearly in time on the size of the training data (slow!).

27

SLIDE 33

Limitations of linear models

Training mapping functions

◮ Idea: leave it for the algorithm to train a suitable representation

mapping function! ˆ y = φ(x)W + b φ(x) = g(xW ′ + b′) (7)

◮ ...where g is a non-linear activation function, and W ′, b′ are its

trainable parameters.

◮ The equation above defines a simple multi-layer perceptron (MLP): our

first neural model.

28

SLIDE 34

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

29

SLIDE 36

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons

◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation

at each hidden layer.

◮ At the last layer, the prediction ˆ

y is produced.

◮ Representation functions and the linear classifiers are trained

simultaneously. Important: neural networks with hidden layers can theoretically approximate any function [Leshno et al., 1993].

30

SLIDE 37

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

31

SLIDE 38

Next lecture on January 31

Training Deep Neural Networks

◮ More on multi-layer perceptrons and feed-forward neural networks. ◮ Is it really like brain? ◮ Common activation functions. ◮ Regularizing neural networks with dropout. ◮ Computation graphs.

32

SLIDE 40

Before the next lecture

Obligatory assignment

◮ The first obligatory assignment will be out today! ◮ Due February 8. ◮ Look it up on the course page.

Group session on January 29

◮ Hands-on: using linear classifiers and MLPs in scikit-learn. ◮ Document classification with BOW features. ◮ Make sure to apply for Abel account!

Thanks for attention!

33

SLIDE 42

References I

Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, pages 273–297. Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867.

34

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks

Andrey Kutuzov, Vinit Ravishankar, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

24 January 2019

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

Introduction

I am Andrey Kutuzov I will do lectures and group sessions in January and February, covering the following topics:

◮ Linear classifiers and simple feed-forward neural networks ◮ Neural language modeling ◮ Dense representations and word embeddings

I am also partially responsible for the first 2 obligatory assignments:

Introduction

Technicalities

◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check Piazza and the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2019

the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

Basics of supervised machine learning

◮ Supervised ML models are trained on example data and produce

generalizations.

◮ Supposed to ‘improve with experience’. ◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ Input 2: corresponding ‘gold’ labels for these instances

y1:n = y1, y2, . . . yn

◮ The trained models allow to make label predictions for unseen instances. ◮ Generally: some program for mapping instances to labels.

Basics of supervised machine learning

Recap on data split

◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data. ◮ Thus, the datasets are usually split into:

Basics of supervised machine learning

◮ We want to find a program which makes good predictions for our task. ◮ Searching among all possible programs is unfeasible. ◮ To cope with that, we make ourselves inductively biased... ◮ ...and set some hypothesis class... ◮ ...to search only within this class.

A popular hypothesis class: linear functions.

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1)

◮ Function input: