IN5550: Neural Methods in Natural Language Processing Lecture 2 - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550: Neural Methods in Natural Language Processing Lecture 2 - - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks Andrey Kutuzov, Vinit Ravishankar, Lilja vrelid, Stephan Oepen, & Erik Velldal University of Oslo 24


slide-1
SLIDE 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks

Andrey Kutuzov, Vinit Ravishankar, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

24 January 2019

1

slide-2
SLIDE 2

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

1

slide-3
SLIDE 3

Introduction

I am Andrey Kutuzov I will do lectures and group sessions in January and February, covering the following topics:

◮ Linear classifiers and simple feed-forward neural networks ◮ Neural language modeling ◮ Dense representations and word embeddings

I am also partially responsible for the first 2 obligatory assignments:

  • 1. Bag of Words Document Classification
  • 2. Word Embedding and Semantic Similarity

2

slide-4
SLIDE 4

Introduction

Technicalities

◮ Make sure to familiarize yourself with the course infrastructure. ◮ Check Piazza and the course page for messages. ◮ Test whether you can access https://github.uio.no/in5550/2019

◮ make sure to update your UiO github profile with your photo, and star

the course repository :-)

◮ Most of machine learning revolves around linear algebra. ◮ We created a LinAlg cheat sheet for this course.

◮ Linked from the course page, adapted for the notation of [Goldberg, 2017]. 3

slide-5
SLIDE 5

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

3

slide-6
SLIDE 6

Basics of supervised machine learning

◮ Supervised ML models are trained on example data and produce

generalizations.

◮ Supposed to ‘improve with experience’. ◮ Input 1: a training set of n training instances x1:n = x1, x2, . . . xn

◮ for example, e-mail messages.

◮ Input 2: corresponding ‘gold’ labels for these instances

y1:n = y1, y2, . . . yn

◮ for example, whether the message is spam (1) or not (0).

◮ The trained models allow to make label predictions for unseen instances. ◮ Generally: some program for mapping instances to labels.

4

slide-7
SLIDE 7

Basics of supervised machine learning

Recap on data split

◮ Recall: we want the model to make good predictions for unseen data. ◮ It should not overfit to the seen data. ◮ Thus, the datasets are usually split into:

  • 1. train data;
  • 2. validation/development data (optional);
  • 3. test/held-out data.

5

slide-8
SLIDE 8

Basics of supervised machine learning

◮ We want to find a program which makes good predictions for our task. ◮ Searching among all possible programs is unfeasible. ◮ To cope with that, we make ourselves inductively biased... ◮ ...and set some hypothesis class... ◮ ...to search only within this class.

A popular hypothesis class: linear functions.

6

slide-9
SLIDE 9

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

6

slide-10
SLIDE 10

Linear classifiers

Simple linear function f (x; W , b) = x · W + b (1)

◮ Function input:

◮ feature vector x ∈ Rdin; ◮ each training instance is represented with din features; ◮ for example, some properties of the documents.

◮ Function parameters θ:

◮ matrix W ∈ Rdin×dout ◮ dout is the dimensionality of the desired prediction (number of classes) ◮ bias vector b ∈ Rdout ◮ bias ‘shifts’ the function output to some direction. 7

slide-11
SLIDE 11

Linear classifiers

Training of a linear classifier f (x; W , b) = x · W + b θ = W , b

◮ Training is finding the optimal θ. ◮ ‘Optimal’ means ‘producing predictions ˆ

y closest to the gold labels y

  • n our n training instances’.

◮ Ideally, ˆ

y = y

8

slide-12
SLIDE 12

Linear classifiers

Representing linguistic features

◮ Each of n instances (documents) is represented by a vector of features

(x ∈ Rdin).

◮ Inversely, each feature can be represented by a vector of instances

(documents) it appears in (feature ∈ Rn).

◮ Together these learned representations form a W matrix, part of θ.

◮ Thus, it contains data both about the instances and their features (more

about this later).

◮ Feature engineering is deciding what features of the instances we will

use during the training.

9

slide-13
SLIDE 13

Linear classifiers

Here, training instances are represented with 2 features each (x = [x0, x1]) and labeled with 2 class labels (y = {black, red}):

◮ Parameters of f (x; W , b) = x · W + b define the line (or hyperplane)

separating the instances.

◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. ◮ Question: lines with 3 values of b are shown. Which is the best?

10

slide-14
SLIDE 14

Linear classifiers

Bag of words

◮ We can have much more features than 2

◮ (although this is much harder to visualize).

◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i?

◮ or a binary flag {1, 0} of whether a appeared in i at all or not.

◮ This schema is called ‘bag of words’ (BoW).

◮ for example, if we have 1000 words in the vocabulary: ◮ i ∈ R1000 ◮ i = [20, 16, 0, 10, 0, . . . , 3] 11

slide-15
SLIDE 15

Linear classifiers

◮ Bag-of-Words feature vector of i can be interpreted as a sum of one-hot

vectors (o) for each token in it:

◮ D extracted from the text above contains 10 words (lowercased): {‘-’,

‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’}.

◮ o0 = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] ◮ o1 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0] ◮ etc... ◮ i = [1, 1, 1, 1, 1, 2, 2, 1, 1, 1] (‘the’ and ‘road’ mentioned 2 times) 12

slide-16
SLIDE 16

Linear classifiers

f (x; W , b) = x · W + b Output of binary classification Binary decision (dout = 1):

◮ ‘Is this message spam or not?’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ

y is also a scalar: either 1 (‘yes’) or −1 (‘no’).

◮ NB: the model can output any number, but we convert all negatives to

−1 and all positives to 1 (sign function). θ = (W ∈ Rdin, b ∈ R1)

13

slide-17
SLIDE 17

Linear classifiers

f (x; W , b) = x · W + b Output of multi-class classification Multi-class decision (dout = k)

◮ ‘Which of k candidates authored this text?’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ

y is also a one-hot vector of k components.

◮ The component corresponding to the correct author has the value of 1,

  • thers are zeros, for example:

ˆ y = [0, 0, 1, 0] (for k = 4) θ = (W ∈ Rdin×dout, b ∈ Rdout)

14

slide-18
SLIDE 18

Linear classifiers

Log-linear classification If we care about how confident is the classifier about each decision:

◮ Map the predictions to the range of [0, 1]... ◮ ...by a squashing function, for example, sigmoid:

ˆ y = σ(f (x)) = 1 1 + e−(f (x)) (2)

◮ The result is the probability of the prediction!

σ(x)

15

slide-19
SLIDE 19

Linear classifiers

◮ For multi-class cases, log-linear models produce probabilities for all

classes, for example: ˆ y = [0.4, 0.1, 0.9, 0.5] (for k = 4)

◮ We choose the one with the highest score:

ˆ y = arg max

i

ˆ y[i] = ˆ y[2] (3)

◮ But often it is more convenient to transform scores into a probability

distribution, using the softmax function: ˆ y = softmax(xW + b) ˆ y[i] = e(xW +b)[i]

  • j e(xW +b)[j]

(4)

◮ ˆ

y = softmax([0.4, 0.1, 0.9, 0.5]) = [0.22, 0.16, 0.37, 0.25]

◮ (all scores sum to 1) 16

slide-20
SLIDE 20

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

16

slide-21
SLIDE 21

Training as optimization

◮ The goal of the training is to find the optimal values of parameters in θ. ◮ Formally, it means to minimize the loss L(θ) on training or development

dataset.

◮ Conceptually, loss is a measure of how ‘far away’ the model predictions

ˆ y are from gold labels y.

◮ Formally, it can be any function L(ˆ

y, y) returning a scalar value:

◮ for example, L = (y − ˆ

y)2 (square error)

◮ It is averaged over all training instances and gives us estimation of the

model ‘fitness’.

◮ ˆ

θ is the best set of parameters: ˆ θ = arg min

θ

L(θ) (5)

17

slide-22
SLIDE 22

Training as optimization

Common loss functions

  • 1. Hinge (binary): L(ˆ

y, y) = max(0, 1 − y · ˆ y)

  • 2. Hinge (multi-class): L(ˆ

y, y) = max(0, 1 − (ˆ y[t] − ˆ y[k]))

  • 3. Log loss: L(ˆ

y, y) = log(1 + exp(−(ˆ y[t] − ˆ y[k]))

  • 4. Binary cross-entropy (logistic loss):

L(ˆ y, y) = −y logˆ y − (1 − y)log(1 − ˆ y)

  • 5. Categorical cross-entropy (negative log-likelihood):

L(ˆ y, y) = −

i

y[i]log(ˆ y[i])

  • 6. Ranking losses, etc, etc...

18

slide-23
SLIDE 23

Training as optimization

Regularization

◮ Sometimes, so as not to overfit, we pose restrictions on the possible θ. ◮ We would like θ to be not only good in predictions, but also not too

complex; it should be ‘lean’ and avoid large weights.

◮ We can live with some errors on the training data, if it gives more

generalization power.

◮ For that, we minimize both the loss and the regularization term R(θ):

ˆ θ = arg min

θ

L(θ) + λR(θ) (6)

◮ The hyperparameter λ is regularization weight (how important is it). ◮ Common regularization terms:

  • 1. L2 norm (Gaussian prior or weight decay);
  • 2. L1 norm (sparse prior or lasso)

19

slide-24
SLIDE 24

Training as optimization

Optimizing with gradient

◮ ˆ

θ = arg min

θ

(L(θ) + λR(θ)) is an optimization problem.

◮ Commonly solved using gradient methods:

  • 1. compute the loss,
  • 2. compute gradient of θ parameters with respect to the loss,
  • 3. (gradient here is the collection of partial derivatives, one for each

parameter of θ)

  • 4. move the parameters in the opposite direction (to decrease the loss),
  • 5. repeat until the optimum is found (the derivative is 0) or until the

pre-defined number of iterations (epochs) is achieved.

Convexity

◮ Convex functions: a single optimum point. ◮ Non-convex functions: multiple optimum points.

20

slide-25
SLIDE 25

Training as optimization

Error surfaces of convex and not-convex functions: Convex function Non-convex function

◮ Convex functions can be easily minimized with gradient methods,

reaching the global optimum.

◮ With non-convex functions, optimization can end up in a local optimum. ◮ Linear and log-linear models as a rule have convex error functions.

21

slide-26
SLIDE 26

Training as optimization

Stochastic gradient descent (SGD)

◮ SGD samples one instance from the training set and computes the error

  • f the gradient on it,

◮ then θ is updated in the opposite direction, ◮ the update is scaled by the learning rate n (can be decaying over the

training time),

◮ repeat until convergence.

Instead of one instance, batches can be used (more stable and computationally efficient).

22

slide-27
SLIDE 27

Training as optimization

Other gradient-based optimizers:

◮ Momentum ◮ AdaGrad ◮ RMSProp ◮ Adam ◮ etc...

All implemented in the libraries we are going to use: PyTorch, Scikit-Learn, TensorFlow, Keras, etc.

23

slide-28
SLIDE 28

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

23

slide-29
SLIDE 29

Limitations of linear models

◮ Linear classifiers are efficient and effective. ◮ Can be used on their own (often enough in practice)... ◮ ...or as building blocks for non-linear neural classifiers. ◮ Unfortunately, linear models can represent only linear relations in the

data

24

slide-30
SLIDE 30

Limitations of linear models

◮ Are there non-linear functions that linear models can’t deal with? ◮ Yes, there are. ◮ One example is the XOR function:

It is clearly not linearly separable.

25

slide-31
SLIDE 31

Limitations of linear models

Possible solutions

◮ We can transform the input so that it becomes linearly separable. ◮ Linear transformations will not be able to do this. ◮ We need non-linear transformations.

φ(x1, x2) = [x1 × x2, x1 + x2] maps the instances to another representation and makes the XOR problem linearly separable:

26

slide-32
SLIDE 32

Limitations of linear models

◮ But how to find the transformation function φ suitable for the task at

hand?

◮ Often, this implies mapping instances to a higher-dimensional space,

making it even more difficult to choose φ manually.

◮ Support Vector Machines (SVM) classifiers handle this to some

extent... [Cortes and Vapnik, 1995]

◮ ...but they scale linearly in time on the size of the training data (slow!).

27

slide-33
SLIDE 33

Limitations of linear models

Training mapping functions

◮ Idea: leave it for the algorithm to train a suitable representation

mapping function! ˆ y = φ(x)W + b φ(x) = g(xW ′ + b′) (7)

◮ ...where g is a non-linear activation function, and W ′, b′ are its

trainable parameters.

◮ The equation above defines a simple multi-layer perceptron (MLP): our

first neural model.

28

slide-34
SLIDE 34

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

28

slide-35
SLIDE 35

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

29

slide-36
SLIDE 36

Going deeply non-linear: multi-layered perceptrons

The nature of perceptrons

◮ Input data goes through successive transformations at each layer. ◮ The transformations are linear, but followed with a non-linear activation

at each hidden layer.

◮ At the last layer, the prediction ˆ

y is produced.

◮ Representation functions and the linear classifiers are trained

simultaneously. Important: neural networks with hidden layers can theoretically approximate any function [Leshno et al., 1993].

30

slide-37
SLIDE 37

Going deeply non-linear: multi-layered perceptrons

Perceptron with 2 hidden layers

31

slide-38
SLIDE 38

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

31

slide-39
SLIDE 39

Next lecture on January 31

Training Deep Neural Networks

◮ More on multi-layer perceptrons and feed-forward neural networks. ◮ Is it really like brain? ◮ Common activation functions. ◮ Regularizing neural networks with dropout. ◮ Computation graphs.

32

slide-40
SLIDE 40

Contents

1

Introduction

2

Basics of supervised machine learning

3

Linear classifiers

4

Training as optimization

5

Limitations of linear models

6

Going deeply non-linear: multi-layered perceptrons

7

Next lecture on January 31

8

Before the next lecture

32

slide-41
SLIDE 41

Before the next lecture

Obligatory assignment

◮ The first obligatory assignment will be out today! ◮ Due February 8. ◮ Look it up on the course page.

Group session on January 29

◮ Hands-on: using linear classifiers and MLPs in scikit-learn. ◮ Document classification with BOW features. ◮ Make sure to apply for Abel account!

Thanks for attention!

33

slide-42
SLIDE 42

References I

Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, pages 273–297. Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867.

34