[PPT] - Natural Language Processing Anoop Sarkar PowerPoint Presentation

SLIDE 1

SFU NatLangLab

Natural Language Processing

Anoop Sarkar anoopsarkar.github.io/nlp-class

Simon Fraser University

October 25, 2018

SLIDE 2

1

Natural Language Processing

Anoop Sarkar anoopsarkar.github.io/nlp-class

Simon Fraser University

Part 1: Neural Language Models The following slides are taken from Tomas Mikolov’s presenta7on at Google circa 2010

SLIDE 3

Model description - recurrent NNLM

U V y(t) s(t-1) s(t) w(t) W

Input layer w and output layer y have the same dimensionality as the vocabulary (10K - 200K) Hidden layer s is orders of magnitude smaller (50 - 1000 neurons) U is the matrix of weights between input and hidden layer, V is the matrix of weights between hidden and output layer Without the recurrent weights W, this model would be a bigram neural network language model

9 / 59

SLIDE 4

Model Description - Recurrent NNLM

The output values from neurons in the hidden and output layers are computed as follows: s(t) = f (Uw(t) + Ws(t−1)) (1) y(t) = g (Vs(t)) , (2) where f(z) and g(z) are sigmoid and softmax activation functions (the softmax function in the output layer is used to ensure that the outputs form a valid probability distribution, i.e. all outputs are greater than 0 and their sum is 1): f(z) = 1 1 + e−z , g(zm) = ezm P

k ezk

(3)

10 / 59

SLIDE 5

Training of RNNLM

The training is performed using Stochastic Gradient Descent (SGD) We go through all the training data iteratively, and update the weight matrices U, V and W online (after processing every word) Training is performed in several epochs (usually 5-10)

11 / 59

SLIDE 6

Training of RNNLM

Gradient of the error vector in the output layer eo(t) is computed using a cross entropy criterion: eo(t) = d(t) − y(t) (4) where d(t) is a target vector that represents the word w(t + 1) (encoded as 1-of-V vector).

12 / 59

SLIDE 7

Training of RNNLM

Weights V between the hidden layer s(t) and the output layer y(t) are updated as V(t+1) = V(t) + s(t)eo(t)T α, (5) where α is the learning rate.

13 / 59

SLIDE 8

Training of RNNLM

Next, gradients of errors are propagated from the output layer to the hidden layer eh(t) = dh

eo(t)T V, t
,

(6) where the error vector is obtained using function dh() that is applied element-wise dhj(x, t) = xsj(t)(1 − sj(t)). (7)

14 / 59

SLIDE 9

Training of RNNLM

Weights U between the input layer w(t) and the hidden layer s(t) are then updated as U(t+1) = U(t) + w(t)eh(t)T α. (8) Note that only one neuron is active at a given time in the input vector w(t). As can be seen from the equation 8, the weight change for neurons with zero activation is none, thus the computation can be speeded up by updating weights that correspond just to the active input neuron.

15 / 59

SLIDE 10

Training of RNNLM - Backpropagation Through Time

The recurrent weights W are updated by unfolding them in time and training the network as a deep feedforward neural network. The process of propagating errors back through the recurrent weights is called Backpropagation Through Time (BPTT).

16 / 59

SLIDE 11

Training of RNNLM - Backpropagation Through Time

U s(t-3) w(t-2) W U U y(t) s(t-1) s(t) w(t) s(t-2) w(t-1) W W V

Figure: Recurrent neural network unfolded as a deep feedforward network, here for 3 time steps back in time.

17 / 59

SLIDE 12

Training of RNNLM - Backpropagation Through Time

Error propagation is done recursively as follows (note that the algorithm requires the states of the hidden layer from the previous time steps to be stored): eh(t−τ−1) = dh

eh(t−τ)T W, t−τ−1
.

(9) The unfolding can be applied for as many time steps as many training examples were already seen, however the error gradients quickly vanish as they get backpropagated in time (in rare cases the errors can explode), so several steps of unfolding are sufficient (this is sometimes referred to as truncated BPTT).

18 / 59

SLIDE 13

Training of RNNLM - Backpropagation Through Time

The recurrent weights W are updated as W(t+1) = W(t) +

T

X

z=0

s(t−z−1)eh(t−z)T α. (10) Note that the matrix W is changed in one update at once, and not during backpropagation of errors. It is more computationally efficient to unfold the network after processing several training examples, so that the training complexity does not increase linearly with the number of time steps T for which the network is unfolded in time.

19 / 59

SLIDE 14

Training of RNNLM - Backpropagation Through Time

U W U U V V V w(t-2) y(t-1) y(t-2) y(t-3) s(t-3) y(t) s(t) w(t) V W W w(t-1) s(t-1) s(t-2)

Figure: Example of batch mode training. Red arrows indicate how the

gradients are propagated through the unfolded recurrent neural network.

20 / 59

SLIDE 15

Extensions: Classes

Computing full probability distribution over all V words can be very complex, as V can easily be more than 100K. We can instead do:

Assign all words from V to a single class Compute probability distribution over all classes Compute probability distribution over words that belong to the specific class

Assignment of words to classes can be trivial: we can use frequency binning.

21 / 59

SLIDE 16

Extensions: Classes

U y(t) s(t-1) s(t) w(t) W X c(t) V

Figure: Factorization of the output layer, c(t) is the class layer.

22 / 59

SLIDE 17

Extensions: Classes

By using simple classes, we can achieve speedups on large data sets more than 100 times. We lose a bit of accuracy of the model (usually 5-10% in perplexity).

23 / 59

SLIDE 18

Empirical Results

Penn Treebank

Comparison of advanced language modeling techniques Combination

Wall Street Journal

JHU setup Kaldi setup

NIST RT04 Broadcast News speech recognition Additional experiments: machine translation, text compression

28 / 59

SLIDE 19

Penn Treebank

We have used the Penn Treebank Corpus, with the same vocabulary and data division as other researchers:

Sections 0-20: training data, 930K tokens Sections 21-22: validation data, 74K tokens Sections 23-24: test data, 82K tokens Vocabulary size: 10K

29 / 59

SLIDE 20

Penn Treebank - Comparison

Model Perplexity Entropy reduction

ver baseline

individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2

5-gram, Good-Turing smoothing (GT5)

162.3

3-gram, Kneser-Ney smoothing (KN3)

148.3

5-gram, Kneser-Ney smoothing (KN5)

141.2

5-gram, Kneser-Ney smoothing + cache

125.7

PAQ8o10t

131.1

Maximum entropy 5-gram model

142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 30 / 59

SLIDE 21

Penn Treebank - Comparison

Model Perplexity Entropy reduction

ver baseline

individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2

5-gram, Good-Turing smoothing (GT5)

162.3

3-gram, Kneser-Ney smoothing (KN3)

148.3

5-gram, Kneser-Ney smoothing (KN5)

141.2

5-gram, Kneser-Ney smoothing + cache

125.7

PAQ8o10t

131.1

Maximum entropy 5-gram model

142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 31 / 59

SLIDE 22

Penn Treebank - Comparison

Model Perplexity Entropy reduction

ver baseline

individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2

5-gram, Good-Turing smoothing (GT5)

162.3

3-gram, Kneser-Ney smoothing (KN3)

148.3

5-gram, Kneser-Ney smoothing (KN5)

141.2

5-gram, Kneser-Ney smoothing + cache

125.7

PAQ8o10t

131.1

Maximum entropy 5-gram model

142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 32 / 59

SLIDE 23

Penn Treebank - Comparison

Model Perplexity Entropy reduction

ver baseline

individual +KN5 +KN5+cache KN5 KN5+cache 3-gram, Good-Turing smoothing (GT3) 165.2

5-gram, Good-Turing smoothing (GT5)

162.3

3-gram, Kneser-Ney smoothing (KN3)

148.3

5-gram, Kneser-Ney smoothing (KN5)

141.2

5-gram, Kneser-Ney smoothing + cache

125.7

PAQ8o10t

131.1

Maximum entropy 5-gram model

142.1 138.7 124.5 0.4% 0.2% Random clusterings LM 170.1 126.3 115.6 2.3% 1.7% Random forest LM 131.9 131.3 117.5 1.5% 1.4% Structured LM 146.1 125.5 114.4 2.4% 1.9% Within and across sentence boundary LM 116.6 110.0 108.7 5.0% 3.0% Log-bilinear LM 144.5 115.2 105.8 4.1% 3.6% Feedforward neural network LM 140.2 116.7 106.6 3.8% 3.4% Syntactical neural network LM 131.3 110.0 101.5 5.0% 4.4% Recurrent neural network LM 124.7 105.7 97.5 5.8% 5.3% Dynamically evaluated RNNLM 123.2 102.7 98.0 6.4% 5.1% Combination of static RNNLMs 102.1 95.5 89.4 7.9% 7.0% Combination of dynamic RNNLMs 101.0 92.9 90.0 8.5% 6.9% 33 / 59

SLIDE 24

Penn Treebank - Combination

Model Weight PPL 3-gram with Good-Turing smoothing (GT3) 165.2 5-gram with Kneser-Ney smoothing (KN5) 141.2 5-gram with Kneser-Ney smoothing + cache 0.0792 125.7 Maximum entropy model 142.1 Random clusterings LM 170.1 Random forest LM 0.1057 131.9 Structured LM 0.0196 146.1 Within and across sentence boundary LM 0.0838 116.6 Log-bilinear LM 144.5 Feedforward NNLM 140.2 Syntactical NNLM 0.0828 131.3 Combination of static RNNLMs 0.3231 102.1 Combination of adaptive RNNLMs 0.3058 101.0 ALL 1 83.5

34 / 59

SLIDE 25

Combination of Techniques (Joshua Goodman, 2001)

Figure from ”A bit of progress in language modeling, extended version” (Goodman, 2001)

35 / 59

SLIDE 26

Empirical Evaluation - JHU WSJ Setup Description

Setup from Johns Hopkins University (results are comparable to other techniques) Wall Street Journal: read speech, very clean (easy task for language modeling experiments) Simple decoder (not state of the art) 36M training tokens, 200K vocabulary WER results obtained by 100-best list rescoring

36 / 59

SLIDE 27

Improvements with Increasing Amount of Data

10

5

10

6

10

7

10

8

7 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9 Training tokens Entropy per word on the WSJ test data KN5 KN5+RNN

The improvement obtained from a single RNN model over the best backoff model increases with more data! However, it is also needed to increase size of the hidden layer with more training data.

37 / 59

SLIDE 28

Improvements with Increasing Amount of Data

# words PPL WER Improvement[%] KN5 +RNN KN5 +RNN Entropy WER 223K 415 333

3.7
675K

390 298 15.6 13.9 4.5 10.9 2233K 331 251 14.9 12.9 4.8 13.4 6.4M 283 200 13.6 11.7 6.1 14.0 36M 212 133 12.2 10.2 8.7 16.4

38 / 59

SLIDE 29

Comparison of Techniques - WSJ, JHU Setup

Model Dev WER[%] Eval WER[%] Baseline - KN5 12.2 17.2 Discriminative LM 11.5 16.9 Joint structured LM

16.7

Static RNN 10.3 14.5 Static RNN + KN 10.2 14.5 Adapted RNN 9.7 14.2 Adapted RNN + KN 9.7 14.2 3 combined RNN LMs 9.5 13.9

39 / 59

SLIDE 30

Empirical evaluation - Kaldi WSJ setup description

The same test sets as JHU setup, but lattices obtained with Kaldi speech recognition toolkit N-best lists were produced by Stefan Kombrink last summer, currently the best Kaldi baseline is much better 37M training tokens, 20K vocabulary WER results obtained by 1000-best list rescoring results obtained with RNNME models, with up to 4-gram features and size of hash 2G parameters Better repeatability of experiments than with the JHU setup

40 / 59

SLIDE 31

Empirical evaluation - Kaldi WSJ setup

Model Perplexity WER [%] heldout Eval 92 Eval 92 Eval 93 GT2 167 209 14.6 19.7 GT3 105 147 13.0 17.6 KN5 87 131 12.5 16.6 KN5 (no count cutoffs) 80 122 12.0 16.6 RNNME-0 90 129 12.4 17.3 RNNME-10 81 116 11.9 16.3 RNNME-80 70 100 10.4 14.9 RNNME-160 65 95 10.2 14.5 RNNME-320 62 93 9.8 14.2 RNNME-480 59 90 10.2 13.7 RNNME-640 59 89 9.6 14.4 combination of RNNME models

9.24

13.23 + unsupervised adaptation

9.15

13.11

Results improve with larger hidden layer.

41 / 59

SLIDE 32

Empirical evaluation - Kaldi WSJ setup

Model Perplexity WER [%] heldout Eval 92 Eval 92 Eval 93 GT2 167 209 14.6 19.7 GT3 105 147 13.0 17.6 KN5 87 131 12.5 16.6 KN5 (no count cutoffs) 80 122 12.0 16.6 RNNME-0 90 129 12.4 17.3 RNNME-10 81 116 11.9 16.3 RNNME-80 70 100 10.4 14.9 RNNME-160 65 95 10.2 14.5 RNNME-320 62 93 9.8 14.2 RNNME-480 59 90 10.2 13.7 RNNME-640 59 89 9.6 14.4 combination of RNNME models

9.24

13.23 + unsupervised adaptation

9.15

13.11

Results improve with larger hidden layer.

42 / 59

SLIDE 33

Evaluation - Broadcast News Speech Recognition

NIST RT04 Broadcast News speech recognition task The baseline system is state-of-the-art setup from IBM based on Attila decoder: very well tuned, hard task 87K vocabulary size, 400M training tokens (10x more than WSJ setups) It has been reported by IBM that state of the art LM on this setup is a regularized class-based maxent model (called ”model M”) NNLMs have been reported to perform about the same as model M (about 0.6% absoulte WER reduction), but are computationally complex We tried class based RNN and RNNME models...

43 / 59

SLIDE 34

Evaluation - Broadcast News Speech Recognition

Model WER[%] Single Interpolated KN4 (baseline) 13.11 13.11 model M 13.1 12.49 RNN-40 13.36 12.90 RNN-80 12.98 12.70 RNN-160 12.69 12.58 RNN-320 12.38 12.31 RNN-480 12.21 12.04 RNN-640 12.05 12.00 RNNME-0 13.21 12.99 RNNME-40 12.42 12.37 RNNME-80 12.35 12.22 RNNME-160 12.17 12.16 RNNME-320 11.91 11.90 3xRNN

11.70

Word error rate on the NIST RT04 evaluation set Still plenty of space for improvements! Adaptation, bigger models, combination of RNN and RNNME, ... Another myth broken: maxent model (aka ”model M”) is not more powerful than NNLMs!

44 / 59

SLIDE 35

Empirical Evaluation - Broadcast News Speech Recognition

10 1 10 2 10 3 11.5 12 12.5 13 13.5 14 14.5 Hidden layer size WER on eval [%] RNN RNN+KN4 KN4 RNNME RNNME+KN4

The improvements increase with more neurons in the hidden layer

45 / 59

SLIDE 36

Empirical Evaluation - Broadcast News Speech Recognition

10

1

10

2

10

3

0.7
0.6
0.5
0.4
0.3
0.2
0.1

Hidden layer size Entropy reduction per word over KN4 [bits] RNN + KN4 RNNME+KN4

Comparison of entropy improvements obtained from RNN and RNNME models over KN4 model

46 / 59

SLIDE 37

Empirical Evaluation - Entropy

Additional experiments to compare RNN and RNNME: Randomized order of sentences in the training data (to prevent adaptation) Comparison of entropy reductions over KN 5-gram model with no count cutoffs and no pruning

47 / 59

SLIDE 38

Empirical Evaluation - Entropy

10

5

10

6

10

7

10

8

10

9

0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06

Training tokens Entropy reduction over KN5 RNN-20 RNNME-20

If hidden layer size is kept constant in the RNN model, the improvements seem to vanish with more data RNNME seems to be useful at large data sets

48 / 59

SLIDE 39

Empirical Evaluation - Entropy

10

5

10

6

10

7

10

8

10

9

0.35
0.3
0.25
0.2
0.15
0.1
0.05

Training tokens Entropy reduction over KN5 RNN-20 RNNME-20 RNN-80 RNNME-80

49 / 59

SLIDE 40

Additional Experiments: Machine Translation

Machine translation: very similar task as speech recognition (from the language modeling point of view) I performed the following experiments when visiting JHU at 2010 Basic RNN models were used (no classes, no BPTT, no ME) Baseline systems were trained by Zhifei Li and Ziyuan Wang

50 / 59

SLIDE 41

Additional Experiments: Machine Translation

Table: BLEU on IWSLT 2005 Machine Translation task, Chinese to English.

Model BLEU baseline (n-gram) 48.7 300-best rescoring with RNNs 51.2

About 400K training tokens, small task

51 / 59

SLIDE 42

Additional Experiments: Machine Translation

Table: BLEU and NIST score on NIST MT 05 Machine Translation task, Chinese to English.

Model BLEU NIST baseline (n-gram) 33.0 9.03 1000-best rescoring with RNNs 34.7 9.19

RNNs were trained on subset of the training data (about 17.5M training tokens), with limited vocabulary

52 / 59

SLIDE 43

Additional Experiments: Text Compression

Compressor Size [MB] Bits per character

riginal text file

1696.7 8.0 gzip -9 576.2 2.72 RNNME-0 273.0 1.29 PAQ8o10t -8 272.1 1.28 RNNME-40 263.5 1.24 RNNME-80 258.7 1.22 RNNME-200 256.5 1.21 PAQ8o10t is state of the art compression program Data compressor = Predictor + Arithmetic coding Task: compression of normalized text data that were used in the NIST RT04 experiments Achieved entropy of English text 1.21 bpc is already lower than the upper bound 1.3 bpc estimated by Shannon Several tricks were used to obtain the results: multiple models with different learning rate, skip-gram features

53 / 59

SLIDE 44

Conclusion: Finally Beyond N-grams?

Extensive experiments confirm that n-grams can be significantly beaten at many interesting tasks:

Penn Treebank: perplexity reduced from 141 to 79 WSJ: 21% - 23% relative reduction of WER Broadcast News Speech Recognition: 11% relative reduction of WER MT: 1.7 - 2.5 BLEU points Text compression

Experiments can be easily repeated using freely available RNNLM tool! But are we any closer to ”intelligent language models”?

54 / 59

SLIDE 45

Data sampled from 4-gram backoff model

OR STUDENT’S IS FROM TEETH PROSECUTORS DO FILLED WITH HER SOME BACKGROUND ON WHAT WAS GOING ON HERE ALUMINUM CANS OF PEACE PIPER SWEAT COLONEL SAYING HAVE ALREADY MADE LAW THAT WOULD PREVENT THE BACTERIA DOWN FOR THE MOST OF IT IN NINETEEN SEVENTY EIGHT WHICH WAS ONE OF A NUMBER OF ISSUES INCLUDING CIVIL SUIT BY THIS TIME NEXT YEAR CRYSTAL FIRMLY AS A HERO OF MINE A PREVIEW THAT THOMAS SEVENTY BODIES AND ASKING QUESTIONS MAYBE ATTORNEY’S OFFICE THEATERS CUT ACROSS THE ELEVENTH AND SUPPORT THEM WITH ELLEN WISEST PULLING DATA GATHERING IN RESPONSE TO AN UNMITIGATED DISPOSITION CONTRACTORS AND AND I’M VERY SORRY FOR THE DEATH OF HER SPOKESWOMAN ONIONS THE FRESH CORN THANKSGIVING CONTROL WHEN I TALKED TO SAID THAT AND THEY THINK WHAT AT LEAST UNTIL AFTER I’M UPSET SO WE INCORPORATED WITH DROPPING EXTRAORDINARY PHONED

55 / 59

SLIDE 46

Data sampled from RNN model

THANKS FOR COMING IN NEXT IN A COUPLE OF MINUTES WHEN WE TAKE A LOOK AT OUR ACCOMPANYING STORY IMAGE GUIDE WHY ARE ANY OF THOSE DETAILS BEING HEARD IN LONDON BUT DEFENSE ATTORNEYS SAY THEY THOUGHT THE CONTACT WAS NOT AIMED DAMAGING AT ANY SUSPECTS THE UNITED NATIONS SECURITY COUNCIL IS NAMED TO WITHIN TWO MOST OF IRAQI ELECTION OFFICIALS IT IS THE MINIMUM TIME A TOTAL OF ONE DETERMINED TO APPLY LIMITS TO THE FOREIGN MINISTERS WHO HAD MORE POWER AND NOW THAN ANY MAN WOULD NAME A CABINET ORAL FIND OUT HOW IMPORTANT HIS DIFFERENT RECOMMENDATION IS TO MAKE WHAT THIS WHITE HOUSE WILL WILL TO BE ADDRESSED ELAINE MATHEWS IS A POLITICAL CORRESPONDENT FOR THE PRESIDENT’S FAMILY WHO FILED A SIMILAR NATIONWIDE OPERATION THAT CAME IN A DEAL THE WEIGHT OF THE CLINTON CERTAINLY OUTRAGED ALL PALESTINIANS IN THE COUNTRY IS DESIGNED TO REVIVE THE ISRAELI TALKS

56 / 59

SLIDE 47

WSJ-Kaldi rescoring

5-gram: IN TOKYO FOREIGN EXCHANGE TRADING YESTERDAY THE UNIT INCREASED AGAINST THE DOLLAR RNNLM: IN TOKYO FOREIGN EXCHANGE TRADING YESTERDAY THE YEN INCREASED AGAINST THE DOLLAR 5-gram: SOME CURRENCY TRADERS SAID THE UPWARD REVALUATION OF THE GERMAN MARK WASN’T BIG ENOUGH AND THAT THE MARKET MAY CONTINUE TO RISE RNNLM: SOME CURRENCY TRADERS SAID THE UPWARD REVALUATION OF THE GERMAN MARKET WASN’T BIG ENOUGH AND THAT THE MARKET MAY CONTINUE TO RISE 5-gram: MEANWHILE QUESTIONS REMAIN WITHIN THE E. M. S. WEATHERED YESTERDAY’S REALIGNMENT WAS ONLY A TEMPORARY SOLUTION RNNLM: MEANWHILE QUESTIONS REMAIN WITHIN THE E. M. S. WHETHER YESTERDAY’S REALIGNMENT WAS ONLY A TEMPORARY SOLUTION 5-gram:

MR. PARNES FOLEY ALSO FOR THE FIRST TIME THE WIND WITH SUEZ’S PLANS FOR

GENERALE DE BELGIQUE’S WAR RNNLM: MR. PARNES SO LATE ALSO FOR THE FIRST TIME ALIGNED WITH SUEZ’S PLANS FOR GENERALE DE BELGIQUE’S WAR 5-gram: HE SAID THE GROUP WAS MARKET IN ITS STRUCTURE AND NO ONE HAD LEADERSHIP RNNLM: HE SAID THE GROUP WAS ARCANE IN ITS STRUCTURE AND NO ONE HAD LEADERSHIP 57 / 59

SLIDE 48

Conclusion: Finally Beyond N-grams?

RNN LMs can generate much more meaningful text than n-gram models trained on the same data Many novel but meaningful sequences of words were generated RNN LMs are clearly better at modeling the language than n-grams However, many simple patterns in the language cannot be efficiently described even by RNNs...

58 / 59

SLIDE 49

4

Acknowledgements

Many slides borrowed or inspired from lecture notes by Michael Collins, Chris Dyer, Kevin Knight, Philipp Koehn, Adam Lopez, Graham Neubig and Luke Zettlemoyer from their NLP course materials. All mistakes are my own. A big thank you to all the students who read through these notes and helped me improve them.