Natural Language Processing (CSE 490U): Neural Language Models
Noah Smith
c 2017 University of Washington nasmith@cs.washington.edu
January 13–18, 2017
1 / 57
Natural Language Processing (CSE 490U): Neural Language Models Noah - - PowerPoint PPT Presentation
Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 1318, 2017 1 / 57 Quick Review A language model is a probability distribution over V .
1 / 57
2 / 57
◮ Non-linear ◮ Differentiable with respect to its inputs ◮ “Assembled” through a series of affine transformations and
◮ Symbolic/discrete inputs handled through lookups.
◮ Typically a collection of scalars, vectors, and matrices ◮ We often assume they are linearized into RD 3 / 57
4 / 57
5 / 57
V +
V
V × dAj d × V
V × H tanh
H +
d × H
6 / 57
V
V × d
V
V × d
7 / 57
8 / 57
H +
d × H
9 / 57
10 / 57
V +
d × V
V × H tanh
11 / 57
12 / 57
13 / 57
14 / 57
15 / 57
16 / 57
17 / 57
◮ Suppose y = xor(x1, x2); this can’t be expressed as a linear
18 / 57
19 / 57
◮ Suppose y = xor(x1, x2); this can’t be expressed as a linear
20 / 57
10 15 20 25 30 1 2 3 4 5 iterations mean squared error
3
3 × 2x 2 + b 3
3
3 × 2x 2 + b 3
21 / 57
◮ Suppose y = xor(x1, x2); this can’t be expressed as a linear
◮ With high-dimensional inputs, there are a lot of conjunctive
22 / 57
◮ Suppose y = xor(x1, x2); this can’t be expressed as a linear
◮ With high-dimensional inputs, there are a lot of conjunctive
◮ Neural models seem to smoothly explore lots of
23 / 57
◮ Suppose y = xor(x1, x2); this can’t be expressed as a linear
◮ With high-dimensional inputs, there are a lot of conjunctive
◮ Neural models seem to smoothly explore lots of
24 / 57
◮ Suppose y = xor(x1, x2); this can’t be expressed as a linear
◮ With high-dimensional inputs, there are a lot of conjunctive
◮ Neural models seem to smoothly explore lots of
25 / 57
26 / 57
27 / 57
28 / 57
29 / 57
30 / 57
◮ So any perplexity experiment is evaluating the model and an
31 / 57
◮ So any perplexity experiment is evaluating the model and an
32 / 57
33 / 57
d
d × d
d
d
d
d × d
d
d
34 / 57
d
d × d
d
d
d
d × d
d
d
35 / 57
d
d × d
d
d
d
d × d
d
d
36 / 57
d
d × d
d
d
d
d × d
d
d
37 / 57
d
d × d
d
d
d
d × d
d
d
38 / 57
◮ This has to be learned.
39 / 57
◮ This has to be learned.
◮ Example: ℓ2-norm of Aj and Tj in the feedforward model
◮ Individual word embeddings can be clustered and dimensions
40 / 57
◮ This has to be learned.
41 / 57
◮ This has to be learned.
42 / 57
◮ The tth input element xt is processed alongside the previous
◮ The tth output is a function of the state st. ◮ The same functions are applied at each iteration:
43 / 57
44 / 57
45 / 57
V
A, B, c sigmoid softmax U
V
A, B, c sigmoid softmax U
V
A, B, c sigmoid softmax U
V
A, B, c sigmoid softmax U
46 / 57
47 / 57
48 / 57
49 / 57
50 / 57
◮ At present, this requires a lot of engineering. 51 / 57
◮ At present, this requires a lot of engineering. ◮ New libraries to help you are coming out all the time. 52 / 57
◮ At present, this requires a lot of engineering. ◮ New libraries to help you are coming out all the time. ◮ Many of them use GPUs to speed things up. 53 / 57
◮ At present, this requires a lot of engineering. ◮ New libraries to help you are coming out all the time. ◮ Many of them use GPUs to speed things up.
54 / 57
55 / 57
56 / 57
57 / 57