SLIDE 1
Traitement automatique des langues : Fondements et applications
Cours 11 : Neural networks (2) Tim Van de Cruys & Philippe Muller 2016—2017
SLIDE 2 Introduction
Machine learning for NLP
- Standard approach: linear model trained over high-dimensional
but very sparse feature vectors
- Recently: non-linear neural networks over dense input vectors
SLIDE 3 Neural Network Architectures
Feed-forward neural networks
- Best known, standard neural network approach
- Fully connected layers
- Can be used as drop-in replacement for typical NLP classifiers
SLIDE 4 Convolutional neural network
Introduction
- Type of feedforward neural network
- Certain layers are not fully connected but locally connected
(convolutional layers, pooling layers)
- same, local cues appear in different places in input (cfr. vision)
SLIDE 5
Convolutional neural network
Intuition
SLIDE 6
Convolutional neural network
Intuition
SLIDE 7
Convolutional neural network
Intuition
SLIDE 8
Convolutional neural network
Architecture
SLIDE 9 Convolutional neural network
Encoding sentences
How to represent variable number of features, e.g. words in a sentence, document?
- Continuous Bag of Words (CBOW): sum embedding vectors of
corresponding features
- no ordering info (”not good quite bad” = ”not bad quite good”)
- Convolutional layer
- ’Sliding window’ approach that takes local structure into account
- Combine individual windows to create vector of fixed size
SLIDE 10 Continuous bag of words
Variable number of features
- Feed-forward network assumes fixed dimensional input
- How to represent variable number of features, e.g. words in a
sentence, document?
- Continuous Bag of Words (CBOW): sum embedding vectors of
corresponding features
SLIDE 11 Convolutional neural network
Convolutional layer for NLP
- Goal: identify indicative local features (n-grams) in large
structure, combine them into fixed size vector
- Convolution: apply filter to each window (linear transformation
+ non-linear activation)
- Pooling: combine by taking maximum
SLIDE 12
Convolutional neural networks
Architecture for NLP
SLIDE 13 Neural Network Architectures
Recurrent (+ recursive) neural networks
- Handle structured data of arbitrary sizes
- Recurrent networks for sequences
- Recursive networks for trees
SLIDE 14 Recurrent neural network
Introduction
- CBOW: no ordering, no structure
- CNN: improvement, but mostly local patterns
- RNN: represent arbitrarily sized structured input as fixed-size
vectors, paying attention to structured properties
SLIDE 15 Recurrent neural network
Model
- x1: input layer (current word)
- a1: hidden layer of current timestep
- a0: hidden layer of previous timestep
- U, W and V: weights matrices
- f(·): element-wise activation function (sigmoid)
- g(·): softmax function to ensure probability distribution
a1 = f(Ux1 + Wa0) (1) y1 = g(Va1) (2)
SLIDE 16
Recurrent neural network
Graphical representation
SLIDE 17 Recurrent neural network
Training
- Consider recurrent neural network as very deep neural network
with shared parameters across computation
- Backpropagation through time
- What kind of supervision?
- Acceptor: based on final state
- Transducer: an output for each input (e.g. language modeling)
- Encoder-decoder: one RNN to encode sequence into vector
representation, another RNN to decode into sequence (e.g. machine translation)
SLIDE 18
Recurrent neural network
Training: graphical representation
SLIDE 19 Recurrent neural network
Multi-layer RNN
- multiple layers of RNNs
- input of next layer is output of RNN layer below it
- Empirically shown to work better
SLIDE 20 Recurrent neural network
Bi-directional RNN
- Input sequence both forward and backward to different RNNs
- Representation is concatenation of forward and backward state
(A & A’)
- Represent both history and future
SLIDE 21
Concrete RNN architectures
Simple RNN
SLIDE 22 Concrete RNN architectures
LSTM
- Long short term memory networks
- In practice, simple RNNs only able to remember narrow context
(vanishing gradient)
- LSTM: complex architecture able to capture long-term
dependencies
SLIDE 23
Concrete RNN architectures
LSTM
SLIDE 24
Concrete RNN architectures
LSTM
SLIDE 25
Concrete RNN architectures
LSTM
SLIDE 26
Concrete RNN architectures
LSTM
SLIDE 27
Concrete RNN architectures
LSTM
SLIDE 28
Concrete RNN architectures
LSTM
SLIDE 29 Concrete RNN architectures
GRU
- LSTM: effective, but complex, computationally expensive
- GRU: cheaper alternative that works well in practice
SLIDE 30 Concrete RNN architectures
GRU
- reset gate (r): how much information from previous hidden
state needs to be included (reset with current information?)
- upgate gate (z): controls updates to hidden state (how much
does hidden state need to be updated with current information?)
SLIDE 31 Recursive neural networks
Introduction
- Generalization of RNNs from sequences to (binary) trees
- Linear transformation + non-linear activation function applied
recursively throughout a tree
SLIDE 32
Application
Image to caption generation
SLIDE 33
Application
Image to caption generation
SLIDE 34
Application
Neural machine translation
SLIDE 35
Application
Neural machine translation
SLIDE 36
Application
Neural dialogue generation (chatbot)
SLIDE 37 Software
- Tensorflow
- Python, C++
- http://www.tensorflow.org
- Theano
- Python
- http://deeplearning.net/software/theano/
- Keras
- Theano/tensorflow-based modular deep learning library
- Lasagne
- Theano-based deep learning library