[PPT] - Automatic Speech Recognition (CS753) Automatic Speech Recognition PowerPoint Presentation

SLIDE 1

Instructor: Preethi Jyothi Feb 9, 2017 

Automatic Speech Recognition (CS753)

Lecture 11: Recurrent Neural Network (RNN) Models for ASR

Automatic Speech Recognition (CS753)

SLIDE 2

Recap: Hybrid DNN-HMM Systems

Instead of GMMs, use scaled

DNN posteriors as the HMM

bservation probabilities
DNN trained using triphone

labels derived from a forced alignment “Viterbi” step.

Forced alignment: Given a training

utuerance {O,W}, find the most likely sequence of states (and hence triphone state labels) using a set of trained triphone HMM models, M. Here M is constrained by the triphones in W.

Fixed window of   5 speech frames

Triphone state labels  (DNN posteriors)

…

39 features in one frame

……

SLIDE 3

Recap: Tandem DNN-HMM Systems

Neural network outputs are

used as “features” to train HMM-GMM models

Use a low-dimensional

botuleneck layer representation to extract features from the botuleneck layer 

Bottleneck Layer Output Layer Input Layer

SLIDE 4

Feedforward DNNs we’ve seen so far…

Assume independence among the training instances
Independent decision made about classifying each

individual speech frame

Network state is completely reset afuer each speech

frame is processed

This independence assumption fails for data like speech which

has temporal and sequential structure

SLIDE 5

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) work naturally with

sequential data and process it one element at a time

HMMs also similarly atuempt to model time dependencies.

How’s it different?

HMMs are limited by the size of the state space. Inference

becomes intractable if the state space grows very large!

What about RNNs?

SLIDE 6

RNN definition

Two main equations govern RNNs: H, O

xt yt ht

unfold

H, O

x1 y1 h0

H, O

x2 y2 h1

H, O

x3 y3 h2 …

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) where W, V, U are matrices of input-hidden weights, hidden-hidden  weights and hidden-output weights resp; b(y) and b(y) are bias vectors

SLIDE 7

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) work naturally with

sequential data and process it one element at a time

HMMs also similarly atuempt to model time dependencies.

How’s it different?

HMMs are limited by the size of the state space. Inference

becomes intractable if the state space grows very large!

What about RNNs? RNNs are designed to capture long-

range dependencies unlike HMMs: Network state is exponential in the number of nodes in a hidden layer

SLIDE 8

Training RNNs

An unrolled RNN is just a very deep feedforward network
For a given input sequence:
create the unrolled network
add a loss function node to the network
then, use backpropagation to compute the gradients
This algorithm is known as backpropagation through time

(BPTT) 

SLIDE 9

Deep RNNs

RNNs can be stacked in layers to form deep RNNs
Empirically shown to perform betuer than shallow RNNs on

ASR [G13] H, O

x1 y1 h0,1

H, O

x2 y2 h1,1

H, O

x3 y3 h2,1

H, O H, O H, O

h0,2 h1,2 h2,2

[G13] A. Graves, A . Mohamed, G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks”, ICASSP, 2013.

SLIDE 10

Vanilla RNN Model

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) H : element wise application of the sigmoid or tanh function O : the sofumax function Run into problems of exploding and vanishing gradients.

SLIDE 11

Exploding/Vanishing Gradients

To address this problem in RNNs, Long Short Term Memory

(LSTM) units were proposed [HS97]

[HS97] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”   Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

In deep networks, gradients in early layers is computed as the

product of terms from all the later layers

This leads to unstable gradients:
If the terms in later layers are large enough, gradients in early

layers (which is the product of these terms) can grow exponentially large: Exploding gradients

If the terms are in later layers are small, gradients in early

layers will tend to exponentially decrease: Vanishing gradients

SLIDE 12

Long Short Term Memory Cells

Memory cell: Neuron that stores information over long time

periods

Forget gate: When on, memory cell retains previous contents.

Otherwise, memory cell forgets contents.

When input gate is on, write into memory cell
When output gate is on, read from the memory cell

Input Gate Output Gate Memory Cell Forget Gate

⊗ ⊗ ⊗

SLIDE 13

Bidirectional RNNs

BiRNNs process the data in both directions with two separate

hidden layers

Outputs from both hidden layers are concatenated at each

position Hf, Of

xhello h0,f

Hf, Of

xworld h1,f

Hf, Of

x. h2,f

Hb, Ob

h3,b

Hb, Ob

h2,b

Hb, Ob

h1,b

concat concat concat

y1,f y3,b y2,f y2,b y3,f y1,b h3,f h0,b

Forward  layer Backward  layer

SLIDE 14

CS 753 Feb 9, 2017 

Automatic Speech Recognition (CS753) RNN-based ASR system

SLIDE 15

ASR with RNNs

Neural networks in ASR systems are typically a single component

(aka acoustic models) in a complex pipeline

Limitations:
1. Frame-level training targets derived from HMM-based

alignments

2. Objective function optimized in NNs is very different from the

final evaluation metric

Goal: Single RNN model that addresses these issues and replaces as

much of the speech pipeline as possible [G14]

[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

SLIDE 16

RNN Architecture

H was implemented using LSTMs in [G14]. Input: Acoustic

feature vectors, one per frame; Output: Characters + space

Deep bidirectional LSTM networks were used

Hf, Of

xt-1 h0,f

Hf, Of

xt h1,f

Hf, Of

xt+1 h2,f

Hb, Ob

h3,b

Hb, Ob

h2,b

Hb, Ob

h1,b h3,f h0,b yt-1 yt yt+1

[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

SLIDE 17

Connectionist Temporal Classification (CTC)

For an input sequence x of length T, Eqn (1) gives the probability of an
utput transcription y; a is a CTC alignment of y
Given a target transcription y*, the CTC objective function to be minimised is

given in Eqn (2)

Modify loss function as shown in Eqn (3) to be a betuer match to the final test

criteria; here, is a transcription loss function

needs to be minimised: Use a Monte-carlo sampling-based algorithm

Pr(y|x) = X

a∈B−1(y)

Pr(a|x) where Pr(a|x) =

T

Y

t=1

Pr(at, t|x)

… (1)

For a target y∗, CTC(x) = − log Pr(y∗|x)

… (2)

L(x) = X

y

Pr(y|x)L(x, y) = X

a

Pr(a|x)L(x, B(a))

… (3)

x)L(x, y) = L(x) =

SLIDE 18

Decoding

First approximation: For a given test input sequence x, pick the

most probable output at each time step

More accurate decoding uses a search algorithm that also

makes use of a dictionary and a language model. (Decoding search algorithms will be discussed in detail in later lectures.)

arg max

y

Pr(y|x) ≈ B(arg max

a

Pr(a|x))

SLIDE 19

WER results

System LM WER RNN-CTC Dictionary only 24.0 RNN-CTC Bigram 10.4 RNN-CTC Trigram 8.7 RNN-WER Dictionary only 21.9 RNN-WER Bigram 9.8 RNN-WER Trigram 8.2 Baseline Bigram 9.4 Baseline Trigram 7.8

[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

SLIDE 20

Some erroneous examples produced by the end-to-end RNN

Target: “There’s unrest but we’re not going to lose them to Dukakis” Output: “There’s unrest but we’re not going to lose them to Dekakis” Target: “T. W. A. also plans to hang its boutique shingle in airports at Lambert Saint” Output: “T. W. A. also plans tohing its bootik single in airports at Lambert Saint”

Instructor: Preethi Jyothi Feb 9, 2017

Automatic Speech Recognition (CS753)

Lecture 11: Recurrent Neural Network (RNN) Models for ASR

Automatic Speech Recognition (CS753)

Recap: Hybrid DNN-HMM Systems

DNN posteriors as the HMM

labels derived from a forced alignment “Viterbi” step.

…

……

Recap: Tandem DNN-HMM Systems

used as “features” to train HMM-GMM models

botuleneck layer representation to extract features from the botuleneck layer

Feedforward DNNs we’ve seen so far…

individual speech frame

frame is processed

has temporal and sequential structure

Recurrent Neural Networks

sequential data and process it one element at a time

How’s it different?

becomes intractable if the state space grows very large!

RNN definition

Two main equations govern RNNs: H, O

xt yt ht

H, O

x1 y1 h0

H, O

x2 y2 h1

H, O

x3 y3 h2 …

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) where W, V, U are matrices of input-hidden weights, hidden-hidden weights and hidden-output weights resp; b(y) and b(y) are bias vectors

Recurrent Neural Networks

sequential data and process it one element at a time

How’s it different?

becomes intractable if the state space grows very large!

range dependencies unlike HMMs: Network state is exponential in the number of nodes in a hidden layer

Training RNNs

(BPTT)

Deep RNNs

ASR [G13] H, O

x1 y1 h0,1

H, O

x2 y2 h1,1

H, O

x3 y3 h2,1

H, O H, O H, O

h0,2 h1,2 h2,2

Vanilla RNN Model

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) H : element wise application of the sigmoid or tanh function O : the sofumax function Run into problems of exploding and vanishing gradients.

Exploding/Vanishing Gradients

(LSTM) units were proposed [HS97]

product of terms from all the later layers

layers (which is the product of these terms) can grow exponentially large: Exploding gradients

layers will tend to exponentially decrease: Vanishing gradients

Long Short Term Memory Cells

periods

Otherwise, memory cell forgets contents.

⊗ ⊗ ⊗

Bidirectional RNNs

hidden layers

position Hf, Of

xhello h0,f

Hf, Of

xworld h1,f

Hf, Of

x. h2,f

Hb, Ob

h3,b

Hb, Ob

h2,b

Hb, Ob

h1,b

y1,f y3,b y2,f y2,b y3,f y1,b h3,f h0,b

CS 753 Feb 9, 2017

Automatic Speech Recognition (CS753) RNN-based ASR system

ASR with RNNs

(aka acoustic models) in a complex pipeline

alignments

final evaluation metric

much of the speech pipeline as possible [G14]

RNN Architecture

Instructor: Preethi Jyothi Feb 9, 2017 

botuleneck layer representation to extract features from the botuleneck layer 

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) where W, V, U are matrices of input-hidden weights, hidden-hidden  weights and hidden-output weights resp; b(y) and b(y) are bias vectors

(BPTT) 

CS 753 Feb 9, 2017