Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent Neural Network (RNN) Models for ASR Instructor: Preethi Jyothi Feb 9, 2017 Recap: Hybrid DNN-HMM Systems Triphone state labels (DNN


slide-1
SLIDE 1

Instructor: Preethi Jyothi Feb 9, 2017


Automatic Speech Recognition (CS753)

Lecture 11: Recurrent Neural Network (RNN) Models for ASR

Automatic Speech Recognition (CS753)

slide-2
SLIDE 2

Recap: Hybrid DNN-HMM Systems

  • Instead of GMMs, use scaled

DNN posteriors as the HMM

  • bservation probabilities
  • DNN trained using triphone

labels derived from a forced alignment “Viterbi” step.

  • Forced alignment: Given a training

utuerance {O,W}, find the most likely sequence of states (and hence triphone state labels) using a set of trained triphone HMM models, M. Here M is constrained by the triphones in W.

Fixed window of 
 5 speech frames

Triphone state labels
 (DNN posteriors)

39 features in one frame

……

slide-3
SLIDE 3

Recap: Tandem DNN-HMM Systems

  • Neural network outputs are

used as “features” to train HMM-GMM models

  • Use a low-dimensional

botuleneck layer representation to extract features from the botuleneck layer


Bottleneck Layer Output Layer Input Layer

slide-4
SLIDE 4

Feedforward DNNs we’ve seen so far…

  • Assume independence among the training instances
  • Independent decision made about classifying each

individual speech frame

  • Network state is completely reset afuer each speech 


frame is processed

  • This independence assumption fails for data like speech which

has temporal and sequential structure

slide-5
SLIDE 5

Recurrent Neural Networks

  • Recurrent Neural Networks (RNNs) work naturally with

sequential data and process it one element at a time

  • HMMs also similarly atuempt to model time dependencies.

How’s it different?

  • HMMs are limited by the size of the state space. Inference

becomes intractable if the state space grows very large!

  • What about RNNs?
slide-6
SLIDE 6

RNN definition

Two main equations govern RNNs: H, O

xt yt ht

unfold

H, O

x1 y1 h0

H, O

x2 y2 h1

H, O

x3 y3 h2 …

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) where W, V, U are matrices of input-hidden weights, hidden-hidden
 weights and hidden-output weights resp; b(y) and b(y) are bias vectors

slide-7
SLIDE 7

Recurrent Neural Networks

  • Recurrent Neural Networks (RNNs) work naturally with

sequential data and process it one element at a time

  • HMMs also similarly atuempt to model time dependencies.

How’s it different?

  • HMMs are limited by the size of the state space. Inference

becomes intractable if the state space grows very large!

  • What about RNNs? RNNs are designed to capture long-

range dependencies unlike HMMs: Network state is exponential in the number of nodes in a hidden layer

slide-8
SLIDE 8

Training RNNs

  • An unrolled RNN is just a very deep feedforward network
  • For a given input sequence:
  • create the unrolled network
  • add a loss function node to the network
  • then, use backpropagation to compute the gradients
  • This algorithm is known as backpropagation through time

(BPTT)


slide-9
SLIDE 9

Deep RNNs

  • RNNs can be stacked in layers to form deep RNNs
  • Empirically shown to perform betuer than shallow RNNs on

ASR [G13] H, O

x1 y1 h0,1

H, O

x2 y2 h1,1

H, O

x3 y3 h2,1

H, O H, O H, O

h0,2 h1,2 h2,2

[G13] A. Graves, A . Mohamed, G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks”, ICASSP, 2013.

slide-10
SLIDE 10

Vanilla RNN Model

ht = H(Wxt + Vht-1 + b(h)) yt = O(Uht + b(y)) H : element wise application of the sigmoid or tanh function O : the sofumax function Run into problems of exploding and vanishing gradients.

slide-11
SLIDE 11

Exploding/Vanishing Gradients

  • To address this problem in RNNs, Long Short Term Memory

(LSTM) units were proposed [HS97]

[HS97] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” 
 Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

  • In deep networks, gradients in early layers is computed as the

product of terms from all the later layers

  • This leads to unstable gradients:
  • If the terms in later layers are large enough, gradients in early

layers (which is the product of these terms) can grow exponentially large: Exploding gradients

  • If the terms are in later layers are small, gradients in early

layers will tend to exponentially decrease: Vanishing gradients

slide-12
SLIDE 12

Long Short Term Memory Cells

  • Memory cell: Neuron that stores information over long time

periods

  • Forget gate: When on, memory cell retains previous contents.

Otherwise, memory cell forgets contents.

  • When input gate is on, write into memory cell
  • When output gate is on, read from the memory cell

Input Gate Output Gate Memory Cell Forget Gate

⊗ ⊗ ⊗

slide-13
SLIDE 13

Bidirectional RNNs

  • BiRNNs process the data in both directions with two separate 


hidden layers

  • Outputs from both hidden layers are concatenated at each

position Hf, Of

xhello h0,f

Hf, Of

xworld h1,f

Hf, Of

x. h2,f

Hb, Ob

h3,b

Hb, Ob

h2,b

Hb, Ob

h1,b

concat concat concat

y1,f y3,b y2,f y2,b y3,f y1,b h3,f h0,b

Forward
 layer Backward
 layer

slide-14
SLIDE 14

CS 753 Feb 9, 2017


Automatic Speech Recognition (CS753) RNN-based ASR system

slide-15
SLIDE 15

ASR with RNNs

  • Neural networks in ASR systems are typically a single component

(aka acoustic models) in a complex pipeline

  • Limitations:
  • 1. Frame-level training targets derived from HMM-based

alignments

  • 2. Objective function optimized in NNs is very different from the

final evaluation metric

  • Goal: Single RNN model that addresses these issues and replaces as

much of the speech pipeline as possible [G14]

[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

slide-16
SLIDE 16

RNN Architecture

  • H was implemented using LSTMs in [G14]. Input: Acoustic

feature vectors, one per frame; Output: Characters + space

  • Deep bidirectional LSTM networks were used

Hf, Of

xt-1 h0,f

Hf, Of

xt h1,f

Hf, Of

xt+1 h2,f

Hb, Ob

h3,b

Hb, Ob

h2,b

Hb, Ob

h1,b h3,f h0,b yt-1 yt yt+1

[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

slide-17
SLIDE 17

Connectionist Temporal Classification (CTC)

  • For an input sequence x of length T, Eqn (1) gives the probability of an
  • utput transcription y; a is a CTC alignment of y
  • Given a target transcription y*, the CTC objective function to be minimised is

given in Eqn (2)

  • Modify loss function as shown in Eqn (3) to be a betuer match to the final test

criteria; here, is a transcription loss function

  • needs to be minimised: Use a Monte-carlo sampling-based algorithm

Pr(y|x) = X

a∈B−1(y)

Pr(a|x) where Pr(a|x) =

T

Y

t=1

Pr(at, t|x)

… (1)

For a target y∗, CTC(x) = − log Pr(y∗|x)

… (2)

L(x) = X

y

Pr(y|x)L(x, y) = X

a

Pr(a|x)L(x, B(a))

… (3)

x)L(x, y) = L(x) =

slide-18
SLIDE 18

Decoding

  • First approximation: For a given test input sequence x, pick the

most probable output at each time step

  • More accurate decoding uses a search algorithm that also

makes use of a dictionary and a language model. (Decoding search algorithms will be discussed in detail in later lectures.)

arg max

y

Pr(y|x) ≈ B(arg max

a

Pr(a|x))

slide-19
SLIDE 19

WER results

System LM WER RNN-CTC Dictionary only 24.0 RNN-CTC Bigram 10.4 RNN-CTC Trigram 8.7 RNN-WER Dictionary only 21.9 RNN-WER Bigram 9.8 RNN-WER Trigram 8.2 Baseline Bigram 9.4 Baseline Trigram 7.8

[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

slide-20
SLIDE 20

Some erroneous examples produced by the end-to-end RNN

Target: “There’s unrest but we’re not going to lose them to Dukakis” Output: “There’s unrest but we’re not going to lose them to Dekakis” Target: “T. W. A. also plans to hang its boutique shingle in airports at Lambert Saint” Output: “T. W. A. also plans tohing its bootik single in airports at Lambert Saint”