Sequence to Sequence models: Connectionist Temporal Classification - PowerPoint PPT Presentation

The actual output of the network 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ /D/ Cannot distinguish between an extended symbol and 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 repetitions of the symbol 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ /F/ /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 /G/ 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 1: Simply select the most probable symbol at each time – Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant 17

The actual output of the network 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 Resulting sequence may be meaningless (what word is “GFIYD”?) 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ /D/ Cannot distinguish between an extended symbol and 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 repetitions of the symbol 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ /F/ /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 /G/ 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 1: Simply select the most probable symbol at each time – Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant 18

The actual output of the network 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 2: Impose external constraints on what sequences are allowed – E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?) 19

The actual output of the network 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 We will refer to the process 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ of obtaining an output from the network as decoding 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 2: Impose external constraints on what sequences are allowed – E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?) 20

The sequence-to-sequence problem /B/ /IY/ /F/ /IY/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • How do we know when to output symbols – In fact, the network produces outputs at every time – Which of these are the real outputs • How do we train these models? 21

Training /B/ /IY/ /F/ /IY/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Given output symbols at the right locations – The phoneme /B/ ends at X 2 , /IY/ at X 4 , /F/ at X 6 , /IY/ at X 9 22

/F/ /IY/ /B/ /IY/ Div Div Div Div 𝑍 𝑍 𝑍 𝑍 2 4 6 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Either just define Divergence as: 𝐸𝐽𝑊 = 𝑌𝑓𝑜𝑢 𝑍 2 , 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 4 , 𝐽𝑍 + 𝑌𝑓𝑜𝑢 𝑍 6 , 𝐺 + 𝑌𝑓𝑜𝑢(𝑍 9 , 𝐽𝑍) • Or.. 23

/IY/ /F/ /IY/ /B/ Div Div Div Div Div Div Div Div Div Div 𝑍 𝑍 𝑍 𝑍 2 4 6 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Either just define Divergence as: 𝐸𝐽𝑊 = 𝑌𝑓𝑜𝑢 𝑍 2 , 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 4 , 𝐽𝑍 + 𝑌𝑓𝑜𝑢 𝑍 6 , 𝐺 + 𝑌𝑓𝑜𝑢(𝑍 9 , 𝐽𝑍) • Or repeat the symbols over their duration 𝐸𝐽𝑊 = ෍ 𝑌𝑓𝑜𝑢 𝑍 𝑢 , 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 𝑢 24

/IY/ /F/ /IY/ /B/ Div Div Div Div Div Div Div Div Div Div 𝑍 𝑍 𝑍 𝑍 2 4 6 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 𝐸𝐽𝑊 = ෍ 𝑌𝑓𝑜𝑢 𝑍 𝑢 , 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 𝑢 • The gradient w.r.t the 𝑢 -th output vector 𝑍 𝑢 −1 𝛼 𝑍 𝑢 𝐸𝐽𝑊 = 0 0 … 0 … 0 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 – Zeros except at the component corresponding to the target 25

Problem: No timing information provided /B/ /IY/ /F/ /IY/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Only the sequence of output symbols is provided for the training data – But no indication of which one occurs where • How do we compute the divergence? – And how do we compute its gradient w.r.t. 𝑍 𝑢 26

Solution 1: Guess the alignment /F/ /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /IY/ /F/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Initialize: Assign an initial alignment – Either randomly, based on some heuristic, or any other rationale • Iterate: – Train the network using the current alignment – Reestimate the alignment for each training instance • Using the decoding methods already discussed 27

Solution 1: Guess the alignment /F/ /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /IY/ /F/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Initialize: Assign an initial alignment – Either randomly, based on some heuristic, or any other rationale • Iterate: – Train the network using the current alignment – Reestimate the alignment for each training instance • Using the decoding methods already discussed 28

Estimating an alignment • Given: – The unaligned 𝐿 -length symbol sequence 𝑇 = 𝑇 0 … 𝑇 𝐿−1 (e.g. /B/ /IY/ /F/ /IY/) – An 𝑂 -length input ( 𝑂 ≥ 𝐿 ) – And a (trained) recurrent network • Find: – An 𝑂 -length expansion 𝑡 0 … 𝑡 𝑂−1 comprising the symbols in S in strict order • e.g. 𝑇 0 𝑇 1 𝑇 1 𝑇 2 𝑇 3 𝑇 3 … 𝑇 𝐿−1 – i.e. 𝑡 0 = 𝑇 0 , 𝑡 2 = 𝑇 1 , 𝑇 3 = 𝑇 1 , 𝑡 4 = 𝑇 2 , 𝑡 5 = 𝑇 3 , … 𝑡 𝑂−1 = 𝑇 𝐿−1 • E.g. /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /F/ /F/ /IY/ .. – 𝑡 𝑗 = 𝑇 𝑙 ⇒ 𝑗 ≥ 𝑙 – 𝑡 𝑗 = 𝑇 𝑙 , 𝑡 𝑘 = 𝑇 𝑚 , 𝑗 < 𝑘 ⇒ 𝑙 ≤ 𝑚 • Outcome: an alignment of the target symbol sequence 𝑇 0 … 𝑇 𝐿−1 to the input 𝑌 0 … 𝑌 𝑂−1 29

Recall: The actual output of the network 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • At each time the network outputs a probability for each output symbol 30

Recall: unconstrained decoding 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ • We find the most likely sequence of symbols – (Conditioned on input 𝑌 0 … 𝑌 𝑂−1 ) • This may not correspond to an expansion of the desired symbol sequence – E.g. the unconstrained decode may be /AH//AH//AH//D//D//AH//F//IY//IY/ • Contracts to /AH/ /D/ /AH/ /F/ /IY/ – Whereas we want an expansion of /B//IY//F//IY/ 31

Constraining the alignment: Try 1 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝐵𝐼 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝐸 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 𝐹𝐼 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ • Block out all rows that do not include symbols from the target sequence – E.g. Block out rows that are not /B/ /IY/ or /F/ 32

Blocking out unnecessary outputs 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 Compute the entire output (for all symbols) Copy the output values for the target symbols into the secondary reduced structure 33

Constraining the alignment: Try 1 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 • Only decode on reduced grid – We are now assured that only the appropriate symbols will be hypothesized 34

Constraining the alignment: Try 1 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 • Only decode on reduced grid – We are now assured that only the appropriate symbols will be hypothesized • Problem: This still doesn’t assure that the decode sequence correctly expands the target symbol sequence – E.g. the above decode is not an expansion of /B//IY//F//IY/ • Still needs additional constraints 35

Try 2: Explicitly arrange the constructed table 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required 36

Try 2: Explicitly arrange the constructed table 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ Note: If a symbol occurs multiple times, we repeat the row in the appropriate location. E.g. the row for /IY/ occurs twice, in the 2 nd and 4 th positions Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required 37

Explicitly constrain alignment 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Constrain that the first symbol in the decode must be the top left block • The last symbol must be the bottom right • The rest of the symbols must follow a sequence that monotonically travels down from top left to bottom right – I.e. never goes up • This guarantees that the sequence is an expansion of the target sequence – /B/ /IY/ /F/ /IY/ in this case 38

Explicitly constrain alignment 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Compose a graph such that every path in the graph from source to sink represents a valid alignment – Which maps on to the target symbol sequence (/B//AH//T/) • Edge scores are 1 • Node scores are the probabilities assigned to the symbols by the neural network • The “score” of a path is the product of the probabilities of all nodes along the path • Find the most probable path from source to sink using any dynamic programming algorithm – E.g. The Viterbi algorithm 39

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • At each node, keep track of – The best incoming edge – The score of the best path from the source to the node • Dynamically compute the best path from source to sink 40

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • First, some notation: 𝑇(𝑠) is the probability of the target symbol assigned to the 𝑠 -th row • 𝑧 𝑢 in the 𝑢 -th time (given inputs 𝑌 0 … 𝑌 𝑢 ) – E.g., S(0) = /B/ • The scores in the 0 th row have the form 𝑧 𝑢 𝐶 – E.g. S(1) = S(3) = /IY/ • The scores in the 1 st and 3 rd rows have the form 𝑧 𝑢 𝐽𝑍 – E.g. S(2) = /F/ • The scores in the 2 nd row have the form 𝑧 𝑢 𝐺 41

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 … 𝑈 − 1 𝑇 0 𝐶𝑄(𝑢, 0) = 0, 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 0 … 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 42

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 … 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 … 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 … 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 … 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 43

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1) • for 𝑢 = 𝑈 𝑒𝑝𝑥𝑜 𝑢𝑝 1 – s(t-1) = BP(s(t)) 52

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1) • for 𝑢 = 𝑈 − 1 𝑒𝑝𝑥𝑜𝑢𝑝 1 𝑡(𝑢 − 1) = 𝐶𝑄(𝑡(𝑢)) 53

Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1) • for 𝑢 = 𝑈 − 1 𝑒𝑝𝑥𝑜𝑢𝑝 1 𝑡(𝑢 − 1) = 𝐶𝑄(𝑡(𝑢)) /B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/ 54

Gradients from the alignment 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ /B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/ 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ = − ෍ 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝐸𝐽𝑊 = ෍ 𝑌𝑓𝑜𝑢 𝑍 𝑢 , 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 𝑢 • The gradient w.r.t the 𝑢 -th output vector 𝑍 𝑢 −1 … 0 … 0 𝛼 𝑍 𝑢 𝐸𝐽𝑊 = 0 0 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 – Zeros except at the component corresponding to the target in the estimated alignment 55

Iterative Estimate and Training /IY/ /IY/ /B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 Initialize Train model with Decode to obtain alignments given alignments alignments The “decode” and “train” steps may be combine into a single “decode, find alignment, 56 compute derivatives” step for SGD and mini -batch updates

Iterative update • Option 1: – Determine alignments for every training instance – Train model (using SGD or your favorite approach) on the entire training set – Iterate • Option 2: – During SGD, for each training instance, find the alignment during the forward pass – Use in backward pass 57

Iterative update: Problem • Approach heavily dependent on initial alignment • Prone to poor local optima • Alternate solution: Do not commit to an alignment during any pass.. 58

The reason for suboptimality 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • We commit to the single “best” estimated alignment – The most likely alignment 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝐸𝐽𝑊 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 – This can be way off, particularly in early iterations, or if the model is poorly initialized • Alternate view: there is a probability distribution over alignments – Selecting a single alignment is the same as drawing a single sample from this distribution – Selecting the most likely alignment is the same as deterministically always drawing the most probable value from the distribution 59

The reason for suboptimality 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • We commit to the single “best” estimated alignment – The most likely alignment 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝐸𝐽𝑊 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 – This can be way off, particularly in early iterations, or if the model is poorly initialized • Alternate view: there is a probability distribution over alignments of the target Symbol sequence (to the input) – Selecting a single alignment is the same as drawing a single sample from it – Selecting the most likely alignment is the same as deterministically always drawing the most probable value from the distribution 60

Averaging over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Instead of only selecting the most likely alignment, use the statistical expectation over all possible alignments 𝐸𝐽𝑊 = 𝐹 − ෍ log 𝑍 𝑢, 𝑡 𝑢 𝑢 – Use the entire distribution of alignments – This will mitigate the issue of suboptimal selection of alignment 61

The expectation over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝐸𝐽𝑊 = 𝐹 − ෍ log 𝑍 𝑢, 𝑡 𝑢 𝑢 • Using the linearity of expectation 𝐸𝐽𝑊 = − ෍ 𝐹 log 𝑍 𝑢, 𝑡 𝑢 𝑢 – This reduces to finding the expected divergence at each input 𝐸𝐽𝑊 = − ෍ ෍ 𝑄(𝑡 𝑢 = 𝑇|𝐓, 𝐘) log 𝑍 𝑢, 𝑡 𝑢 = 𝑡 𝑢 𝑇∈𝑇 1 …𝑇 𝐿 62

The expectation over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t The probability of seeing the specific symbol s at time t, 𝐸𝐽𝑊 = 𝐹 − ෍ log 𝑍 𝑢, 𝑡 𝑢 given that the symbol sequence is an expansion of 𝐓 = 𝑇 0 … 𝑇 𝐿−1 and given the input sequence 𝐘 = 𝑌 0 … 𝑌 𝑂−1 𝑢 • Using the linearity of expectation We need to be able to compute this 𝐸𝐽𝑊 = − ෍ 𝐹 log 𝑍 𝑢, 𝑡 𝑢 𝑢 – This reduces to finding the expected divergence at each input 𝐸𝐽𝑊 = − ෍ ෍ 𝑄(𝑡 𝑢 = 𝑇|𝐓, 𝐘) log 𝑍 𝑢, 𝑡 𝑢 = 𝑇 𝑢 𝑇∈𝑇 1 …𝑇 𝐿 63

A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄(𝑡 𝑢 = 𝑇 𝑠 |𝐓, 𝐘) ∝ 𝑄(𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘) • 𝑄(𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘) is the total probability of all valid paths in the graph for target sequence 𝐓 that go through the symbol 𝑇 𝑠 (the 𝑠 th symbol in the sequence 𝑇 1 … 𝑇 𝐿 ) at time 𝑢 • We will compute this using the “forward - backward” algorithm 64

A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Decompose 𝑄(𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘) as follows: 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 … 𝑡 𝑂−1 , 𝐓 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • [𝑇 𝑠+ ] indicates that 𝑡 𝑢+1 might either be 𝑇 𝑠 or 𝑇 𝑠+1 • [𝑇 𝑠− ] indicates that 𝑡 𝑢−1 might be either 𝑇 𝑠 or 𝑇 𝑠−1 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 – Because the target symbol sequence 𝐓 is implicit in the synchronized sequences 𝑡 0 … 𝑡 𝑂−1 which are constrained to be expansions of 𝐓 65

A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • For a recurrent network without feedback from the output we can make the conditional independence assumption: 𝑄 𝑡 𝑢+1 … 𝑡 0 … 𝑡 𝑢 , 𝐘 = 𝑄 𝑡 𝑢+1 … 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 66

A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • For a recurrent network without feedback from the output we can make the conditional independence assumption: 𝑄 𝑡 𝑢+1 … 𝑡 0 … 𝑡 𝑢 , 𝐘 = 𝑄 𝑡 𝑢+1 … 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 Note: in reality, this assumption is not valid if the hidden states are unknown, but 67 we will make it anyway

Conditional independence 𝑧 0 𝑧 1 𝐘 = 𝑌 0 𝑌 1 … 𝑌 𝑂−1 𝐈 = 𝐼 0 𝐼 1 … 𝐼 𝑂−1 ⋮ 𝑧 𝑂−1 • Dependency graph: Input sequence 𝐘 = 𝑌 0 𝑌 1 … 𝑌 𝑂−1 governs hidden variables 𝐈 = 𝐼 0 𝐼 1 … 𝐼 𝑂−1 • Hidden variables govern output predictions 𝑧 0 , 𝑧 1 , … 𝑧 𝑂−1 individually • 𝑧 0 , 𝑧 1 , … 𝑧 𝑂−1 are conditionally independent given 𝐈 • Since 𝐈 is deterministically derived from 𝐘 , 𝑧 0 , 𝑧 1 , … 𝑧 𝑂−1 are also conditionally independent given 𝐘 – This wouldn’t be true if the relation between 𝐘 and 𝐈 were not deterministic or if 𝐘 is unknown 68

A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 69

A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 70

The expectation over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • We will call the first term the forward probability 𝛽 𝑢, 𝑠 • We will call the second term the backward probability 𝛾 𝑢, 𝑠 71

Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛽 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 … 𝑡 𝑢−1 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 … 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−2 →𝑇 1 …[𝑇 𝑠− ] 𝑡 0 …𝑡 𝑢−2 →𝑇 1 …[𝑇 (𝑠−1)− ] 72

Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛽 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 … 𝑡 𝑢−1 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 … 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−2 →𝑇 1 …[𝑇 𝑠− ] 𝑡 0 …𝑡 𝑢−2 →𝑇 1 …[𝑇 (𝑠−1)− ] 𝑇(𝑠) 75 𝑧 𝑢 𝛽 𝑢 − 1, 𝑠 𝛽 𝑢 − 1, 𝑠 − 1

Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛽 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 … 𝑡 𝑢−1 , 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 … 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0 …𝑡 𝑢−2 →𝑇 1 …[𝑇 𝑠− ] 𝑡 0 …𝑡 𝑢−2 →𝑇 1 …[𝑇 (𝑠−1)− ] 𝑇(𝑠) 𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 𝑧 𝑢 76

Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛽 𝑢 − 1, 𝑠 − 1 𝛽 𝑢 − 1, 𝑠 𝛽 𝑢, 𝑠 𝑇(𝑠) 𝛽 𝑢, 𝑠 = 𝛽 𝑢 − 1, 𝑠 + 𝛽 𝑢 − 1, 𝑠 − 1 𝑧 𝑢 77

Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛽 0,1 = 𝑧 0 𝛽 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 … 𝑈 − 1 𝑇 1 𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 … 𝐿 𝑇 𝑚 • 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 78

In practice.. • The recursion 𝑇 𝑚 𝛽(𝑢, 𝑚) = (𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 will generally underflow • Instead we can do it in the log domain log 𝛽(𝑢, 𝑚) = log(𝑓 log 𝛽 𝑢−1,𝑚 + 𝑓 log 𝛽 𝑢−1,𝑚−1 ) + log 𝑧 𝑢 𝑇 𝑚 – This can be computed entirely without underflow 84

Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛽 0,1 = 1, ො 𝛽 0, 𝑠 = 0, 𝑠 > 1 ො 𝑇 𝑠 , 𝛽 0, 𝑠 = ො 𝛽 0, 𝑠 𝑧 0 1 ≤ 𝑠 ≤ 𝐿 • for 𝑢 = 1 … 𝑈 − 1 𝛽(𝑢, 1) = 𝛽(𝑢 − 1,1) ො for 𝑚 = 2 … 𝐿 • 𝛽(𝑢, 𝑚) = 𝛽 𝑢 − 1, 𝑚 + 𝛽 𝑢 − 1, 𝑚 − 1 ො 𝑇 𝑠 , 𝛽 𝑢, 𝑠 = ො 𝛽 𝑢, 𝑠 𝑧 𝑢 1 ≤ 𝑠 ≤ 𝐿 85

The forward probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ 𝑄 𝑡 0 … 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 0 …𝑡 𝑢−1 →𝑇 1 …[𝑇 𝑠− ] 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • We will call the first term the forward probability 𝛽 𝑢, 𝑠 • We will call the second term the backward probability 𝛾 𝑢, 𝑠 We have seen how to compute this 𝛽 𝑢, 𝑠 86

The forward probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛽 𝑢, 𝑠 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • We will call the first term the forward probability 𝛽 𝑢, 𝑠 • We will call the second term the backward probability 𝛾 𝑢, 𝑠 We have seen how to compute this 87

The forward probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛽 𝑢, 𝑠 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • We will call the first term the forward probability 𝛽 𝑢, 𝑠 • We will call the second term the backward probability 𝛾 𝑢, 𝑠 Lets look at this 𝛾 𝑢, 𝑠 88

Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛾 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 89

Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛾 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠 , 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 90

Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛾 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠 , 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 91

Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛾 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠 , 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 𝑡 𝑢+2 …𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ]…𝑇 𝐿 92 𝑇(𝑠) 𝑇(𝑠+1) 𝛾 𝑢 + 1, 𝑠 𝛾 𝑢 + 1, 𝑠 + 1 𝑧 𝑢+1 𝑧 𝑢+1

Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑇(𝑠) 𝛾 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛾 𝑢 + 1, 𝑠 + 1 𝛾 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 93

Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛾 𝑈 − 1, 𝐿 = 1, 𝛾 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿 • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0 𝑇 𝐿 𝛾(𝑢, 𝐿) = 𝛾(𝑢 + 1, 𝐿)𝑧 𝑢+1 for 𝑚 = 𝐿 − 1 … 1 𝑇(𝑚) 𝛾 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛾 𝑢 + 1, 𝑠 + 1 • 𝛾 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 94

The joint probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛽 𝑢, 𝑠 ෍ 𝑄 𝑡 𝑢+1 … 𝑡 𝑂−1 𝐘 𝑡 𝑢+1 …𝑡 𝑂−1 →[𝑇 𝑠+ ]…𝑇 𝐿 • We will call the first term the forward probability 𝛽 𝑢, 𝑠 • We will call the second term the backward probability 𝛾 𝑢, 𝑠 We now can compute this 𝛾 𝑢, 𝑠 99

The joint probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝐽𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛽 𝑢, 𝑠 𝛾 𝑢, 𝑠 • We will call the first term the forward probability 𝛽 𝑢, 𝑠 • We will call the second term the backward probability 𝛾 𝑢, 𝑠 Forward algo Backward algo 100

Sequence to Sequence models: Connectionist Temporal Classification - PowerPoint PPT Presentation

Deep Learning Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1 Sequence-to-sequence modelling Problem: A sequence 1 goes in A different sequence 1 comes out

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Machine Translation and Sequence-to-sequence Models http://phontron.com/class/mtandseq2seq2018/

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Machine Translation/ Sequence-to-sequence Models Graham Neubig Site

Joint use of SAXS o o with MX and EM Peter Konarev European Molecular Biology Laboratory,

Nuclear Plant Decommissioning: Host Community Engagement December 9, 2015 11:00 a.m. 12:00 pm

Properties of Engineering Materials Phase Diagrams Dr. Eng. Yazan Al-Zain Department of

Superfluid Helium-3: Universal Concepts for Condensed Matter and the Big Bang Dieter Vollhardt

Lecture 23: Recurrent Neural Networks, Long Short Term Memory Networks, Conntectionist Temporal

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Sambuz

Useful Links

Newsletter

Mail Us

Sequence to Sequence models: Connectionist Temporal Classification - PowerPoint PPT Presentation

Deep Learning Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1 Sequence-to-sequence modelling Problem: A sequence 1 goes in A different sequence 1 comes out

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Machine Translation and Sequence-to-sequence Models http://phontron.com/class/mtandseq2seq2018/

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Machine Translation/ Sequence-to-sequence Models Graham Neubig Site

Joint use of SAXS o o with MX and EM Peter Konarev European Molecular Biology Laboratory,

Nuclear Plant Decommissioning: Host Community Engagement December 9, 2015 11:00 a.m. 12:00 pm

Properties of Engineering Materials Phase Diagrams Dr. Eng. Yazan Al-Zain Department of

Superfluid Helium-3: Universal Concepts for Condensed Matter and the Big Bang Dieter Vollhardt

Lecture 23: Recurrent Neural Networks, Long Short Term Memory Networks, Conntectionist Temporal

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or