ELEN E6884/COMS 86884 Speech Recognition Lecture 2
Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,eeide,stanchen}@us.ibm.com 15 September 2005
■❇▼
ELEN E6884: Speech Recognition
ELEN E6884/COMS 86884 Speech Recognition Lecture 2 Michael - - PowerPoint PPT Presentation
ELEN E6884/COMS 86884 Speech Recognition Lecture 2 Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA { picheny,eeide,stanchen } @us.ibm.com 15 September 2005 ELEN E6884: Speech
■❇▼
ELEN E6884: Speech Recognition
■ today is picture day! ■ will hand out hardcopies of slides and readings for now
■ main feedback from last lecture
■ Lab 0 due tomorrow ■ Lab 1 out today, due on Friday in two weeks
■❇▼
ELEN E6884: Speech Recognition 1
■ Feature Extraction ■ Brief Break ■ Dynamic Time Warping
■❇▼
ELEN E6884: Speech Recognition 2
■ Capture essential information for sound and word identification ■ Compress information into a manageable form ■ Make it easy to factor out irrelevant information to recognition
■ Would be nice to find features that are i.i.d.
■❇▼
ELEN E6884: Speech Recognition 3
■ Model speech signal with a parsimonious set of parameters that
■ Use some type of function approximation such as Taylor or
■ Ignore pitch
■ Match human perception of frequency bands
■ Ignore other speaker dependent characteristics e.g. vocal tract
■❇▼
ELEN E6884: Speech Recognition 4
■ Incorporate dynamics
■❇▼
ELEN E6884: Speech Recognition 5
■❇▼
ELEN E6884: Speech Recognition 6
■❇▼
ELEN E6884: Speech Recognition 7
■❇▼
ELEN E6884: Speech Recognition 8
■ Improve LPC estimates (works better with “flatter” spectra) ■ Reduce or eliminate DC offsets ■ Mimic equal-loudness contours (higher frequency sounds
■❇▼
ELEN E6884: Speech Recognition 9
■❇▼
ELEN E6884: Speech Recognition 10
■ Experiments in speech coding suggest that F should be around
■ From last week, we know that both Hamming and Hanning
■❇▼
ELEN E6884: Speech Recognition 11
■❇▼
ELEN E6884: Speech Recognition 12
■ If too long, vocal tract will be non-stationary; smooth out
■ If too short, spectral output will be too variable with respect to
■❇▼
ELEN E6884: Speech Recognition 13
■❇▼
ELEN E6884: Speech Recognition 14
■❇▼
ELEN E6884: Speech Recognition 15
■❇▼
ELEN E6884: Speech Recognition 16
■❇▼
ELEN E6884: Speech Recognition 17
■❇▼
ELEN E6884: Speech Recognition 18
■❇▼
ELEN E6884: Speech Recognition 19
■❇▼
ELEN E6884: Speech Recognition 20
■❇▼
ELEN E6884: Speech Recognition 21
v>€.o. k5.
\
,-(f't ~f~a ~\<> f>~ tcX'
rht:" ~e. Itne~ . C-o~fdhd ro ~C{l1fs
(b)
Time (ms) 2 Frequency (kHz) 3 4 (a)
Figure 12.28 (a) Cepstra and (b) log spectra for sequential segments of voiced
W It{, Cep..trovU~ I:.YYlOo~ed. SUp« IMPI)S-e.d,
_~F\'~\J'f't, ~crm Opp~heli'l"~ 5~. "D,scft'k
PrOCL~~/~ N
ELEN E6884: Speech Recognition 22
■❇▼
ELEN E6884: Speech Recognition 23
p
■❇▼
ELEN E6884: Speech Recognition 24
j=1 a[j]z−j
■❇▼
ELEN E6884: Speech Recognition 25
∞
∞
∞
p
■❇▼
ELEN E6884: Speech Recognition 26
p
n x[n − i]x[n − j].
p
■❇▼
ELEN E6884: Speech Recognition 27
■❇▼
ELEN E6884: Speech Recognition 28
i−1
■❇▼
ELEN E6884: Speech Recognition 29
■❇▼
ELEN E6884: Speech Recognition 30
■❇▼
ELEN E6884: Speech Recognition 31
■❇▼
ELEN E6884: Speech Recognition 32
■ The a[j]s themselves have an enormous dynamic range,
■ One can generate the spectrum from the LP coefficients but
■ Can use various transformations,
■ The transformation that works best is the LPC Cepstrum.
■❇▼
ELEN E6884: Speech Recognition 33
∞
p
■❇▼
ELEN E6884: Speech Recognition 34
∞
l=1 la[l]z−l−1
j=1 a[j]z−j
j=1 a[j]z−j) and equating
j=1 j n˜
j=n−p j n˜
■❇▼
ELEN E6884: Speech Recognition 35
Li−1
■❇▼
ELEN E6884: Speech Recognition 36
■❇▼
ELEN E6884: Speech Recognition 37
■❇▼
ELEN E6884: Speech Recognition 38
■❇▼
ELEN E6884: Speech Recognition 39
k−f(m−1) f(m)−f(m−1)
f(m+1)−k f(m+1)−f(m)
■❇▼
ELEN E6884: Speech Recognition 40
N−1
■❇▼
ELEN E6884: Speech Recognition 41
■❇▼
ELEN E6884: Speech Recognition 42
■❇▼
ELEN E6884: Speech Recognition 43
M−1
■❇▼
ELEN E6884: Speech Recognition 44
■ Smooth spectral fit that matches higher amplitude components
■ Perceptually based frequency scale (MFCCs) ■ Perceptually based amplitude scale (neither)
N−1
■❇▼
ELEN E6884: Speech Recognition 45
■❇▼
ELEN E6884: Speech Recognition 46
■❇▼
ELEN E6884: Speech Recognition 47
■❇▼
ELEN E6884: Speech Recognition 48
■❇▼
ELEN E6884: Speech Recognition 49
■❇▼
ELEN E6884: Speech Recognition 50
■❇▼
ELEN E6884: Speech Recognition 51
■❇▼
ELEN E6884: Speech Recognition 52
■❇▼
ELEN E6884: Speech Recognition 53
■❇▼
ELEN E6884: Speech Recognition 54
■ Stop closures and releases ■ Formant transitions
t = (yt, ∆yt)
■❇▼
ELEN E6884: Speech Recognition 55
■❇▼
ELEN E6884: Speech Recognition 56
■❇▼
ELEN E6884: Speech Recognition 57
■ for each word w, collect a single audio example Araw
w
■ to recognize audio signal Araw, find word w that minimizes
w )
■❇▼
ELEN E6884: Speech Recognition 58
■ convert raw audio signals Araw into salient features A
■ example:
■❇▼
ELEN E6884: Speech Recognition 59
■ a simple scenario
■ case 1: A, Aw are all the same length, say, T frames
T
■❇▼
ELEN E6884: Speech Recognition 60
■ see what works well ■ popular distances
i)2
p
i|p
■ whatever
■❇▼
ELEN E6884: Speech Recognition 61
■ what to do? ■ solution 1: make everything the same length
T
■❇▼
ELEN E6884: Speech Recognition 62
■ do vowels and consonants stretch equally in time? ■ handling silence
■ want a nonlinear alignment scheme!
■❇▼
ELEN E6884: Speech Recognition 63
T
■ introduce warping functions τ1(t), τ2(t)
✲ ✻ r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r
✲
✲
t=1 t=2 t=3 t=4 t=5 t=6 t=7
■❇▼
ELEN E6884: Speech Recognition 64
T
■ given a pair of warping functions τ1(t), τ2(t), distance is well-
✲ ✻ r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r
✲
✲
t=1 t=2 t=3 t=4 t=5 t=6 t=7
■❇▼
ELEN E6884: Speech Recognition 65
■ begin at the beginning; end at the end
■ don’t move backwards (monotonicity)
■ don’t move forwards too far (locality)
✲ ✻ r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r
✲
✲
t=1 t=2 t=3 t=4 t=5 t=6 t=7
■❇▼
ELEN E6884: Speech Recognition 66
■ even better: constrain alignment to be comprised of sequence
■ e.g., three possible moves
✈ ✈ ✈ ✈
✲ ✻
■ alignment must consist only of segments of these types
■❇▼
ELEN E6884: Speech Recognition 67
T
■ which (legal) warping function to use to calculate the distance
■ consider all of them; pick the one with the smallest distance
τ1,τ2
ELEN E6884: Speech Recognition 68
τ1,τ2
■ no: dynamic programming!
■ why the name “dynamic programming”?
■❇▼
ELEN E6884: Speech Recognition 69
■ DTW can be framed as instance of shortest path problem ■ solvable using dynamic programming ■ concepts useful in other speech algorithms (HMM’s, finite-state
■❇▼
ELEN E6884: Speech Recognition 70
■ how can we solve this baffling conundrum? ■ we want shortest paths, so how few O’s can you touch? 1 2 3 4 19 1 3 3 10 1 1
■❇▼
ELEN E6884: Speech Recognition 71
■ key observation 1
S′→S{d(S′) + DISTANCE(S′, S)}
1 2 3 4 19 1 3 3 10 1 1
■❇▼
ELEN E6884: Speech Recognition 72
■ proposed algorithm
■ key observation 2
■❇▼
ELEN E6884: Speech Recognition 73
■ sort states topologically: number from 1, . . . , N
■ d(1) = 0 ■ for S = 2, . . . , N do
S′→S{d(S′) + DISTANCE(S′, S)}
■ final answer: d(N) 1 1 3 2 3 7 4 8 19 1 10 3 11 3 10 11 1 1
■❇▼
ELEN E6884: Speech Recognition 74
τ1,τ2
■ or was the whole dynamic programming discussion just our way
■❇▼
ELEN E6884: Speech Recognition 75
■ align two frame utterance A1 (A-T) with three frame utterance
■ frame distances FRAMEDIST(A1(t1), A2(t2))
■ move set
✈ ✈ ✈ ✈
✲ ✻
■❇▼
ELEN E6884: Speech Recognition 76
■ we have the states of the graph; start state, final state ■ a particular τ1(t), τ2(t) represents path from start to final state ■ what are arcs of graph, and what are distances on each arc?
✲ ✻ ✉ ✉ ✉ ✉ ✉ ✉
■❇▼
ELEN E6884: Speech Recognition 77
■ at each state of the graph, you can take each move
✲ ✻ ✉ ✉ ✉ ✉ ✉ ✉
✲ ✻
✲ ✻ ✲ ✲ ✻
■❇▼
ELEN E6884: Speech Recognition 78
■ take corresponding frame distance (at arc source)
✲ ✻ ✉ ✉ ✉ ✉ ✉ ✉
✲ ✻
✲ ✻ ✲ ✲ ✻
10 10 10
■❇▼
ELEN E6884: Speech Recognition 79
τ1,τ2
■ distance associated with alignment is same as distance along
■ yes!
■❇▼
ELEN E6884: Speech Recognition 80
■ sort states topologically
■ for t1 = 1, . . . , 2 do
■ final answer: d(2, 3) + FRAMEDIST(A1(2), A2(3))
■❇▼
ELEN E6884: Speech Recognition 81
■ let’s simulate this algorithm on a 100Hz human brain
✲ ✻ ✉ ✉ ✉ ✉ ✉ ✉
✲ ✻
✲ ✻ ✲ ✲ ✻
10 10 10
■❇▼
ELEN E6884: Speech Recognition 82
■ what if we want to know which alignment produced the shortest
■❇▼
ELEN E6884: Speech Recognition 83
■ trace back from final state to get best path: (3,2), (2,1), (1, 1)
✲ ✻ ✉ ✉ ✉ ✉ ✉ ✉
✲ ✻
✲ ✻ ✲ ✲ ✻
10 10 10
■❇▼
ELEN E6884: Speech Recognition 84
✈ ✈ ✈ ✈
✲ ✻
■❇▼
ELEN E6884: Speech Recognition 85
✈ ✈ ✈ ✈ ✈ ✈ ✈ ✈ ✈
✲ ✻
■❇▼
ELEN E6884: Speech Recognition 86
T
■ correct bias for longer utterances to have longer distances
t=1 FRAMEDIST(A1(τ1(t)), A2(τ2(t)))
■❇▼
ELEN E6884: Speech Recognition 87
■ DTW is effective way to calculate distance between two signals
■ can be extended to multiple training examples per word
■ can be extended to connected speech
■ signal processing and DTW are all you need for simple ASR
■❇▼
ELEN E6884: Speech Recognition 88
■ training
w , or
■ recognition
w
■❇▼
ELEN E6884: Speech Recognition 89
■ signal processing and dynamic time warping
■ what’s next
■❇▼
ELEN E6884: Speech Recognition 90
■❇▼
ELEN E6884: Speech Recognition 91