8-Speech Recognition
Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types
1
8-Speech Recognition Speech Recognition Concepts Speech Recognition - - PowerPoint PPT Presentation
8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types 1 7-Speech Recognition (Cont d) HMM Calculating Approaches
Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types
1
HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM
2
Isolated Word Recognition (IWR)
Speaker Dependent, Multiple Speaker, And
Vocabulary Size
Small <20 Medium >100 , <1000 Large >1000, <10000 Very Large >10000
3
4
NLP Speech Processing Text Speech NLP Speech Processing Speech Understanding Speech Synthesis Text Phone Sequence Speech Recognition
Bottom-Up Approach Top-Down Approach Blackboard Approach
5
6
Signal Processing Feature Extraction Segmentation Signal Processing Feature Extraction Segmentation Segmentation Sound Classification Rules Phonotactic Rules Lexical Access Language Model Voiced/Unvoiced/Silence Knowledge Sources Recognized Utterance
7
Unit Matching System Feature Analysis Lexical Hypo thesis Syntactic Hypo thesis Semantic Hypo thesis Utterance Verifier/ Matcher Inventory
recognition units Word Dictionary Grammar Task Model Recognized Utterance
8
Environmental Processes Acoustic Processes Lexical Processes Syntactic Processes Semantic Processes Black board
Articulatory Based Recognition
Use Articulatory system modeling for recognition This theory is the most successful so far
Auditory Based Recognition
Use Auditory system for recognition
Hybrid Based Recognition
Is a combination of the above theories
Motor Theory
Model the intended gesture of speaker
9
We have the sequence of acoustic
Solution : Find the most probable word
10
A : Acoustic Symbols W : Word Sequence we should find so that
11
W
12
13
W
W
W W
14
1 2 1 1 2 1 1 2 3 4 1 2 3 1 2 1 1 2 1 1
n n n n n n i i i n i
15
2 1 1
i i i n i
1 1
i i n i
1 i n i
16
1 2 3
Number of happening W3 after W1W2 Total number of happening W1W2
3 3 2 3 2 1 2 3 1 1 2 3
Prosody (Recognition should be Prosody
Noise (Noise should be prevented) Spontaneous Speech
17
Dynamic Time Warping (DTW) Hidden Markov Model (HMM) Artificial Neural Network (ANN) Hybrid Systems
18
19
20
21
22
23
24
25
26
1
1
1 N
1 N
1
i N i ix
Neural Network Types
Perceptron Time Delay Time Delay Neural Network Computational
27
28
. . . . . .
1 M
1 N
29
. . . . . .
. . . . . .
30
31
32
33
Hybrid Neural Network and Matched Filter For
Recognition
34
PATTERN CLASSIFIER Speech Acoustic Features Delays Output Units
The system is simple, But too much
Doesn’t determine a specific structure Regardless of simplicity, the results are
Training size is large, so training should
35
Different preprocessing techniques are
The choice of preprocessing method is
36
37
38
39
40
41
42
شور MFCCنتبميم تاوصا زا ناسنا شوگ کاردا هوحن ربيدشاب.
شور MFCCاس هب تبسنيو ريحم رد اهيِگژياهطيزيونيم لمع رتهبيدنک.
MFCCاهدربراک تهج ًاساساياسانشييارا راتفگياسانش رد اما تسا هدش هيي وگين هدنيبسانم نامدنار زيدراد.
نش دحاوي ناسنا شوگ راد
Melميز هطبار کمک هب هک دشابيم تسدب ريآيد:
43
44
45
0,1,...,1 k M
k
لم رايعم رب ينتبم رتليف عيزوت
46
47
48
Mel-scalingیدنب میرف IDCT |FFT|2 Low-order coefficients Differentiator Cepstra Delta & Delta Delta Cepstra Logarithm
49
ایراو هک یتهجرد لمرتلیف کناب یاه یژرنا تشاگن سن
سن لماکریغ تروص هب راتفگ یاه یگژیو للبقتسا هب تب
زیمت یاهطیحم رد بسانم خساپ یزیون یاهطیحم رد نآ ییاراک شهاک
50
Short-term Fourier Transform
Standard way of frequency analysis: decompose the
incoming signal into the constituent frequency components.
W(n): windowing function N: frame length p: step size
51
Related to masking phenomenon: the
Frequency components within a critical band
52
53
Spectral values in adjacent frequency
The correlation results in a Gaussian
Decorrelation is useful to improve the
54
Computed as the inverse Fourier transform of the
log magnitude of the Fourier transform of the signal
The log magnitude is real and symmetric -> the
transform is equivalent to the Discrete Cosine Transform.
Approximately decorrelated
55
Find an orthogonal basis such that the
This turns out to be equivalent to diagonalize
Complete decorrelation Computes the principal dimensions of
56
Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (PC)
Find an orthogonal basis such that the reconstruction
error over the training set is minimized
This turns out to be equivalent to diagonalize the sample autocovariance matrix
Complete decorrelation
Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes
57
Algorithm
58
x F y
Apply Transform
Output = (R- dim vectors)
M R
y * Input=
(N-dim vectors)
M N
x *
Covariance matrix
1
1
M x x x x Cov
M i T i i
i
N i
EigVec EigVal
N i ... 1
Transform matrix
N
EigVec EigVec EigVec F .
2 1
. . .
2 1
EigVal EigVal
Eigen values Eigen vectors
PCA in speech recognition systems
59
Find an orthogonal basis such that the
This also turns to be a general
Complete decorrelation Provide the optimal linear separability
60
61
Formant information is crucial for
Enhance and preserve the formant
Truncating the number of cepstral
Linear prediction: peak-hugging property
62
To capture the temporal features of the
Delta Feature: first and second order differences;
regression
Cepstral Mean Subtraction:
○ For normalizing for channel effects and adjusting for
spectral slope
63
Filtering of the temporal trajectories of some
This is usually a bandpass filter, maintaining
64
65
66
67
68
j k k j j N j j j Q Q Q Q Q
1 1 1 2 1 ). 1 2 1 2 1 3 1 2 1 2 1 2 1
Word Pair Model: Specify which word pairs are valid
69
1 3 1 2 1 2 2 1 3 2 1 1 2 1 3 1 1 1 1 1 1 1 2 1 1 i N i i N i i i N i i i N i i i Q i i N
70
) , , , ( log 1 lim ) ( log ) ( ) ( ) ( ) ( ) , , , ( ) , , , ( log ) , , , ( 1 lim
2 1 2 1 2 1 2 1 2 1 Q Q V w Q Q Q Q Q
w w w P Q H w P w P H w P w P w P w w w P w w w P w w w P Q H
Entropy of the Source: First order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the source puts out, Assuming independence:
71
Q Q H Q p N i Q i i i i p Q
w w w P B w w w P Q H w w w w P Q H w w w P Q H
p
/ 1 2 1 2 1 1 1 2 1 2 1
) , , , ( ˆ 2 ) , , , ( ˆ log 1 ) , , , | ( log 1 ) , , , ( log 1
We often compute H based on a finite but sufficiently large Q: H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: In general: Perplexity is defined as:
72
73
H V w
فلا)
B=8 ب) B=4
لاثم
Overall recognition system based on subword units
74