[PPT] - 8-Speech Recognition Speech Recognition Concepts Speech Recognition PowerPoint Presentation

SLIDE 1

8-Speech Recognition

 Speech Recognition Concepts  Speech Recognition Approaches  Recognition Theories  Bayse Rule  Simple Language Model  P(A|W) Network Types

1

SLIDE 2

7-Speech Recognition (Cont’d)

 HMM Calculating Approaches  Neural Components  Three Basic HMM Problems  Viterbi Algorithm  State Duration Modeling  Training In HMM

2

SLIDE 3

Recognition Tasks

 Isolated Word Recognition (IWR)

Connected Word (CW) , And Continuous Speech Recognition (CSR)

 Speaker Dependent, Multiple Speaker, And

Speaker Independent

 Vocabulary Size

 Small <20  Medium >100 , <1000  Large >1000, <10000  Very Large >10000

3

SLIDE 4

Speech Recognition Concepts

4

NLP Speech Processing Text Speech NLP Speech Processing Speech Understanding Speech Synthesis Text Phone Sequence Speech Recognition

Speech recognition is inverse of Speech Synthesis

SLIDE 5

Speech Recognition Approaches

 Bottom-Up Approach  Top-Down Approach  Blackboard Approach

5

SLIDE 6

Bottom-Up Approach

6

Signal Processing Feature Extraction Segmentation Signal Processing Feature Extraction Segmentation Segmentation Sound Classification Rules Phonotactic Rules Lexical Access Language Model Voiced/Unvoiced/Silence Knowledge Sources Recognized Utterance

SLIDE 7

Top-Down Approach

7

Unit Matching System Feature Analysis Lexical Hypo thesis Syntactic Hypo thesis Semantic Hypo thesis Utterance Verifier/ Matcher Inventory

f speech

recognition units Word Dictionary Grammar Task Model Recognized Utterance

SLIDE 8

Blackboard Approach

8

Environmental Processes Acoustic Processes Lexical Processes Syntactic Processes Semantic Processes Black board

SLIDE 9

Recognition Theories

 Articulatory Based Recognition

 Use Articulatory system modeling for recognition  This theory is the most successful so far

 Auditory Based Recognition

 Use Auditory system for recognition

 Hybrid Based Recognition

 Is a combination of the above theories

 Motor Theory

 Model the intended gesture of speaker

9

SLIDE 10

Recognition Problem

 We have the sequence of acoustic

symbols and we want to find the words uttered by speaker

 Solution : Find the most probable word

sequence given Acoustic symbols

10

SLIDE 11

Recognition Problem

 A : Acoustic Symbols  W : Word Sequence  we should find so that

11

W ˆ

) | ( max ) | ˆ ( A W P A W P

W



SLIDE 12

Bayse Rule

) , ( ) ( ) | ( y x P y P y x P 

12

) ( ) ( ) | ( ) | ( y P x P x y P y x P  ) ( ) ( ) | ( ) | ( A P W P W A P A W P  

SLIDE 13

Bayse Rule (Cont’d)

13

) ( ) ( ) | ( max A P W P W A P

W



) | ( max ) | ˆ ( A W P A W P

W

 ) ( ) | ( max ) | ( max ˆ W P W A P Arg A W P Arg W

W W

 

SLIDE 14

Simple Language Model

14

n

w w w w w 

3 2 1



) ,..., , , ( ) ,..., , | ( )..... , , | ( ) , | ( ) | ( ) ( ) | ( ) (

1 2 1 1 2 1 1 2 3 4 1 2 3 1 2 1 1 2 1 1

W W W W P W W W W P W W W W P W W W P W W P W P w w w w P w P

n n n n n n i i i n i       

    

Computing this probability is very difficult and we need a very big database. So we use Trigram and Bigram models.

SLIDE 15

Simple Language Model (Cont’d)

15

) | ( ) (

2 1 1   

 

i i i n i

w w w P w P ) | ( ) (

1 1  

 

i i n i

w w P w P

Trigram : Bigram :

) ( ) (

1 i n i

w P w P



 

Monogram :

SLIDE 16

Simple Language Model (Cont’d)

16

 ) | (

1 2 3

w w w P

Computing Method :

Number of happening W3 after W1W2 Total number of happening W1W2

Ad hoc Method :

) ( ) | ( ) | ( ) | (

3 3 2 3 2 1 2 3 1 1 2 3

w f w w f w w w f w w w P      

SLIDE 17

Error Production Factor

 Prosody (Recognition should be Prosody

Independent)

 Noise (Noise should be prevented)  Spontaneous Speech

17

SLIDE 18

P(A|W) Computing Approaches

 Dynamic Time Warping (DTW)  Hidden Markov Model (HMM)  Artificial Neural Network (ANN)  Hybrid Systems

18

SLIDE 19

Dynamic Time Warping

19

SLIDE 20

Dynamic Time Warping

20

SLIDE 21

Dynamic Time Warping

21

SLIDE 22

Dynamic Time Warping

22

SLIDE 23

Dynamic Time Warping

Search Limitation :

First & End Interval
Global Limitation
Local Limitation

23

SLIDE 24

Dynamic Time Warping

Global Limitation :

24

SLIDE 25

Dynamic Time Warping

Local Limitation :

25

SLIDE 26

Artificial Neural Network

26

 . . .

1

x x

1

w w

1  N

w

1  N

x

y

) (

1

   

  i N i ix

w y

Simple Computation Element

f a Neural Network

SLIDE 27

Artificial Neural Network (Cont’d)

 Neural Network Types

 Perceptron  Time Delay  Time Delay Neural Network Computational

Element (TDNN)

27

SLIDE 28

Artificial Neural Network (Cont’d)

28

. . . . . .

x

y

1  M

y

1  N

x

Single Layer Perceptron

SLIDE 29

Artificial Neural Network (Cont’d)

29

. . . . . .

Three Layer Perceptron

. . . . . .

SLIDE 30

2.5.4.2 Neural Network Topologies

30

SLIDE 31

TDNN

31

SLIDE 32

2.5.4.6 Neural Network Structures for Speech Recognition

32

SLIDE 33

2.5.4.6 Neural Network Structures for Speech Recognition

33

SLIDE 34

Hybrid Methods

 Hybrid Neural Network and Matched Filter For

Recognition

34

PATTERN CLASSIFIER Speech Acoustic Features Delays Output Units

SLIDE 35

Neural Network Properties

 The system is simple, But too much

iteration is needed for training

 Doesn’t determine a specific structure  Regardless of simplicity, the results are

good

 Training size is large, so training should

be offline

35

SLIDE 36

Pre-processing

 Different preprocessing techniques are

employed as the front end for speech recognition systems

 The choice of preprocessing method is

based on the task, the noise level, the modeling tool, etc.

36

SLIDE 37

37

SLIDE 38

38

SLIDE 39

39

SLIDE 40

40

SLIDE 41

41

SLIDE 42

42

SLIDE 43

شور MFCC



شور MFCCنتبميم تاوصا زا ناسنا شوگ کاردا هوحن ربيدشاب.



شور MFCCاس هب تبسنيو ريحم رد اهيِگژياهطيزيونيم لمع رتهبيدنک.



MFCCاهدربراک تهج ًاساساياسانشييارا راتفگياسانش رد اما تسا هدش هيي وگين هدنيبسانم نامدنار زيدراد.

نش دحاوي ناسنا شوگ راد

Melميز هطبار کمک هب هک دشابيم تسدب ريآيد:

43

SLIDE 44

شور لحارم MFCC

هلحرم1 :س تشاگني کمک هب سناکرف هزوح هب نامز هزوح زا لانگ FFTهاتوک نامز.

44

z(n) :سي لانگراتفگ w(n) : هرجنپ دننام هرجنپ عباتگنيمه WF= e-j2π/F m : 0,…,F – 1; :رف لوطيراتفگ مي. F

SLIDE 45

شور لحارم MFCC

هلحرم2 :يژرنا نتفايف کناب لاناک رهيرتل. Mاهکناب دادعتينتبم رتليفيم لم رايعم ربيدشاب. ف عباتياهرتليتسا رتليف کناب.

45

0,1,...,1 k M  

( )

k

W j

SLIDE 46

لم رايعم رب ينتبم رتليف عيزوت

46

SLIDE 47

شور لحارم MFCC



هلحرم4 :زاس هدرشفيدبت لامعا و فيطي ل DCT هب لوصح تهج ارضي ب MFCC

47



لباب هطبار رد L ، ... ، = nارض هبترمي ب MFCCميدشاب.

SLIDE 48

48

Mel-scalingیدنب میرف IDCT |FFT|2 Low-order coefficients Differentiator Cepstra Delta & Delta Delta Cepstra Logarithm

SLIDE 49

مورتسپک لم بیارض(MFC MFCC)

49

SLIDE 50

مورتسپک لم یاه یگژیو(MFCC)

ایراو هک یتهجرد لمرتلیف کناب یاه یژرنا تشاگن سن

دشاب ممیسکام اهنآ(زا هدافتسا اب DCT )

سن لماکریغ تروص هب راتفگ یاه یگژیو للبقتسا هب تب

رگیدکی(ریثات DCT )

زیمت یاهطیحم رد بسانم خساپ یزیون یاهطیحم رد نآ ییاراک شهاک

50

SLIDE 51

Time-Frequency analysis

 Short-term Fourier Transform

 Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.

 W(n): windowing function  N: frame length  p: step size

51

SLIDE 52

Critical band integration

 Related to masking phenomenon: the

threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise

 Frequency components within a critical band

are not resolved. Auditory system interprets the signals within a critical band as a whole

52

SLIDE 53

Bark scale

53

SLIDE 54

Feature orthogonalization

 Spectral values in adjacent frequency

channels are highly correlated

 The correlation results in a Gaussian

model with lots of parameters: have to estimate all the elements of the covariance matrix

 Decorrelation is useful to improve the

parameter estimation.

54

SLIDE 55

Cepstrum

 Computed as the inverse Fourier transform of the

log magnitude of the Fourier transform of the signal

 The log magnitude is real and symmetric -> the

transform is equivalent to the Discrete Cosine Transform.

 Approximately decorrelated

55

SLIDE 56

Principal Component Analysis

 Find an orthogonal basis such that the

reconstruction error over the training set is minimized

 This turns out to be equivalent to diagonalize

the sample autocovariance matrix

 Complete decorrelation  Computes the principal dimensions of

variability, but not necessarily provide the

ptimal discrimination among classes

56

SLIDE 57

Principal Component Analysis (PCA)



Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (PC)

 Find an orthogonal basis such that the reconstruction

error over the training set is minimized



This turns out to be equivalent to diagonalize the sample autocovariance matrix



Complete decorrelation



Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes

57

SLIDE 58

PCA (Cont.)

 Algorithm

58

x F y 

Apply Transform

Output = (R- dim vectors)

M R

y * Input=

(N-dim vectors)

M N

x *

Covariance matrix

  

1

    



M x x x x Cov

M i T i i

i

N i

EigVec EigVal

N i ... 1 

Transform matrix

            

N

EigVec EigVec EigVec F .

2 1

. . .

2 1

  EigVal EigVal

Eigen values Eigen vectors

SLIDE 59

PCA (Cont.)



PCA in speech recognition systems

59



SLIDE 60

Linear discriminant Analysis

 Find an orthogonal basis such that the

ratio of the between-class variance and within-class variance is maximized

 This also turns to be a general

eigenvalue-eigenvector problem

 Complete decorrelation  Provide the optimal linear separability

under quite restrict assumption

60

SLIDE 61

PCA vs. LDA

61

SLIDE 62

Spectral smoothing

 Formant information is crucial for

recognition

 Enhance and preserve the formant

information:

 Truncating the number of cepstral

coefficients

 Linear prediction: peak-hugging property

62

SLIDE 63

Temporal processing

 To capture the temporal features of the

spectral envelop; to provide the robustness:

 Delta Feature: first and second order differences;

regression

 Cepstral Mean Subtraction:

○ For normalizing for channel effects and adjusting for

spectral slope

63

SLIDE 64

RASTA (RelAtive SpecTral Analysis)

 Filtering of the temporal trajectories of some

function of each of the spectral values; to provide more reliable spectral features

 This is usually a bandpass filter, maintaining

the linguistically important spectral envelop modulation (1-16Hz)

64

SLIDE 65

65

SLIDE 66

RASTA-PLP

66

SLIDE 67

67

SLIDE 68

68

SLIDE 69

       

    

therwise

valid is w w if w w P w w w P w w w w P w w w w P w w w P w w P w P w w w P W P w w w W

j k k j j N j j j Q Q Q Q Q

1 ) | ( ), | ( ) | ( | ( ) | ( ) | ( ) ( ) ( ) ( ,

1 1 1 2 1 ). 1 2 1 2 1 3 1 2 1 2 1 2 1

     

Language Models for LVCSR

Word Pair Model: Specify which word pairs are valid

69

SLIDE 70

Statistical Language Modeling

 

    

             

) ( ) ( ) ( ) , ( ) , ( ) , , ( ) , | ( ˆ , ) , , ( ) , , , ( ) , , | ( ˆ ), , , , | ( ) (

1 3 1 2 1 2 2 1 3 2 1 1 2 1 3 1 1 1 1 1 1 1 2 1 1 i N i i N i i i N i i i N i i i Q i i N

w F w F p w F w w F p w w F w w w F p w w w P w w F w w w F w w w P w w w w P W P    

70

SLIDE 71

 

) , , , ( log 1 lim ) ( log ) ( ) ( ) ( ) ( ) , , , ( ) , , , ( log ) , , , ( 1 lim

2 1 2 1 2 1 2 1 2 1 Q Q V w Q Q Q Q Q

w w w P Q H w P w P H w P w P w P w w w P w w w P w w w P Q H                            

    

 

Perplexity of the Language Model

Entropy of the Source: First order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the source puts out, Assuming independence:

71

SLIDE 72

Q Q H Q p N i Q i i i i p Q

w w w P B w w w P Q H w w w w P Q H w w w P Q H

p

/ 1 2 1 2 1 1 1 2 1 2 1

) , , , ( ˆ 2 ) , , , ( ˆ log 1 ) , , , | ( log 1 ) , , , ( log 1

     

               



   

We often compute H based on a finite but sufficiently large Q: H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: In general: Perplexity is defined as:

72

SLIDE 73

73

H V w

B w P w P H 2 ) ( log ) (    

فلا)

B=8 ب) B=4

لاثم

SLIDE 74

Overall recognition system based on subword units

74