8-Speech Recognition Speech Recognition Concepts Speech Recognition - - PowerPoint PPT Presentation

8 speech recognition
SMART_READER_LITE
LIVE PREVIEW

8-Speech Recognition Speech Recognition Concepts Speech Recognition - - PowerPoint PPT Presentation

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types 1 7-Speech Recognition (Cont d) HMM Calculating Approaches


slide-1
SLIDE 1

8-Speech Recognition

 Speech Recognition Concepts  Speech Recognition Approaches  Recognition Theories  Bayse Rule  Simple Language Model  P(A|W) Network Types

1

slide-2
SLIDE 2

7-Speech Recognition (Cont’d)

 HMM Calculating Approaches  Neural Components  Three Basic HMM Problems  Viterbi Algorithm  State Duration Modeling  Training In HMM

2

slide-3
SLIDE 3

Recognition Tasks

 Isolated Word Recognition (IWR)

Connected Word (CW) , And Continuous Speech Recognition (CSR)

 Speaker Dependent, Multiple Speaker, And

Speaker Independent

 Vocabulary Size

 Small <20  Medium >100 , <1000  Large >1000, <10000  Very Large >10000

3

slide-4
SLIDE 4

Speech Recognition Concepts

4

NLP Speech Processing Text Speech NLP Speech Processing Speech Understanding Speech Synthesis Text Phone Sequence Speech Recognition

Speech recognition is inverse of Speech Synthesis

slide-5
SLIDE 5

Speech Recognition Approaches

 Bottom-Up Approach  Top-Down Approach  Blackboard Approach

5

slide-6
SLIDE 6

Bottom-Up Approach

6

Signal Processing Feature Extraction Segmentation Signal Processing Feature Extraction Segmentation Segmentation Sound Classification Rules Phonotactic Rules Lexical Access Language Model Voiced/Unvoiced/Silence Knowledge Sources Recognized Utterance

slide-7
SLIDE 7

Top-Down Approach

7

Unit Matching System Feature Analysis Lexical Hypo thesis Syntactic Hypo thesis Semantic Hypo thesis Utterance Verifier/ Matcher Inventory

  • f speech

recognition units Word Dictionary Grammar Task Model Recognized Utterance

slide-8
SLIDE 8

Blackboard Approach

8

Environmental Processes Acoustic Processes Lexical Processes Syntactic Processes Semantic Processes Black board

slide-9
SLIDE 9

Recognition Theories

 Articulatory Based Recognition

 Use Articulatory system modeling for recognition  This theory is the most successful so far

 Auditory Based Recognition

 Use Auditory system for recognition

 Hybrid Based Recognition

 Is a combination of the above theories

 Motor Theory

 Model the intended gesture of speaker

9

slide-10
SLIDE 10

Recognition Problem

 We have the sequence of acoustic

symbols and we want to find the words uttered by speaker

 Solution : Find the most probable word

sequence given Acoustic symbols

10

slide-11
SLIDE 11

Recognition Problem

 A : Acoustic Symbols  W : Word Sequence  we should find so that

11

W ˆ

) | ( max ) | ˆ ( A W P A W P

W

slide-12
SLIDE 12

Bayse Rule

) , ( ) ( ) | ( y x P y P y x P 

12

) ( ) ( ) | ( ) | ( y P x P x y P y x P  ) ( ) ( ) | ( ) | ( A P W P W A P A W P  

slide-13
SLIDE 13

Bayse Rule (Cont’d)

13

) ( ) ( ) | ( max A P W P W A P

W

) | ( max ) | ˆ ( A W P A W P

W

 ) ( ) | ( max ) | ( max ˆ W P W A P Arg A W P Arg W

W W

 

slide-14
SLIDE 14

Simple Language Model

14

n

w w w w w 

3 2 1

) ,..., , , ( ) ,..., , | ( )..... , , | ( ) , | ( ) | ( ) ( ) | ( ) (

1 2 1 1 2 1 1 2 3 4 1 2 3 1 2 1 1 2 1 1

W W W W P W W W W P W W W W P W W W P W W P W P w w w w P w P

n n n n n n i i i n i       

    

Computing this probability is very difficult and we need a very big database. So we use Trigram and Bigram models.

slide-15
SLIDE 15

Simple Language Model (Cont’d)

15

) | ( ) (

2 1 1   

 

i i i n i

w w w P w P ) | ( ) (

1 1  

 

i i n i

w w P w P

Trigram : Bigram :

) ( ) (

1 i n i

w P w P

 

Monogram :

slide-16
SLIDE 16

Simple Language Model (Cont’d)

16

 ) | (

1 2 3

w w w P

Computing Method :

Number of happening W3 after W1W2 Total number of happening W1W2

Ad hoc Method :

) ( ) | ( ) | ( ) | (

3 3 2 3 2 1 2 3 1 1 2 3

w f w w f w w w f w w w P      

slide-17
SLIDE 17

Error Production Factor

 Prosody (Recognition should be Prosody

Independent)

 Noise (Noise should be prevented)  Spontaneous Speech

17

slide-18
SLIDE 18

P(A|W) Computing Approaches

 Dynamic Time Warping (DTW)  Hidden Markov Model (HMM)  Artificial Neural Network (ANN)  Hybrid Systems

18

slide-19
SLIDE 19

Dynamic Time Warping

19

slide-20
SLIDE 20

Dynamic Time Warping

20

slide-21
SLIDE 21

Dynamic Time Warping

21

slide-22
SLIDE 22

Dynamic Time Warping

22

slide-23
SLIDE 23

Dynamic Time Warping

Search Limitation :

  • First & End Interval
  • Global Limitation
  • Local Limitation

23

slide-24
SLIDE 24

Dynamic Time Warping

Global Limitation :

24

slide-25
SLIDE 25

Dynamic Time Warping

Local Limitation :

25

slide-26
SLIDE 26

Artificial Neural Network

26

 . . .

1

x x

1

w w

1  N

w

1  N

x

y

) (

1

   

  i N i ix

w y

Simple Computation Element

  • f a Neural Network
slide-27
SLIDE 27

Artificial Neural Network (Cont’d)

 Neural Network Types

 Perceptron  Time Delay  Time Delay Neural Network Computational

Element (TDNN)

27

slide-28
SLIDE 28

Artificial Neural Network (Cont’d)

28

. . . . . .

x

y

1  M

y

1  N

x

Single Layer Perceptron

slide-29
SLIDE 29

Artificial Neural Network (Cont’d)

29

. . . . . .

Three Layer Perceptron

. . . . . .

slide-30
SLIDE 30

2.5.4.2 Neural Network Topologies

30

slide-31
SLIDE 31

TDNN

31

slide-32
SLIDE 32

2.5.4.6 Neural Network Structures for Speech Recognition

32

slide-33
SLIDE 33

2.5.4.6 Neural Network Structures for Speech Recognition

33

slide-34
SLIDE 34

Hybrid Methods

 Hybrid Neural Network and Matched Filter For

Recognition

34

PATTERN CLASSIFIER Speech Acoustic Features Delays Output Units

slide-35
SLIDE 35

Neural Network Properties

 The system is simple, But too much

iteration is needed for training

 Doesn’t determine a specific structure  Regardless of simplicity, the results are

good

 Training size is large, so training should

be offline

35

slide-36
SLIDE 36

Pre-processing

 Different preprocessing techniques are

employed as the front end for speech recognition systems

 The choice of preprocessing method is

based on the task, the noise level, the modeling tool, etc.

36

slide-37
SLIDE 37

37

slide-38
SLIDE 38

38

slide-39
SLIDE 39

39

slide-40
SLIDE 40

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

شور MFCC

شور MFCCنتبميم تاوصا زا ناسنا شوگ کاردا هوحن ربيدشاب.

شور MFCCاس هب تبسنيو ريحم رد اهيِگژياهطيزيونيم لمع رتهبيدنک.

MFCCاهدربراک تهج ًاساساياسانشييارا راتفگياسانش رد اما تسا هدش هيي وگين هدنيبسانم نامدنار زيدراد.

نش دحاوي ناسنا شوگ راد

Melميز هطبار کمک هب هک دشابيم تسدب ريآيد:

43

slide-44
SLIDE 44

شور لحارم MFCC

هلحرم1 :س تشاگني کمک هب سناکرف هزوح هب نامز هزوح زا لانگ FFTهاتوک نامز.

44

z(n) :سي لانگراتفگ w(n) : هرجنپ دننام هرجنپ عباتگنيمه WF= e-j2π/F m : 0,…,F – 1; :رف لوطيراتفگ مي. F

slide-45
SLIDE 45

شور لحارم MFCC

هلحرم2 :يژرنا نتفايف کناب لاناک رهيرتل. Mاهکناب دادعتينتبم رتليفيم لم رايعم ربيدشاب. ف عباتياهرتليتسا رتليف کناب.

45

0,1,...,1 k M  

( )

k

W j

slide-46
SLIDE 46

لم رايعم رب ينتبم رتليف عيزوت

46

slide-47
SLIDE 47

شور لحارم MFCC

هلحرم4 :زاس هدرشفيدبت لامعا و فيطي ل DCT هب لوصح تهج ارضي ب MFCC

47

لباب هطبار رد L ، ... ، = nارض هبترمي ب MFCCميدشاب.

slide-48
SLIDE 48

48

Mel-scalingیدنب میرف IDCT |FFT|2 Low-order coefficients Differentiator Cepstra Delta & Delta Delta Cepstra Logarithm

slide-49
SLIDE 49

مورتسپک لم بیارض(MFC MFCC)

49

slide-50
SLIDE 50

مورتسپک لم یاه یگژیو(MFCC)

ایراو هک یتهجرد لمرتلیف کناب یاه یژرنا تشاگن سن

دشاب ممیسکام اهنآ(زا هدافتسا اب DCT )

سن لماکریغ تروص هب راتفگ یاه یگژیو للبقتسا هب تب

رگیدکی(ریثات DCT )

زیمت یاهطیحم رد بسانم خساپ یزیون یاهطیحم رد نآ ییاراک شهاک

50

slide-51
SLIDE 51

Time-Frequency analysis

 Short-term Fourier Transform

 Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.

 W(n): windowing function  N: frame length  p: step size

51

slide-52
SLIDE 52

Critical band integration

 Related to masking phenomenon: the

threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise

 Frequency components within a critical band

are not resolved. Auditory system interprets the signals within a critical band as a whole

52

slide-53
SLIDE 53

Bark scale

53

slide-54
SLIDE 54

Feature orthogonalization

 Spectral values in adjacent frequency

channels are highly correlated

 The correlation results in a Gaussian

model with lots of parameters: have to estimate all the elements of the covariance matrix

 Decorrelation is useful to improve the

parameter estimation.

54

slide-55
SLIDE 55

Cepstrum

 Computed as the inverse Fourier transform of the

log magnitude of the Fourier transform of the signal

 The log magnitude is real and symmetric -> the

transform is equivalent to the Discrete Cosine Transform.

 Approximately decorrelated

55

slide-56
SLIDE 56

Principal Component Analysis

 Find an orthogonal basis such that the

reconstruction error over the training set is minimized

 This turns out to be equivalent to diagonalize

the sample autocovariance matrix

 Complete decorrelation  Computes the principal dimensions of

variability, but not necessarily provide the

  • ptimal discrimination among classes

56

slide-57
SLIDE 57

Principal Component Analysis (PCA)

Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (PC)

 Find an orthogonal basis such that the reconstruction

error over the training set is minimized

This turns out to be equivalent to diagonalize the sample autocovariance matrix

Complete decorrelation

Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes

57

slide-58
SLIDE 58

PCA (Cont.)

 Algorithm

58

x F y 

Apply Transform

Output = (R- dim vectors)

M R

y * Input=

(N-dim vectors)

M N

x *

Covariance matrix

  

1

1

    

M x x x x Cov

M i T i i

i

N i

EigVec EigVal

N i ... 1 

Transform matrix

            

N

EigVec EigVec EigVec F .

2 1

. . .

2 1

  EigVal EigVal

Eigen values Eigen vectors

slide-59
SLIDE 59

PCA (Cont.)

PCA in speech recognition systems

59

slide-60
SLIDE 60

Linear discriminant Analysis

 Find an orthogonal basis such that the

ratio of the between-class variance and within-class variance is maximized

 This also turns to be a general

eigenvalue-eigenvector problem

 Complete decorrelation  Provide the optimal linear separability

under quite restrict assumption

60

slide-61
SLIDE 61

PCA vs. LDA

61

slide-62
SLIDE 62

Spectral smoothing

 Formant information is crucial for

recognition

 Enhance and preserve the formant

information:

 Truncating the number of cepstral

coefficients

 Linear prediction: peak-hugging property

62

slide-63
SLIDE 63

Temporal processing

 To capture the temporal features of the

spectral envelop; to provide the robustness:

 Delta Feature: first and second order differences;

regression

 Cepstral Mean Subtraction:

○ For normalizing for channel effects and adjusting for

spectral slope

63

slide-64
SLIDE 64

RASTA (RelAtive SpecTral Analysis)

 Filtering of the temporal trajectories of some

function of each of the spectral values; to provide more reliable spectral features

 This is usually a bandpass filter, maintaining

the linguistically important spectral envelop modulation (1-16Hz)

64

slide-65
SLIDE 65

65

slide-66
SLIDE 66

RASTA-PLP

66

slide-67
SLIDE 67

67

slide-68
SLIDE 68

68

slide-69
SLIDE 69

       

    

  • therwise

valid is w w if w w P w w w P w w w w P w w w w P w w w P w w P w P w w w P W P w w w W

j k k j j N j j j Q Q Q Q Q

1 ) | ( ), | ( ) | ( | ( ) | ( ) | ( ) ( ) ( ) ( ,

1 1 1 2 1 ). 1 2 1 2 1 3 1 2 1 2 1 2 1

     

Language Models for LVCSR

Word Pair Model: Specify which word pairs are valid

69

slide-70
SLIDE 70

Statistical Language Modeling

 

    

             

) ( ) ( ) ( ) , ( ) , ( ) , , ( ) , | ( ˆ , ) , , ( ) , , , ( ) , , | ( ˆ ), , , , | ( ) (

1 3 1 2 1 2 2 1 3 2 1 1 2 1 3 1 1 1 1 1 1 1 2 1 1 i N i i N i i i N i i i N i i i Q i i N

w F w F p w F w w F p w w F w w w F p w w w P w w F w w w F w w w P w w w w P W P    

70

slide-71
SLIDE 71

 

) , , , ( log 1 lim ) ( log ) ( ) ( ) ( ) ( ) , , , ( ) , , , ( log ) , , , ( 1 lim

2 1 2 1 2 1 2 1 2 1 Q Q V w Q Q Q Q Q

w w w P Q H w P w P H w P w P w P w w w P w w w P w w w P Q H                            

    

 

Perplexity of the Language Model

Entropy of the Source: First order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the source puts out, Assuming independence:

71

slide-72
SLIDE 72

Q Q H Q p N i Q i i i i p Q

w w w P B w w w P Q H w w w w P Q H w w w P Q H

p

/ 1 2 1 2 1 1 1 2 1 2 1

) , , , ( ˆ 2 ) , , , ( ˆ log 1 ) , , , | ( log 1 ) , , , ( log 1

     

               

   

We often compute H based on a finite but sufficiently large Q: H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: In general: Perplexity is defined as:

72

slide-73
SLIDE 73

73

H V w

B w P w P H 2 ) ( log ) (    

فلا)

B=8 ب) B=4

لاثم

slide-74
SLIDE 74

Overall recognition system based on subword units

74