Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - - PowerPoint PPT Presentation

▶

May 28, 2023 212 likes •644 views

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech

SLIDE 1

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach

Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.

SLIDE 2

➢ Provides efficient communication between humans and machine

Speech Recognition Is Important

SLIDE 3

➢ Provides efficient communication between humans and machine ➢ Increases throughput

Speech Recognition Is Important

SLIDE 4

➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication

Image sources: [1] www.glasbergen.com [2] www.playbuzz.com

Speech Recognition Is Important

SLIDE 5

➢ Intuitive for humans, not for machines

We say, "Let there be light".

Speech Recognition Is Difficult

SLIDE 6

➢ Intuitive for humans, not for machines

We say, "Let there be light". Machine hears,

Speech Recognition Is Difficult

SLIDE 7

➢ Intuitive for humans, not for machines

We say, "Let there be light". Machine hears,

➢

Different pitches, accents, durations, noise levels

Speech Recognition Is Difficult

SLIDE 8

➢ A recurrent neural network (RNN) architecture proposed

Double layered bidirectional long short term memory (LSTM) used

Our Contribution

SLIDE 9

➢ A recurrent neural network (RNN) architecture proposed

Double layered bidirectional long short term memory (LSTM) used

➢ Individual phonemes detected

to some extent

Our Contribution

SLIDE 10

➢ A recurrent neural network (RNN) architecture proposed

Double layered bidirectional long short term memory (LSTM) used

➢ Individual phonemes detected

to some extent

➢ Compared results with other methods on Bengali automatic speech recognizer

Our Contribution

SLIDE 11

➢ Preprocessing

noise reduction, mel frequency cepstral coefficients (MFCC)

extraction

➢ Training

➢ Postprocessing

Methodology

SLIDE 12

➢ Speech signal is divided into a number of frames

each frame has 13 features (i.e., MFCCs)

Methodology: Preprocessing (1/2)

SLIDE 13

➢ Speech signal is divided into a number of frames

each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

Methodology: Preprocessing (1/2)

SLIDE 14

➢ Speech signal is divided into a number of frames

each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

S-O-T-E-R-O
5 phones

Methodology: Preprocessing (1/2)

SLIDE 15

➢ Speech signal is divided into a number of frames

each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

S-O-T-E-R-O
5 phones
17 frames

Methodology: Preprocessing (1/2)

SLIDE 16

➢ Speech signal is divided into a number of frames

each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

S-O-T-E-R-O
5 phones
17 frames
SSS-OO-TTT-EEEE-RRR-OO
roughly 3 frames for each phone

Methodology: Preprocessing (1/2)

SLIDE 17

SSS-OO-TTT-EEEE-RRR-OO

Methodology: Preprocessing (2/2)

SLIDE 18

➢ Each timestamp carries a frame

Methodology: Training with LSTM (1/3)

SLIDE 19

➢ Each timestamp carries a frame ➢ Input layer contains 13 units

ne for each coefficient

Methodology: Training with LSTM (1/3)

SLIDE 20

➢ Each timestamp carries a frame ➢ Input layer contains 13 units

ne for each coefficient

➢ Next two layers are recurrent

each contains 100 LSTM cells

Methodology: Training with LSTM (1/3)

SLIDE 21

➢ Each timestamp carries a frame ➢ Input layer contains 13 units

ne for each coefficient

➢ Next two layers are recurrent

each contains 100 LSTM cells

➢ The final layer is a softmax layer with 30 units

there are 30 phonemes in total

Methodology: Training with LSTM (1/3)

SLIDE 22

➢ An LSTM cell

exploits context using gates
has recurrent connection for temporal

modeling

has nonlinear squashing activation

functions for capturing complex relation

➢ Together, responsible for associating cepstral coefficients to phones

Methodology: Training with LSTM (2/3)

SLIDE 23

Methodology: Training with LSTM (3/3)

SLIDE 24

➢ A phone found in consecutive frames

STRONG EVIDENCE, it is present. How many?

➢ Depends on threshold T

Methodology: Postprocessing

SLIDE 25

Methodology: Postprocessing

SLIDE 26

➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)

12 frames

Methodology: Postprocessing

SLIDE 27

➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)

12 frames

➢ Network output: S,Sh,Sh,O,O,O,L,L,O,O,O

Methodology: Postprocessing

SLIDE 28

➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)

12 frames

➢ Network output: S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination: _,Sh,Sh,O,O,O,L,L,O,O,O

Methodology: Postprocessing

SLIDE 29

Methodology: Postprocessing

SLIDE 30

Final Output: "ShOLO"

Methodology: Postprocessing

SLIDE 31

➢ Phone Detection Error rate (Perr)

Analysis: Evaluation Metrics

SLIDE 32

➢ Phone Detection Error rate (Perr) ➢ Word Detection Error rate (Werr)

Analysis: Evaluation Metrics

SLIDE 33

➢ Phone Detection Error rate (Perr) ➢ Word Detection Error rate (Werr) ➢ Bengali Real Number Dataset (Nahid et al., 2016)

Analysis: Evaluation Metrics

SLIDE 34

➢ 28.7% Perr 13.2% Werr ➢ CMU Sphinx4 incurred 15% Werr (Nahid et al., 2016)

Analysis: How Model Learns

SLIDE 35

➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L')

they occurred more frequently
no other phon pronounced similarly

Analysis: Phon labeling

SLIDE 36

➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L')

they occurred more frequently
no other phon pronounced similarly

➢ Some phons frequently garbled

Analysis: Phon labeling

SLIDE 37

➢ It seems, on average, there are 5 phones in a word in dataset ➢ Robustness not checked

Analysis: Threshold Affects Learning

SLIDE 38

➢ Phon alignment very difficult

Same example introduced thrice, but with different

alignment

➢ Only vocabulary words recognized

114 unique words

➢ Robustness in threshold selection not tested

Limitations

SLIDE 39

➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too

Conclusion

SLIDE 40

Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for

recognizing Bangla real number automatically using CMU Sphinx4. In Informatics, Electronics and Vision (ICIEV), 2016.

Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition

with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.

Ravuri, Suman, and Steven Wegmann. "How neural network features and depth

modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

References

SLIDE 41

Thank you

SLIDE 42