Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - PowerPoint PPT Presentation
Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech
Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.
Speech Recognition Is Important ➢ Provides efficient communication between humans and machine
Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput
Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication Image sources: [1] www.glasbergen.com [2] www.playbuzz.com
Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light".
Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears,
Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears, Different pitches, accents, durations, noise levels ➢
Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used •
Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent •
Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent • ➢ Compared results with other methods on Bengali automatic speech recognizer
Methodology ➢ Preprocessing noise reduction, mel frequency cepstral coefficients (MFCC) • extraction ➢ Training ➢ Postprocessing
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) •
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO)
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones •
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames •
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames • SSS-OO-TTT-EEEE-RRR-OO • roughly 3 frames for each phone •
Methodology: Preprocessing (2/2) SSS-OO-TTT-EEEE-RRR-OO
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient •
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells •
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells • ➢ The final layer is a softmax layer with 30 units there are 30 phonemes in total •
Methodology: Training with LSTM (2/3) ➢ An LSTM cell exploits context using gates • has recurrent connection for temporal • modeling has nonlinear squashing activation • functions for capturing complex relation ➢ Together, responsible for associating cepstral coefficients to phones
Methodology: Training with LSTM (3/3)
Methodology: Postprocessing ➢ A phone found in consecutive frames STRONG EVIDENCE, it is present. How many ? ➢ Depends on threshold T
Methodology: Postprocessing
Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames
Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O
Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination : _,Sh,Sh,O,O,O,L,L,O,O,O
Methodology: Postprocessing
Methodology: Postprocessing Final Output: " ShOLO "
Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err )
Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err )
Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err ) ➢ Bengali Real Number Dataset (Nahid et al., 2016)
Analysis: How Model Learns ➢ 28.7% P err 13.2% W err ➢ CMU Sphinx4 incurred 15% W err ( Nahid et al ., 2016)
Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly
Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly ➢ Some phons frequently garbled
Analysis: Threshold Affects Learning It seems, on ➢ average, there are 5 phones in a word in dataset ➢ Robustness not checked
Limitations ➢ Phon alignment very difficult Same example introduced thrice, but with different • alignment ➢ Only vocabulary words recognized 114 unique words • ➢ Robustness in threshold selection not tested
Conclusion ➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too
References Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for • recognizing Bangla real number automatically using CMU Sphinx4 . In Informatics, Electronics and Vision (ICIEV), 2016. Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition • with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013. Ravuri, Suman, and Steven Wegmann. "How neural network features and depth • modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
Thank you
Phoneme List
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.