A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping
Qing He, Gregory Wornell, Wei Ma June 21, 2016
MIT - Signals Information and Algorithms Lab
- Texas Instruments
- Texas Instruments MIT - Signals Information and Algorithms Lab - - PowerPoint PPT Presentation
A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab
A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping
Qing He, Gregory Wornell, Wei Ma June 21, 2016
MIT - Signals Information and Algorithms Lab
Motivation: Low-Power Wake-up
– High data acquisition rate to minimize information loss and to enable flexible downstream processing – Involves many stages of processing on high dimensional data
2
ADC
Host device (DSP) Always ON : >100mWs
Low-Power Front-end
Always ON : ~ 50-300uW
voice-authenticated wake-up front-end
– Early-stage signal dimension reduction with analog components – Adaptive data acquisition and robust processing Wake-up signal
Turn on with wake-up signal
System Architecture: A Comparison
3
Acoustic feature extraction MFCC Sampling Windowing Speaker Verification Feature Extraction Unit Sampling Rate e.g., ~24 kHz Fast processing High Dimensional Features
Conventional System:
< 4 kHz Low-Rate Processing (kHz)
Proposed System:
High power-consumption! Low power-consumption!
Narrowband Filter Bank Accept/ Reject Weighted DTW Enrollment Samples
Pattern Match
NBSC
Low rate ADC
Accept/ Reject
Spectral feature extraction Analog Front-End
4
Output Speaker-verification Backend Spectral Feature Pre-Selection
capable of preserving most speech information
Spectral Feature Pre-Selection
5
Review: Speech Sound Generation
Feature Selection Weighted DTW Experiments
(a) Excitation signal: (c) Speech spectrogram: (b) Vocal tract modulation:
Essential for speech recognition Separable in the cepstral domain
6
Cepstral Representation of Speech
Feature Selection Feature Extraction Command Recognition
Frequency (Hz)
Spectral Density Spectrogram
Time (s) Frequency (Hz)
Harmonics Vocal Tract Modulation
IFFT
Quefrency (Cycle/kHz)
Cepstral Coefficients
to the cepstral domain is power expensive! Question: How to extract without acquiring the full spectrum or performing transformation to the cepstral domain?
7
Point-Wise Spectral Sampling on the Harmonics
Spectrogram
Time (s) Frequency (Hz) Frequency (Hz) [dB]
Cepstral Coefficients
[dB]
is retained
Feature Selection Weighted DTW Experiments
Narrow-Band Spectral Filtering
Feature Selection Feature Extraction Command Recognition
Spectrogram
Time (s) Frequency (Hz) Frequency (Hz) [dB] 8 [dB]
Narrow-band Spectral Filtering: Parameters
Spectrogram
Time (s) Frequency (Hz) [dB] 9 Frequency (Hz)
where is the narrow-band band-width is the spacing between narrow-bands is mostly retained With = 200Hz, = 800Hz, = 100Hz, aliasing at the baseband is attenuated significantly
Feature Selection Weighted DTW Experiments
10
Narrow-Band Spectral Coefficients (NBSC)
BP Filtering BP Filtering
low-power
Feature Selection Weighted DTW Experiments
Block Diagram of the Proposed System
11
Digital back-end Enrollment Samples
Weighted Dynamic Time-Warping (DTW)
Analog front-end
threshold
Feature Selection Weighted DTW Experiments
Accept/ Reject
12
Output Spectral Feature Selection
Weighted DTW
Weighted Dynamic Time Warping
Feature Selection Weighted DTW Experiments
Voice-authenticated wake-up:
– Model based: GMM [Reynolds, 2000], i-vectors [Dehak, 2011], DNN [Liu, 2015], HMM [Rosenberg, 1990]
– Template based: DTW [Sakoe, 1978]:
Overview: Speaker-Verification Systems
13
Feature Selection Weighted DTW Experiments
Feature Extraction
Model
Enrollments Model Adaptation Training data Model Training
threshold
MFCC
Accept/ Reject
Feature Extraction
Distance Measure
Enrollments
No prior model training
threshold
Accept/ Reject
14
Weighted Dynamic Time Warping
Reference signal Speech input M = 3 Small penalty Large penalty compress stretch
distance plus a penalty term
Feature Selection Weighted DTW Experiments
15
Distance Matrix Computation
where Cost is a function of the signal magnitude and the # of consecutive warping steps
Feature Selection Weighted DTW Experiments
16
Classical v.s. Weighted DTW
Fails to align the signal envelopes Less mutation on signal envelope Signal envelopes are well aligned The shape of T is mutated
Feature Selection Weighted DTW Experiments
17
System Experiment
Feature Selection Weighted DTW Experiments
Output Spectral Feature Selection Weighted Dynamic Time Warping
18
Experiment Setup
Passphrase # of speakers # of repetitions Hi Galaxy 40 40 OK Glass 40 20 OK Hua Wei 30 20
Data Set: Baseline Systems: •
40-dim MFCC + Classical DTW
sample such that the total SNR is 3dB
Parameters:
Feature Selection Weighted DTW Experiments
19
Summary of Experiment Results
Clean (EER [%]) Noisy (3dB) (EER [%]) MFCC (40-dim) NBSC (12 bands) MFCC (40-dim) NBSC (8 bands) Weighted-DTW 0.9 1.1 10.5 5.7 DTW 1.4 1.5 13 6.7 GMM/UBM 2.6 N/A 6.8 N/A Features Algorithm features below 2kHz are dropped
all features
Feature Selection Weighted DTW Experiments
20
Experiments: Adaptive Band Selection
Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy (band selection) 6.8 6.6 6.3 5.7 Noisy (all bands) 15.5 15 15 14.5
Feature Selection Weighted DTW Experiments
System Power Estimation
– Cortex-M0 micro-controller – Clock-speed: 40MHz – Decision: every 60 ms
21
Fixed-Power (uW) Additional Power per Band (uW) Total Power (12 bands) (uW)
Front-end
TI’s 13 band filter-bank features 150 10 270
Back-end
Text-dependent speaker verification <9 <108
<380
Summary: Low-Power Text-Dependent Speaker Verification
22
noisy bands)
systems, with low-power implementation Output Spectral Feature Selection Weighted Dynamic Time Warping
23
24
25
False-Positive under Continuous Running
Feature Selection Feature Extraction Text-dependent speaker verification
Clean (OOV False Positive [%]) Noisy (3dB) (OOV False Positive [%]) MFCC (40-dim) NBSC (12 bands) MFCC (40-dim) NBSC (8 bands) Weighted-DTW 1.4 0.6 Algorithm Features
~1 false-positive per hour in a noisy restaurant
26
Experiments: Adaptive Band Selection
Feature Selection Feature Extraction Text-dependent speaker verification
Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy (band selection) 6.8 6.6 6.3 5.7 Noisy (all bands) 15.5 15 15 14.5
a larger number of filters MFSC (EER [%]) 13 26 1.95 1.83 16.4 17.2 33.4 33.9
Narrow-Band Spectral Filtering
Spectrogram
Time (s) Frequency (Hz) Frequency (Hz) [dB]
=
=
Quefrency (Cycle/kHz)
Cepstral Coefficients
27
Feature Selection Weighted DTW Experiments