[PPT] - - Texas Instruments MIT - Signals Information and Algorithms Lab PowerPoint Presentation

SLIDE 1

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping

Qing He, Gregory Wornell, Wei Ma June 21, 2016

MIT - Signals Information and Algorithms Lab

Texas Instruments

SLIDE 2

Motivation: Low-Power Wake-up

Conventionally, for voice wake up, the host device is always ON

– High data acquisition rate to minimize information loss and to enable flexible downstream processing – Involves many stages of processing on high dimensional data

2

ADC

Host device (DSP) Always ON : >100mWs

Low-Power Front-end

Always ON : ~ 50-300uW

Much lower power consumption can be achieved with an application-specific

voice-authenticated wake-up front-end

– Early-stage signal dimension reduction with analog components – Adaptive data acquisition and robust processing Wake-up signal

Turn on with wake-up signal

SLIDE 3

System Architecture: A Comparison

3

Acoustic feature extraction MFCC Sampling Windowing Speaker Verification Feature Extraction Unit Sampling Rate e.g., ~24 kHz Fast processing High Dimensional Features

Conventional System:

< 4 kHz Low-Rate Processing (kHz)

Proposed System:

High power-consumption! Low power-consumption!

Narrowband Filter Bank Accept/ Reject Weighted DTW Enrollment Samples

Pattern Match

NBSC

Low rate ADC

Accept/ Reject

Spectral feature extraction Analog Front-End

SLIDE 4

4

Output Speaker-verification Backend Spectral Feature Pre-Selection

A few carefully selected narrow-bands are

capable of preserving most speech information

Spectral Feature Pre-Selection

SLIDE 5

5

Review: Speech Sound Generation

Feature Selection Weighted DTW Experiments

(a) Excitation signal: (c) Speech spectrogram: (b) Vocal tract modulation:

Essential for speech recognition Separable in the cepstral domain

SLIDE 6

6

Cepstral Representation of Speech

Feature Selection Feature Extraction Command Recognition

Frequency (Hz)

Spectral Density Spectrogram

Time (s) Frequency (Hz)

Harmonics Vocal Tract Modulation

IFFT

Quefrency (Cycle/kHz)

Cepstral Coefficients

Acquiring the entire speech spectrum and performing transformation

to the cepstral domain is power expensive! Question: How to extract without acquiring the full spectrum or performing transformation to the cepstral domain?

SLIDE 7

7

Point-Wise Spectral Sampling on the Harmonics

Spectrogram

Time (s) Frequency (Hz) Frequency (Hz) [dB]

Cepstral Coefficients

[dB]

×

is retained

Feature Selection Weighted DTW Experiments

SLIDE 8

Narrow-Band Spectral Filtering

Feature Selection Feature Extraction Command Recognition

Spectrogram

Time (s) Frequency (Hz) Frequency (Hz) [dB] 8 [dB]

×

SLIDE 9

Narrow-band Spectral Filtering: Parameters

Spectrogram

Time (s) Frequency (Hz) [dB] 9 Frequency (Hz)

where is the narrow-band band-width is the spacing between narrow-bands is mostly retained With = 200Hz, = 800Hz, = 100Hz, aliasing at the baseband is attenuated significantly

Feature Selection Weighted DTW Experiments

SLIDE 10

10

Narrow-Band Spectral Coefficients (NBSC)

BP Filtering BP Filtering

Narrow-band spectral features retain essential speech information
A small number of filters
Low-rate sampling and simple processing

low-power

Feature Selection Weighted DTW Experiments

SLIDE 11

Block Diagram of the Proposed System

11

Digital back-end Enrollment Samples

×

Weighted Dynamic Time-Warping (DTW)

Analog front-end

. . .

threshold

Feature Selection Weighted DTW Experiments

Accept/ Reject

Individual bands can be discarded in the presence of noise

SLIDE 12

12

Output Spectral Feature Selection

Weighted DTW

Weighted Dynamic Time Warping

Identifies the user and the passphrase in one shot
User-defined passphrase (~1s)
Very few enrollment samples (e.g., 3)

Feature Selection Weighted DTW Experiments

Voice-authenticated wake-up:

SLIDE 13

Text-Dependent Speaker-Verification

– Model based: GMM [Reynolds, 2000], i-vectors [Dehak, 2011], DNN [Liu, 2015], HMM [Rosenberg, 1990]

– Template based: DTW [Sakoe, 1978]:

Overview: Speaker-Verification Systems

13

Feature Selection Weighted DTW Experiments

Feature Extraction

Model

Enrollments Model Adaptation Training data Model Training

threshold

MFCC

Accept/ Reject

Feature Extraction

Distance Measure

Enrollments

No prior model training

threshold

Accept/ Reject

SLIDE 14

14

Weighted Dynamic Time Warping

Reference signal Speech input M = 3 Small penalty Large penalty compress stretch

The distance between two points and is equal to the

distance plus a penalty term

Penalty scales with the # of consecutive warping steps M
Penalty scales with the signal magnitude
Penalty for warping is low when signal is small
Penalty for warping is high when signal is large

Feature Selection Weighted DTW Experiments

SLIDE 15

15

Distance Matrix Computation

where Cost is a function of the signal magnitude and the # of consecutive warping steps

Feature Selection Weighted DTW Experiments

SLIDE 16

16

Classical v.s. Weighted DTW

Fails to align the signal envelopes Less mutation on signal envelope Signal envelopes are well aligned The shape of T is mutated

Feature Selection Weighted DTW Experiments

SLIDE 17

17

System Experiment

Feature Selection Weighted DTW Experiments

Output Spectral Feature Selection Weighted Dynamic Time Warping

SLIDE 18

18

Experiment Setup

Passphrase # of speakers # of repetitions Hi Galaxy 40 40 OK Glass 40 20 OK Hua Wei 30 20

Data Set: Baseline Systems: •

40-dim MFCC + Classical DTW

40-dim MFCC + GMM-UBM model
Noisy samples:
Wind and car noises are added to each clean

sample such that the total SNR is 3dB

Parameters:

# of enrollment samples: 3
Narrow-band spectral coefficients (NBSC) band-width: 200Hz
f0 estimation using autocorrelation method [Rabiner, 1976]

Feature Selection Weighted DTW Experiments

SLIDE 19

19

Summary of Experiment Results

Clean (EER [%]) Noisy (3dB) (EER [%]) MFCC (40-dim) NBSC (12 bands) MFCC (40-dim) NBSC (8 bands) Weighted-DTW 0.9 1.1 10.5 5.7 DTW 1.4 1.5 13 6.7 GMM/UBM 2.6 N/A 6.8 N/A Features Algorithm features below 2kHz are dropped

Without noise, the NBSC yields comparable accuracy to the MFCC features
At 3dB SNR, the NBSC yields much better accuracy than the MFCC features
The Weighted-DTW yields improved accuracy than the classical DTW for

all features

The proposed system yields improved accuracy than the GMM/UBM method
Taking only 3 enrollment samples as prior
Without prior background model training

Feature Selection Weighted DTW Experiments

SLIDE 20

20

Experiments: Adaptive Band Selection

Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy (band selection) 6.8 6.6 6.3 5.7 Noisy (all bands) 15.5 15 15 14.5

Accuracy improves as the # of bands increases
Accuracy improves significantly with band selection

Feature Selection Weighted DTW Experiments

SLIDE 21

System Power Estimation

Back-end implementation:

– Cortex-M0 micro-controller – Clock-speed: 40MHz – Decision: every 60 ms

21

Fixed-Power (uW) Additional Power per Band (uW) Total Power (12 bands) (uW)

Front-end

TI’s 13 band filter-bank features 150 10 270

Back-end

Text-dependent speaker verification <9 <108

<380

SLIDE 22

Summary: Low-Power Text-Dependent Speaker Verification

22

Early stage signal dimension reduction
Analog feature extraction front-end
Low-rate sampling and processing
Support adaptive band-selection
Improved robustness to noise (discard

noisy bands)

Demonstrated comparable accuracy to existing

systems, with low-power implementation Output Spectral Feature Selection Weighted Dynamic Time Warping

SLIDE 23

Questions?

23

SLIDE 24

Back-up Slides

24

SLIDE 25

25

False-Positive under Continuous Running

Feature Selection Feature Extraction Text-dependent speaker verification

Clean (OOV False Positive [%]) Noisy (3dB) (OOV False Positive [%]) MFCC (40-dim) NBSC (12 bands) MFCC (40-dim) NBSC (8 bands) Weighted-DTW 1.4 0.6 Algorithm Features

Decision threshold is the same as the speaker-verification EER threshold

~1 false-positive per hour in a noisy restaurant

Out-of-vocabulary samples:
50000 samples of 1.2s duration
Short commands, utterances from audio books and conversations

SLIDE 26

26

Experiments: Adaptive Band Selection

Feature Selection Feature Extraction Text-dependent speaker verification

Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy (band selection) 6.8 6.6 6.3 5.7 Noisy (all bands) 15.5 15 15 14.5

Accuracy improves as the # of bands increases
Accuracy improves significantly with band selection
NBSC yields much better performance than the MFSC, which uses

a larger number of filters MFSC (EER [%]) 13 26 1.95 1.83 16.4 17.2 33.4 33.9

SLIDE 27

Narrow-Band Spectral Filtering

Spectrogram

Time (s) Frequency (Hz) Frequency (Hz) [dB]

=

×

=

Quefrency (Cycle/kHz)

Cepstral Coefficients

27

Feature Selection Weighted DTW Experiments