- Texas Instruments MIT - Signals Information and Algorithms Lab - - PowerPoint PPT Presentation

texas instruments mit signals information and algorithms
SMART_READER_LITE
LIVE PREVIEW

- Texas Instruments MIT - Signals Information and Algorithms Lab - - PowerPoint PPT Presentation

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab


slide-1
SLIDE 1

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping

Qing He, Gregory Wornell, Wei Ma June 21, 2016

MIT - Signals Information and Algorithms Lab

  • Texas Instruments
slide-2
SLIDE 2

Motivation: Low-Power Wake-up

  • Conventionally, for voice wake up, the host device is always ON

– High data acquisition rate to minimize information loss and to enable flexible downstream processing – Involves many stages of processing on high dimensional data

2

ADC

Host device (DSP) Always ON : >100mWs

Low-Power Front-end

Always ON : ~ 50-300uW

  • Much lower power consumption can be achieved with an application-specific

voice-authenticated wake-up front-end

– Early-stage signal dimension reduction with analog components – Adaptive data acquisition and robust processing Wake-up signal

Turn on with wake-up signal

slide-3
SLIDE 3

System Architecture: A Comparison

3

Acoustic feature extraction MFCC Sampling Windowing Speaker Verification Feature Extraction Unit Sampling Rate e.g., ~24 kHz Fast processing High Dimensional Features

Conventional System:

< 4 kHz Low-Rate Processing (kHz)

Proposed System:

High power-consumption! Low power-consumption!

Narrowband Filter Bank Accept/ Reject Weighted DTW Enrollment Samples

Pattern Match

NBSC

Low rate ADC

Accept/ Reject

Spectral feature extraction Analog Front-End

slide-4
SLIDE 4

4

Output Speaker-verification Backend Spectral Feature Pre-Selection

  • A few carefully selected narrow-bands are

capable of preserving most speech information

Spectral Feature Pre-Selection

slide-5
SLIDE 5

5

Review: Speech Sound Generation

Feature Selection Weighted DTW Experiments

(a) Excitation signal: (c) Speech spectrogram: (b) Vocal tract modulation:

Essential for speech recognition Separable in the cepstral domain

slide-6
SLIDE 6

6

Cepstral Representation of Speech

Feature Selection Feature Extraction Command Recognition

Frequency (Hz)

Spectral Density Spectrogram

Time (s) Frequency (Hz)

Harmonics Vocal Tract Modulation

IFFT

Quefrency (Cycle/kHz)

Cepstral Coefficients

  • Acquiring the entire speech spectrum and performing transformation

to the cepstral domain is power expensive! Question: How to extract without acquiring the full spectrum or performing transformation to the cepstral domain?

slide-7
SLIDE 7

7

Point-Wise Spectral Sampling on the Harmonics

Spectrogram

Time (s) Frequency (Hz) Frequency (Hz) [dB]

Cepstral Coefficients

[dB]

×

is retained

Feature Selection Weighted DTW Experiments

slide-8
SLIDE 8

Narrow-Band Spectral Filtering

Feature Selection Feature Extraction Command Recognition

Spectrogram

Time (s) Frequency (Hz) Frequency (Hz) [dB] 8 [dB]

×

slide-9
SLIDE 9

Narrow-band Spectral Filtering: Parameters

Spectrogram

Time (s) Frequency (Hz) [dB] 9 Frequency (Hz)

where is the narrow-band band-width is the spacing between narrow-bands is mostly retained With = 200Hz, = 800Hz, = 100Hz, aliasing at the baseband is attenuated significantly

Feature Selection Weighted DTW Experiments

slide-10
SLIDE 10

10

Narrow-Band Spectral Coefficients (NBSC)

BP Filtering BP Filtering

  • Narrow-band spectral features retain essential speech information
  • A small number of filters
  • Low-rate sampling and simple processing

low-power

Feature Selection Weighted DTW Experiments

slide-11
SLIDE 11

Block Diagram of the Proposed System

11

Digital back-end Enrollment Samples

×

Weighted Dynamic Time-Warping (DTW)

Analog front-end

. . .

threshold

Feature Selection Weighted DTW Experiments

Accept/ Reject

  • Individual bands can be discarded in the presence of noise
slide-12
SLIDE 12

12

Output Spectral Feature Selection

Weighted DTW

Weighted Dynamic Time Warping

  • Identifies the user and the passphrase in one shot
  • User-defined passphrase (~1s)
  • Very few enrollment samples (e.g., 3)

Feature Selection Weighted DTW Experiments

Voice-authenticated wake-up:

slide-13
SLIDE 13
  • Text-Dependent Speaker-Verification

– Model based: GMM [Reynolds, 2000], i-vectors [Dehak, 2011], DNN [Liu, 2015], HMM [Rosenberg, 1990]

– Template based: DTW [Sakoe, 1978]:

Overview: Speaker-Verification Systems

13

Feature Selection Weighted DTW Experiments

Feature Extraction

Model

Enrollments Model Adaptation Training data Model Training

threshold

MFCC

Accept/ Reject

Feature Extraction

Distance Measure

Enrollments

No prior model training

threshold

Accept/ Reject

slide-14
SLIDE 14

14

Weighted Dynamic Time Warping

Reference signal Speech input M = 3 Small penalty Large penalty compress stretch

  • The distance between two points and is equal to the

distance plus a penalty term

  • Penalty scales with the # of consecutive warping steps M
  • Penalty scales with the signal magnitude
  • Penalty for warping is low when signal is small
  • Penalty for warping is high when signal is large

Feature Selection Weighted DTW Experiments

slide-15
SLIDE 15

15

Distance Matrix Computation

where Cost is a function of the signal magnitude and the # of consecutive warping steps

Feature Selection Weighted DTW Experiments

slide-16
SLIDE 16

16

Classical v.s. Weighted DTW

Fails to align the signal envelopes Less mutation on signal envelope Signal envelopes are well aligned The shape of T is mutated

Feature Selection Weighted DTW Experiments

slide-17
SLIDE 17

17

System Experiment

Feature Selection Weighted DTW Experiments

Output Spectral Feature Selection Weighted Dynamic Time Warping

slide-18
SLIDE 18

18

Experiment Setup

Passphrase # of speakers # of repetitions Hi Galaxy 40 40 OK Glass 40 20 OK Hua Wei 30 20

Data Set: Baseline Systems: •

40-dim MFCC + Classical DTW

  • 40-dim MFCC + GMM-UBM model
  • Noisy samples:
  • Wind and car noises are added to each clean

sample such that the total SNR is 3dB

Parameters:

  • # of enrollment samples: 3
  • Narrow-band spectral coefficients (NBSC) band-width: 200Hz
  • f0 estimation using autocorrelation method [Rabiner, 1976]

Feature Selection Weighted DTW Experiments

slide-19
SLIDE 19

19

Summary of Experiment Results

Clean (EER [%]) Noisy (3dB) (EER [%]) MFCC (40-dim) NBSC (12 bands) MFCC (40-dim) NBSC (8 bands) Weighted-DTW 0.9 1.1 10.5 5.7 DTW 1.4 1.5 13 6.7 GMM/UBM 2.6 N/A 6.8 N/A Features Algorithm features below 2kHz are dropped

  • Without noise, the NBSC yields comparable accuracy to the MFCC features
  • At 3dB SNR, the NBSC yields much better accuracy than the MFCC features
  • The Weighted-DTW yields improved accuracy than the classical DTW for

all features

  • The proposed system yields improved accuracy than the GMM/UBM method
  • Taking only 3 enrollment samples as prior
  • Without prior background model training

Feature Selection Weighted DTW Experiments

slide-20
SLIDE 20

20

Experiments: Adaptive Band Selection

Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy (band selection) 6.8 6.6 6.3 5.7 Noisy (all bands) 15.5 15 15 14.5

  • Accuracy improves as the # of bands increases
  • Accuracy improves significantly with band selection

Feature Selection Weighted DTW Experiments

slide-21
SLIDE 21

System Power Estimation

  • Back-end implementation:

– Cortex-M0 micro-controller – Clock-speed: 40MHz – Decision: every 60 ms

21

Fixed-Power (uW) Additional Power per Band (uW) Total Power (12 bands) (uW)

Front-end

TI’s 13 band filter-bank features 150 10 270

Back-end

Text-dependent speaker verification <9 <108

<380

slide-22
SLIDE 22

Summary: Low-Power Text-Dependent Speaker Verification

22

  • Early stage signal dimension reduction
  • Analog feature extraction front-end
  • Low-rate sampling and processing
  • Support adaptive band-selection
  • Improved robustness to noise (discard

noisy bands)

  • Demonstrated comparable accuracy to existing

systems, with low-power implementation Output Spectral Feature Selection Weighted Dynamic Time Warping

slide-23
SLIDE 23

Questions?

23

slide-24
SLIDE 24

Back-up Slides

24

slide-25
SLIDE 25

25

False-Positive under Continuous Running

Feature Selection Feature Extraction Text-dependent speaker verification

Clean (OOV False Positive [%]) Noisy (3dB) (OOV False Positive [%]) MFCC (40-dim) NBSC (12 bands) MFCC (40-dim) NBSC (8 bands) Weighted-DTW 1.4 0.6 Algorithm Features

  • Decision threshold is the same as the speaker-verification EER threshold

~1 false-positive per hour in a noisy restaurant

  • Out-of-vocabulary samples:
  • 50000 samples of 1.2s duration
  • Short commands, utterances from audio books and conversations
slide-26
SLIDE 26

26

Experiments: Adaptive Band Selection

Feature Selection Feature Extraction Text-dependent speaker verification

Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy (band selection) 6.8 6.6 6.3 5.7 Noisy (all bands) 15.5 15 15 14.5

  • Accuracy improves as the # of bands increases
  • Accuracy improves significantly with band selection
  • NBSC yields much better performance than the MFSC, which uses

a larger number of filters MFSC (EER [%]) 13 26 1.95 1.83 16.4 17.2 33.4 33.9

slide-27
SLIDE 27

Narrow-Band Spectral Filtering

Spectrogram

Time (s) Frequency (Hz) Frequency (Hz) [dB]

=

×

=

Quefrency (Cycle/kHz)

Cepstral Coefficients

27

Feature Selection Weighted DTW Experiments