[PPT] - The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram PowerPoint Presentation

SLIDE 1

IBM Research

The IBM 2016 Speaker Recognition System

Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos

SLIDE 2

IBM Research

2

Outline

Introduction
Speaker Recognition System
Experimental Setup
Results
Conclusions

SLIDE 3

IBM Research

3

Introduction

SLIDE 4

IBM Research

4

Recent Progress

Major advancements over the past several years.
SOTA i-vector systems use UBMs to estimate stats
Previous Work:
Gaussian mixture model UBM [Reynolds 1997]
Phonetically-inspired UBM (PI-UBM) [Omar 2010]
DNN-based phonetically-aware UBM [Lei 2014]
TDNN-based UBM (full-covariance) [Snyder 2015]
DNN bottleneck based features are also used in SOTA systems

[Heck 1998; Richardson 2015; Matějka 2016 ]

SLIDE 5

IBM Research

5

Objectives

To share state-of-the-art results on the NIST 2010 SRE
To present the key system components that helped us

achieve these results

Speaker- and channel-adapted fMLLR based features
A DNN acoustic model with a large number of senones (10k)
A nearest-neighbor discriminant analysis (NDA) technique
To quantify the contribution of each component

SLIDE 6

IBM Research

6

Speaker Recognition System

SLIDE 7

IBM Research

7

Speaker Recognition System

Our i-vector based speaker recognition system
Speaker- and channel-normalized fMLLR based features
i-vectors are estimated using DNN senone posteriors (~10k)
LDANDA based intersession variability compensation

i-vector Extraction Acoustic Feats. Speech SAD Suff. Stats Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA

SLIDE 8

IBM Research

8

Feature space MLLR

Speech i-vector Extraction Acoustic Feats. SAD Suff. Stats Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA

SLIDE 9

IBM Research

9

DNN Senone I-vectors

Posteriors Senones (10k) B-W Statistics

i-vector Extraction Acoustic Feats. Speech SAD Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA Suff. Stats

SLIDE 10

IBM Research

LDA assumes unimodal and Gaussian distributions
It cannot effectively handle multimodal data
It can be rank deficient

10

Linear Discriminant Analysis (LDA)

i-vector Extraction Acoustic Feats. Speech SAD Suff. Stats Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA

SLIDE 11

IBM Research Class 1 Class 2

Nearest Neighbor Discriminant Analysis (NDA)

global class means local k-NN means LDA NDA emphasize samples near boundary



( )( )

1 1 1

i

N C C T ij i ij i ij b l l l l l i j l j i

w

= = = ≠

= − −

∑∑∑

S x x M M

( ) ( )

{ }

( ) ( )

min , ( , ) , , ( , ) , ( , ) , ( , )

i i i i l K l l K l ij l i i i i l K l l K l

d NN i d NN j w d NN i d NN j

α α α α

= + x x x x x x x x

( )( )

1 C T b i i i i

p

=

= − −

∑

Sμ μ μ μ

11

SLIDE 12

IBM Research

12

Experimental Setup

SLIDE 13

IBM Research

13

Data

Training Data
NIST 2004-2008 SRE (English telephony and microphone data)
Switchboard (SWB) cellular Parts I and II, SWB2 Phases II and III
Total of 60,178 recordings
Evaluation data
NIST 2010 SRE (extended evaluation set)

Cond. Enroll Test Mismatch #Targets #Impostors C1

Int. mic.
Int. mic. (same type)

No 4,034 795,995 C2

Int. mic.
Int. mic. (different type)

Yes 15,084 2,789,534 C3

Int. mic.

Telephony Yes 3,989 637,850 C4

Int. mic.

Room microphone Yes 3,637 756,775 C5 Telephony Telephony (different type) Yes 7,169 408,950

SLIDE 14

IBM Research

14

DNN System Configuration

6 fully connected hidden layers with 2048 units
The bottleneck layer has 512 units
Trained using 600 hours of speech from Fisher
Input is a 9-frame context of 40-D fMLLR feats.
Estimates posterior probabilities of 10k senones
2k and 4k posteriors are also explored

SLIDE 15

IBM Research

15

Speaker Recognition System Configuration

500-dimensional total variability subspace trained using a subset
f 48,325 recordings from NIST SRE, SWBCELL, and SWB2
Sufficient statistics are generated using posteriors from:
Gender independent 2048-component GMM-UBM (21,207 recordings)
DNN with 7 hidden layers and 2k, 4k, or 10k senones
MFCCs and fMLLR based features are evaluated
LDA/NDA is applied to obtain 250-dimensional feature vectors
Gaussian PLDA backend trained with 60,178 speech segments
Evaluation metrics: Equal error rate (EER) and minDCF’08,’10

SLIDE 16

IBM Research

16

Results

SLIDE 17

IBM Research

17

LDA vs NDA (MFCC, 2048-GMM, 10k DNN, C5)

NDA outperforms LDA for both GMM and DNN based systems

System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.12 0.439 GMM-MFCC-NDA 1.55 0.076 0.286 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-MFCC-NDA 0.76 0.036 0.147

SLIDE 18

IBM Research

18

Speaker- and channel-normalized fMLLRs outperform MFCCs

System EER [%] minDCF08 minDCF10 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-fMLLR-LDA 0.82 0.032 0.120 DNN-MFCC-NDA 0.76 0.036 0.147 DNN-fMLLR-NDA 0.67 0.028 0.092

MFCC vs fMLLR (10k DNN, C5)

SLIDE 19

IBM Research

19

Impact of #Senones (fMLLR, C5)

Using 10k senones gives the best performance
NDA consistently outperforms LDA for 2k, 4k, and 10k senones
Note: in contrast to DNNs, increasing the number of

components in GMMs (beyond 2k, with diag. cov. matrices) does not improve the results [Lei 2014; Snyder 2015].

System #Senones EER [%] minDCF08 minDCF10 DNN-LDA 2k 1.19 0.054 0.212 DNN-NDA 0.95 0.043 0.166 DNN-LDA 4k 0.98 0.041 0.169 DNN-NDA 0.86 0.033 0.116 DNN-LDA 10k 0.82 0.032 0.120 DNN-NDA 0.67 0.028 0.092

SLIDE 20

IBM Research

20

DET Plot Performance (C5)

SLIDE 21

IBM Research

21

System Progression (C5)

System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.120 0.439 1.55 0.076 0.286 0.76 0.036 0.147 0.67 0.028 0.092

Achieved the best published performance (EER = 0.67%) on

NIST 2010 SRE (C5).

Building upon previous best results:
EER = 1.09% [Snyder 2015]  Gender-dependent (both genders)
EER = 0.94% [Matějka 2016]  Gender-dependent (female trials)

SLIDE 22

IBM Research

22

Conclusions

SLIDE 23

IBM Research

23

Presented the IBM i-vector speaker recognition system:

 Speaker- and channel-normalized fMLLR based features may be more effective than raw MFCCs in matched conditions  Using a DNN-UBM with 10k senones to partition the acoustic space provides the best performance  NDA more effective than LDA for channel compensation in the i-vector space (with multimodal data)

Achieved the best published performance (EER = 0.67%)
n NIST 2010 SRE (C5)
For further progress on our system see us at IS-2016