The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram - - PowerPoint PPT Presentation

the ibm 2016 speaker recognition system
SMART_READER_LITE
LIVE PREVIEW

The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram - - PowerPoint PPT Presentation

IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos IBM Research Outline Introduction Speaker Recognition System Experimental Setup Results Conclusions 2 IBM Research


slide-1
SLIDE 1

IBM Research

The IBM 2016 Speaker Recognition System

Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos

slide-2
SLIDE 2

IBM Research

2

Outline

  • Introduction
  • Speaker Recognition System
  • Experimental Setup
  • Results
  • Conclusions
slide-3
SLIDE 3

IBM Research

3

Introduction

slide-4
SLIDE 4

IBM Research

4

Recent Progress

  • Major advancements over the past several years.
  • SOTA i-vector systems use UBMs to estimate stats
  • Previous Work:
  • Gaussian mixture model UBM [Reynolds 1997]
  • Phonetically-inspired UBM (PI-UBM) [Omar 2010]
  • DNN-based phonetically-aware UBM [Lei 2014]
  • TDNN-based UBM (full-covariance) [Snyder 2015]
  • DNN bottleneck based features are also used in SOTA systems

[Heck 1998; Richardson 2015; Matějka 2016 ]

slide-5
SLIDE 5

IBM Research

5

Objectives

  • To share state-of-the-art results on the NIST 2010 SRE
  • To present the key system components that helped us

achieve these results

  • Speaker- and channel-adapted fMLLR based features
  • A DNN acoustic model with a large number of senones (10k)
  • A nearest-neighbor discriminant analysis (NDA) technique
  • To quantify the contribution of each component
slide-6
SLIDE 6

IBM Research

6

Speaker Recognition System

slide-7
SLIDE 7

IBM Research

7

Speaker Recognition System

  • Our i-vector based speaker recognition system
  • Speaker- and channel-normalized fMLLR based features
  • i-vectors are estimated using DNN senone posteriors (~10k)
  • LDANDA based intersession variability compensation

i-vector Extraction Acoustic Feats. Speech SAD Suff. Stats Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA

slide-8
SLIDE 8

IBM Research

8

Feature space MLLR

Speech i-vector Extraction Acoustic Feats. SAD Suff. Stats Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA

slide-9
SLIDE 9

IBM Research

9

DNN Senone I-vectors

Posteriors Senones (10k) B-W Statistics

i-vector Extraction Acoustic Feats. Speech SAD Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA Suff. Stats

slide-10
SLIDE 10

IBM Research

  • LDA assumes unimodal and Gaussian distributions
  • It cannot effectively handle multimodal data
  • It can be rank deficient

10

Linear Discriminant Analysis (LDA)

i-vector Extraction Acoustic Feats. Speech SAD Suff. Stats Dim. Reduc. T matrix LDA/NDA Score fMLLR PLDA

slide-11
SLIDE 11

IBM Research Class 1 Class 2

Nearest Neighbor Discriminant Analysis (NDA)

global class means local k-NN means LDA NDA emphasize samples near boundary

( )( )

1 1 1

i

N C C T ij i ij i ij b l l l l l i j l j i

w

= = = ≠

= − −

∑∑∑

S x x M M

( ) ( )

{ }

( ) ( )

min , ( , ) , , ( , ) , ( , ) , ( , )

i i i i l K l l K l ij l i i i i l K l l K l

d NN i d NN j w d NN i d NN j

α α α α

= + x x x x x x x x

( )( )

1 C T b i i i i

p

=

= − −

Sμ μ μ μ

11

slide-12
SLIDE 12

IBM Research

12

Experimental Setup

slide-13
SLIDE 13

IBM Research

13

Data

  • Training Data
  • NIST 2004-2008 SRE (English telephony and microphone data)
  • Switchboard (SWB) cellular Parts I and II, SWB2 Phases II and III
  • Total of 60,178 recordings
  • Evaluation data
  • NIST 2010 SRE (extended evaluation set)

Cond. Enroll Test Mismatch #Targets #Impostors C1

  • Int. mic.
  • Int. mic. (same type)

No 4,034 795,995 C2

  • Int. mic.
  • Int. mic. (different type)

Yes 15,084 2,789,534 C3

  • Int. mic.

Telephony Yes 3,989 637,850 C4

  • Int. mic.

Room microphone Yes 3,637 756,775 C5 Telephony Telephony (different type) Yes 7,169 408,950

slide-14
SLIDE 14

IBM Research

14

DNN System Configuration

  • 6 fully connected hidden layers with 2048 units
  • The bottleneck layer has 512 units
  • Trained using 600 hours of speech from Fisher
  • Input is a 9-frame context of 40-D fMLLR feats.
  • Estimates posterior probabilities of 10k senones
  • 2k and 4k posteriors are also explored
slide-15
SLIDE 15

IBM Research

15

Speaker Recognition System Configuration

  • 500-dimensional total variability subspace trained using a subset
  • f 48,325 recordings from NIST SRE, SWBCELL, and SWB2
  • Sufficient statistics are generated using posteriors from:
  • Gender independent 2048-component GMM-UBM (21,207 recordings)
  • DNN with 7 hidden layers and 2k, 4k, or 10k senones
  • MFCCs and fMLLR based features are evaluated
  • LDA/NDA is applied to obtain 250-dimensional feature vectors
  • Gaussian PLDA backend trained with 60,178 speech segments
  • Evaluation metrics: Equal error rate (EER) and minDCF’08,’10
slide-16
SLIDE 16

IBM Research

16

Results

slide-17
SLIDE 17

IBM Research

17

LDA vs NDA (MFCC, 2048-GMM, 10k DNN, C5)

  • NDA outperforms LDA for both GMM and DNN based systems

System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.12 0.439 GMM-MFCC-NDA 1.55 0.076 0.286 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-MFCC-NDA 0.76 0.036 0.147

slide-18
SLIDE 18

IBM Research

18

  • Speaker- and channel-normalized fMLLRs outperform MFCCs

System EER [%] minDCF08 minDCF10 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-fMLLR-LDA 0.82 0.032 0.120 DNN-MFCC-NDA 0.76 0.036 0.147 DNN-fMLLR-NDA 0.67 0.028 0.092

MFCC vs fMLLR (10k DNN, C5)

slide-19
SLIDE 19

IBM Research

19

Impact of #Senones (fMLLR, C5)

  • Using 10k senones gives the best performance
  • NDA consistently outperforms LDA for 2k, 4k, and 10k senones
  • Note: in contrast to DNNs, increasing the number of

components in GMMs (beyond 2k, with diag. cov. matrices) does not improve the results [Lei 2014; Snyder 2015].

System #Senones EER [%] minDCF08 minDCF10 DNN-LDA 2k 1.19 0.054 0.212 DNN-NDA 0.95 0.043 0.166 DNN-LDA 4k 0.98 0.041 0.169 DNN-NDA 0.86 0.033 0.116 DNN-LDA 10k 0.82 0.032 0.120 DNN-NDA 0.67 0.028 0.092

slide-20
SLIDE 20

IBM Research

20

DET Plot Performance (C5)

slide-21
SLIDE 21

IBM Research

21

System Progression (C5)

System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.120 0.439 1.55 0.076 0.286 0.76 0.036 0.147 0.67 0.028 0.092

  • Achieved the best published performance (EER = 0.67%) on

NIST 2010 SRE (C5).

  • Building upon previous best results:
  • EER = 1.09% [Snyder 2015]  Gender-dependent (both genders)
  • EER = 0.94% [Matějka 2016]  Gender-dependent (female trials)
slide-22
SLIDE 22

IBM Research

22

Conclusions

slide-23
SLIDE 23

IBM Research

23

  • Presented the IBM i-vector speaker recognition system:

 Speaker- and channel-normalized fMLLR based features may be more effective than raw MFCCs in matched conditions  Using a DNN-UBM with 10k senones to partition the acoustic space provides the best performance  NDA more effective than LDA for channel compensation in the i-vector space (with multimodal data)

  • Achieved the best published performance (EER = 0.67%)
  • n NIST 2010 SRE (C5)
  • For further progress on our system see us at IS-2016

Conclusions