Uncertainty Modeling without Subspace Methods for Text-Dependent - - PowerPoint PPT Presentation

▶

Dec 14, 2023 45 likes •237 views

Speaker Recognition Task and Features Two Backends Experiments Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann Odyssey Speaker and

SLIDE 1

Speaker Recognition Task and Features Two Backends Experiments

Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition

Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann

Odyssey Speaker and Language Recognition Workshop Bilbao, Spain

June, 2016

1 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 2

Speaker Recognition Task and Features Two Backends Experiments

Uncertainty Modeling in Text-Dependent Speaker Recognition

Large numbers of mixture components are surprisingly effective in text-dependent speaker recognition where utterances are typically of 1 or 2 seconds duration The number of times a mixture component is observed typically << 1 and it could be 0 (particularly at test time) so

bservations ought to be treated as being noisy in the

statistical sense Some progress has been made in uncertainty modeling in text-independent speaker recognition with subspace methods (i-vectors, speaker factors) but these are of limited use in text-dependent speaker recognition We tackle the problem of uncertainty modeling without resorting to subspace methods

2 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 3

Speaker Recognition Task and Features Two Backends Experiments

RSR2015 Part III (Random Digits)

Background set (97 speakers) used for JFA and backend training Results reported on development set Enrollment consists of 3 utterances of the 10 digits in random order Each test utterance consists of a random string of 5 digits Error rates are much higher than on Part I Counterintuitively, it is hard to beat a naive GMM/UBM benchmark using HMMs We focus on backend modeling with a standard 60-dimensional PLP front end

3 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 4

Speaker Recognition Task and Features Two Backends Experiments

JFA for Speaker Recognition with Digits

Given a speaker and a collection of enrollment recordings, the recordings are modeled by supervectors of the form m + Uxr + Dz (1) Speakers are characterized by z-vectors (supervector sized); the x-vectors (low-dimensional) model channel effects To perform speaker recognition, for each digit d in a test utterance compare the vectors supervectors ze and zt where

ze is extracted from the enrollment utterances zt is extracted from the test utterance

z vectors may be digit-independent (global) or digit-dependent (local)

4 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 5

Speaker Recognition Task and Features Two Backends Experiments

Two Backends

The Joint Density Backend uses point estimates of ze and zt The Hidden Supervector Backend treats ze and zt as latent variables. Inference requires

Baum-Welch statistics A joint prior distribution (under the same-speaker hypothesis) P(w) where w = (ze, zt) Calculating the posterior of w given Baum-Welch statistics

5 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 6

Speaker Recognition Task and Features Two Backends Experiments

Joint Density Backend

The joint distribution for target trials, PT(ze, zt), is modeled by a Gaussian for each mixture component Insufficient data to train full covariance Gaussians and diagonal Gaussians obviously incorrect “Semi-diagonal” constraints (see paper) Gaussians estimated by arranging the background set into a collection of target trials For non-target trials, assume statistical independence, i.e. PN(ze, zt) = PT(ze) × PT(zt) Likelihood ratio for speaker verification: PT(ze, zt) PN(ze, zt) where the product ranges over the digits in the test utterance and mixture components in the UBM

6 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 7

Speaker Recognition Task and Features Two Backends Experiments

Hidden Supervector Backend

For each mixture component treat ze, zt as a pair of hidden mean vectors which are correlated in the case of a target trial Use an “i-vector extractor” to do probability calculations (not to extract factors) The “i-vector” w is the pair ze, zt so its dimension is twice that of the acoustic feature vectors The i-vector model has full rank so we can take the total variability matrix to be the identity and shift the burden of modeling the correlation between ze and zt to the prior The prior cannot be standard normal so it needs to be estimated

7 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 8

Speaker Recognition Task and Features Two Backends Experiments

Posterior Calculations

For an i-vector extractor with a non-standard prior, Cov(w, w) =

NcT ⊤

c T c

−1 w = Cov(w, w)

Pµ +
c

T ⊤

c F c

where µ is the prior expectation and P the precision. (In the

standard case, µ = 0 and P = I.)

8 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 9

Speaker Recognition Task and Features Two Backends Experiments

Minimum Divergence Estimation of the Prior

We need to supply the mean µ and precision matrix P that specifies the prior distribution of “i-vectors” for same-speaker trials. Arrange the background set into a collection of target trials indexed by s = 1, . . . , S and let w(s) be the “i-vector” for trial s. µ = 1 S

w(s) P−1 = 1 S

s
w(s)w⊤(s)
− µµ⊤

Minor modifications to make µ and P digit dependent or impose semi-diagonal constraints.

9 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 10

Speaker Recognition Task and Features Two Backends Experiments

For the different speaker hypothesis, treat ze and zt as being statistically independent. In other words, suppress the cross correlations in the covariance matrix P−1 that defines the prior under the same-speaker hypothesis.

10 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 11

Speaker Recognition Task and Features Two Backends Experiments

Likelihood Ratio

Given data and a probability model with hidden variables, the evidence is the likelihood of the data calculated by integrating out the hidden variables For an i-vector model the integral can be evaluated in closed form (it is a Gaussian integral) and expressed in terms of the Baum-Welch statistics (see paper) To evaluate the likelihood ratio for a speaker verification trial, evaluate the evidence twice

Using the prior for the same-speaker hypothesis Using the prior for the different speaker hypothesis

11 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 12

Speaker Recognition Task and Features Two Backends Experiments

Preparing the Baum-Welch Statistics

For each speaker, we have a collection of (enrollment or test) recordings indexed by r For each mixture component c, zero and first order statistics denoted by Nr

c and F r c

Remove the channel effects from each recording and pool

ver recordings

Nc =

Nr

c

F c =

(F r

c − Nr cUc xr)

xr is a point-estimate of the hidden variable xr in (1) One set of “synthetic” statistics per speaker (regardless of the number of recordings)

12 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 13

Speaker Recognition Task and Features Two Backends Experiments

“Length Normalization” of the Synthetic Statistics

In the JFA model (1), zc is a hidden variable The posterior covariance and expectation Cc and zc, are given by Cc = (I + NcD∗

cDc)−1

zc = CcD∗

cF c

so that

= zc 2 + trace(Cc) For each speaker, we scale the synthetic first order statistics so that

c

is the same for all speakers

13 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 14

Speaker Recognition Task and Features Two Backends Experiments

The dominant term in (2) is trace(Cc) An experiment in the Appendix A demonstrates its usefulness The posterior covariance matrix Cc depends critically on the relevance factor

14 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 15

Speaker Recognition Task and Features Two Backends Experiments

128 Mixture Components, Global z-vectors

norm.? EER (M/F) DCF (M/F) 1 GMM

4.8%/8.0%

0.217/0.356 2 JDB

4.8%/7.6%

0.219/0.353 3 HSB × 4.5%/6.8% 0.201/0.338 4 HSB

3.9%/6.1%

0.177/0.307

Table 1: Results on the development set obtained with 128

Gaussians. The systems are a GMM/UBM system, the Joint Density

Backend (JDB) and the Hidden Supervector Backend (HSB) both with global z-vectors. Baum-Welch statistics normalization is indicated by “norm”.

15 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 16

Speaker Recognition Task and Features Two Backends Experiments

512 Components, Global z-vectors

r EER (M/F) DCF (M/F) 1 GMM 2 4.7%/8.2% 0.195/0.336 2 JDB 2 4.3%/6.1% 0.196/0.288 5 HSB 1 3.3%/4.6% 0.148/0.234

Table 2: Results on the development set obtained with 512 Gaussians and global z-vectors.

16 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 17

Speaker Recognition Task and Features Two Backends Experiments

512 Components, Local z-vectors

EER (M/F) DCF (M/F) JDB (component fusion) 3.9%/5.2% 0.184/0.259 HSB (component fusion) 3.6%/3.9% 0.152/0.197 HSB (forced alignment) 3.5%/4.0% 0.152/0.197

Table 3: Results on the development set obtained with 512 Gaussians, local z-vectors and digit-dependent backends

17 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 18

Speaker Recognition Task and Features Two Backends Experiments

Fusion of Local and Global

EER (M/F) DCF (M/F) dev local 3.7%/3.8% 0.149/0.193 dev global 3.2%/4.5% 0.148/0.232 dev fusion 2.9%/3.6% 0.131/0.186 eval local 2.6%/4.5% 0.134/0.211 eval global 2.7%/4.7% 0.140/0.236 eval fusion 2.3%/4.0% 0.122/0.192

Table 4: Results on the development and evaluation sets obtained with local and global Hidden Supervector systems, 512 components.

18 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

SLIDE 19

Speaker Recognition Task and Features Two Backends Experiments

Conclusion

Modeling uncertainty yields error rate reductions of up to 25% compared with the Joint Density Backend, consistently across all experiments on the RSR Part III task This can be achieved without resorting to subspace methods although the idea can be seen as applying the same idea as the I-Vector Backend (Interspeech 2015) at the level of individual mixture components Unlike the I-Vector backend, the Hidden Supervector Backend can be configured in a way which makes very modest computational demands

With semi-diagonal constraints on the prior, the run-time linear algebra involves only diagonal matrices

19 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods