Joint Factor Analysis for Text-Dependent Speaker Verification - - PowerPoint PPT Presentation

▶

Dec 22, 2022 333 likes •554 views

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion Joint Factor Analysis for Text-Dependent Speaker Verification Patrick Kenny, Themos

SLIDE 1

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Joint Factor Analysis for Text-Dependent Speaker Verification

Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam, Pierre Ouellet and Marcel Kockmann

Odyssey Speaker and Language Recognition Workshop

June, 2014

1 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 2

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Text-dependent speaker recognition

Lexical constraints enable speaker verification with short utterances The classes to be recognized are speaker-phrase combinations rather than speakers as such Speaker-phrase variability cannot generally be modeled using subspace methods (i-vectors or speaker factors) Achieving channel robsustness is hard Left-to-right structure can be exploited but this would complicate channel modeling There is no such thing as a “universal” UBM

2 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 3

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

JFA without speaker factors

Given parallel recordings of a phrase by a speaker indexed by r Each recordings is assumed to be characterized by a GMM whose mean vectors are of the form mc + Ucxr + Dczc where the hidden variables xr and zc have standard normal priors (c for mixture component) Uxr models channel variability, Dz models speaker-phrase variability Typically U is estimated by maximum likelihood II and Dc by relevance MAP The prior on z is factorial in the sense that P(z) =

c P(zc)

3 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 4

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Vogt’s algorithm [Vogt, 2008]

Starting from Baum-Welch statistics, the hidden variables can be estimated by alternating between x and z This is a variational Bayes algorithm so it produces a variational lower bound which can be used to do likelihood (or evidence) calculations For example, speaker verification decisions can be made by Bayesian model selection in the same way as in PLDA

Given enrollment and test utterances, is the data better accounted for by positing one z-vector or two?

4 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 5

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Zhao and Dong’s algorithm [Zhao, 2010]

Baum-Welch statistics ought to be collected with GMMs adapted from the UBM (rather than with the UBM itself) Introduce extra hidden variables to account for the alignment between frames and mixture components Variational Bayes and Bayesian model selection by extending Vogt’s algorithm Caveat We do not have to liberty to adapt the UBM using some of the hidden variables but not others — this is a problem

5 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 6

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

JFA-Based Classifiers

z-vectors as features [Kenny, 2014]

UBM adaptation results in a severe degradation Maximum likelihood II misbehaves in the absence of UBM adaptation

Bayesian model selection

UBM adaptation helps for small codebooks (64 Gaussians) UBM adaptation hurts for large codebooks (512)

Traditional JFA likelihood ratios (JFA-LLR)

Works better than Bayesian model selection without UBM adaptation Can be made to work much better (40% error rate reductions) with careful UBM adaptation

6 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 7

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Anomalous results I

Bayesian model selection, 512 Gaussians (columns 1 and 2), 64 Gaussians (columns 3 and 4), Vogt vs. Zhao & Dong EER 2008 NDCF EER 2008 NDCF Vogt 2.2% 0.085 3.6% 0.145 Zhao & Dong 2.7% 0.096 3.4% 0.133 With 64 Gaussians UBM adaptation is helpful With 512 Gaussians UBM adaptation is not helpful Most experiments conducted on a “hard” subset of RSR2015 (generally with 64 Gaussians)

7 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 8

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Anomalous results II

Bayesian model selection versus traditional JFA log likelihood ratios (JFA-LLRs), 512 and 64 Gaussians EER 2008 NDCF EER 2008 NDCF Vogt 2.2% 0.085 3.6% 0.145 Zhao & Dong 2.7% 0.096 3.4% 0.133 JFA-LLR 1.7% 0.065 2.7% 0.110 Traditional JFA-LLRs work better than Bayesian model selection for both 512 and 64 Gaussians No UBM adaptation in traditional JFA-LLR calculation

8 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 9

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

The paper in a nutshell

JFA-LLR results can be substantially improved (40% reductions in error rates) with careful UBM adaptation Adapting to lexical content, to speaker effects in enrollment utterances, and to channel effects in test utterances are all

helpful. Do not adapt to speaker effects in test

utterances. Maximum likelihood II estimation works properly if LLRs are evaluated with UBM adaptation Even so, JFA works better as a feature extractor In this case, UBM adaptation is not helpful The factorial prior is too weak

9 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 10

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Phrase-dependent background modeling (PBM)

a-b-c EER 2008 NDCF UBM 0-1-1 2.7% 0.110 PBM 1-1-1 2.1% 0.092 In traditional JFA the numerator of the LLR is evaluated by comparing the test speaker to the “UBM speaker” In text-dependent speaker recognition, lexical content introduces a mismatch with the UBM Mean supervectors can be made phrase-dependent, other JFA parameters shared across phrases a-b-c counts alignment iterations in phrase adaptation (a), in processing enrollment utterances (b) and in processing test utterances (c)

10 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 11

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Adapting to the channel effects in the test utterance

a-b-c EER 2008 NDCF channel factors 1-1-1 2.1% 0.092 channel factors + adaptation 1-1-5 2.0% 0.086 Traditional JFA uses the UBM (or PBM) to align the test data and integrates over channel factors Eigenchannel modeling as originally conceived involved multiple alignment iterations Variational Bayes enables you to do both

11 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 12

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Adapting to the speaker effects in the enrollment utterances

a-b-c EER 2008 NDCF 1-1-1 2.1% 0.092 1-5-1 2.0% 0.080 To evaluate the denominator of the likelihood ratio, collect Baum-Welch statistics with a GMM that has been adapted to the speaker effects in the enrollment utterances Do not adapt to the speaker effects in the test utterance

12 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 13

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Phrase-dependent background modeling – again

a-b-c EER 2008 NDCF 5-1-1 1.7% 0.076 5-5-1 1.7% 0.070 5-5-5 1.6% 0.066 Estimating PBMs with 5 iterations of relevance MAP rather than one gives another major improvement

13 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 14

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Maximum likelihood II works with UBM adaptation

a-b-c EER 2008 NDCF relevance MAP 5-5-1 1.7% 0.070 diagonal Dc 5-5-1 1.7% 0.069 full Dc 5-5-1 1.7% 0.065 For relevance MAP , D∗

cΣ−1 c Dc = 1/rI (r = relevance factor)

Maximum likelihood II estimation of D only works if multiple alignment iterations are performed in JFA training (and at enrollment time) If Dc is taken to be be full it turns out to be of low rank (compare [Hasan, 2013])

14 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 15

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

JFA as a feature extractor works better than JFA-LLR

512 Gaussians a-b-c EER 2008 NDCF JFA-LLR 5-1-1 1.5% 0.062 JFA-LLR 5-5-5 2.0% 0.083 z-vectors 5-1- 1.3% 0.056 z-vectors + NAP 5-1- 1.4% 0.055 512 Gaussians without adaptation (5-1-1) performs similarly to 64 Gaussians with adaptation (5-5-5) for LLR-based classification 5-1- indicates that one alignment iteration is used in JFA feature extraction (for both enrollment and test utterances) Phrase background modeling is helpful but NAP is not needed

15 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 16

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

z-vector results on the full evaluation set

Female trials (columns 1 and 2), male trials (columns 3 and 4), RSR2015 Part I evaluation set, lexically mismatched target trials excluded EER DCF EER DCF Vogt + cosine + no norm 0.92% 0.045 0.61% 0.038 Vogt + cosine + s-norm 0.61% 0.027 0.44% 0.028 Zhao & Dong + cosine + s-norm 0.93% 0.042 0.79% 0.042 GMM/UBM + t-norm 1.06% 0.045 0.60% 0.034 512 Gaussians Do not adapt the UBM in extracting z-vectors Neither NAP nor PLDA is needed

16 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 17

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Conclusions for likelihood ratios

Adapting the UBM to lexical content, to speaker effects in enrollment utterances and to channel effects in test utterances are all effective Adapting the UBM to speaker effects in test utterances is treacherous This accounts for difference in performance between Bayesian model selection with Zhao & Dong’s algorithm and JFA-LLR with UBM adaptation

In theory, the numerator of the LLR ought to be evaluated by integrating over the speaker population rather than by plugging in the “UBM speaker”

17 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 18

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

In the case of a single test utterance, the factorial prior allows mixture components to move in statistically independent ways under UBM adaptation — too much freedom In the case of multiple enrollment utterances mixture components are constrained to move in the same way for each utterance — adaptation to speaker effects in enrollment utterances is OK Subspace priors do not suffer from this problem — adaptation to channel effects in test utterances is OK Even though 40% error rate reductions can be achieved by adapting the UBM judiciously in evaluating LLRs, JFA works better as a feature extractor than as a monolithic classifier

18 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 19

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

Conclusions for z-vectors

UBM adaptation to speaker effects needs to be avoided in extracting z-vectors from test utterances So it needs to be avoided in extracting z-vectors from enrollment utterances as well Fortunately adaptation appears to be unnecessary if a large UBM (512 Gaussians) is used — assuming no gross mismatches Phrase-dependent background modeling works very well — 50% reduction in error rates compared with ICASSP 2014 This also provides a simple, effective method for domain adaptation (Interspeech 2014) z-vectors are robust to channel effects — NAP or PLDA modeling in the back-end provides no benefit

19 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA

SLIDE 20

Text-Dependent Speaker Recognition Three JFA-Based Classifiers and a Bunch of Anomalies Evaluating JFA likelihood ratios with UBM adaptation Conclusion

References

Vogt, 2008 R. J. Vogt and S. Sridharan, “Explicit modeling of session variability for speaker verification,” Computer Speech and Language, 2008. Zhao, 2010 X. Zhao and Y. Dong, “Variational Bayesian Joint Factor Analysis Models for Speaker Verification,” IEEE Trans. ASLP, Mar. 2012. Kenny, 2014 P . Kenny, T. Stafylakis, P . Ouellet, and M. J. Alam, “JFA-based front ends for speaker recognition,” ICASSP 2014. Hasan, 2013 T. Hasan and J. H. L. Hansen, “Acoustic factor analysis for robust speaker verification.” IEEE

Trans. ASLP, 2013.

20 / 20 P . Kenny, T. Stafylakis, J. Alam et al. Text-Dependent JFA