First Investigations on Self Trained Speaker Diarization el Le Lan 1 - - PowerPoint PPT Presentation

first investigations on self trained speaker diarization
SMART_READER_LITE
LIVE PREVIEW

First Investigations on Self Trained Speaker Diarization el Le Lan 1 - - PowerPoint PPT Presentation

First Investigations on Self Trained Speaker Diarization el Le Lan 1 , 2 Sylvain Meignier 2 Ga Delphine Charlet 1 Anthony Larcher 2 1 Orange Labs, France first.lastname@orange.com 2 LIUM, Universit e du Maine, France


slide-1
SLIDE 1

First Investigations on Self Trained Speaker Diarization

Ga¨ el Le Lan 1,2 Sylvain Meignier 2 Delphine Charlet 1 Anthony Larcher 2

1Orange Labs, France

first.lastname@orange.com

2LIUM, Universit´

e du Maine, France first.lastname@lium.univ-lemans.fr

June 22, 2016

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 1 / 15

slide-2
SLIDE 2

Context

Cross-recording speaker diarization of French TV archives Speaker indexing of collections of multiple recordings Two-pass approach

Speaker segmentation and clustering, within each recording Cross-recording speaker linking

State of the art speaker recognition framework

i-vector/PLDA hierarchical agglomerative clustering

PLDA maximizes the inter-speaker variability, while minimizing the intra-speaker. Using the target data as training material, how good can we estimate this variability ?

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 2 / 15

slide-3
SLIDE 3

State-of-the-Art two-pass Diarization Framework (baseline)

target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-4
SLIDE 4

State-of-the-Art two-pass Diarization Framework (baseline)

frontend target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-5
SLIDE 5

State-of-the-Art two-pass Diarization Framework (baseline)

frontend speaker segmentation target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-6
SLIDE 6

State-of-the-Art two-pass Diarization Framework (baseline)

frontend speaker segmentation i-vector extraction Universal Background Model Total Variability Matrix target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-7
SLIDE 7

State-of-the-Art two-pass Diarization Framework (baseline)

frontend speaker segmentation i-vector extraction similarity scoring Universal Background Model Total Variability Matrix PLDA parameters target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-8
SLIDE 8

State-of-the-Art two-pass Diarization Framework (baseline)

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering Universal Background Model Total Variability Matrix PLDA parameters target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-9
SLIDE 9

State-of-the-Art two-pass Diarization Framework (baseline)

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering Universal Background Model Total Variability Matrix PLDA parameters target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-10
SLIDE 10

State-of-the-Art two-pass Diarization Framework (baseline)

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering Universal Background Model Total Variability Matrix PLDA parameters target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-11
SLIDE 11

State-of-the-Art two-pass Diarization Framework (baseline)

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering Universal Background Model Total Variability Matrix PLDA parameters target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-12
SLIDE 12

State-of-the-Art two-pass Diarization Framework (baseline)

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring Universal Background Model Total Variability Matrix PLDA parameters target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-13
SLIDE 13

State-of-the-Art two-pass Diarization Framework (baseline)

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-14
SLIDE 14

State-of-the-Art two-pass Diarization Framework (baseline)

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker baseline - supervised

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-15
SLIDE 15

State-of-the-Art two-pass Diarization Framework (baseline)

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker

acoustic mismatch

baseline - supervised

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

slide-16
SLIDE 16

”Self Trained” Framework

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker baseline - supervised self trained - unsup.

acoustic mismatch Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 4 / 15

slide-17
SLIDE 17

Adapted Framework

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker baseline - supervised self trained - unsup. adapted - unsup.

acoustic mismatch Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 5 / 15

slide-18
SLIDE 18

”Self Trained” Diarization ? (1/2)

Goal: avoid acoustic mismatch using the target data as training material Requirements to train an i-vector/PLDA system UBM/TV: clean speech segments, straightforward PLDA: several sessions per speaker, in various acoustic conditions

Are there several speakers appearing in different episodes ? Assuming we know how to effectively cluster the target data, can we train a system with those ?

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 6 / 15

slide-19
SLIDE 19

Which Data ?

200 hours of French broadcast news (drawn from REPERE, ETAPE and ESTER evaluation campaigns) Two shows selected as target corpora: LCP Info and BFM Story train corpus: all other recordings

Corpus LCPtarget BFMtarget #Episodes 45 42 Episode duration 25m 60m Evaluated (labeled) speech duration 10h08m 19h57m One-Time speakers 127 345 Recurring speakers (2+ occurrences) 93 77

  • R. speakers (3+ occurrences)

48 35 Total speakers 220 422 O.T. speakers speech proportion 20.12% 44,84%

  • R. speakers (2+ occurrences) s.p.

79.88% 55,16%

  • R. speakers (3+ occurrences) s.p.

67.06% 45.94% Average speaker time per episode 1m08s 1m58s

Table: Composition of target corpora.

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 7 / 15

slide-20
SLIDE 20

Oracle Framework

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

baseline - supervised

  • racle - supervised

LCPtarget BFMtarget 10,87 X

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 8 / 15

slide-21
SLIDE 21

Oracle Framework

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

baseline - supervised

  • racle - supervised

LCPtarget BFMtarget 17,72 13,22 10,87 X

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 8 / 15

slide-22
SLIDE 22

Minimum Requirements for PLDA Parameters Estimation

Oracle Experiment For the LCPtarget corpus, we can estimate suitable PLDA parameters with a minimum of 37 episodes

40 recurring speakers, appearing in 7.2 episodes, in average

As for the BFMtarget corpus, the EM algorithm does not converge

all episodes, 35 recurring speakers, appearing in 5.45 episodes

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 9 / 15

slide-23
SLIDE 23

”Self Training” Experiment

for each recording

frontend speaker segmentation i-vector extraction similarity scoring cosine-based speaker clustering

cross-recording similarity scoring cosine-based

speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

baseline - supervised

  • racle - supervised

self trained - unsup. LCPtarget BFMtarget 17,72 13,22 10,87 X cos. 29.68 27.62

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 10 / 15

slide-24
SLIDE 24

”Self Training” Experiment

for each recording

frontend speaker segmentation i-vector extraction similarity scoring plda-based speaker clustering

cross-recording similarity scoring plda-based

speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

baseline - supervised

  • racle - supervised

self trained - unsup. LCPtarget BFMtarget 17,72 13,22 10,87 X cos. 29.68 27.62 plda X X

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 10 / 15

slide-25
SLIDE 25

”Self Trained” Diarization ? (2/2)

Using the target data as training material, we can train the system (UBM, TV matrix, PLDA parameters) UBM/TV: segments produced by speaker segmentation (only keep 10+ seconds segments)

the BIC parameters are chosen so that the segments are considered pure (only one speaker in each segment)

PLDA: require several sessions per speaker, from various episodes (3+)

perform an i-vector/cosine based cross-recording speaker diarization use the unpure speaker clusters to perform i-vector normalization and estimate PLDA parameters

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 11 / 15

slide-26
SLIDE 26

Adaptation Experiment

baseline - supervised

  • racle - supervised

self trained - unsup. adapted - unsup.

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

LCPtarget BFMtarget 17,72 13,22 10,87 X cos. 29.68 27.62 plda X X 16,58 12,60

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 12 / 15

slide-27
SLIDE 27

Adaptation Experiment

baseline - supervised

  • racle - supervised

self trained - unsup. adapted - unsup.

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

LCPtarget BFMtarget 17,72 13,22 10,87 X cos. 29.68 27.62 plda X X 16,58 12,60 tgt. X X

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 12 / 15

slide-28
SLIDE 28

Adaptation Experiment

baseline - supervised

  • racle - supervised

self trained - unsup. adapted - unsup.

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

LCPtarget BFMtarget 17,72 13,22 10,87 X cos. 29.68 27.62 plda X X 16,58 12,60 tgt. X X both 15,60 11,38

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 12 / 15

slide-29
SLIDE 29

Adaptation Experiment

baseline - supervised

  • racle - supervised

self trained - unsup. adapted - unsup.

for each recording

frontend speaker segmentation i-vector extraction similarity scoring speaker clustering cross-recording similarity scoring speaker linking Universal Background Model Total Variability Matrix PLDA parameters diarization output (speaker clusters) target data unlabeled frontend i-vector extraction train data labeled by speaker target data labels

acoustic mismatch

LCPtarget BFMtarget 17,72 13,22 10,87 X cos. 29.68 27.62 plda X X 16,58 12,60 tgt. X X both 15,60 11,38 iter 15.52 11.56

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 12 / 15

slide-30
SLIDE 30

System Parameters

Toolkit used : Sidekit for Diarization (S4D)1 Frontend: 13 MFCC + ∆ + ∆∆ UBM: GMM with 256 components, diagonal covariance matrix TV matrix rank: 200 PLDA eigenvoice matrix rank: 100 (no eigenchannel matrix) Speaker clustering/linking: Connected Components + Hierarchical Agglomerative Clustering Evaluation Metric: Diarization Error Rate (DER), 250ms collar

1http://lium.univ-lemans.fr/sidekit/s4d. Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 13 / 15

slide-31
SLIDE 31

Recap

baseline - supervised training, external train data

  • racle - supervised training, target data

targetonly - unsupervised training, target data targetadapted - unsupervised domain adaptation, train + target data

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 14 / 15

slide-32
SLIDE 32

Conclusions

External data required for bootstrapping Unlabeled target data, unperfectly clustered, with domain adaptation: better results Future work

Improve the adaptation framework

Weighting variability between train and target Iterative procedure

Bootstrap with a huge unlabeled dataset

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 15 / 15

slide-33
SLIDE 33

Data Details ?

LCPtarget BFMtarget sup unsup sup unsup Data used for UBM/TV training 9h56m 19h17m 19h50m 39h09m Average session duration for UBM/TV training 1m08s 4m23s 1m58s 4m10s Number of speaker classes for PLDA training 47 130 35 190 Average number of sessions by speaker class 7.31 5.25 5.45 4.34 Average session duration 1m10s 1m25s 2m50s 2m07

Table: Composition of data used for UBM, TV and PLDA estimation, for both corpora

Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 16 / 15