University of the Basque Country (EHU) Systems for the NIST 2011 LRE - - PowerPoint PPT Presentation

▶

Mar 28, 2024 315 likes •526 views

Train and development data System description Analysis of the results Conclusions University of the Basque Country (EHU) Systems for the NIST 2011 LRE Mikel Penagarikano, Amparo Varona, Luis Javier Rodr guez-Fuentes, Mireia Diez, Germ

SLIDE 1

Train and development data System description Analysis of the results Conclusions

University of the Basque Country (EHU) Systems for the NIST 2011 LRE

Mikel Penagarikano, Amparo Varona, Luis Javier Rodr´ ıguez-Fuentes, Mireia Diez, Germ´ an Bordel

GTTS, Dept. Electricity and Electronics University of the Basque Country (EHU) Leioa, Spain mikel.penagarikano@ehu.es

NIST 2011 LRE Workshop Atlanta (Georgia), USA December 6-7, 2011

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 2

Train and development data System description Analysis of the results Conclusions

Outline

Train and development data New target languages Data partitioning

System description Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Analysis of the results Subsystem comparison Post-eval analisys

Conclusions

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 3

Train and development data System description Analysis of the results Conclusions New target languages Data partitioning

New target languages

9 new target languages: Arabic Iraqi, Arabic Levantine, Arabic Maghrebi, Arabic MSA, Czech, Lao, Panjabi, Polish, Slovak. NIST data: 100 30-second segments per new language. Randomly split in two halves:

lre11-train, for training lre11-dev, for development/test

Aditional data used by BLZ consortium (BLZ-train)1:

Arabic Iraqi: CTS from LDC2006S45 Arabic Levantine: CTS from LDC2006S29 Arabic Maghrebi: BN speech from Arrabia TV (Morocco) Arabic MSA: BN speech from Kalaka-2 (Al Jazeera) Czech:

BN speech from the COST278 BN database Telephone speech from LDC2000S89 and LDC2009S02

Lao: Telephone speech from VOA3 (LRE09) Panjabi: no data Polish: BN speech from Telewizja Polska Slovak: BN speech from the COST278 BN database

1Broadcast news speech was downsampled to 8 kHz and applied the Filtering

and Noise Adding Tool (FANT) to simulate a telephone channel.

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 4

Train and development data System description Analysis of the results Conclusions New target languages Data partitioning

Data partitioning

Development: restricted to segments audited by NIST.

The evaluation set of NIST 2007 LRE The evaluation set of NIST 2009 LRE lre11-dev 8500 30-second segments

Train: 66 training subsets, including target and non-target languages:

CTS from previous LREs (18 subsets) Narrow-band speech (telephone speech?) from VOA/LRE2009 (30 subsets) lre11-train (9 subsets) BLZ-train (9 subsets) 35000 long (>30-second) segments

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 5

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Short description

High-level subsystems (phonotactic):

Czech phone-lattice phonotactic SVM Hungarian phone-lattice phonotactic SVM Russian phone-lattice phonotactic SVM

Low-level subsystems (acoustics):

Linearized Eigenchannel GMM (Dot-Scoring) with channel compensated statistics Generative iVectors

Optional ZT-norm Generative backend Multiclass linear logistic regression Minimum expected cost Bayes decision

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 6

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Disk failure

Two weeks before the submission deadline, and due to a mechanical failure of a disk we lost the LRE11 data:

Indexes (VOA time marks) Speech wave files Baum-Welch statistics Expected counts of n-grams (up to 4-grams)

No time to start again (nor money for professional data recovery) We found partial copies of:

Channel-compensated Baum-Welch statistics Expected counts of 3-grams

The submission was adapted to use the available data (speech signals, statistics, etc.)

Phonotactic subsystem was limited to 3-grams. iVectors were computed on the compensated sufficient statistics space

See: Stuck inside of a disk failure

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 7

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Phonotactic subsystems

Common approach to SVM-based phonotactic language recognition Phone Decoder→ Phone-state Posteriors → Lattice → Expected counts

f n-grams

→ SVM-based Language Models Freely available software was used in all the stages: Phone Decoders: TRAPS/NN phone decoders developed by BUT for Czech (CZ), Hungarian (HU) and Russian (RU). Phone-state Posteriors & Lattice: HTK along with the BUT recipe Expected counts of n-grams: The lattice-tool from SRILM SVM modeling: LIBLINEAR (a fast linear-only version of libSVM). Modified by adding some lines of code to get the regression values (instead of class labels).

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 8

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Experimental setup

An energy-based voice activity detector is applied to split and remove long-duration non-speech segments from signals. Non-phonetic units: int (intermittent noise), pau (short pause) and spk (non-speech speaker noise) are mapped to a single non-phonetic unit. A ranked (frequency-based) sparse representation, which involved only the M most frequent features (unigrams + bigrams + . . . + n-grams) is used SVM vectors consist of expected counts of phone n-grams extracted from the lattices, converted to frequencies and weighted with regard to their background probabilities as:

wi =

√

p(di|background)

The SVM language models are trained using a L2-regularized L1-loss support vector classification solver.

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 9

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Acoustic subsystems

Both systems have in common the acoustic parameters: 7MFCC + SDC (7-2-3-7) & gender independent 1024 mixture GMM Dot-Scoring Statistics extraction → Channel compensation → Dot-Scoring Channel matrix: estimated using only target languages data 500 channels 10 ML-MD iterations Generative iVector subsystem iVector extraction → Generative Gaussian Language Models Total variability matrix: estimated using only target languages data 500 dimensions 10 ML-MD iterations

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 10

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Backend & Fusion

An independent backend and fusion was estimated for each nominal duration (3, 10 and 30 sec). Both the backend and the fusion were estimated with the FoCal toolkit. A ZT-norm was optionally applied to the scores prior to the backend Each subsystem produced 66 scores that were mapped to 24 target languages by means of a generative Gaussian backend

Discriminative Gaussian backends were tried but showed no improvement at development.

Multiclass linear logistic regression based fusion was applied

Pairwise and language family-wise regressions were tried but showed no improvement at development.

Minimum expected cost Bayes decisions were made

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 11

Train and development data System description Analysis of the results Conclusions Short description Phonotactic subsystems Acoustic subsystems Backend & Fusion Submission

Submission

One primary and three contrastive systems were submitted. The 5 subsystems were included in each submission. Submissions differ in the use of ZT-norm and the development subsets used for the estimation of fusion and calibration parameters of test signals with nominal duration of 10 and 3 seconds.

Table: Main features of the EHU primary and contrastive systems.

System zt-norm Backend & Fusion Train Dataset 30s 10s 3s Primary No dev30 dev10 dev03 Contrastive 1 No dev30 dev10+dev30 dev03+dev10+dev30 Contrastive 2 Yes dev30 dev10 dev03 Contrastive 3 Yes dev30 dev10+dev30 dev03+dev10+dev30

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 12

Train and development data System description Analysis of the results Conclusions Subsystem comparison Post-eval analisys

Subsystem comparison - 30 seconds

0.0884 0.0895 399 0.1464 1492 0.1539 .1571 .1579 0.1674 0.1685 0.1689 0.1877 0.193 0.1976 0.1991 0.1995 0.2036 0.2065 0.2072 0.2204 0.2444 0.2504 0.3755 0.3858 0.4431 0.4572 0.4965 0.5206 0.2 0.3 0.4 0.5 0.6

ehu i3a l2f

0.0763 0.0774 0.0786 0.0796 0.0805 0.0826 0.0839 0.0841 0.0847 0.0895 0.0907 0.0909 0.0914 0.0919 0.0933 0.1 0.1033 0.1034 0.1074 0.1121 0.1268 0.1279 0.1399 0.1492 0.157 0.157 0.16 0.16 0.16 0.1 0.2 blz_contrast3 blz_primary ehu_contrast1 ehu_primary ehu_contrast3 ehu_contrast2 blz_contrast2 blz_contrast1 i3a_contrast3 i3a_contrast1 i3a_primary i3a_contrast2 l2f_primary l2f_contrast2 l2f_contrast1

blz ehu

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 13

Train and development data System description Analysis of the results Conclusions Subsystem comparison Post-eval analisys

Subsystem comparison - 10 seconds

85 09 16 327 0.1343 345 347 1436 1455 .1468 0.1477 .1506 .1532 .1534 0.1644 0.1674 0.1705 0.1751 0.1978 0.2009 0.2115 0.2119 0.2127 0.2151 0.2184 0.2235 0.2368 0.2385 0.2483 0.2484 0.2492 0.2561 0.2580 0.2595 0.2597 0.2650 0.2787 0.2872 0.2954 0.3061 0.3930 0.4143 0.4371 0.4379 0.4821 0.4963 0.2000 0.3000 0.4000 0.5000 0.6000

blz ehu i3a l2f

0.1196 0.1210 0.1220 0.1244 0.1285 0.1309 0.1316 0.1327 0.1345 0.1347 0.1436 0.1455 0.146 0.150 0.153 0.153 0.1 0.1 0.1 0. 0.0000 0.1000 blz_contrast3_llr blz_contrast1_llr blz_primary_llr ehu_contrast3_llr ehu_contrast1_llr ehu_contrast2_llr ehu_primary_llr blz_contrast2_llr i3a_contrast3_llr i3a_contrast1_llr l2f_contrast2_llr l2f_primary_llr l2f_contrast1_llr i3a_contrast2_llr i3a_primary_llr

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 14

Train and development data System description Analysis of the results Conclusions Subsystem comparison Post-eval analisys

Subsystem comparison - 3 seconds

0.2163 0.2280 0.2291 0.2332 0.2351 0.2378 0.2411 0.2419 0.2475 0.2528 0.2534 0.2541 0.2571 0.2638 0.2642 0.2650 0.2683 0.2705 0.2725 0.2769 0.2780 0.2862 0.2865 0.2896 0.2905 0.3157 0.3371 0.3379 0.3398 0.3405 0.3426 0.3452 0.3467 0.3478 0.3479 0.3488 0.3507 0.3526 0.3551 0.3606 0.3614 0.3651 0.3664 0.3898 0.4214 0.4323 0.4411 0.4539 0.4876 0.4937 0.2000 0.3000 0.4000 0.5000 0.6000

blz ehu i3a l2f

0.0000 0.1000 blz_contrast1_llr ehu_contrast3_llr ehu_contrast1_llr blz_contrast2_llr blz_primary_llr ehu_contrast2_llr ehu_primary_llr l2f_primary_llr blz_contrast3_llr l2f_contrast1_llr l2f_contrast2_llr i3a_contrast3_llr i3a_contrast1_llr i3a_contrast2_llr i3a_primary_llr

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 15

Train and development data System description Analysis of the results Conclusions Subsystem comparison Post-eval analisys

ZT-norm & generative/discriminative backend - 30 seconds

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 16

Train and development data System description Analysis of the results Conclusions Subsystem comparison Post-eval analisys

Phonotactic vs. Acoustic - 30 seconds

!"#
$%&'
EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 17

Train and development data System description Analysis of the results Conclusions Subsystem comparison Post-eval analisys

Greedy selection - 30 seconds

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 18

Train and development data System description Analysis of the results Conclusions

Conclusions

A very competitive submission was obtained based on state of the art language recognition technology. Data collection may have been the key. For 3-second tests, using a larger development set (3, 10 and 30-second segments) increased the robustness of the system. Unlike the BLZ submision, the ZT-norm didn’t provide any improvement. The discriminative backend improved only the Dot-Scoring system. Third participation, with a great performance improvement. In 2007, avgCost was around 0,30 and in 2009 it was around 0,07.

EHU Systems for LRE11 (Atlanta, December 6-7 2011)

SLIDE 19

Train and development data System description Analysis of the results Conclusions

Thank you!

EHU Systems for LRE11 (Atlanta, December 6-7 2011)