Multilingual acoustic word embedding models for processing - - PowerPoint PPT Presentation

▶

Mar 22, 2024 136 likes •325 views

Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020 Herman Kamper 1 , Yevgen Matusevych 2 , Sharon Goldwater 2 1 Stellenbosch University, South Africa 2 University of Edinburgh, UK

SLIDE 1

Multilingual acoustic word embedding models for processing zero-resource languages

ICASSP 2020 Herman Kamper1, Yevgen Matusevych2, Sharon Goldwater2

1Stellenbosch University, South Africa 2University of Edinburgh, UK

http://www.kamperh.com/

SLIDE 2

Background: Why acoustic word embeddings?

Current speech recognition methods require large labelled data sets
Zero-resource speech processing aims to develop methods that can

discover linguistic structure from unlabelled speech [Dunbar et al., ASRU’17]

Example applications: Unsupervised term discovery, query-by-example
Problem: Need to compare speech segments of variable duration

1 / 11

SLIDE 3

Acoustic word embeddings

Embedding space with z ∈ RM X(2) X(1) z(1) z(2)

2 / 11

SLIDE 4

Example application: Query-by-example search

Hits Query: Search database: Embed z(q) z(1), . . . , z(N) Nearest neighbour Embed all segments/ utterances

[Levin et al., ICASSP’15] 3 / 11

SLIDE 5

Supervised and unsupervised acoustic embeddings

Growing body of work on acoustic word

embeddings

Supervised and unsupervised methods
Unsupervised methods can be applied in

zero-resource settings

But there is still a large performance gap

4 / 11

SLIDE 6

Supervised and unsupervised acoustic embeddings

Growing body of work on acoustic word

embeddings

Supervised and unsupervised methods
Unsupervised methods can be applied in

zero-resource settings

But there is still a large performance gap

Unsupervised Supervised 10 20 30 40 50 Average precision (%)

CAE-RNN

[Kamper, ICASSP’19] 4 / 11

SLIDE 7

Unsupervised monolingual acoustic word embeddings

X X x1 x2 x3 xT f 1 f 2 f 3 f T

[Chung et al., Interspeech’16; Kamper, ICASSP’19] 5 / 11

SLIDE 8

Unsupervised monolingual acoustic word embeddings

pair discovered X X′ x1 x2 x3 xT f 1 f 2 f 3 f T ′

[Chung et al., Interspeech’16; Kamper, ICASSP’19] 5 / 11

SLIDE 9

Supervised multilingual acoustic word embeddings

x1 x2 x3 xT X acoustic word embedding z Russian Polish French

яблоки бежать courir pommes jab lka biec

6 / 11

SLIDE 10

Experimental setup

Training data: Six well-resourced languages

Czech (CS), French (FR), Polish (PL), Portuguese (PT), Russian (RU), Thai (TH)

Test data: Six languages treated as zero-resource

Spanish (ES), Hausa (HA), Croatian (HR), Swedish (SV), Turkish (TR), Mandarin (ZH)

Evaluation: Same-different isolated word discrimination
Embeddings: M = 130 for all models
Baselines:

— Downsampling: 10 equally spaced MFCCs flattened — Dynamic time warping (DTW) alignment cost between test segments

7 / 11

SLIDE 11

1. Is multilingual supervised > monolingual unsupervised?

DTW Downsample CAE-RNN CAE-RNN ClassifierRNN (UTD) (Multiling.) (Multiling.) 10 20 30 40 50 60 70 Average precision (%)

Test results on Spanish

Baselines Unsupervised Multilingual 8 / 11

SLIDE 12

1. Is multilingual supervised > monolingual unsupervised?

DTW Downsample CAE-RNN CAE-RNN ClassifierRNN (UTD) (Multiling.) (Multiling.) 10 20 30 40 50 Average precision (%)

Test results on Hausa

Baselines Unsupervised Multilingual 8 / 11

SLIDE 13

2. Does training on more languages help?

HR (UTD) RU RU+CS RU+CS+FR Multilingual Training set 10 20 30 40 50 Average precision (%)

Development results on Croatian

CAE-RNN ClassifierRNN 9 / 11

SLIDE 14

3. Is the choice of training language important?

ES HA HR SV TR ZH Evaluation language CS FR PL PT RU TH Training language 41.6 51.1 41.0 28.7 37.0 42.6 42.6 41.8 30.4 25.3 32.5 35.8 41.1 43.7 35.8 25.5 33.7 39.5 45.9 46.2 36.4 26.6 34.1 39.6 35.0 39.7 31.3 22.3 29.7 37.1 28.5 44.5 29.9 17.9 23.6 36.2

10 / 11

SLIDE 15

Conclusions and future work

Conclusions:

Proposed to train a supervised multilingual acoustic word embedding model
n well-resourced languages and then apply to zero-resource languages
Multilingual CAE-RNN and ClassifierRNN consistently outperform

unsupervised models trained on zero-resource languages

11 / 11

SLIDE 16

Conclusions and future work

Conclusions:

Proposed to train a supervised multilingual acoustic word embedding model
n well-resourced languages and then apply to zero-resource languages
Multilingual CAE-RNN and ClassifierRNN consistently outperform

unsupervised models trained on zero-resource languages Future work:

Different models both for multilingual and unsupervised training
Analysis to understand the difference between CAE-RNN and ClassifierRNN
Does language conditioning help during decoding?

11 / 11

SLIDE 17

Multilingual acoustic word embedding models for processing zero-resource languages

ICASSP 2020 Herman Kamper1, Yevgen Matusevych2, Sharon Goldwater2

http://www.kamperh.com/

Background: Why acoustic word embeddings?

discover linguistic structure from unlabelled speech [Dunbar et al., ASRU’17]

Acoustic word embeddings

Embedding space with z ∈ RM X(2) X(1) z(1) z(2)

Example application: Query-by-example search

Hits Query: Search database: Embed z(q) z(1), . . . , z(N) Nearest neighbour Embed all segments/ utterances

Supervised and unsupervised acoustic embeddings

embeddings

zero-resource settings

Supervised and unsupervised acoustic embeddings

embeddings

zero-resource settings

Unsupervised Supervised 10 20 30 40 50 Average precision (%)

CAE-RNN

Unsupervised monolingual acoustic word embeddings

X X x1 x2 x3 xT f 1 f 2 f 3 f T

Unsupervised monolingual acoustic word embeddings

pair discovered X X′ x1 x2 x3 xT f 1 f 2 f 3 f T ′

Supervised multilingual acoustic word embeddings

x1 x2 x3 xT X acoustic word embedding z Russian Polish French

яблоки бежать courir pommes jab lka biec

Experimental setup

Czech (CS), French (FR), Polish (PL), Portuguese (PT), Russian (RU), Thai (TH)

Spanish (ES), Hausa (HA), Croatian (HR), Swedish (SV), Turkish (TR), Mandarin (ZH)

— Downsampling: 10 equally spaced MFCCs flattened — Dynamic time warping (DTW) alignment cost between test segments

Test results on Spanish

Test results on Hausa

Development results on Croatian

ES HA HR SV TR ZH Evaluation language CS FR PL PT RU TH Training language 41.6 51.1 41.0 28.7 37.0 42.6 42.6 41.8 30.4 25.3 32.5 35.8 41.1 43.7 35.8 25.5 33.7 39.5 45.9 46.2 36.4 26.6 34.1 39.6 35.0 39.7 31.3 22.3 29.7 37.1 28.5 44.5 29.9 17.9 23.6 36.2

Conclusions and future work

Conclusions:

unsupervised models trained on zero-resource languages

Conclusions and future work

Conclusions:

unsupervised models trained on zero-resource languages Future work:

https://arxiv.org/abs/2002.02109 https://github.com/kamperh/globalphone_awe