Multilingual acoustic word embedding models for processing - - PowerPoint PPT Presentation

multilingual acoustic word embedding models for
SMART_READER_LITE
LIVE PREVIEW

Multilingual acoustic word embedding models for processing - - PowerPoint PPT Presentation

Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020 Herman Kamper 1 , Yevgen Matusevych 2 , Sharon Goldwater 2 1 Stellenbosch University, South Africa 2 University of Edinburgh, UK


slide-1
SLIDE 1

Multilingual acoustic word embedding models for processing zero-resource languages

ICASSP 2020 Herman Kamper1, Yevgen Matusevych2, Sharon Goldwater2

1Stellenbosch University, South Africa 2University of Edinburgh, UK

http://www.kamperh.com/

slide-2
SLIDE 2

Background: Why acoustic word embeddings?

  • Current speech recognition methods require large labelled data sets
  • Zero-resource speech processing aims to develop methods that can

discover linguistic structure from unlabelled speech [Dunbar et al., ASRU’17]

  • Example applications: Unsupervised term discovery, query-by-example
  • Problem: Need to compare speech segments of variable duration

1 / 11

slide-3
SLIDE 3

Acoustic word embeddings

Embedding space with z ∈ RM X(2) X(1) z(1) z(2)

2 / 11

slide-4
SLIDE 4

Example application: Query-by-example search

Hits Query: Search database: Embed z(q) z(1), . . . , z(N) Nearest neighbour Embed all segments/ utterances

[Levin et al., ICASSP’15] 3 / 11

slide-5
SLIDE 5

Supervised and unsupervised acoustic embeddings

  • Growing body of work on acoustic word

embeddings

  • Supervised and unsupervised methods
  • Unsupervised methods can be applied in

zero-resource settings

  • But there is still a large performance gap

4 / 11

slide-6
SLIDE 6

Supervised and unsupervised acoustic embeddings

  • Growing body of work on acoustic word

embeddings

  • Supervised and unsupervised methods
  • Unsupervised methods can be applied in

zero-resource settings

  • But there is still a large performance gap

Unsupervised Supervised 10 20 30 40 50 Average precision (%)

CAE-RNN

[Kamper, ICASSP’19] 4 / 11

slide-7
SLIDE 7

Unsupervised monolingual acoustic word embeddings

X X x1 x2 x3 xT f 1 f 2 f 3 f T

[Chung et al., Interspeech’16; Kamper, ICASSP’19] 5 / 11

slide-8
SLIDE 8

Unsupervised monolingual acoustic word embeddings

pair discovered X X′ x1 x2 x3 xT f 1 f 2 f 3 f T ′

[Chung et al., Interspeech’16; Kamper, ICASSP’19] 5 / 11

slide-9
SLIDE 9

Supervised multilingual acoustic word embeddings

x1 x2 x3 xT X acoustic word embedding z Russian Polish French

яблоки бежать courir pommes jab lka biec

6 / 11

slide-10
SLIDE 10

Experimental setup

  • Training data: Six well-resourced languages

Czech (CS), French (FR), Polish (PL), Portuguese (PT), Russian (RU), Thai (TH)

  • Test data: Six languages treated as zero-resource

Spanish (ES), Hausa (HA), Croatian (HR), Swedish (SV), Turkish (TR), Mandarin (ZH)

  • Evaluation: Same-different isolated word discrimination
  • Embeddings: M = 130 for all models
  • Baselines:

— Downsampling: 10 equally spaced MFCCs flattened — Dynamic time warping (DTW) alignment cost between test segments

7 / 11

slide-11
SLIDE 11
  • 1. Is multilingual supervised > monolingual unsupervised?

DTW Downsample CAE-RNN CAE-RNN ClassifierRNN (UTD) (Multiling.) (Multiling.) 10 20 30 40 50 60 70 Average precision (%)

Test results on Spanish

Baselines Unsupervised Multilingual 8 / 11

slide-12
SLIDE 12
  • 1. Is multilingual supervised > monolingual unsupervised?

DTW Downsample CAE-RNN CAE-RNN ClassifierRNN (UTD) (Multiling.) (Multiling.) 10 20 30 40 50 Average precision (%)

Test results on Hausa

Baselines Unsupervised Multilingual 8 / 11

slide-13
SLIDE 13
  • 2. Does training on more languages help?

HR (UTD) RU RU+CS RU+CS+FR Multilingual Training set 10 20 30 40 50 Average precision (%)

Development results on Croatian

CAE-RNN ClassifierRNN 9 / 11

slide-14
SLIDE 14
  • 3. Is the choice of training language important?

ES HA HR SV TR ZH Evaluation language CS FR PL PT RU TH Training language 41.6 51.1 41.0 28.7 37.0 42.6 42.6 41.8 30.4 25.3 32.5 35.8 41.1 43.7 35.8 25.5 33.7 39.5 45.9 46.2 36.4 26.6 34.1 39.6 35.0 39.7 31.3 22.3 29.7 37.1 28.5 44.5 29.9 17.9 23.6 36.2

10 / 11

slide-15
SLIDE 15

Conclusions and future work

Conclusions:

  • Proposed to train a supervised multilingual acoustic word embedding model
  • n well-resourced languages and then apply to zero-resource languages
  • Multilingual CAE-RNN and ClassifierRNN consistently outperform

unsupervised models trained on zero-resource languages

11 / 11

slide-16
SLIDE 16

Conclusions and future work

Conclusions:

  • Proposed to train a supervised multilingual acoustic word embedding model
  • n well-resourced languages and then apply to zero-resource languages
  • Multilingual CAE-RNN and ClassifierRNN consistently outperform

unsupervised models trained on zero-resource languages Future work:

  • Different models both for multilingual and unsupervised training
  • Analysis to understand the difference between CAE-RNN and ClassifierRNN
  • Does language conditioning help during decoding?

11 / 11

slide-17
SLIDE 17

https://arxiv.org/abs/2002.02109 https://github.com/kamperh/globalphone_awe