Improving Unsupervised Acoustic Word Embeddings using Speaker and - - PowerPoint PPT Presentation

▶

Feb 21, 2023 354 likes •603 views

Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information Lisa van Staden, Herman Kamper 31 January 2020 Zero-Resource Speech Processing Popular methods for speech processing rely on transcribed speech. Obtaining

SLIDE 1

Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information

Lisa van Staden, Herman Kamper 31 January 2020

SLIDE 2

Zero-Resource Speech Processing

Popular methods for speech processing rely on transcribed speech. Obtaining transcriptions is expensive and not always possible.

SLIDE 3

Zero-Resource Speech Processing

Popular methods for speech processing rely on transcribed speech. Obtaining transcriptions is expensive and not always possible.

SLIDE 4

Tasks in Zero-Resource Processing

We don’t always need to predict text labels:

Query-by-Example Search: search speech using speech.
Unsupervised Term Discovery: Discover repeating patterns in

speech.

SLIDE 5

Tasks in Zero-Resource Processing

We don’t always need to predict text labels:

Query-by-Example Search: search speech using speech.
Unsupervised Term Discovery: Discover repeating patterns in

speech.

SLIDE 6

Tasks in Zero-Resource Processing

We don’t always need to predict text labels:

Query-by-Example Search: search speech using speech.
Unsupervised Term Discovery: Discover repeating patterns in

speech.

SLIDE 7

Speech Segment Comparison

These tasks require comparing speech segments. The conventional method is Dynamic Time Warping.

Computationally expensive.

SLIDE 8

Speech Segment Comparison

These tasks require comparing speech segments. The conventional method is Dynamic Time Warping.

Computationally expensive.

SLIDE 9

Acoustic Word Embeddings

We want to map speech to these representation without using labels.

SLIDE 10

Acoustic Word Embeddings

We want to map speech to these representation without using labels.

SLIDE 11

Speaker and Gender Information

Acoustic properties of speech from difgerent speakers/genders difger.

Speaker B Speaker A

cat cat bat

Male Female

pan pun pan

Male

We want embeddings to be robust.

SLIDE 12

RNN (Correspondence) Autoencoder

embedding Encoder Decoder

GRU GRU GRU GRU GRU GRU

x1' / y1' x2' / y2' xT' / yT' x1 x2 xT

SLIDE 13

Speaker/Gender Conditioning

embedding Encoder Decoder

GRU GRU GRU GRU GRU GRU

Speaker\Gender x1' / y1' x2' / y2' xT' / yT' x1 x2 xT

SLIDE 14

Adversarial Training

Encoder Embedding Decoder Classifier

X X'/Y' p

Turn A Turn B

SLIDE 15

Speaker/Gender Classifier

z p Linear ReLU Dropout Softmax

SLIDE 16

Evaluating Quality of AWEs

Use the same-difgerent task to evaluate AWEs:

Measure if AWEs are similar given a threshold.
Calculate area under Precision vs Recall curve.

SLIDE 17

Results

AE-Baseline AE-Top-1 AE-Top-2 CAE-Baseline CAE-Top-1 CAE-Top-2 Model Type 5 10 15 20 25 30 Average Precision (%) 25.19 25.53 25.38 30.18 30.49 29.72 11.65 12.78 11.22 22.52 28.98 22.72 English Xitsonga 11

SLIDE 18

Evaluate Speaker and Gender Predictability

Analyse if the speaker and gender information has decreased:

Use speaker/gender classifier model.
Evaluate accuracy.

SLIDE 19

Average Precision vs Speaker/Gender Predictability

AE

72 74 76 78 80 82 84 Speaker Predictability 26.0 26.2 26.4 26.6 26.8 27.0 Average Precision 89.5 90.0 90.5 91.0 91.5 92.0 92.5 93.0 93.5 Gender Predictability 26.0 26.2 26.4 26.6 26.8 27.0 Average Precision

CAE

68 70 72 74 76 78 80 82 84 Speaker Predictability 30.0 30.5 31.0 31.5 32.0 Average Precision 88 89 90 91 92 93 Gender Predictability 30.0 30.5 31.0 31.5 32.0 Average Precision

SLIDE 20

Conclusions

English data shows marginal improvement by incorporating

speaker information.

Best Xitsonga model shows 22% improvement.
It’s diffjcult to remove speaker and gender information.
Future work ...

SLIDE 21

Conclusions

English data shows marginal improvement by incorporating

speaker information.

Best Xitsonga model shows 22% improvement.
It’s diffjcult to remove speaker and gender information.
Future work ...

SLIDE 22

Conclusions

English data shows marginal improvement by incorporating

speaker information.

Best Xitsonga model shows 22% improvement.
It’s diffjcult to remove speaker and gender information.
Future work ...

SLIDE 23

Conclusions

English data shows marginal improvement by incorporating

speaker information.

Best Xitsonga model shows 22% improvement.
It’s diffjcult to remove speaker and gender information.
Future work ...