Improving Unsupervised Acoustic Word Embeddings using Speaker and - - PowerPoint PPT Presentation
Improving Unsupervised Acoustic Word Embeddings using Speaker and - - PowerPoint PPT Presentation
Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information Lisa van Staden, Herman Kamper 31 January 2020 Zero-Resource Speech Processing Popular methods for speech processing rely on transcribed speech. Obtaining
Zero-Resource Speech Processing
Popular methods for speech processing rely on transcribed speech. Obtaining transcriptions is expensive and not always possible.
1
Zero-Resource Speech Processing
Popular methods for speech processing rely on transcribed speech. Obtaining transcriptions is expensive and not always possible.
1
Tasks in Zero-Resource Processing
We don’t always need to predict text labels:
- Query-by-Example Search: search speech using speech.
- Unsupervised Term Discovery: Discover repeating patterns in
speech.
2
Tasks in Zero-Resource Processing
We don’t always need to predict text labels:
- Query-by-Example Search: search speech using speech.
- Unsupervised Term Discovery: Discover repeating patterns in
speech.
2
Tasks in Zero-Resource Processing
We don’t always need to predict text labels:
- Query-by-Example Search: search speech using speech.
- Unsupervised Term Discovery: Discover repeating patterns in
speech.
2
Speech Segment Comparison
These tasks require comparing speech segments. The conventional method is Dynamic Time Warping.
- Computationally expensive.
3
Speech Segment Comparison
These tasks require comparing speech segments. The conventional method is Dynamic Time Warping.
- Computationally expensive.
3
Acoustic Word Embeddings
We want to map speech to these representation without using labels.
4
Acoustic Word Embeddings
We want to map speech to these representation without using labels.
4
Speaker and Gender Information
Acoustic properties of speech from difgerent speakers/genders difger.
Speaker B Speaker A
cat cat bat
Male Female
pan pun pan
Male
We want embeddings to be robust.
5
RNN (Correspondence) Autoencoder
embedding Encoder Decoder
GRU GRU GRU GRU GRU GRU
x1' / y1' x2' / y2' xT' / yT' x1 x2 xT
6
Speaker/Gender Conditioning
embedding Encoder Decoder
GRU GRU GRU GRU GRU GRU
Speaker\Gender x1' / y1' x2' / y2' xT' / yT' x1 x2 xT
7
Adversarial Training
Encoder Embedding Decoder Classifier
X X'/Y' p
Turn A Turn B
8
Speaker/Gender Classifier
z p Linear ReLU Dropout Softmax
9
Evaluating Quality of AWEs
Use the same-difgerent task to evaluate AWEs:
- Measure if AWEs are similar given a threshold.
- Calculate area under Precision vs Recall curve.
10
Results
AE-Baseline AE-Top-1 AE-Top-2 CAE-Baseline CAE-Top-1 CAE-Top-2 Model Type 5 10 15 20 25 30 Average Precision (%) 25.19 25.53 25.38 30.18 30.49 29.72 11.65 12.78 11.22 22.52 28.98 22.72 English Xitsonga 11
Evaluate Speaker and Gender Predictability
Analyse if the speaker and gender information has decreased:
- Use speaker/gender classifier model.
- Evaluate accuracy.
12
Average Precision vs Speaker/Gender Predictability
AE
72 74 76 78 80 82 84 Speaker Predictability 26.0 26.2 26.4 26.6 26.8 27.0 Average Precision 89.5 90.0 90.5 91.0 91.5 92.0 92.5 93.0 93.5 Gender Predictability 26.0 26.2 26.4 26.6 26.8 27.0 Average Precision
CAE
68 70 72 74 76 78 80 82 84 Speaker Predictability 30.0 30.5 31.0 31.5 32.0 Average Precision 88 89 90 91 92 93 Gender Predictability 30.0 30.5 31.0 31.5 32.0 Average Precision
13
Conclusions
- English data shows marginal improvement by incorporating
speaker information.
- Best Xitsonga model shows 22% improvement.
- It’s diffjcult to remove speaker and gender information.
- Future work ...
14
Conclusions
- English data shows marginal improvement by incorporating
speaker information.
- Best Xitsonga model shows 22% improvement.
- It’s diffjcult to remove speaker and gender information.
- Future work ...
14
Conclusions
- English data shows marginal improvement by incorporating
speaker information.
- Best Xitsonga model shows 22% improvement.
- It’s diffjcult to remove speaker and gender information.
- Future work ...
14
Conclusions
- English data shows marginal improvement by incorporating
speaker information.
- Best Xitsonga model shows 22% improvement.
- It’s diffjcult to remove speaker and gender information.
- Future work ...
14