[PPT] - Unsupervised neural network based feature extraction using weak PowerPoint Presentation

SLIDE 1

Unsupervised neural network based feature extraction using weak top-down constraints

Herman Kamper1,2, Micha Elsner3, Aren Jansen4, Sharon Goldwater2

1CSTR and 2ILCC, School of Informatics, University of Edinburgh, UK 3Department of Linguistics, The Ohio State University, USA 4HLTCOE and CLSP, Johns Hopkins University, USA

ICASSP 2015

SLIDE 2

Introduction

◮ Huge amounts of speech audio data are becoming available online. ◮ Even for severely under-resourced and endangered languages (e.g. unwritten),

data is being collected.

◮ Generally this data is unlabelled. ◮ We want to build speech technology on available unlabelled data.

2 / 16

SLIDE 3

Introduction

◮ Huge amounts of speech audio data are becoming available online. ◮ Even for severely under-resourced and endangered languages (e.g. unwritten),

data is being collected.

◮ Generally this data is unlabelled. ◮ We want to build speech technology on available unlabelled data. ◮ Need unsupervised speech processing techniques.

2 / 16

SLIDE 4

Example application: query-by-example search

3 / 16

SLIDE 5

Example application: query-by-example search

Spoken query:

3 / 16

SLIDE 6

Example application: query-by-example search

Spoken query:

3 / 16

SLIDE 7

Example application: query-by-example search

Spoken query:

3 / 16

SLIDE 8

Example application: query-by-example search

Spoken query:

3 / 16

SLIDE 9

Example application: query-by-example search

Spoken query:

3 / 16

SLIDE 10

Example application: query-by-example search

Spoken query:

3 / 16

SLIDE 11

Example application: query-by-example search

Spoken query: What features should we use to represent the speech for such unsupervised tasks?

3 / 16

SLIDE 12

Supervised neural network feature extraction

4 / 16

SLIDE 13

Supervised neural network feature extraction

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states

4 / 16

SLIDE 14

Supervised neural network feature extraction

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor (learned from data)

4 / 16

SLIDE 15

Supervised neural network feature extraction

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor (learned from data) Phone classifier (learned jointly)

4 / 16

SLIDE 16

Supervised neural network feature extraction

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor (learned from data) Phone classifier (learned jointly)

But what if we do not have phone class targets to train our network?

4 / 16

SLIDE 17

Weak supervision: unsupervised term discovery

5 / 16

SLIDE 18

Weak supervision: unsupervised term discovery

5 / 16

SLIDE 19

Weak supervision: unsupervised term discovery

5 / 16

SLIDE 20

Weak supervision: unsupervised term discovery

5 / 16

SLIDE 21

Weak supervision: unsupervised term discovery

5 / 16

SLIDE 22

Weak supervision: unsupervised term discovery

5 / 16

SLIDE 23

Weak supervision: unsupervised term discovery

Can we use these discovered word pairs to provide us with weak supervision?

5 / 16

SLIDE 24

Weak supervision: align the discovered word pairs

Use correspondence idea from [Jansen et al., 2013]

6 / 16

SLIDE 25

Weak supervision: align the discovered word pairs

Use correspondence idea from [Jansen et al., 2013]:

6 / 16

SLIDE 26

Weak supervision: align the discovered word pairs

Use correspondence idea from [Jansen et al., 2013]:

6 / 16

SLIDE 27

Weak supervision: align the discovered word pairs

Use correspondence idea from [Jansen et al., 2013]:

6 / 16

SLIDE 28

Autoencoder (AE) neural network

7 / 16

SLIDE 29

Autoencoder (AE) neural network

Input speech frame A normal autoencoder neural network is trained to reconstruct its input. Output is same as input

7 / 16

SLIDE 30

Autoencoder (AE) neural network

Input speech frame This reconstruction criterion can be used to pretrain a deep neural network. Output is same as input

7 / 16

SLIDE 31

The correspondence autoencoder (cAE)

Frame from one word The correspondence autoencoder (cAE) takes a frame from one word, and tries to reconstruct the corresponding frame from the other word in the pair. Frame from other word in pair

8 / 16

SLIDE 32

The correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor In this way we learn an unsupervised feature extractor using the weak word-pair supervision. Frame from other word in pair

8 / 16

SLIDE 33

Complete unsupervised cAE training algorithm

Speech corpus Initialize weights Train stacked autoencoder (pretraining) Align word pair frames Train correspondence autoencoder (1) (2) (3) (4) Unsupervised term discovery Unsupervised feature extractor

9 / 16

SLIDE 34

Evaluation of features: the same-different task

10 / 16

SLIDE 35

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like”

10 / 16

SLIDE 36

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query

10 / 16

SLIDE 37

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query “pie” “grape” “apple” “apple” “like” Treat as terms to search

10 / 16

SLIDE 38

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

10 / 16

SLIDE 39

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

10 / 16

SLIDE 40

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” DTW distance: d1

10 / 16

SLIDE 41

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

10 / 16

SLIDE 42

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

10 / 16

SLIDE 43

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

10 / 16

SLIDE 44

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1 d2

10 / 16

SLIDE 45

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2

10 / 16

SLIDE 46

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2

×

10 / 16

SLIDE 47

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2

×

“apple” “like”

10 / 16

SLIDE 48

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2 d3

×

“apple” “like”

10 / 16

SLIDE 49

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3

×

“apple” “like”

10 / 16

SLIDE 50

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3

×
“apple”

“like”

10 / 16

SLIDE 51

Evaluation of features: the same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same different same different DTW distance: d1 d2 d3 d4 dN

×
×
10 / 16

SLIDE 52

Evaluation of features: the same-different task

◮ Each term is treated in turn as the query. ◮ The threshold is varied to obtain a precision-recall curve. ◮ The area under the precision-recall curve is used as the final evaluation

metric, referred to as average precision (AP).

◮ AP is higher for feature representations which are better able to associate

words of the same type and discriminate between words of different types.

◮ AP has been shown to correlate well with phone recognition error rates

[Carlin et al., 2011] and has been used in several other unsupervised studies.

11 / 16

SLIDE 53

Baseline: partitioned universal background model

!"##$%& '())#$*(+&

!"(,#+&-#./& 012$(3#.4&

.51+&&

6572215+&89:& ;)1<+&=(.>& ?51.&@.5/#2& ?5.**(+& 89:&6572215+2&

6572215+&89:& @.5/#&'(+2A.51+A2& !7BC(.>&8+1A&:(>#)2&

Use posteriorgram features from the partitioned universal background model (UBM) as baseline [Jansen et al., 2013].

12 / 16

SLIDE 54

Evaluation

◮ Speech from Switchboard is used for evaluation. ◮ Pretraining data: 23 hours of untranscribed speech. ◮ We consider two sets of word pairs for training the cAE:

1

100k gold standard word pairs.

2

80k word pairs discovered using unsupervised term discovery (UTD).

◮ Test set for same-different evaluation: 11k word tokens, 60.7M pairs, 3%

produced by same speaker.

13 / 16

SLIDE 55

Evaluation

◮ Speech from Switchboard is used for evaluation. ◮ Pretraining data: 23 hours of untranscribed speech. ◮ We consider two sets of word pairs for training the cAE:

1

100k gold standard word pairs.

2

80k word pairs discovered using unsupervised term discovery (UTD).

◮ Test set for same-different evaluation: 11k word tokens, 60.7M pairs, 3%

produced by same speaker.

◮ Neural network architecture (optimized on development set):

39-dimensional single-frame MFCC input features, 13 layers, 100 hidden units per layer, take features from the fourth-last encoding layer.

13 / 16

SLIDE 56

Comparison with baseline: gold standard word pairs

Features Average precision MFCCs with CMVN 0.214 UBM with 1024 components [Jansen et al., 2013] 0.222 1024-UBM partitioned 100 components [Jansen et al., 2013] 0.286 100-unit, 13-layer stacked autoencoder 0.215 100-unit, 13-layer correspondence autoencoder 0.469 Supervised NN, 10 hours [Carlin et al., 2011] 0.439 Supervised NN, 100 hours [Carlin et al., 2011] 0.516

14 / 16

SLIDE 57

Evaluation using terms from unsupervised term discovery

Features Average precision MFCCs with CMVN 0.214 Best of [Jansen et al., 2013] using gold standard word pairs 0.286 Correspondence autoencoder trained on gold standard word pairs 0.469 Correspondence autoencoder trained on UTD pairs 0.341 Supervised NN, 10 hours [Carlin et al., 2011] 0.439 Supervised NN, 100 hours [Carlin et al., 2011] 0.516

15 / 16

SLIDE 58

Summary and conclusion

◮ Introduced the correspondence autoencoder (cAE), a novel neural network

which can be trained unsupervised on unlabelled speech data.

◮ Evaluated the network in a word discrimination task. ◮ Showed 64% relative improvement over a previous state-of-the-art GMM

system.

◮ Come to within 23% of supervised baseline. ◮ Future work: apply in further unsupervised speech processing tasks; how can

the correspondence idea be used in other neural network structures?

16 / 16

SLIDE 59

Code

https://github.com/kamperh/speech_correspondence/

SLIDE 60

Choosing the network architecture

5 10 15 20 Number of hidden layers 0.20 0.25 0.30 0.35 0.40 Average precision (AP)

ptimum