[PPT] - A New Adaptation Method for Speaker- -Model Model A New Adaptation PowerPoint Presentation

SLIDE 1

S.X. Zhang / Man-Wai Mak

P. 1

PCM 2007

High-Level Speaker Verification

A New Adaptation Method for Speaker A New Adaptation Method for Speaker-

Model

Model Creation in High Creation in High-

Level Speaker Verification

Level Speaker Verification

Dept. of Electronic and

Information Engineering

Shi-Xiong Zhang and Man-Wai MAK

The Hong Kong Polytechnic University

SLIDE 2

S.X. Zhang / Man-Wai Mak

P. 2

PCM 2007

Outline Outline

Experiments and Results New Adaptation for Speaker Modeling GMM system and MAP Adaptation

New Adaptation Methods for High-Level Speaker Verification

Introduction of Speaker Verification

SLIDE 3

S.X. Zhang / Man-Wai Mak

P. 3

PCM 2007

What is Speaker Verification? What is Speaker Verification?

– To verify the identity of a claimant based on his/her own voices (Determine whether a person is who he/she claims to be) Is this Mary’s voice? I am Mary ?

New Adaptation Methods for High-Level Speaker Verification

SLIDE 4

S.X. Zhang / Man-Wai Mak

P. 4

PCM 2007

Two Phases of Speaker Verification Two Phases of Speaker Verification

Feature extraction Feature extraction Model training Model training

Enrollment speech for a target speaker

Bob Sally

Enrollment Enrollment Phase Phase

Model training Model training

Accepted/ Reject

Feature extraction Feature extraction Verification Verification

Claimed identity: Sally

Verification Verification Phase Phase

Verification Verification

models for a target speaker

Sally Bob Feature extraction Feature extraction

Modular Representation of Speaker Verification

SLIDE 5

S.X. Zhang / Man-Wai Mak

P. 7

PCM 2007

The mixture density function is a linear combination of

several Gaussian densities

Gaussian mixture model (GMM):

Traditional Speaker Modeling Traditional Speaker Modeling

1

( | ) ( )

M s s s i i i

p w p

=

Λ = ∑ x x

' 1 1/ 2 / 2

1 1 ( ) exp ( ) ( ) ( ) 2 (2 )

{ }

s s s s i i i i D s i

p π

−

= − − Σ − Σ x x μ x μ

{ , , }

s s s s i i i

w Λ = Σ μ

GMM GMM

T i m e Frequency 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0

Low-level Features for Speaker Verification

x

SLIDE 6

S.X. Zhang / Man-Wai Mak

P. 8

PCM 2007

Verification based on Speaker and Background Verification based on Speaker and Background GMM Model GMM Model

Universal Background Model (UBM)

The UBM is a large GMM trained to represent the distribution of speaker-independent features.

Speaker GMM is used to represent a specific user

Feature extraction Feature extraction GMM Speaker Model GMM Speaker Model GMM UBM GMM UBM Decision Decision

Σ

+ _

Scores

( |

)

s

p Λ x

1

( | ) ( )

M b b b i i i

p w p

=

Λ = ∑ x x

t

X

MAP Low-level Features for Speaker Verification

SLIDE 7

S.X. Zhang / Man-Wai Mak

P. 9

PCM 2007

High High-

Level Features

Level Features

Humans use several levels of perceptual cues for speaker recognition

Perceptual Cues Depends on

Pronunciations
Idiolect (word

usage) Socio-economic status, education, place of birth

Prosodic

(Rhythm)

Speed of

speech

Intonation

Personality type, parental influence

Acoustic aspect
f speech

Physical structure of vocal apparatus

Perceptual Cues Depends on

Pronunciations
Idiolect (word

usage) Socio-economic status, education, place of birth

Prosodic

(Rhythm)

Speed of

speech

Intonation

Personality type, parental influence

Acoustic aspect
f speech

Physical structure of vocal apparatus

High-level cues (learned traits) Low-level cues (physical traits) Easy to automatically extract Difficult to automatically extract

High-Level Speaker Verification

SLIDE 8

S.X. Zhang / Man-Wai Mak

P. 10

PCM 2007

What What’ ’s the s the A Articulatory rticulatory F Feature? eature?

Articulatory features (AFs) are abstract classes that describe the movements and positions of different articulators during speech production.

Two AFs were adopted for Pronunciation Modelling (AFCPM): /u/

Speaker 2 Speaker 1

SLIDE 9

S.X. Zhang / Man-Wai Mak

P. 11

PCM 2007

T i m e 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7

x

46

( , | )

S

P m p q

2

( , | )

S

P m p q

} , , {

1 T

X X …

} , , {

1 T

q q …

Null-Grammar Phoneme Recognizer MFCC Sequences from Verification Utterance } , , , {

2 1 M T M M

l l l …

} , , , {

2 1 P T P P

l l l …

Creating Creating Speaker Speaker Models Models and and Background Background Models Models

AF-MLP

for Place

AF-MLP

for Manner

} , , {

1 T

X X …

} , , {

1 T

q q …

Null-Grammar Phoneme Recognizer MFCC Sequences from Verification Utterance } , , , {

2 1 M T M M

l l l …

} , , , {

2 1 P T P P

l l l …

Creating Creating Speaker Speaker Models Models and and Background Background Models Models

AF-MLP

for Place

AF-MLP

for Manner

Articulatory Feature Pronunciation Modeling

Unadapted AFCPM of Speaker s

Articulatory Feature Conditional Pronunciation Modeling

1

( , | )

S

P m p q

AFCPM Training AFCPM Training

Data Sparse Problem

6 10

SLIDE 10

S.X. Zhang / Man-Wai Mak

P. 12

PCM 2007

2

( , | )

S

P m p q

2

( , | )

b

P m p q

46

( , | )

b

P m p q

1

( , | )

S

P m p q

Contribution of our paper Contribution of our paper

Adaptation & Adaptation & Model Creation Model Creation

AFCPM of Background Model

1

( , | )

b

P m p q

Adapted AFCPM of Speaker s Enrollment Data for a target speaker s

T i m e Frequency 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0

x

SLIDE 11

S.X. Zhang / Man-Wai Mak

P. 13

PCM 2007

Verification based on Speaker and Background Verification based on Speaker and Background AFCPM Model AFCPM Model

Feature extraction Feature extraction AFCPM UBM AFCPM UBM Decision Decision

Σ

+ _

Scores

t M t P t

q l m l p ⎧ ⎫ ⎪ ⎪ = ⎨ ⎬ ⎪ ⎪ = ⎩ ⎭

Adaptation

AFCPM Verification

( ,

| )

S

P m p q

( , | )

b

P m p q

T i m e Frequency 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0

x

AFCPM Speaker Model AFCPM Speaker Model

SLIDE 12

S.X. Zhang / Man-Wai Mak

P. 14

PCM 2007

AF AF Modeling Modeling

Unadapted phoneme-dependent AFCPM of Speaker s

MAP MAP

Adapted AFCPM

f Speaker s

( ,

| )

S

P m p q ( , | )

S

P m p q phoneme-dependent AFCPM of Background Model

( , | )

b

P m p q

Traditional MAP Adaptation Traditional MAP Adaptation

(

, | ) ( , | ) (1 ) ( , | )

s s b

P m p q P m p q P m p q β β = + −

#((*,*, ) in the utterances of speaker ) #((*,*, ) in the utterances of speaker ) q s q s r β = +

MAP MAP Adaptation: Adaptation:

T i m e Frequency 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0

x

SLIDE 13

S.X. Zhang / Man-Wai Mak

P. 15

PCM 2007

Limitation of Traditional MAP Adaptation Limitation of Traditional MAP Adaptation

SLIDE 14

S.X. Zhang / Man-Wai Mak

P. 16

PCM 2007

Limitation of Traditional MAP Adaptation Limitation of Traditional MAP Adaptation

SLIDE 15

S.X. Zhang / Man-Wai Mak

P. 17

PCM 2007

AF AF Modeling Modeling

Unadapted Phoneme-dependent AFCPM of Speaker s

New Adaptation New Adaptation

Adapted AFCPM

f Speaker s

( ,

| )

S

P m p q

( , | )

S

P m p q

AFCPM of Background Model

( , | )

b

P m p q

T i m e Frequency 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0

x

( , ) * |

b

P m p

( , ) * |

S

P m p

Proposed New Adaptation Method Proposed New Adaptation Method

(

, | ) ( , | ) (1 ) ( , | ) (1 ) ( , | ( , |*) ) ( , |*)

q q q q s s s s b b s b b b

P P m p q P m p q P m p q P m m p q p P m p β β α α ⎡ ⎤ = + − + − ⎢ ⎥ ⎣ ⎦

#((*,*, ) in the utterances of all background speakers) #((*,*, ) in the utterances of all background speakers)

q b

q q r α = +

Unadapted Phoneme- independent AFCPM of Speaker s

SLIDE 16

S.X. Zhang / Man-Wai Mak

P. 18

PCM 2007

AF AF Modeling Modeling

Unadapted phoneme-dependent AFCPM of Speaker s

MAP MAP

Adapted AFCPM

f Speaker s

( ,

| )

S

P m p q ( , | )

S

P m p q phoneme-dependent AFCPM of Background Model

( , | )

b

P m p q

Traditional MAP Adaptation Traditional MAP Adaptation

(

, | ) ( , | ) (1 ) ( , | )

s s b

P m p q P m p q P m p q β β = + −

#((*,*, ) in the utterances of speaker ) #((*,*, ) in the utterances of speaker ) q s q s r β = +

MAP MAP Adaptation: Adaptation:

T i m e Frequency 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0

x

SLIDE 17

S.X. Zhang / Man-Wai Mak

P. 19

PCM 2007

n corresponding to phoneme ch of speaker 1018.spk.count.PhnC

Place class, p 1 2 3 4 5 6 7 8 9 10 tion corresponding to phoneme independent model of speaker 1018.spk Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Method D: Method D: New Adaptation New Adaptation

Distribution corresponding to phoneme ch of speaker 1018.spk.count Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 Distribution corresponding to phoneme ch of speaker 1042.spk.count Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Speaker1018

( , | =/ch/)

s

P m p q

Manner class , m Place class, p Manner class , m Place class, p

Speaker1018

( , |*)

s

P m p

Method D: Method D: New Adaptation New Adaptation

stribution corresponding to phoneme ch of speaker female.bkg.co Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Manner class , m Place class, p

n corresponding to phoneme independent model of speaker female.bkg Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Manner class , m Place class, p

( , | =/ch/)

b

P m p q ( , |*)

b

P m p

ution corresponding to phoneme independent model of speaker 1042.spk.c Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Manner class , m Place class, p

bution corresponding to phoneme ch of speaker 1041.spk.count.Ph Place class, p Manner class, m 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Place class, p Manner class , m

Speaker1042

( , |*)

s

P m p (a)

Speaker1042

( , | =/ch/)

s

P m p q (b) (c) (d) (e) (f)

Speaker1018 ( ,

| / /)

s

P m p q ch =

Speaker1042 ( ,

| / /)

s

P m p q ch =

Place class, p Manner class , m Place class, p Manner class , m

d=14.1695 r=0.8013 (g) (h)

Proposed Adaptation Method for High-Level Speaker Verification

SLIDE 18

S.X. Zhang / Man-Wai Mak

P. 20

PCM 2007 Proposed Adaptation Method for High-Level Speaker Verification

Comparing the created speaker models Comparing the created speaker models based on Method A and D based on Method A and D Conventional MAP Conventional MAP Adaptation Adaptation Proposed New Proposed New Adaptation Adaptation

SLIDE 19

S.X. Zhang / Man-Wai Mak

P. 21

PCM 2007

Comparison and Results

MAP Proposed False Rejection Rate False Acceptance Rate

To create speaker models and evaluate the performance

NIST00

To create the background models and mapping functions

NIST99

To train the AF-MLPs

HTIMIT

To train null-grammar phone recognizer

SPIDRE

Purpose Purpose

Database Database

NO. of target speakers: 547
NO. of speaker trails: 3,135
NO. of impostor attempts: 3,0151

SLIDE 20

S.X. Zhang / Man-Wai Mak

P. 22

PCM 2007

THANKS !

QA

High-Level Speaker Verification

SLIDE 21

S.X. Zhang / Man-Wai Mak

P. 23

PCM 2007

Complementary

High-Level Speaker Verification

SLIDE 22

S.X. Zhang / Man-Wai Mak

P. 24

PCM 2007

How to extracted AFs? How to extracted AFs?

At frame position t, 9 consecutive frames of MFCCs (Xt) centred at frame t were input to an MLP.

P t

l

9 frames (t−4,..., t,…, t+4) of MFCCs, Xt

Input Layer Output Layer Hidden Layer P(Place=Silence|Xt) P(Place=Glottal|Xt)

Maxnet

. . .

MLP for the Place of Articulation

Manner labels Place labels

) | Manner ( max arg

t M m M t

X m P l = =

∈

) | Place ( max arg

t P p P t

X P P l = =

∈

} , , {

1 T

X X …

} , , {

1 T

q q …

Null-Grammar Phoneme Recognizer MFCC Sequences from Verification Utterance } , , , {

2 1 M T M M

l l l …

} , , , {

2 1 P T P P

l l l …

Creating Creating Speaker Speaker Models Models and and Background Background Models Models

AF-MLP

for Place

AF-MLP

for Manner

} , , {

1 T

X X …

} , , {

1 T

q q …

Null-Grammar Phoneme Recognizer MFCC Sequences from Verification Utterance } , , , {

2 1 M T M M

l l l …

} , , , {

2 1 P T P P

l l l …

Creating Creating Speaker Speaker Models Models and and Background Background Models Models

AF-MLP

for Place

AF-MLP

for Manner

Feature Extraction

Training

SLIDE 23

S.X. Zhang / Man-Wai Mak

P. 25

PCM 2007

Phoneme Phoneme-

Dependent

Dependent AF AFCPM CPM Training Training

(Manner , Place | Phoneme )

No. of { ,

, } from the data of speaker

No. of { } from the data of speak

, | ) er (

S S

P m p q m p q s q P m p q s = = = = =

Low Vowel /aa/ 7 Coronal Coronal Stop Stop /t/ 6 … … … … Low Low Vowel Vowel /t/ 1 Silence Silence /t/ 2 Silence Silence /t/ 3 Silence Silence /t/ 4 Silence Silence /t/ 5 Phoneme, qt Frame, t

t

m

l

t

p

l

P(Manner=Vowel Vowel, Place=Low Low | /t/) = 1/6, P(Manner=Silence, Place=Silence | /t/) = 4/6, P(Manner=Stop Stop, Place=Coronal Coronal | /t/) = 1/6, and all other entries are equal to 0.

1/6 1/6

/t/

Approximant / Lateral Nasal Fricative Stop 1/6 1/6 Vowel 4/6 4/6 Silence

Silence High Middle Low Labial Dental Coronal Palatal Velar Glottal

Manner Manner

Place Place

( , | /t/)

S

P m p q =

…

AFCPM of Phoneme q for SPK S Phoneme- Dependent Speaker Models

…

AFCPM of Phoneme 46 for SPK S AFCPM of Phoneme 1 for SPK S

SLIDE 24

S.X. Zhang / Man-Wai Mak

P. 26

PCM 2007

(

, | ) ( , | ) (1 ) ( , | )

s s b

P m p q P m p q P m p q β β = + −

#((*,*, ) in the utterances of speaker ) #((*,*, ) in the utterances of speaker ) q s q s r β = +

Traditional MAP Adaptation Traditional MAP Adaptation

Method A MAP Method A MAP Adaptation: Adaptation:

SLIDE 25

S.X. Zhang / Man-Wai Mak

P. 27

PCM 2007

Proposed Adaptation Method B and C Adaptation Method B: Adaptation Method B: Adaptation Method C: Adaptation Method C:

SLIDE 26

S.X. Zhang / Man-Wai Mak

P. 28

PCM 2007

Proposed New Adaptation Method Proposed New Adaptation Method

Adaptation Adaptation Method D: Method D:

(

, | ) ( , | ) (1 ) ( , | ) (1 ) ( , | ( , |*) ) ( , |*)

q q q q s s s s b b b b b b

P P m p q P m p q P m p q P m m p q p P m p β β α α ⎡ ⎤ = + − + − ⎢ ⎥ ⎣ ⎦

#((*,*, ) in the utterances of all background speakers) #((*,*, ) in the utterances of all background speakers)

q b

q q r α = +