[PPT] - Sub-Project I Prosody, Tones and Text-To-Speech Synthesis PowerPoint Presentation

SLIDE 1

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang

SLIDE 2

2

Outline

Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction

SLIDE 3

3

Members

Chiu-yu Tseng Professor & Research Fellow (Co-PI) Academia Sinica Sin-Horng Chen Professor (PI) NCTU Yuan-Fu Liao Assistant Professor (Co-PI) , NTUT Yih-Ru Wang, Associate Professor (Co-PI) , NCTU Hsin-min Wang Associate Research Fellow Academia Sinica Lin-shan Lee Professor , NTU

SLIDE 4

4

Theme of Sub-Project I

0.02 0.04 0.06 0.08 0.1 0.12

0.2
0.15
0.1
0.05

0.05 0.1 0.15 0.2 Dimension 1 Dimension 2 1 6 5 7 3 2 4 8 8-4 8-1 8-3 8-5 8-7 8-6 8-8 4-7 4-6 4-1 4-3 2-1 2-7 2-3 2-6 2-4 2-2 2-5 3-1 3-6 3-2 3-5 3-7 3-3 3-4 7-6 7-1 7-5 7-7 7-4 7-8 7-2 7-3 5-7 5-6 5-5 5-1 5-2 5-3 5-8 5-4 6-5 6-7 6-6 6-1 6-2 6-3 6-4 6-8 1-6 1-5 1-3 1-7 1-1 1-2 1-4 1-8 Keyword Speaker

Fast speakers Slow speakers More breaks

Less breaks

Tone Behavior and Modeling Tone Behavior and Modeling Applications in Speech/Speaker Recognition Applications in Speech/Speaker Recognition Applications in Text-to-speech Synthesis Applications in Text-to-speech Synthesis Prosody Analysis and Modeling Prosody Analysis and Modeling

Latent Factor-based pitch contour model

n n

s s n n

Y Z γ β ) ( + =

Mean model: Shape model:

n n n n n n

p f i ft pt t n n

X Y β β β β β β + + + + + + =

n n n n n

f i s q tc n n

b b b b b X Z + + + + + = Tone Sandhi Hierarchical modeling of fluent prosody High performance TTS Speaker recognition Prosodic model-based tone recognizer

SLIDE 5

5

Research Focus

How to analyze and model fluent speech prosody

– Approach 1: Hierarchical modeling of fluent speech prosody

Develop a hierarchical prosody framework of fluent speech
Construct modular acoustic models for: (1) F0 contours, (2) duration

patterns, (3) Intensity distribution and (4) boundary breaks

– Approach 2: Latent factor analysis-based modeling

Assume there are some latent affecting factors
Latent factor analysis for syllable duration, pitch contour, energy and Inter-

syllable coarticulation

Explore the relation between latent factors and syntactic information

How to integrate these two approaches and apply them to

– Text-to-speech synthesis – Speech/tone/speaker recognition

SLIDE 6

6

Research Roadmap

Automatic prosodic labeling
Prosodic phrase analysis
High performance TTS

Mandarin, Min-south, Hakka Current Achievements Future Direction

Eigen prosody analysis-based

speaker recognition

RNN/VQ-based

prosodic modeling

COSPRO corpus/Toolkits
Hierarchical modeling of

fluent speech prosody

Corpus-based TTS
Model-based TTS
Language model+pause, PM
Tone modeling and

recognition, MLP/RNN

HMM
Model-based tone recognizer
Prosodic model-based

speaker recognition

Prosodic cues-dependent LM
Latent factor analysis

duration, pitch mean, shape, inter-syllable coarticulation

Investigation in relation to

prosody organization: F0 range and reset, naturalness and measurement, voice quality

SLIDE 7

7

Hierarchical Prosody Framework of Fluent Speech (1/4)

Hierarchical framework of fluent speech prosody for multi- phrase speech paragraphs

– Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. – Acoustic templates are derived for each prosody level

F0 template
Syllable duration templates and temporal allocation patterns
Intensity distribution patterns
Boundary break patterns

SLIDE 8

8

Hierarchical Prosody Framework of Fluent Speech (2/4)

The Prosody Hierarchy with Prosodic Boundaries

Breath Group Initial PP Final PP Middle Prosodic Phrase PWPW .. .. .. .. .. .. .. .. .. .. .. .. .. PW B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B3 B3 B4 Prosodic Group B4 B5

SLIDE 9

9

Hierarchical Prosody Framework of Fluent Speech (3/4)

F0 cadence of multi-phrase PG (Prosodic Phrase Group )

Tide over Wave and Ripple

Syllable duration cadence of multi- phrase PG

the PW level the PPh level PG-initial PPh l PG-medial PPh l PG-final PPh l

1.2
1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11

1.2
1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11

1.2
1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11

1.2
1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4

1.2
1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11

SLIDE 10

10

Hierarchical Prosody Framework of Fluent Speech (4/4)

Duration Re-synthesis, F054C F0 Re-synthesis, F054C

50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final

Original

50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z )

Original

Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s

Modified Original Original

SLIDE 11

11

Latent Factor Analysis-based Prosody Modeling (1/3)

0 .4 0 .6 0 .8 1 1 .2 1 .4 1 .6 1 .8 2 0 .4 0 .6 0 .8 1 1 .2 1 .4 1 .6 1 .8 2 c o m p a n d in g fa c to r o f s y lla b le companding factor of initial and final fin a l in itia l

Syllable Duration Model

– Multiplicative model – Additive model

Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models

n n n n n

n n t y j l s

Z X γ γ γ γ γ =

n n n n n

n n t y j l s

Z X γ γ γ γ γ = + + + + +

mean: 42.3 frames 43.9 frames variance: 180 frame2 2.52 frame2 RMSE: 1.93 frames (5ms/frame)

SLIDE 12

12

Latent Factor Analysis-based Prosody Modeling (2/3)

2 4 6 8 10 12 14 16 18 20 6 6.5 7 7.5 8 8.5 9 9.5 frame pitch period (ms) 033 133 233 333 433 533 020 030

Syllable Pitch Contour Model

– Mean model – Shape model

The patterns of x-3-3

n n

s s n n

Y Z γ β ) ( + =

n n n n n n

p f i ft pt t n n

X Y β β β β β β + + + + + + =

n n n n n

f i s q tc n n

b b b b b X Z + + + + + =

Reconstructed pitch mean

SLIDE 13

13

Latent Factor Analysis-based Prosody Modeling (3/3)

Inter-syllable coarticulation pitch contour model

The relationship of syllable pitch contours and affecting factors Reconstructed pitch contour

SLIDE 14

14

Mandarin/Taiwanese TTS

Block diagram of TTS system TTS samples input Min-Nan or Chinese text Text Analyzer Acoustic Inventory RNN-based Prosody Generator PSOLA Speech Synthesizer synthetic speech

base-syllable sequence linguistic feature waveform sequence prosodic parameters

Model- based TTS Corpus- based TTS female 1 female 2 female 3 female 4 female 5 female 1 female 2 female 3 female 4 female 5 Taiwanese

SLIDE 15

15

Tone Behavior Modeling and Recognition with Inter-Syllabic Features

Gabor-IFAS-based pitch detection Four inter-syllabic features

– Ratio of duration of adjacent syllables – Averaged pitch value over a syllable – Maximum pitch difference within a syllable – Averaged slope of the pitch contour over a syllable

Context-dependent tone behavior modeling

SLIDE 16

16

Eigen-Prosody Analysis-based Robust Speaker Recognition

60.2 61.9 74.9 79.3 58.3 60.5 69.4 74.6 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 MAP- GMM/CMS +GPD_S +ML-AKI +EPA a v g s p k r e c o g r a te ( % ) avg unseen handset

Use latent semantic analysis (LSA) to efficiently extract

useful speaker cues to resist handset mismatch from few training/test data

– Step 1: Automatic prosodic state labeling and speaker-keyword statistics – Step 2: Eigen-prosody space construction using Latent semantic analysis

prosodic features Prosody State Labeling Prosody keyword parsing prosody keywords

A

…......... ……. …….. 1 1 2 1 Co-occurrence Matrix speakers dictionary VQ-based Prosody modeling sequences of prosody states eigen- prosody space

A U VT S

high dimensional prosody space Eigen-prosody analysis (SVD)

0.02 0.04 0.06 0.08 0.1 0.12

0.2
0.15
0.1
0.05

0.05 0.1 0.15 0.2 Dimension 1 Dimension 2 1 6 5 7 3 2 4 8 8-4 8-1 8-3 8-5 8-7 8-6 8-8 4-7 4-6 4-1 4-3 2-1 2-7 2-3 2-6 2-4 2-2 2-5 3-1 3-6 3-2 3-5 3-7 3-3 3-4 7-6 7-1 7-5 7-7 7-4 7-8 7-2 7-3 5-7 5-6 5-5 5-1 5-2 5-3 5-8 5-4 6-5 6-7 6-6 6-1 6-2 6-3 6-4 6-8 1-6 1-5 1-3 1-7 1-1 1-2 1-4 1-8 Keyword Speaker

Fast speakers Slow speakers More breaks Less breaks

Experimental results on

HTIMIT corpus

– Ten different handsets – 302 speakers – 7/3 utterances for training/test respectively

SLIDE 17

17

Research Infrastructure (1/2)

Sinica COSPRO and Toolkits: http://www.myet.com/COSPRO/

– 9 sets of Mandarin Chinese fluent speech corpora collected – Platform developed – Each corpus was designed to bring out different prosody features involved in fluent speech. – Annotation processes include labeling and tagging perceived units and boundaries in fluent speech, especially the ultimate unit the multiple phrase speech paragraph. – Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship characteristic to narrative or discourse organization.

SLIDE 18

18

Research Infrastructure (2/2)

Tree-Bank Speech Database

– Uttered by a single female speaker – Short paragraphs, 110,000 syllables – Sentence-based syntactic tree annotated manually – Pitch contour and syllable segmentation corrected manually

SLIDE 19

19

Future Direction (1/5)

Automatic prosodic labeling of Mandarin speech corpus Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using prosodic cues Prosodic modeling-based robust speaker recognition

SLIDE 20

20

Future Direction (2/5)

Automatic prosodic labeling of Mandarin Speech corpus

– Goal: To construct a prosody-syntax model by exploiting the relationship

f prosodic features and linguistic features and use it to automatic labeling
f various acoustic cues:
Prosodic phrase boundary detection
Inter-syllable/inter-word coarticulation classification
Full/half/sandhi tone labeling for Tone 3
Syllable pronunciation clustering
Homograph determination
The grouping of monosyllabic words with their neighboring words

SLIDE 21

21

Future Direction (3/5)

Analysis of prosodic phrase structure

– 4-level prosody hierarchy: PW, PPh, BG, PG – Issues to be studied

Detection and classification of prosodic phrases
Relation between syntactic phrase structure and prosodic phrase structure
Other affecting factors: speaking rate, speaking style, emotion type,

spontaneity of speech

Model-based tone recognition

– Current approach

Acoustic feature normalization
Context-dependent tone modeling

– Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour

SLIDE 22

22

Future Direction (4/5)

High performance TTS

– Applying the sophisticated prosody models

Modular model of fluent speech prosody
Latent factor analysis-based modeling

– Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient.

Consider both linguistic information and acoustic cues
Specially treat to monosyllabic words

– Use the above prosody-syntax models to assist in the generation of prosodic information

SLIDE 23

23

Future Direction (5/5)

Speech recognition/language modeling using prosodic cues

– Automatic prosodic states labeling – Prosodic state-dependent acoustic modeling – Prosodic state-dependent language modeling

Prosodic modeling-based robust speaker recognition

– Automatic prosodic cues labeling – N-gram language model to learn the prosodic behavior of speakers – Applying principle component analysis (PCA) to N-gram to find a compact prosodic speaker space