Sub-Project I Prosody, Tones and Text-To-Speech Synthesis - - PowerPoint PPT Presentation
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis - - PowerPoint PPT Presentation
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang Outline Members Theme of Sub-project I Research Roadmap
2
Outline
Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction
3
Members
Chiu-yu Tseng Professor & Research Fellow (Co-PI) Academia Sinica Sin-Horng Chen Professor (PI) NCTU Yuan-Fu Liao Assistant Professor (Co-PI) , NTUT Yih-Ru Wang, Associate Professor (Co-PI) , NCTU Hsin-min Wang Associate Research Fellow Academia Sinica Lin-shan Lee Professor , NTU
4
Theme of Sub-Project I
0.02 0.04 0.06 0.08 0.1 0.12- 0.2
- 0.15
- 0.1
- 0.05
Fast speakers Slow speakers More breaks
Less breaks
Tone Behavior and Modeling Tone Behavior and Modeling Applications in Speech/Speaker Recognition Applications in Speech/Speaker Recognition Applications in Text-to-speech Synthesis Applications in Text-to-speech Synthesis Prosody Analysis and Modeling Prosody Analysis and Modeling
Latent Factor-based pitch contour model
n n
s s n n
Y Z γ β ) ( + =
Mean model: Shape model:
n n n n n n
p f i ft pt t n n
X Y β β β β β β + + + + + + =
n n n n n
f i s q tc n n
b b b b b X Z + + + + + = Tone Sandhi Hierarchical modeling of fluent prosody High performance TTS Speaker recognition Prosodic model-based tone recognizer
5
Research Focus
How to analyze and model fluent speech prosody
– Approach 1: Hierarchical modeling of fluent speech prosody
- Develop a hierarchical prosody framework of fluent speech
- Construct modular acoustic models for: (1) F0 contours, (2) duration
patterns, (3) Intensity distribution and (4) boundary breaks
– Approach 2: Latent factor analysis-based modeling
- Assume there are some latent affecting factors
- Latent factor analysis for syllable duration, pitch contour, energy and Inter-
syllable coarticulation
- Explore the relation between latent factors and syntactic information
How to integrate these two approaches and apply them to
– Text-to-speech synthesis – Speech/tone/speaker recognition
6
Research Roadmap
- Automatic prosodic labeling
- Prosodic phrase analysis
- High performance TTS
Mandarin, Min-south, Hakka Current Achievements Future Direction
- Eigen prosody analysis-based
speaker recognition
- RNN/VQ-based
prosodic modeling
- COSPRO corpus/Toolkits
- Hierarchical modeling of
fluent speech prosody
- Corpus-based TTS
- Model-based TTS
- Language model+pause, PM
- Tone modeling and
recognition, MLP/RNN
- HMM
- Model-based tone recognizer
- Prosodic model-based
speaker recognition
- Prosodic cues-dependent LM
- Latent factor analysis
duration, pitch mean, shape, inter-syllable coarticulation
- Investigation in relation to
prosody organization: F0 range and reset, naturalness and measurement, voice quality
7
Hierarchical Prosody Framework of Fluent Speech (1/4)
Hierarchical framework of fluent speech prosody for multi- phrase speech paragraphs
– Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. – Acoustic templates are derived for each prosody level
- F0 template
- Syllable duration templates and temporal allocation patterns
- Intensity distribution patterns
- Boundary break patterns
8
Hierarchical Prosody Framework of Fluent Speech (2/4)
The Prosody Hierarchy with Prosodic Boundaries
Breath Group Initial PP Final PP Middle Prosodic Phrase PWPW .. .. .. .. .. .. .. .. .. .. .. .. .. PW B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B3 B3 B4 Prosodic Group B4 B5
9
Hierarchical Prosody Framework of Fluent Speech (3/4)
F0 cadence of multi-phrase PG (Prosodic Phrase Group )
Tide over Wave and Ripple
Syllable duration cadence of multi- phrase PG
the PW level the PPh level PG-initial PPh l PG-medial PPh l PG-final PPh l
- 1.2
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11
- 1.2
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11
- 1.2
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11
- 1.2
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4
- 1.2
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11
10
Hierarchical Prosody Framework of Fluent Speech (4/4)
Duration Re-synthesis, F054C F0 Re-synthesis, F054C
50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial Final 50 100 150 200 250 300 350 Initial Medial FinalOriginal
Original
50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z ) 50 100 150 200 250 300 350 Initial Middle Final F 0 ( H z )
Original
Original
Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s
Modified Original Original
11
Latent Factor Analysis-based Prosody Modeling (1/3)
0 .4 0 .6 0 .8 1 1 .2 1 .4 1 .6 1 .8 2 0 .4 0 .6 0 .8 1 1 .2 1 .4 1 .6 1 .8 2 c o m p a n d in g fa c to r o f s y lla b le companding factor of initial and final fin a l in itia l
Syllable Duration Model
– Multiplicative model – Additive model
Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models
n n n n n
n n t y j l s
Z X γ γ γ γ γ =
n n n n n
n n t y j l s
Z X γ γ γ γ γ = + + + + +
mean: 42.3 frames 43.9 frames variance: 180 frame2 2.52 frame2 RMSE: 1.93 frames (5ms/frame)
12
Latent Factor Analysis-based Prosody Modeling (2/3)
2 4 6 8 10 12 14 16 18 20 6 6.5 7 7.5 8 8.5 9 9.5 frame pitch period (ms) 033 133 233 333 433 533 020 030
Syllable Pitch Contour Model
– Mean model – Shape model
The patterns of x-3-3
n n
s s n n
Y Z γ β ) ( + =
n n n n n n
p f i ft pt t n n
X Y β β β β β β + + + + + + =
n n n n n
f i s q tc n n
b b b b b X Z + + + + + =
Reconstructed pitch mean
13
Latent Factor Analysis-based Prosody Modeling (3/3)
Inter-syllable coarticulation pitch contour model
The relationship of syllable pitch contours and affecting factors Reconstructed pitch contour
14
Mandarin/Taiwanese TTS
Block diagram of TTS system TTS samples input Min-Nan or Chinese text Text Analyzer Acoustic Inventory RNN-based Prosody Generator PSOLA Speech Synthesizer synthetic speech
base-syllable sequence linguistic feature waveform sequence prosodic parameters
Model- based TTS Corpus- based TTS female 1 female 2 female 3 female 4 female 5 female 1 female 2 female 3 female 4 female 5 Taiwanese
15
Tone Behavior Modeling and Recognition with Inter-Syllabic Features
Gabor-IFAS-based pitch detection Four inter-syllabic features
– Ratio of duration of adjacent syllables – Averaged pitch value over a syllable – Maximum pitch difference within a syllable – Averaged slope of the pitch contour over a syllable
Context-dependent tone behavior modeling
16
Eigen-Prosody Analysis-based Robust Speaker Recognition
60.2 61.9 74.9 79.3 58.3 60.5 69.4 74.6 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 MAP- GMM/CMS +GPD_S +ML-AKI +EPA a v g s p k r e c o g r a te ( % ) avg unseen handset
- Use latent semantic analysis (LSA) to efficiently extract
useful speaker cues to resist handset mismatch from few training/test data
– Step 1: Automatic prosodic state labeling and speaker-keyword statistics – Step 2: Eigen-prosody space construction using Latent semantic analysis
prosodic features Prosody State Labeling Prosody keyword parsing prosody keywords
A
…......... ……. …….. 1 1 2 1 Co-occurrence Matrix speakers dictionary VQ-based Prosody modeling sequences of prosody states eigen- prosody space
A U VT S
high dimensional prosody space Eigen-prosody analysis (SVD)
0.02 0.04 0.06 0.08 0.1 0.12
- 0.2
- 0.15
- 0.1
- 0.05
0.05 0.1 0.15 0.2 Dimension 1 Dimension 2 1 6 5 7 3 2 4 8 8-4 8-1 8-3 8-5 8-7 8-6 8-8 4-7 4-6 4-1 4-3 2-1 2-7 2-3 2-6 2-4 2-2 2-5 3-1 3-6 3-2 3-5 3-7 3-3 3-4 7-6 7-1 7-5 7-7 7-4 7-8 7-2 7-3 5-7 5-6 5-5 5-1 5-2 5-3 5-8 5-4 6-5 6-7 6-6 6-1 6-2 6-3 6-4 6-8 1-6 1-5 1-3 1-7 1-1 1-2 1-4 1-8 Keyword Speaker
Fast speakers Slow speakers More breaks Less breaks
- Experimental results on
HTIMIT corpus
– Ten different handsets – 302 speakers – 7/3 utterances for training/test respectively
17
Research Infrastructure (1/2)
Sinica COSPRO and Toolkits: http://www.myet.com/COSPRO/
– 9 sets of Mandarin Chinese fluent speech corpora collected – Platform developed – Each corpus was designed to bring out different prosody features involved in fluent speech. – Annotation processes include labeling and tagging perceived units and boundaries in fluent speech, especially the ultimate unit the multiple phrase speech paragraph. – Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship characteristic to narrative or discourse organization.
18
Research Infrastructure (2/2)
Tree-Bank Speech Database
– Uttered by a single female speaker – Short paragraphs, 110,000 syllables – Sentence-based syntactic tree annotated manually – Pitch contour and syllable segmentation corrected manually
19
Future Direction (1/5)
Automatic prosodic labeling of Mandarin speech corpus Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using prosodic cues Prosodic modeling-based robust speaker recognition
20
Future Direction (2/5)
Automatic prosodic labeling of Mandarin Speech corpus
– Goal: To construct a prosody-syntax model by exploiting the relationship
- f prosodic features and linguistic features and use it to automatic labeling
- f various acoustic cues:
- Prosodic phrase boundary detection
- Inter-syllable/inter-word coarticulation classification
- Full/half/sandhi tone labeling for Tone 3
- Syllable pronunciation clustering
- Homograph determination
- The grouping of monosyllabic words with their neighboring words
21
Future Direction (3/5)
Analysis of prosodic phrase structure
– 4-level prosody hierarchy: PW, PPh, BG, PG – Issues to be studied
- Detection and classification of prosodic phrases
- Relation between syntactic phrase structure and prosodic phrase structure
- Other affecting factors: speaking rate, speaking style, emotion type,
spontaneity of speech
Model-based tone recognition
– Current approach
- Acoustic feature normalization
- Context-dependent tone modeling
– Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour
22
Future Direction (4/5)
High performance TTS
– Applying the sophisticated prosody models
- Modular model of fluent speech prosody
- Latent factor analysis-based modeling
– Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient.
- Consider both linguistic information and acoustic cues
- Specially treat to monosyllabic words
– Use the above prosody-syntax models to assist in the generation of prosodic information
23