Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation

speech processing 11 492 18 492 speech processing 11 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a


slide-1
SLIDE 1

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492

Speech Synthesis Evaluation

slide-2
SLIDE 2

Evaluating Speech Synthesis Evaluating Speech Synthesis

 How good is the voice?

How good is the voice?

 This voice is a 45.67

This voice is a 45.67

 Is voice X better than voice Y

Is voice X better than voice Y

 Why?

Why?

slide-3
SLIDE 3

Evaluation Evaluation

 Objective measures

Objective measures

 Run a program and get a number

Run a program and get a number

 Subjective measures

Subjective measures

 Have human listeners extract a score

Have human listeners extract a score

 Do Object and Subjective scores correlate

Do Object and Subjective scores correlate

slide-4
SLIDE 4

Human Tests Human Tests

 Synthesis people are warped

Synthesis people are warped

 The more you listen the better it becomes

The more you listen the better it becomes

 They hear things others don’t

They hear things others don’t

 Non-synthesis people are warped

Non-synthesis people are warped

 People very sensitive to listening conditions

People very sensitive to listening conditions

 What question do you ask

What question do you ask

 What hardware you play it on

What hardware you play it on

 There are (at least) two orthogonal scales

There are (at least) two orthogonal scales

 Understandability

Understandability

 Naturalness

Naturalness

slide-5
SLIDE 5

Standard Tests Standard Tests

 DRT: diagnostic rhyme tests

DRT: diagnostic rhyme tests

 Test confusable phones

Test confusable phones

 “

“bat” vs “pat” bat” vs “pat”

 Good for identifying phone errors

Good for identifying phone errors

 Sometimes in carrier sentences

Sometimes in carrier sentences

 Now we will say pat again.

Now we will say pat again.

 Unit selection

Unit selection

 Just include the standard works in the database

Just include the standard works in the database

slide-6
SLIDE 6

Standard Tests Standard Tests

 SUS: Semantically unpredictable sentences

SUS: Semantically unpredictable sentences

 Det adj noun verb det adj noun

Det adj noun verb det adj noun

 Automatically filled in with low frequency words

Automatically filled in with low frequency words

 The parklike holders threw the vague vegetables

The parklike holders threw the vague vegetables

 The simplistic consonants swam the episcopal quartet

The simplistic consonants swam the episcopal quartet

 The dark geniuses woke the humane emptiness.

The dark geniuses woke the humane emptiness.

 The masterly serials withdrew the collaborative brochure

The masterly serials withdrew the collaborative brochure

 Test for understandability

Test for understandability

 Ask users to type in what they hear

Ask users to type in what they hear

 Good as discrimination

Good as discrimination

 Very hard for even fluent non-natives

Very hard for even fluent non-natives

slide-7
SLIDE 7

Standard tests Standard tests

 MOS: mean opinion scores

MOS: mean opinion scores

 1-5 quality, naturalness, “like it”

1-5 quality, naturalness, “like it”

 Take average score

Take average score

slide-8
SLIDE 8

Some experimental problems Some experimental problems

 Order of presentation

Order of presentation

 Other aids change perception

Other aids change perception

 Showing the text makes it much easier

Showing the text makes it much easier

 Having a talking head “improves” the synthesis

Having a talking head “improves” the synthesis

 Hardware quality

Hardware quality

 Some voices better on the telephone

Some voices better on the telephone

 Loud speaker quality (headphone quality)

Loud speaker quality (headphone quality)

 Room acoustics

Room acoustics

 Volume

Volume

 Understandability

Understandability

 Harder if doing other task

Harder if doing other task

 Personal preference

Personal preference

 Voice is full understandable but “creepy”

Voice is full understandable but “creepy”

 Voice is incomprehensible but “funny”

Voice is incomprehensible but “funny”

 Sounds like my grade school teacher

Sounds like my grade school teacher

slide-9
SLIDE 9

TTS Evaluation TTS Evaluation

 How good are your ears?

How good are your ears?

slide-10
SLIDE 10

SUS Sentences SUS Sentences

 sus_00005

sus_00005

 sus_00012

sus_00012

 sus_00017

sus_00017

 sus_00022

sus_00022

slide-11
SLIDE 11

SUS Sentences SUS Sentences

 The sorrowful premieres sang the

The sorrowful premieres sang the

  • stentation gymnast
  • stentation gymnast

 The temperamental gateways forgave the

The temperamental gateways forgave the weatherbeaten finalist weatherbeaten finalist

 The disruptive billboards blew the sugary

The disruptive billboards blew the sugary endorsement endorsement

 The serene adjustments foresaw the

The serene adjustments foresaw the acceptable acquisition acceptable acquisition

slide-12
SLIDE 12

TTS Evaluation TTS Evaluation

slide-13
SLIDE 13

TTS Evaluation TTS Evaluation

 In mud eels are, in mud none are

In mud eels are, in mud none are

 A 1918 state constitutional amendment

A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite.

 Which is which

Which is which

 The numbers are 25 and 34.

The numbers are 25 and 34.

 The numbers 20 5 and 34.

The numbers 20 5 and 34.

 What is the temperature in Pittsburgh

What is the temperature in Pittsburgh

slide-14
SLIDE 14

Objective Synthesis Tests Objective Synthesis Tests

 Text analysis

Text analysis

 How well do you cover NSWs

How well do you cover NSWs

 How well do you cover homographs

How well do you cover homographs

 Lexical coverage

Lexical coverage

 How often do you see a new word

How often do you see a new word

 Lexical correctness

Lexical correctness

 How correct are pronunciations

How correct are pronunciations

 For unseen words

For unseen words

 For seen words

For seen words

 Phonetic intelligibility

Phonetic intelligibility

 DRT tests

DRT tests

 Semantic intelligibility

Semantic intelligibility

 SUS tests

SUS tests

slide-15
SLIDE 15

Blizzard Challenge Blizzard Challenge

 Annual Event from 2005 (15 years plus)

Annual Event from 2005 (15 years plus)

 Distribute large databases of speech

Distribute large databases of speech

 Participants

Participants

 Build a voice

Build a voice

 Synthesize a set of sentences

Synthesize a set of sentences

 Listeners

Listeners

 Listen and grade results

Listen and grade results

slide-16
SLIDE 16

Blizzard Challenge Blizzard Challenge

2005: US English synthesis, 4 voices, 1 hour each 2005: US English synthesis, 4 voices, 1 hour each

4 teams plus “Studio” (human speech) 4 teams plus “Studio” (human speech)

2006: US English: 1 voice: 6 hours and 1 hour 2006: US English: 1 voice: 6 hours and 1 hour

12 teams 12 teams

2007: US English: 1 voice: 9 hours and 1 hour 2007: US English: 1 voice: 9 hours and 1 hour

14 teams 14 teams

2008: UK English: 15 hours: Mandarin 5 hours 2008: UK English: 15 hours: Mandarin 5 hours

19 teams 19 teams

2009: UK English: 15 hours: Mandarin 5 hours 2009: UK English: 15 hours: Mandarin 5 hours

2010: UK English 18 hours: Mandarin 6 hours 2010: UK English 18 hours: Mandarin 6 hours

2010- Audio Books, Indian Languages, Speaking in Noise 2010- Audio Books, Indian Languages, Speaking in Noise

Split between industry and academia Split between industry and academia

Split between Asia, Europe, America (mostly Europe and Asia). Split between Asia, Europe, America (mostly Europe and Asia).

slide-17
SLIDE 17

Listeners Listeners

 Three sets of listeners

Three sets of listeners

 Speech experts (participants)

Speech experts (participants)

 Paid undergrads (native speakers)

Paid undergrads (native speakers)

 Volunteers

Volunteers

 Types of tests

Types of tests

 MOS tests (1-5)

MOS tests (1-5)

 SUS tests

SUS tests

 DRT tests

DRT tests

 About 300 listeners in total

About 300 listeners in total

slide-18
SLIDE 18

Listening Listening

 Web based

Web based

 So everyone did it in a different environment

So everyone did it in a different environment

 But we got access to more people

But we got access to more people

 Asked to do it in quiet office with headphone

Asked to do it in quiet office with headphone

 Could listen multiple times

Could listen multiple times

slide-19
SLIDE 19

Blizzard Challenge Results Blizzard Challenge Results

 Speech Experts

Speech Experts

 Like synthesis better

Like synthesis better

 Understand synthesis better

Understand synthesis better

 Volunteers don’t always finish tests

Volunteers don’t always finish tests

 Undergrads sometimes finish tests

Undergrads sometimes finish tests

 (or put in filler answers)

(or put in filler answers)

 Results were correlated over different

Results were correlated over different subgroups subgroups

slide-20
SLIDE 20

Application Tests Application Tests

 How does it work *in* the application

How does it work *in* the application

 With real application data

With real application data

 A good voice is not noticed

A good voice is not noticed

 Have *real* users evaluate it

Have *real* users evaluate it

 Give them a choice (even if artificial)

Give them a choice (even if artificial)

 CEO chooses the one they like!

CEO chooses the one they like!

slide-21
SLIDE 21

Clearer Spoken Output Clearer Spoken Output

 In Let’s Go Bus Domain

In Let’s Go Bus Domain

 Lexical Choice

Lexical Choice

 The next bus is at 10:23

The next bus is at 10:23

 The next bus is in 11 minutes

The next bus is in 11 minutes

 Prosodic variation

Prosodic variation

 The next bus is at 10:23

The next bus is at 10:23

 The next bus is at, 10:23.

The next bus is at, 10:23.

 Spectral variation

Spectral variation

 Clear articulation (when asked to repeat)

Clear articulation (when asked to repeat)

 The next bust is at, 10:23.

The next bust is at, 10:23.

slide-22
SLIDE 22

Summary Summary

 TTS Evaluation is hard

TTS Evaluation is hard

 But not impossible

But not impossible

 Clear ways (that are consistent) are available

Clear ways (that are consistent) are available

 MOS scores

MOS scores

 SUS

SUS

 Application based testing

Application based testing

slide-23
SLIDE 23
slide-24
SLIDE 24