Speech Processing 15-492/18-492 Speech Synthesis Evaluation - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Speech Synthesis Evaluation - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How good is the voice? How good is the voice? This voice is a 45.67 This voice is a 45.67 Is voice X better than voice Y Is voice X


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Speech Synthesis Evaluation

slide-2
SLIDE 2

Evaluating Speech Synthesis

  • How good is the voice?

How good is the voice?

  • This voice is a 45.67

This voice is a 45.67

  • Is voice X better than voice Y

Is voice X better than voice Y

  • Why?

Why?

slide-3
SLIDE 3

Evaluation

  • Objective measures

Objective measures

  • Run a program and get a number

Run a program and get a number

  • Subjective measures

Subjective measures

  • Have human listeners extract a score

Have human listeners extract a score

  • Do Object and Subjective scores correlate

Do Object and Subjective scores correlate

slide-4
SLIDE 4

Human Tests

  • Synthesis people are warped

Synthesis people are warped

  • The more you listen the better it becomes

The more you listen the better it becomes

  • They hear things others don’t

They hear things others don’t

  • Non

Non-

  • synthesis people are warped

synthesis people are warped

  • People very sensitive to listening conditions

People very sensitive to listening conditions

  • What question do you ask

What question do you ask

  • What hardware you play it on

What hardware you play it on

  • There are (at least) two orthogonal scales

There are (at least) two orthogonal scales

  • Understandable

Understandable

  • natural

natural

slide-5
SLIDE 5

Standard Tests

  • DRT: diagnostic rhyme tests

DRT: diagnostic rhyme tests

  • Test confusable phones

Test confusable phones

  • “bat”

“bat” vs vs “pat” “pat”

  • Good for identifying phone errors

Good for identifying phone errors

  • Sometimes in carrier sentences

Sometimes in carrier sentences

  Now we will say pat again.

Now we will say pat again.

  • Unit selection

Unit selection

  Just include the standard works in the database

Just include the standard works in the database

slide-6
SLIDE 6

Standard Tests

  • SUS: Semantically unpredictable sentences

SUS: Semantically unpredictable sentences

  • Det

Det adj adj noun verb noun verb det det adj adj noun noun

  • Automatically filled in with low frequency words

Automatically filled in with low frequency words

  The

The parklike parklike holders threw the vague vegetables holders threw the vague vegetables

  The simplistic consonants swam the

The simplistic consonants swam the episcopal episcopal quartet quartet

  The dark geniuses woke the humane emptiness.

The dark geniuses woke the humane emptiness.

  The masterly serials withdrew the collaborative brochure

The masterly serials withdrew the collaborative brochure

  • Test for understandability

Test for understandability

  • Ask users to type in what they hear

Ask users to type in what they hear

  • Good as discrimination

Good as discrimination

  • Very hard for even fluent non

Very hard for even fluent non-

  • natives

natives

slide-7
SLIDE 7

Standard tests

  • MOS: mean opinion scores

MOS: mean opinion scores

  • 1

1-

  • 5 quality, naturalness, “like it”

5 quality, naturalness, “like it”

  • Take average score

Take average score

slide-8
SLIDE 8

Some experimental problems

  • Order of presentation

Order of presentation

  • Other aids change perception

Other aids change perception

  • Showing the text makes it much easier

Showing the text makes it much easier

  • Having a talking head “improves” the synthesis

Having a talking head “improves” the synthesis

  • Hardware quality

Hardware quality

  • Some voices better on the telephone

Some voices better on the telephone

  • Loud speaker quality (headphone quality)

Loud speaker quality (headphone quality)

  • Room acoustics

Room acoustics

  • Volume

Volume

  • Understandability

Understandability

  • Harder if doing other task

Harder if doing other task

  • Personal preference

Personal preference

  • Voice is full understandable but “creepy”

Voice is full understandable but “creepy”

  • Voice is incomprehensible but “funny”

Voice is incomprehensible but “funny”

  • Sounds like my grade school teacher

Sounds like my grade school teacher

slide-9
SLIDE 9

TTS Evaluation

  • How good are your ears?

How good are your ears?

slide-10
SLIDE 10

SUS Sentences

  • sus_00022

sus_00022

  • sus_00012

sus_00012

  • sus_00005

sus_00005

  • sus_00017

sus_00017

slide-11
SLIDE 11

SUS Sentences

  • The serene adjustments foresaw the

The serene adjustments foresaw the acceptable acquisition acceptable acquisition

  • The temperamental gateways forgave the

The temperamental gateways forgave the weatherbeaten weatherbeaten finalist finalist

  • The sorrowful premieres sang the

The sorrowful premieres sang the

  • stentatious gymnast
  • stentatious gymnast
  • The disruptive billboards blew the sugary

The disruptive billboards blew the sugary endorsement endorsement

slide-12
SLIDE 12

TTS Evaluation

slide-13
SLIDE 13

TTS Evaluation

  • In mud eels are, in mud none are

In mud eels are, in mud none are

  • A 1918 state constitutional amendment

A 1918 state constitutional amendment made Massachusetts one of 23 states made Massachusetts one of 23 states where citizens can enact laws by plebiscite. where citizens can enact laws by plebiscite.

  • Which is which

Which is which

  • The numbers are 25 and 34.

The numbers are 25 and 34.

  • The numbers 20 5 and 34.

The numbers 20 5 and 34.

  • What is the temperature in Pittsburgh

What is the temperature in Pittsburgh

slide-14
SLIDE 14

Objective Synthesis Tests

  • Text analysis

Text analysis

  • How well do you cover

How well do you cover NSWs NSWs

  • How well do you cover homographs

How well do you cover homographs

  • Lexical coverage

Lexical coverage

  • How often do you see a new word

How often do you see a new word

  • Lexical correctness

Lexical correctness

  • How correct are pronunciations

How correct are pronunciations

  • For unseen words

For unseen words

  • For seen words

For seen words

  • Phonetic intelligibility

Phonetic intelligibility

  • DRT tests

DRT tests

  • Semantic intelligibility

Semantic intelligibility

  • SUS tests

SUS tests

slide-15
SLIDE 15

Blizzard Challenge

  • Annual Event from 2005

Annual Event from 2005

  • Distribute large databases of speech

Distribute large databases of speech

  • Participants

Participants

  • Build a voice

Build a voice

  • Synthesize a set of sentences

Synthesize a set of sentences

  • Listeners

Listeners

  • Listen and grade results

Listen and grade results

slide-16
SLIDE 16

Blizzard Challenge

  • 2005: US English synthesis, 4 voices, 1 hour each

2005: US English synthesis, 4 voices, 1 hour each

  • 4 teams plus “Studio” (human speech)

4 teams plus “Studio” (human speech)

  • 2006: US English: 1 voice: 6 hours and 1 hour

2006: US English: 1 voice: 6 hours and 1 hour

  • 12 teams

12 teams

  • 2007: US English: 1 voice: 9 hours and 1 hour

2007: US English: 1 voice: 9 hours and 1 hour

  • 14 teams

14 teams

  • 2008: UK English: 15 hours: Mandarin 5 hours

2008: UK English: 15 hours: Mandarin 5 hours

  • 19 teams

19 teams

  • Split between industry and academia

Split between industry and academia

  • Split between Asia, Europe, Americas.

Split between Asia, Europe, Americas.

slide-17
SLIDE 17

Listeners

  • Three sets of listeners

Three sets of listeners

  • Speech experts (participants)

Speech experts (participants)

  • Paid undergrads (native speakers)

Paid undergrads (native speakers)

  • Volunteers

Volunteers

  • Types of tests

Types of tests

  • MOS tests (1

MOS tests (1-

  • 5)

5)

  • SUS tests

SUS tests

  • DRT tests

DRT tests

  • About 300 listeners in total

About 300 listeners in total

slide-18
SLIDE 18

Listening

  • Web based

Web based

  • So everyone did it in a different environment

So everyone did it in a different environment

  • But we got access to more people

But we got access to more people

  • Asked to do it in quiet office with headphone

Asked to do it in quiet office with headphone

  • Could listen multiple times

Could listen multiple times

slide-19
SLIDE 19

Blizzard Challenge Results

  • Speech Experts

Speech Experts

  • Like synthesis better

Like synthesis better

  • Understand synthesis better

Understand synthesis better

  • Volunteers don’t always finish tests

Volunteers don’t always finish tests

  • Undergrads sometime finish tests

Undergrads sometime finish tests

  • (or put in filler answers)

(or put in filler answers)

  • Results were correlated over different

Results were correlated over different subgroups subgroups

slide-20
SLIDE 20

Application Tests

  • How does it work *in* the application

How does it work *in* the application

  • With real application data

With real application data

  • A good voice is not noticed

A good voice is not noticed

  • Have *real* users evaluate it

Have *real* users evaluate it

  • Give them a choice (even if artificial)

Give them a choice (even if artificial)

  • CEO choices the one they like!

CEO choices the one they like!

slide-21
SLIDE 21

Clearer Spoken Output

  • In Let’s Go Bus Domain

In Let’s Go Bus Domain

  • Lexical Choice

Lexical Choice

  • The next bus is at 10:23

The next bus is at 10:23

  • The next bus is in 11 minutes

The next bus is in 11 minutes

  • Prosodic variation

Prosodic variation

  • The next bus is at 10:23

The next bus is at 10:23

  • The next bus is at, 10:23.

The next bus is at, 10:23.

  • Spectral variation

Spectral variation

  • Clear articulation (when asked to repeat)

Clear articulation (when asked to repeat)

  • The next bust is at, 10:23.

The next bust is at, 10:23.

slide-22
SLIDE 22

Summary

  • TTS Evaluation is hard

TTS Evaluation is hard

  • But not impossible

But not impossible

  • Clear ways (that are consistent) are available

Clear ways (that are consistent) are available

  MOS scores

MOS scores

  SUS

SUS

  Application based testing

Application based testing

slide-23
SLIDE 23
slide-24
SLIDE 24

HW2: TTS

  • Due 3:30pm Monday October 20

Due 3:30pm Monday October 20th

th

  • Install Festival and

Install Festival and Festvox Festvox

  • Find 10 errors in each of two different

Find 10 errors in each of two different synthesizers synthesizers

  • Build a voice

Build a voice

  • A Talking Clock

A Talking Clock

  • A general voice

A general voice

  • (or both)

(or both)

slide-25
SLIDE 25