[PPT] - Outliers Detection vs. Control Questions to Ensure Reliable Results PowerPoint Presentation

SLIDE 1

Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study

Rafael Zequeira Jiménez, Laura Fernández Gallardo, Sebastian Möller Quality and Usability Lab, Technische Universität Berlin

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 2

Crowdsourcing

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 2

SLIDE 3

Page 3

Motivation

Spe Speech quali lity is important for for the e Quality of

f Ex

Exper erience (QoE) in:

* * The The coll collected ra rati tings can can be be used used to to trai train AI AI systems to to pre redict the the speech qu quali ality au auto tomat atical ally ly audio books virtual or robotic conversational agents

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 4

Page 4

Motivation

Spe Speech quali lity ex experiments traditionally ly co conducted in La Laboratory

Professional audio

equipment

Soundproof room
Limited number of

participants

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 5

Crowdsourcing Study

Con
nducted a

a spe peech quality as assessment ex experiment

Cr

Crowd-workers were ere pre presen ented with th 20 20 sp speech sti stimuli

Opinion ab

about ove

verall

ll quality gat gathered in a a 5-point sc scale

SLIDE 6

Speech Material:

Database number 50

501 from from ITU-T Re

Rec. P.8

P.863

4

4 Ge Germans were ere rec recorded per per co condition

20

200 sp speech sti stimuli (9s (9s lon

ng on
n av

avg.)

50

50 deg degradation co conditions:

nar

arrow- & wide- ba band

tem

emporal l cl clipping

sig

signal-corre related nois

ise,
com
mbinations of
f these de

degradations

Th

The e da database con contains quali lity ra ratings to

the

e 20 200 sti stimuli li made by by 24 24 di different nati tive Ge German listeners, in acc accordance to

ITU-T Rec
Rec. P.8

.800

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 6

SLIDE 7

Study Conditions:

Address the st

study to nati ative Ge Germ rmans

Coll

Collect 24 24 ra rati tings per per sti stimulus fro from di different listener ers

Ex

Experiment in ac accordance with the ITU-T Re Rec. . P.8 P.800

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 7

SLIDE 8

Crowdsourcing Platform:

Ge

German bas based CS CS pla platform

Rep

Reported 1 1 mill llion glo global user sers in Se September 201 2017

Most of
f their user

sers are are from from Ge German sp speaking co countries

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 8

SLIDE 9

Crowdsourcing Experiment

Scre

Screening ta task to

recr

recruit Ge Germ rman listeners

Spe

Speech quali lity as assessment task:

Qualification ph

phase

Spe

Speech quali lity ass assessment t

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 9

SLIDE 10

Crowdsourcing Experiment

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 10

Quali alific icat ation

co

consent req request

use

se of

f hea

eadphone

au

audio Math ath trapping questi tion

5

5 st stimuli li as as an an an anchor

Speech Quali ality ty Asse ssessment

intr

troduction

en

environment rec record up p to

15

15s

20

20 sti stimuli to

ra

rate te

2

2 tra rapping Question

SLIDE 11

Crowdsourcing Experiment

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 11

Quali alific icat ation

co

consent req request

use

se of

f hea

eadphone

au

audio Math ath trapping questi tion

5

5 st stimuli li as as an an an anchor

Speech Quali ality ty Asse ssessment

intr

troduction

en

environment rec record up p to

15

15s

20

20 sti stimuli to

ra

rate te

2

2 tra rapping Question

SLIDE 12

Results

87

87 wor

rkers pa

participated in the stu study

8

8 wor

rkers fa

failed the e Qualification ph phase

53

3 unique listen eners:

60,

60,4% male les

96,

96,2% nati tive Ge Germans

pro

rovided 48 4840 ra rati tings

the co

colle llected ra rati tings acc account for for 24 24 to 26 26 ass assessment fro from dif different t listeners per per file file

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 12

SLIDE 13

Crowdsourcing vs. Laboratory

Spearman’s rank-order cor

correla lation: :

rho = 0,864 (p<0,001

01)

Mon
notonic rel

relationship betw between La Lab- an and CS CS- MOS

Ro

Root Mean Sq Square e Err Error:

RMSE

SE=0 =0,47 ,474

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 13

SLIDE 14

Filtering from unreliable workers

Work in [1]

[1] and and [2] [2] re recommends:

the use

se of

f trapping ques

estion, to cat catch inattentive user sers

when

en the use ser fa fail, l, then en all all of

f their ra

rati tings are are dis discarded Thi This app approach was ef effe fective in [1] 1] an and improved slig slightly the res result lts in [2] 2]

[1] B. Naderi, T. Polzehl, I. Wechsung, F. Köster, and S. Möller, “Effect of Trapping Questions on the Reliability of Speech Quality Judgments in a Crowdsourcing Paradigm,” in Interspeech, 2015, pp. 2799–2803. [2] R. Zequeira Jiménez, L. Fernández Gallardo, and S. Möller, “Scoring Voice Likability using Pair-Comparison: Laboratory vs. Crowdsourcing Approach,” in Ninth International Conference on Quality of Multimedia Experience (QoMEX), 2017, pp. 1–3. Page 14

SLIDE 15

Filtering from unreliable workers

A wor

rker is

s unre reli liable or

r untrustworthy when:

:

s/h

s/he fa fails the e trapping ques estion in the SQ SQAT AT

s/h

s/he fa fails the e Qualification mor

re than onc
nce

Page 15 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 16

Filtering from unreliable workers

A wor

rker is

s unre reli liable or

r untrustworthy when:

:

s/h

s/he fa fails the e trapping ques estion in the SQ SQAT AT

s/h

s/he fa fails the e Qualification mor

re than onc
nce

Page 16 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 17

Filtering from unreliable workers

Discarded 32

320 ra ratings in total fro from W4, 4, W5, W7

W6 did

did not

t co

conduct the SQ SQAT Meth thod: :

“fi filt ltering by by tra trappin ing question" (F-TQ)

Spearman’s rank-order co

correla lation on

n 45

4520 ra rati tings: :

rh

rho

=

= 0,8 0,862 (p (p<0,001) When dis discarding all all the e wor

rkers (F-TQ’)

) :

rh

rho

=

= 0,8 0,854 (p< p<0,001)

Page 17 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 18

Outlier Detection

utliers:
rati

tings abo above 1,5 1,5 inte terquartile le ran range (I (IQR)

dep

epicted by by ci circle les ex extr treme out

utliers:
rati

tings at t 3,0 3,0 IQR or

r

abo above

dep

epicted by by ast asterisks

Page 18 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 19

Outlier Detection

Discarded 12

122 ra ratings identified as as ex extreme out

utliers

Meth thod: :

“fi filt ltering by by ou

utli

tlier de detectio ion 1" " (F-OD1)

Spearman’s rank-order co

correla lation: :

rh

rho

=

= 0,8 0,863 (p (p<0,001) sti still not be better r than the fir first coef coefficient when en no

da

data was dis discarded

Page 19 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 20

Outlier Detection 2

Discarded 14

1480 ra ratings fro from 12 12 wor

rkers that wer

ere out

utliers or
r

ex extr treme out

utliers three times or
r mor
re [5]

5]. Meth thod: :

“fi filt ltering by by ou

utli

tlier de detectio ion 2" " (F-OD2)

Spearman’s rank-order co

correla lation: :

rh

rho

=

= 0,8 0,867 (p (p<0,001)

Page 20 [5] B. Naderi, Motivation of Workers on Microtask Crowdsourcing Platforms. Springer, 2018.

SLIDE 21

Alternative Approach

Appli

lied F-OD1 an and F-OD2 an and dis discarded 15 1529 ra ratings in total.

Ide

dentify the ou

utl

tliers made by by all all the wor

rkers that fa

failed the e trapping ques estions. . The Then rem removed 17 17 ra rati tings. Method: :

F-TQ-OD

Spearman’s rank-order co

correla lation on

n 32

3294 ra rati tings: :

rh

rho

=

= 0,8 0,868 (p< (p<0,001)

Page 21 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

SLIDE 22

Results Overview

Approach Ratings discarded rho RMSE

0,864*

0,474 F-TQ 320 0,862* 0,476 F-TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479

Page 22 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

*p < 0,001

SLIDE 23

Results Comparison

Approach Method Workers Discarded Ratings Discarded [6] gold standard questions 25% 75% [7] verification questions

34,3%

F-TQ-OD trapping question +

utliers detection

22% 31,9%

Page 23 [6] T. Hoßfeld, M. Seufert, M. Hirth, T. Zinner, P. Tran-Gia, and R. Schatz, “Quantification of YouTube QoE via Crowdsourcing,” in 2011 IEEE International Symposium on Multimedia, 2011, pp. 494–499. [7] J. Redi and I. Povoa, “Crowdsourcing for Rating Image Aesthetic Appeal: Better a Paid or a Volunteer Crowd?,” in International ACM Workshop on Crowdsourcing for Multimedia, 2014, pp. 25–30.

SLIDE 24

Discussion

Approach Ratings discarded rho RMSE

0,864*

0,474 F-TQ 320 0,862* 0,476 F-TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479

Page 24 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

We

e rec recommend to em emplo loy F-OD1 in case “high correlation” is not a prio priority. . Thi This is s the most co cost ef effe fective app approach.

We

e re recommend to use se F-TQ TQ-OD for for mor

re

e acc accurate re resu sult lts.

SLIDE 25

Conclusion

Ad

Adapted suc successfully a a La Laboratory listening tes est to

Cr

Crowdsourcing

Obta

tained a a str strong and and sta statisticall lly sig significant Spe Spearman co corr rrelation: : r=0.86 868

Tes

Tested ou

utl

tliers dete detection an and trapping ques estion to fi filt lter r the dat data fro from unre reli liable le ra rati tings

Pro

Proposed a a co combination of

f ou
utl

tlier de dete tection and and trapping ques estion that leads to mor

re acc

accurate resu results

Fur

Furth ther testing is s req required to de dete termine for for which type of

f ex

expe periment our

ur

app approach ca can be be app applied.

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 25

SLIDE 26