Outliers Detection vs. Control Questions to Ensure Reliable Results - - PowerPoint PPT Presentation
Outliers Detection vs. Control Questions to Ensure Reliable Results - - PowerPoint PPT Presentation
Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study Rafael Zequeira Jimnez , Laura Fernndez Gallardo, Sebastian Mller Quality and Usability Lab, Technische Universitt
Crowdsourcing
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 2
Page 3
Motivation
Spe Speech quali lity is important for for the e Quality of
- f Ex
Exper erience (QoE) in:
* * The The coll collected ra rati tings can can be be used used to to trai train AI AI systems to to pre redict the the speech qu quali ality au auto tomat atical ally ly audio books virtual or robotic conversational agents
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Page 4
Motivation
Spe Speech quali lity ex experiments traditionally ly co conducted in La Laboratory
- Professional audio
equipment
- Soundproof room
- Limited number of
participants
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Crowdsourcing Study
- Con
- nducted a
a spe peech quality as assessment ex experiment
- Cr
Crowd-workers were ere pre presen ented with th 20 20 sp speech sti stimuli
- Opinion ab
about ove
- verall
ll quality gat gathered in a a 5-point sc scale
Speech Material:
- Database number 50
501 from from ITU-T Re
- Rec. P.8
P.863
- 4
4 Ge Germans were ere rec recorded per per co condition
- 20
200 sp speech sti stimuli (9s (9s lon
- ng on
- n av
avg.)
- 50
50 deg degradation co conditions:
- nar
arrow- & wide- ba band
- tem
emporal l cl clipping
- sig
signal-corre related nois
- ise,
- com
- mbinations of
- f these de
degradations
- Th
The e da database con contains quali lity ra ratings to
- the
e 20 200 sti stimuli li made by by 24 24 di different nati tive Ge German listeners, in acc accordance to
- ITU-T Rec
- Rec. P.8
.800
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 6
Study Conditions:
- Address the st
study to nati ative Ge Germ rmans
- Coll
Collect 24 24 ra rati tings per per sti stimulus fro from di different listener ers
- Ex
Experiment in ac accordance with the ITU-T Re Rec. . P.8 P.800
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 7
Crowdsourcing Platform:
- Ge
German bas based CS CS pla platform
- Rep
Reported 1 1 mill llion glo global user sers in Se September 201 2017
- Most of
- f their user
sers are are from from Ge German sp speaking co countries
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 8
Crowdsourcing Experiment
- Scre
Screening ta task to
- recr
recruit Ge Germ rman listeners
- Spe
Speech quali lity as assessment task:
- Qualification ph
phase
- Spe
Speech quali lity ass assessment t
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 9
Crowdsourcing Experiment
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 10
Quali alific icat ation
- co
consent req request
- use
se of
- f hea
eadphone
- au
audio Math ath trapping questi tion
- 5
5 st stimuli li as as an an an anchor
Speech Quali ality ty Asse ssessment
- intr
troduction
- en
environment rec record up p to
- 15
15s
- 20
20 sti stimuli to
- ra
rate te
- 2
2 tra rapping Question
Crowdsourcing Experiment
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 11
Quali alific icat ation
- co
consent req request
- use
se of
- f hea
eadphone
- au
audio Math ath trapping questi tion
- 5
5 st stimuli li as as an an an anchor
Speech Quali ality ty Asse ssessment
- intr
troduction
- en
environment rec record up p to
- 15
15s
- 20
20 sti stimuli to
- ra
rate te
- 2
2 tra rapping Question
Results
- 87
87 wor
- rkers pa
participated in the stu study
- 8
8 wor
- rkers fa
failed the e Qualification ph phase
- 53
3 unique listen eners:
- 60,
60,4% male les
- 96,
96,2% nati tive Ge Germans
- pro
rovided 48 4840 ra rati tings
- the co
colle llected ra rati tings acc account for for 24 24 to 26 26 ass assessment fro from dif different t listeners per per file file
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 12
Crowdsourcing vs. Laboratory
- Spearman’s rank-order cor
correla lation: :
- rho = 0,864 (p<0,001
01)
- Mon
- notonic rel
relationship betw between La Lab- an and CS CS- MOS
- Ro
Root Mean Sq Square e Err Error:
- RMSE
SE=0 =0,47 ,474
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 13
Filtering from unreliable workers
- Work in [1]
[1] and and [2] [2] re recommends:
- the use
se of
- f trapping ques
estion, to cat catch inattentive user sers
- when
en the use ser fa fail, l, then en all all of
- f their ra
rati tings are are dis discarded Thi This app approach was ef effe fective in [1] 1] an and improved slig slightly the res result lts in [2] 2]
[1] B. Naderi, T. Polzehl, I. Wechsung, F. Köster, and S. Möller, “Effect of Trapping Questions on the Reliability of Speech Quality Judgments in a Crowdsourcing Paradigm,” in Interspeech, 2015, pp. 2799–2803. [2] R. Zequeira Jiménez, L. Fernández Gallardo, and S. Möller, “Scoring Voice Likability using Pair-Comparison: Laboratory vs. Crowdsourcing Approach,” in Ninth International Conference on Quality of Multimedia Experience (QoMEX), 2017, pp. 1–3. Page 14
Filtering from unreliable workers
A wor
- rker is
s unre reli liable or
- r untrustworthy when:
:
- s/h
s/he fa fails the e trapping ques estion in the SQ SQAT AT
- s/h
s/he fa fails the e Qualification mor
- re than onc
- nce
Page 15 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Filtering from unreliable workers
A wor
- rker is
s unre reli liable or
- r untrustworthy when:
:
- s/h
s/he fa fails the e trapping ques estion in the SQ SQAT AT
- s/h
s/he fa fails the e Qualification mor
- re than onc
- nce
Page 16 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Filtering from unreliable workers
- Discarded 32
320 ra ratings in total fro from W4, 4, W5, W7
- W6 did
did not
- t co
conduct the SQ SQAT Meth thod: :
“fi filt ltering by by tra trappin ing question" (F-TQ)
- Spearman’s rank-order co
correla lation on
- n 45
4520 ra rati tings: :
- rh
rho
- =
= 0,8 0,862 (p (p<0,001) When dis discarding all all the e wor
- rkers (F-TQ’)
) :
- rh
rho
- =
= 0,8 0,854 (p< p<0,001)
Page 17 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Outlier Detection
- utliers:
- rati
tings abo above 1,5 1,5 inte terquartile le ran range (I (IQR)
- dep
epicted by by ci circle les ex extr treme out
- utliers:
- rati
tings at t 3,0 3,0 IQR or
- r
abo above
- dep
epicted by by ast asterisks
Page 18 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Outlier Detection
- Discarded 12
122 ra ratings identified as as ex extreme out
- utliers
Meth thod: :
“fi filt ltering by by ou
- utli
tlier de detectio ion 1" " (F-OD1)
- Spearman’s rank-order co
correla lation: :
- rh
rho
- =
= 0,8 0,863 (p (p<0,001) sti still not be better r than the fir first coef coefficient when en no
- da
data was dis discarded
Page 19 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Outlier Detection 2
- Discarded 14
1480 ra ratings fro from 12 12 wor
- rkers that wer
ere out
- utliers or
- r
ex extr treme out
- utliers three times or
- r mor
- re [5]
5]. Meth thod: :
“fi filt ltering by by ou
- utli
tlier de detectio ion 2" " (F-OD2)
- Spearman’s rank-order co
correla lation: :
- rh
rho
- =
= 0,8 0,867 (p (p<0,001)
Page 20 [5] B. Naderi, Motivation of Workers on Microtask Crowdsourcing Platforms. Springer, 2018.
Alternative Approach
- Appli
lied F-OD1 an and F-OD2 an and dis discarded 15 1529 ra ratings in total.
- Ide
dentify the ou
- utl
tliers made by by all all the wor
- rkers that fa
failed the e trapping ques estions. . The Then rem removed 17 17 ra rati tings. Method: :
F-TQ-OD
- Spearman’s rank-order co
correla lation on
- n 32
3294 ra rati tings: :
- rh
rho
- =
= 0,8 0,868 (p< (p<0,001)
Page 21 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
Results Overview
Approach Ratings discarded rho RMSE
- 0,864*
0,474 F-TQ 320 0,862* 0,476 F-TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479
Page 22 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
*p < 0,001
Results Comparison
Approach Method Workers Discarded Ratings Discarded [6] gold standard questions 25% 75% [7] verification questions
- 34,3%
F-TQ-OD trapping question +
- utliers detection
22% 31,9%
Page 23 [6] T. Hoßfeld, M. Seufert, M. Hirth, T. Zinner, P. Tran-Gia, and R. Schatz, “Quantification of YouTube QoE via Crowdsourcing,” in 2011 IEEE International Symposium on Multimedia, 2011, pp. 494–499. [7] J. Redi and I. Povoa, “Crowdsourcing for Rating Image Aesthetic Appeal: Better a Paid or a Volunteer Crowd?,” in International ACM Workshop on Crowdsourcing for Multimedia, 2014, pp. 25–30.
Discussion
Approach Ratings discarded rho RMSE
- 0,864*
0,474 F-TQ 320 0,862* 0,476 F-TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479
Page 24 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop
- We
e rec recommend to em emplo loy F-OD1 in case “high correlation” is not a prio priority. . Thi This is s the most co cost ef effe fective app approach.
- We
e re recommend to use se F-TQ TQ-OD for for mor
- re
e acc accurate re resu sult lts.
Conclusion
- Ad
Adapted suc successfully a a La Laboratory listening tes est to
- Cr
Crowdsourcing
- Obta
tained a a str strong and and sta statisticall lly sig significant Spe Spearman co corr rrelation: : r=0.86 868
- Tes
Tested ou
- utl
tliers dete detection an and trapping ques estion to fi filt lter r the dat data fro from unre reli liable le ra rati tings
- Pro
Proposed a a co combination of
- f ou
- utl
tlier de dete tection and and trapping ques estion that leads to mor
- re acc
accurate resu results
- Fur
Furth ther testing is s req required to de dete termine for for which type of
- f ex
expe periment our
- ur
app approach ca can be be app applied.
HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 25