Outliers Detection vs. Control Questions to Ensure Reliable Results - - PowerPoint PPT Presentation

outliers detection vs control questions to ensure
SMART_READER_LITE
LIVE PREVIEW

Outliers Detection vs. Control Questions to Ensure Reliable Results - - PowerPoint PPT Presentation

Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study Rafael Zequeira Jimnez , Laura Fernndez Gallardo, Sebastian Mller Quality and Usability Lab, Technische Universitt


slide-1
SLIDE 1

Outliers Detection vs. Control Questions to Ensure Reliable Results in Crowdsourcing. A Speech Quality Assessment Case Study

Rafael Zequeira Jiménez, Laura Fernández Gallardo, Sebastian Möller Quality and Usability Lab, Technische Universität Berlin

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-2
SLIDE 2

Crowdsourcing

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 2

slide-3
SLIDE 3

Page 3

Motivation

Spe Speech quali lity is important for for the e Quality of

  • f Ex

Exper erience (QoE) in:

* * The The coll collected ra rati tings can can be be used used to to trai train AI AI systems to to pre redict the the speech qu quali ality au auto tomat atical ally ly audio books virtual or robotic conversational agents

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-4
SLIDE 4

Page 4

Motivation

Spe Speech quali lity ex experiments traditionally ly co conducted in La Laboratory

  • Professional audio

equipment

  • Soundproof room
  • Limited number of

participants

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-5
SLIDE 5

Crowdsourcing Study

  • Con
  • nducted a

a spe peech quality as assessment ex experiment

  • Cr

Crowd-workers were ere pre presen ented with th 20 20 sp speech sti stimuli

  • Opinion ab

about ove

  • verall

ll quality gat gathered in a a 5-point sc scale

slide-6
SLIDE 6

Speech Material:

  • Database number 50

501 from from ITU-T Re

  • Rec. P.8

P.863

  • 4

4 Ge Germans were ere rec recorded per per co condition

  • 20

200 sp speech sti stimuli (9s (9s lon

  • ng on
  • n av

avg.)

  • 50

50 deg degradation co conditions:

  • nar

arrow- & wide- ba band

  • tem

emporal l cl clipping

  • sig

signal-corre related nois

  • ise,
  • com
  • mbinations of
  • f these de

degradations

  • Th

The e da database con contains quali lity ra ratings to

  • the

e 20 200 sti stimuli li made by by 24 24 di different nati tive Ge German listeners, in acc accordance to

  • ITU-T Rec
  • Rec. P.8

.800

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 6

slide-7
SLIDE 7

Study Conditions:

  • Address the st

study to nati ative Ge Germ rmans

  • Coll

Collect 24 24 ra rati tings per per sti stimulus fro from di different listener ers

  • Ex

Experiment in ac accordance with the ITU-T Re Rec. . P.8 P.800

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 7

slide-8
SLIDE 8

Crowdsourcing Platform:

  • Ge

German bas based CS CS pla platform

  • Rep

Reported 1 1 mill llion glo global user sers in Se September 201 2017

  • Most of
  • f their user

sers are are from from Ge German sp speaking co countries

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 8

slide-9
SLIDE 9

Crowdsourcing Experiment

  • Scre

Screening ta task to

  • recr

recruit Ge Germ rman listeners

  • Spe

Speech quali lity as assessment task:

  • Qualification ph

phase

  • Spe

Speech quali lity ass assessment t

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 9

slide-10
SLIDE 10

Crowdsourcing Experiment

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 10

Quali alific icat ation

  • co

consent req request

  • use

se of

  • f hea

eadphone

  • au

audio Math ath trapping questi tion

  • 5

5 st stimuli li as as an an an anchor

Speech Quali ality ty Asse ssessment

  • intr

troduction

  • en

environment rec record up p to

  • 15

15s

  • 20

20 sti stimuli to

  • ra

rate te

  • 2

2 tra rapping Question

slide-11
SLIDE 11

Crowdsourcing Experiment

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 11

Quali alific icat ation

  • co

consent req request

  • use

se of

  • f hea

eadphone

  • au

audio Math ath trapping questi tion

  • 5

5 st stimuli li as as an an an anchor

Speech Quali ality ty Asse ssessment

  • intr

troduction

  • en

environment rec record up p to

  • 15

15s

  • 20

20 sti stimuli to

  • ra

rate te

  • 2

2 tra rapping Question

slide-12
SLIDE 12

Results

  • 87

87 wor

  • rkers pa

participated in the stu study

  • 8

8 wor

  • rkers fa

failed the e Qualification ph phase

  • 53

3 unique listen eners:

  • 60,

60,4% male les

  • 96,

96,2% nati tive Ge Germans

  • pro

rovided 48 4840 ra rati tings

  • the co

colle llected ra rati tings acc account for for 24 24 to 26 26 ass assessment fro from dif different t listeners per per file file

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 12

slide-13
SLIDE 13

Crowdsourcing vs. Laboratory

  • Spearman’s rank-order cor

correla lation: :

  • rho = 0,864 (p<0,001

01)

  • Mon
  • notonic rel

relationship betw between La Lab- an and CS CS- MOS

  • Ro

Root Mean Sq Square e Err Error:

  • RMSE

SE=0 =0,47 ,474

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 13

slide-14
SLIDE 14

Filtering from unreliable workers

  • Work in [1]

[1] and and [2] [2] re recommends:

  • the use

se of

  • f trapping ques

estion, to cat catch inattentive user sers

  • when

en the use ser fa fail, l, then en all all of

  • f their ra

rati tings are are dis discarded Thi This app approach was ef effe fective in [1] 1] an and improved slig slightly the res result lts in [2] 2]

[1] B. Naderi, T. Polzehl, I. Wechsung, F. Köster, and S. Möller, “Effect of Trapping Questions on the Reliability of Speech Quality Judgments in a Crowdsourcing Paradigm,” in Interspeech, 2015, pp. 2799–2803. [2] R. Zequeira Jiménez, L. Fernández Gallardo, and S. Möller, “Scoring Voice Likability using Pair-Comparison: Laboratory vs. Crowdsourcing Approach,” in Ninth International Conference on Quality of Multimedia Experience (QoMEX), 2017, pp. 1–3. Page 14

slide-15
SLIDE 15

Filtering from unreliable workers

A wor

  • rker is

s unre reli liable or

  • r untrustworthy when:

:

  • s/h

s/he fa fails the e trapping ques estion in the SQ SQAT AT

  • s/h

s/he fa fails the e Qualification mor

  • re than onc
  • nce

Page 15 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-16
SLIDE 16

Filtering from unreliable workers

A wor

  • rker is

s unre reli liable or

  • r untrustworthy when:

:

  • s/h

s/he fa fails the e trapping ques estion in the SQ SQAT AT

  • s/h

s/he fa fails the e Qualification mor

  • re than onc
  • nce

Page 16 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-17
SLIDE 17

Filtering from unreliable workers

  • Discarded 32

320 ra ratings in total fro from W4, 4, W5, W7

  • W6 did

did not

  • t co

conduct the SQ SQAT Meth thod: :

“fi filt ltering by by tra trappin ing question" (F-TQ)

  • Spearman’s rank-order co

correla lation on

  • n 45

4520 ra rati tings: :

  • rh

rho

  • =

= 0,8 0,862 (p (p<0,001) When dis discarding all all the e wor

  • rkers (F-TQ’)

) :

  • rh

rho

  • =

= 0,8 0,854 (p< p<0,001)

Page 17 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-18
SLIDE 18

Outlier Detection

  • utliers:
  • rati

tings abo above 1,5 1,5 inte terquartile le ran range (I (IQR)

  • dep

epicted by by ci circle les ex extr treme out

  • utliers:
  • rati

tings at t 3,0 3,0 IQR or

  • r

abo above

  • dep

epicted by by ast asterisks

Page 18 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-19
SLIDE 19

Outlier Detection

  • Discarded 12

122 ra ratings identified as as ex extreme out

  • utliers

Meth thod: :

“fi filt ltering by by ou

  • utli

tlier de detectio ion 1" " (F-OD1)

  • Spearman’s rank-order co

correla lation: :

  • rh

rho

  • =

= 0,8 0,863 (p (p<0,001) sti still not be better r than the fir first coef coefficient when en no

  • da

data was dis discarded

Page 19 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-20
SLIDE 20

Outlier Detection 2

  • Discarded 14

1480 ra ratings fro from 12 12 wor

  • rkers that wer

ere out

  • utliers or
  • r

ex extr treme out

  • utliers three times or
  • r mor
  • re [5]

5]. Meth thod: :

“fi filt ltering by by ou

  • utli

tlier de detectio ion 2" " (F-OD2)

  • Spearman’s rank-order co

correla lation: :

  • rh

rho

  • =

= 0,8 0,867 (p (p<0,001)

Page 20 [5] B. Naderi, Motivation of Workers on Microtask Crowdsourcing Platforms. Springer, 2018.

slide-21
SLIDE 21

Alternative Approach

  • Appli

lied F-OD1 an and F-OD2 an and dis discarded 15 1529 ra ratings in total.

  • Ide

dentify the ou

  • utl

tliers made by by all all the wor

  • rkers that fa

failed the e trapping ques estions. . The Then rem removed 17 17 ra rati tings. Method: :

F-TQ-OD

  • Spearman’s rank-order co

correla lation on

  • n 32

3294 ra rati tings: :

  • rh

rho

  • =

= 0,8 0,868 (p< (p<0,001)

Page 21 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

slide-22
SLIDE 22

Results Overview

Approach Ratings discarded rho RMSE

  • 0,864*

0,474 F-TQ 320 0,862* 0,476 F-TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479

Page 22 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

*p < 0,001

slide-23
SLIDE 23

Results Comparison

Approach Method Workers Discarded Ratings Discarded [6] gold standard questions 25% 75% [7] verification questions

  • 34,3%

F-TQ-OD trapping question +

  • utliers detection

22% 31,9%

Page 23 [6] T. Hoßfeld, M. Seufert, M. Hirth, T. Zinner, P. Tran-Gia, and R. Schatz, “Quantification of YouTube QoE via Crowdsourcing,” in 2011 IEEE International Symposium on Multimedia, 2011, pp. 494–499. [7] J. Redi and I. Povoa, “Crowdsourcing for Rating Image Aesthetic Appeal: Better a Paid or a Volunteer Crowd?,” in International ACM Workshop on Crowdsourcing for Multimedia, 2014, pp. 25–30.

slide-24
SLIDE 24

Discussion

Approach Ratings discarded rho RMSE

  • 0,864*

0,474 F-TQ 320 0,862* 0,476 F-TQ’ 780 0,854* 0,480 F-OD1 122 0,863* 0,477 F-OD2 1480 0,867* 0,474 F-TQ-OD 1546 0,868* 0,479

Page 24 HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop

  • We

e rec recommend to em emplo loy F-OD1 in case “high correlation” is not a prio priority. . Thi This is s the most co cost ef effe fective app approach.

  • We

e re recommend to use se F-TQ TQ-OD for for mor

  • re

e acc accurate re resu sult lts.

slide-25
SLIDE 25

Conclusion

  • Ad

Adapted suc successfully a a La Laboratory listening tes est to

  • Cr

Crowdsourcing

  • Obta

tained a a str strong and and sta statisticall lly sig significant Spe Spearman co corr rrelation: : r=0.86 868

  • Tes

Tested ou

  • utl

tliers dete detection an and trapping ques estion to fi filt lter r the dat data fro from unre reli liable le ra rati tings

  • Pro

Proposed a a co combination of

  • f ou
  • utl

tlier de dete tection and and trapping ques estion that leads to mor

  • re acc

accurate resu results

  • Fur

Furth ther testing is s req required to de dete termine for for which type of

  • f ex

expe periment our

  • ur

app approach ca can be be app applied.

HumL@WWW2018 – 1st International Workshop on Augmenting Intelligence with Humans-in-the-Loop Page 25

slide-26
SLIDE 26

Thank you for your Attention!

Ra Rafael Zeq Zequeira Jim iménez rafael.zequeira@tu tu-berlin.de @ze zequeiraj