A Stepwise Analysis of Aggregated Crowdsourced Labels Describing - - PowerPoint PPT Presentation

a stepwise analysis of aggregated crowdsourced labels
SMART_READER_LITE
LIVE PREVIEW

A Stepwise Analysis of Aggregated Crowdsourced Labels Describing - - PowerPoint PPT Presentation

A Stepwise Analysis of Aggregated Crowdsourced Labels Describing Multimodal Emotional Behaviors Alec Burmania and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and


slide-1
SLIDE 1

msp.utdallas.edu

Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science

A Stepwise Analysis of Aggregated Crowdsourced Labels Describing Multimodal Emotional Behaviors

Alec Burmania and Carlos Busso

slide-2
SLIDE 2

msp.utdallas.edu

Labels from Expressive Speech

q Emotional databases rely on labels for classification q Usually obtained via perceptual evaluations q Lab Setting + Allows researcher close control over subjects

  • Expensive
  • Small demographic distribution
  • Smaller corpus size

q Crowdsourcing + Can solve some of the above issues + Widely tested and used in perceptual evaluations

  • Raises issues with rater reliability

2

slide-3
SLIDE 3

msp.utdallas.edu

Labels from Expressive Speech

q How do we balance quality and quantity in perceptual evaluations? q How many labels is enough? q Crowdsourcing makes these decisions important q What is the value of an extra evaluator?

  • r

Many Evaluators & Low Quality Few Evaluators & High Quality

3

slide-4
SLIDE 4

msp.utdallas.edu

Previous Work

q Burmania et al. (2016) explores tradeoff between quality and quantity of emotional annotations on emotion classification q Explore the concept of effective reliability proposed by Rosenthal [2008] q It is equivalent to have:

  • 15 annotators with reliability κ=0.45 (RSB=92)
  • 10 annotators with reliability κ=0.54 (RSB=92)

q Classification performance may be increase via design of label collection instead of maximizing inter-evaluator agreement

4

  • A. Burmania, M. Abdelwahab, and C. Busso, "Tradeoff between quality and quantity of emotional annotations to characterize

expressive behaviors," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, March 2016, pp. 5190-5194.

RSB = nκ 1 + (n − 1)κ

slide-5
SLIDE 5

msp.utdallas.edu

Motivation

q Compare the value of additional evaluators by analyzing consensus labels q Derive guideline for subjective evaluations q Case study: emotional annotations of the MSP-IMPROV corpus

5

N evaluators N Evaluators 1 new evaluator

+

Consensus labels Consensus labels

=

slide-6
SLIDE 6

msp.utdallas.edu

MSP-IMPROV Corpus

An example scene.

q Recordings of 12 subjects improvising scenes in pairs (>9 hours, 8,438 turns) [Busso et al, 2017] q Actors are assigned context for a scene that they are supposed to act out q Collected for corpus of fixed lexical content but different emotions q Data Sets q Target – Recorded Sentences with fixed lexical content (648) q Improvisation – Scene to produce target q Interaction – Interactions between scenes 6

  • C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. Mower Provost, "MSP-IMPROV: An acted

corpus of dyadic interactions to study emotion perception," IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 119-130 January-March 2017.

slide-7
SLIDE 7

msp.utdallas.edu

MSP-IMPROV Corpus

How can I not ?

Anger Happiness Sadness Neutral Lazy friend asks you to skip class Accepting job

  • ffer

Taking extra help when you are failing classes Using coupon at store 7

slide-8
SLIDE 8

msp.utdallas.edu

MSP-IMPROV Corpus

8

slide-9
SLIDE 9

msp.utdallas.edu

Perceptual Evaluation

q Verify if a worker is spamming in real time q We will focus on a five class problem (angry, sad, neutral, happy, other) q Reference set includes target sentences (648)

Collect Reference Set (Gold Standard) Phase A Phase B

End

… …

Data

R R R R R R R R R R R

End

Data R Data R

Interleave Reference Set with Data (Online Quality Assessment)

9

  • A. Burmania, S. Parthasarathy, and C. Busso, "Increasing the reliability of crowdsourcing evaluations using online quality

assessment," IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 374-388, October-December 2016.

Trace performance in real time

videos

REFERENCE SET

videos

REFERENCE SET

videos x

✓ ✓

slide-10
SLIDE 10

msp.utdallas.edu

Rater Quality

5 Raters 10 Raters 15 Raters 20 Raters 25 Raters Δθ # sent κ # sent κ # sent κ # sent κ # sent κ 5 638 0.572 525 0.558 246 0.515 52 0.488

  • 10

643 0.532 615 0.522 466 0.501 207 0.459 26 0.455 15 648 0.501 643 0.495 570 0.483 351 0.443 112 0.402 20 648 0.469 648 0.471 619 0.463 510 0.451 182 0.414 25 648 0.452 648 0.450 643 0.450 561 0.440 247 0.416 30 648 0.438 648 0.433 648 0.436 609 0.431 298 0.410 35 648 0.425 648 0.433 648 0.426 619 0.424 346 0.403 40 648 0.420 648 0.427 648 0.425 629 0.423 356 0.402 90 648 0.422 648 0.419 648 0.422 629 0.419 381 0.409

Increasing agreement due to filter

Constant sample size

Decreasing samples meeting size criteria

10

slide-11
SLIDE 11

msp.utdallas.edu

Label Groups

q We consider two sets of labels based on kappa agreement: q High agreement group (n=12) q Moderate agreement group (n=20)

11

High Agreement Condition Moderate Agreement Condition

5 Raters 10 Raters 15 Raters 20 Raters 25 Raters Δθ # sent κ # sent κ # sent κ # sent κ # sent κ 5 638 0.572 525 0.558 246 0.515 52 0.488

  • 10

643 0.532 615 0.522 466 0.501 207 0.459 26 0.455 15 648 0.501 643 0.495 570 0.483 351 0.443 112 0.402 20 648 0.469 648 0.471 619 0.463 510 0.451 182 0.414 25 648 0.452 648 0.450 643 0.450 561 0.440 247 0.416 30 648 0.438 648 0.433 648 0.436 609 0.431 298 0.410 35 648 0.425 648 0.433 648 0.426 619 0.424 346 0.403 40 648 0.420 648 0.427 648 0.425 629 0.423 356 0.402 90 648 0.422 648 0.419 648 0.422 629 0.419 381 0.409

slide-12
SLIDE 12

msp.utdallas.edu

Label Aggregation

q Aggregation of votes is done using majority vote q Each vote is equally weighted q Votes are iteratively added chronologically as they were collected q Due to majority vote, we establish the following transitions: q EmoAèEmoA (No Change) q EmoAèNA (No Agreement – a tie has been established) q NAèEmoA (A tie is broken) q NAèNA (tie remains a tie)

12

Happiness Happiness Sadness Happiness Sadness Happiness After Before Happiness Happiness Sadness Happiness Sadness Happiness Happiness Happiness Sadness Happiness Sadness + Anger Happiness After Before Happiness Happiness Sadness Happiness Sadness Happiness Happiness Happiness Sadness Happiness Sadness + Sadness No Agreement After Before Happiness Happiness Sadness Happiness Sadness Sadness No Agreement Happiness Happiness Sadness Happiness Sadness Sadness + Happiness Happiness After Before Happiness Happiness Sadness Happiness Sadness Sadness No Agreement Happiness Happiness Sadness Happiness Sadness Sadness + Anger No Agreement

We cannot transition from

  • ne emotion to another!
slide-13
SLIDE 13

msp.utdallas.edu

Experiments

q Trends in labels will be evaluated iteratively for each added label q We consider:

13

Label Stability Label Changes Frequency of Change Adding more than

  • ne evaluator

Five class problem (angry, sad, neutral, happy, other)!

slide-14
SLIDE 14

msp.utdallas.edu

5 10 15 20

n

50 100

Percentage of Videos

1 2 3 4 5 6 7 8 9 10 11 12

n

50 100

Percentage of Videos

Label Stability

q EmoA è EmoA q NA è NA q Observations q After 4 evaluators, labels are stable q n=6, less than 10% of labels change q Similar trends for high and moderate agreement conditions

14

Moderate Agreement High Agreement

Percentage of videos with the same aggregated labels before and after adding an additional evaluator

slide-15
SLIDE 15

msp.utdallas.edu

Label Changes

q Inverse plots q NA è EmoA q EmoA è NA q Observations q n=2, 40-44% agreement is lost q n=3, most of the ties are solved

15

Moderate Agreement High Agreement

Percentage of the videos in which their labels changed as we add

  • ne extra evaluator

n 1 2 3 4 5 6 7 8 9 10 11 12 Percentage of Videos 10 20 30 40 50 Changed Labels No Agreement n 2 4 6 8 10 12 14 16 18 20 Percentage of Videos 10 20 30 40 50 Changed Labels No Agreement

slide-16
SLIDE 16

msp.utdallas.edu

5 10 15

Frequency of Label Change

10 20 30 40 50

Percentage of Videos

1 2 3 4 5 6 7 8 9 10 11

Frequency of Label Change

20 40 60

Percentage of Videos

Change Frequency

q Example, ~25% change labels 2 times q Observations q 45% to 50% never change labels q Trend on even values of m indicate that ties are usually broken q About 75% sentences change labels less than 4 times q About 10% of the sentences change labels multiple times

16

Moderate Agreement High Agreement

Percentage of the videos in which their aggregated labels changed m times as we incrementally add evaluators

slide-17
SLIDE 17

msp.utdallas.edu

Adding More than One Evaluator

q How different are the aggregated labels when we add more than one evaluator? q 3 versus 5, 5 versus 20 q This analysis does not follow the incremental stepwise approach q Snapshots different values of n q We consider: q 3, 5, 9, and 20 annotators q We have an additional case: q EmoA è EmoB (from one emotion to another)

17

After Before Happiness Sadness Happiness Happiness Happiness Sadness Happiness Sadness Sadness Sadness

slide-18
SLIDE 18

msp.utdallas.edu

Adding More than One Evaluator

q Observations: q Labels are very stable, even 3 versus 20 (76% overlap in labels) q Only few labels benefits from extra evaluations q Higher agreement case shows more stability

18

Moderate Agreement High Agreement

200 400 600

Number of Videos

Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA 3 vs 5 3 vs 9 3 vs 20 5 vs 9 5 vs 20

476 28 39 30 26 447 59 53 28 12 455 69 62 10 03 485 32 42 26 14 477 57 52 09 04

200 400 600

Number of Videos

Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA Emo A -> Emo A Emo A -> Emo B NA -> Emo Emo -> NA NA -> NA 3 vs 5 3 vs 9 5 vs 9

519 25 43 29 22 498 56 55 19 10 526 40 43 21 08

slide-19
SLIDE 19

msp.utdallas.edu

Discussion

q There is a reduced value in additional annotations q It helps about 10% of the labels q We can save resources by tracking consistency of evaluations q Five evaluators per sentence resolve most of the ambiguities q We observe this trend for moderate and high inter-evaluator agreement q Zhang et al. [2015] proposed to stop evaluation when agreement is reached q If n=5 and three people agree, stop the evaluation

19

  • Y. Zhang, E. Coutinho, Z. Zhang, C. Quan, and B. Schuller, “Dynamic active learning based on agreement and applied to emotion

recognition in spoken interactions,” in International conference on Multimodal interaction (ICMI 2015), Seattle, WA, USA, November 2015, pp. 275–278.

slide-20
SLIDE 20

msp.utdallas.edu

Discussion

q An important exception is when consensus labels are not the goal q Training with soft-margin [Lotfian and Busso, 2017] q Study of emotion perception q Emotion perceptual evaluations are complex cognitive tasks q We expect higher label stability for simpler behavioral tasks

20

  • R. Lotfian and C. Busso, "Formulating emotion perception as a probabilistic model with application to categorical emotion

classification," in International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017.

slide-21
SLIDE 21

msp.utdallas.edu

Limitation and Future Work

q Generalizing the patterns in other databases q Larger or small numbers of classes q Different corpora q Inter-evaluator agreement variability q Use of other aggregation techniques q Entropy based techniques

21

slide-22
SLIDE 22

msp.utdallas.edu

Questions?

Interested in the MSP-IMPROV database? Come visit us at msp.utdallas.edu and click “Resources”

22

This work was funded by NSF CAREER award IIS-1453781

slide-23
SLIDE 23

msp.utdallas.edu

References

26

Burmania, A., S. Parthasarathy, and C. Busso, “Increasing the reliability of crowdsourcing evaluations using on line quality assessment,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 374–388, October- December 2016. Burmania, A., M. Abdelwahab, and C. Busso, “Tradeoff between quality and quantity of emotional annotation s to characterize expressive behaviors,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, March 2016, pp. 5190–5194. Busso, C., S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. Mower Provost, “MSP-IMPRO V: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, January-March 2017. Lotfian, R. and C. Busso, "Formulating emotion perception as a probabilistic model with application to categorical emotion classification," in International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017. Rosenthal, R. “Conducting judgment studies: Some methodological issues,” in The new handbook of methods in nonverbal behavior research, J. Harrigan, R. Rosenthal, and K. R. Scherer, Eds., pp. 199–234. Oxford University Press, Oxford, UK, May 2008. Zhang, Y., E. Coutinho, Z. Zhang, C. Quan, and B. Schuller, “Dynamic active learning based on agreement and applied to emotion recognition in spoken interactions,” in International conference on Multimodal interaction (ICMI 2015), Seattle, WA, USA, November 2015, pp. 275–278.