Speaker and Emotion Recognition of TV-Series Data Using Multimodal - - PowerPoint PPT Presentation

speaker and emotion recognition of tv series data using
SMART_READER_LITE
LIVE PREVIEW

Speaker and Emotion Recognition of TV-Series Data Using Multimodal - - PowerPoint PPT Presentation

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara


slide-1
SLIDE 1

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

Sashi Novitasari1, Quoc Truong Do1, Sakriani Saktj1,3, Dessi Lestari2, Satoshi Nakamura1,3

1 Graduate School of Informatjon Science, Nara Instjtute of Science and Technology 2 Department of Informatjcs, Bandung Instjtute of Technology 3 RIKEN AIP 1{sashi.novitasari.si3, do.truong.dj3, ssaktj, s-nakamura}@is.naist.jp 2{dessipuji}@informatjka.org

slide-2
SLIDE 2

Outline

  • 1. Introductjon
  • 2. Data
  • 3. Model Architectures
  • 4. Features
  • 5. Experiment
  • 6. Conclusion
slide-3
SLIDE 3
  • Real-life communicatjon involves

linguistjc and paralinguistjc aspects

  • Multjmodal and multjtask recognitjon
  • f non-verbal aspects of speech
  • Recognitjon of speech’s speaker and

emotjon from emotjon-rich data

  • Previous works:
  • I. Introduction

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

  • Multjmodal or multjtask emotjon-speaker recognitjon (not integrated)

(Tang et al., 2016; Tian et al., 2016; Vallet et al., 2013)

slide-4
SLIDE 4
  • II. Data
  • TV-series data → expressive conversatjon

○ Video graphic: Facial features ○ Audio : Acoustjc features ○ Subtjtle : Lexical features

  • English
  • Utuerance-level annotatjon

○ Speaker : 57 names ○ Emotjon - valence: 3 classes (negatjve - neutral – positjve) ○ Emotjon - arousal : 3 classes (negatjve - neutral - positjve)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-5
SLIDE 5
  • III. Model Architectures
  • Multjlayer perceptron models (5 layers)
  • Multjmodal classifjcatjon
  • Multjtask classifjcatjon

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-6
SLIDE 6
  • III. Model Architectures

Multimodal Classifjcation

2 evaluated approaches:

  • a. Features concatenatjon
  • b. Features hierarchical fusion

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-7
SLIDE 7
  • III. Model Architectures

Multitask Classifjcation

Perform classifjcatjon on several tasks at once.

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-8
SLIDE 8
  • IV. Features
  • 1. Acoustjc (main)

○ INTERSPEECH 2010 feature conf. ○ openSMILE toolkit (Eyben et al., 2010)

  • 2. Lexical

○ Word-vectors average ○ Pre-trained Google Word2Vec (Mikolov et al., 2013)

  • 3. Facial

○ Facial contours and angles ○ openFace toolkit (Baltrusaitjs et al., 2016)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-9
SLIDE 9
  • V. Experiment
slide-10
SLIDE 10
  • Train set: 2460 utuerances

○ Speaker : 57 speaker (imbalanced) ○ Valence : Negatjve 31%, Neutral 60% , Positjve 9% ○ Arousal : Negatjve 4%, Neutral 75% , Positjve 21%

  • Evaluated on 300 utuerances

○ Speaker : 10 speaker, 30 samples each ○ Valence : Negatjve 32%, Neutral 57% , Positjve 11% ○ Arousal : Negatjve 1%, Neutral 78% , Positjve 21%

  • Compared performance of unimodal, multjmodal, single-task, and multjtask

models

  • Evaluated based on F1-score(%) on evaluatjon set
  • V. Experiment

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-11
SLIDE 11
  • V. Experiment

Result

slide-12
SLIDE 12

F1-scores (%) on evaluatjon set

  • V. Experiment

Result: Speaker

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

*Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

slide-13
SLIDE 13

F1-score (%) on evaluatjon set

  • V. Experiment

Result: Emotion

*Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-14
SLIDE 14
  • V. Experiment

Result Summary

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

*Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

slide-15
SLIDE 15
  • VI. Conclusion
  • We constructed the multimodal and multitask speaker-emotion

recognition model by using deep learning and TV-series data

  • Multitask model able to outperform single-task model, especially

when recognizing emotion by using acoustic features only

  • Multimodal-multitask model did not result in a signifjcant

improvement (larger data might be needed)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

slide-16
SLIDE 16

Thank You