Speaker and Emotion Recognition of TV-Series Data Using Multimodal - - PowerPoint PPT Presentation

▶

Oct 20, 2023 233 likes •401 views

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning Sashi Novitasari 1 , Quoc Truong Do 1 , Sakriani Saktj 1,3 , Dessi Lestari 2 , Satoshi Nakamura 1,3 1 Graduate School of Informatjon Science, Nara

SLIDE 1

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

Sashi Novitasari1, Quoc Truong Do1, Sakriani Saktj1,3, Dessi Lestari2, Satoshi Nakamura1,3

1 Graduate School of Informatjon Science, Nara Instjtute of Science and Technology 2 Department of Informatjcs, Bandung Instjtute of Technology 3 RIKEN AIP 1{sashi.novitasari.si3, do.truong.dj3, ssaktj, s-nakamura}@is.naist.jp 2{dessipuji}@informatjka.org

SLIDE 2

Outline

1. Introductjon
2. Data
3. Model Architectures
4. Features
5. Experiment
6. Conclusion

SLIDE 3

Real-life communicatjon involves

linguistjc and paralinguistjc aspects

Multjmodal and multjtask recognitjon
f non-verbal aspects of speech
Recognitjon of speech’s speaker and

emotjon from emotjon-rich data

Previous works:
I. Introduction

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

Multjmodal or multjtask emotjon-speaker recognitjon (not integrated)

(Tang et al., 2016; Tian et al., 2016; Vallet et al., 2013)

SLIDE 4

II. Data
TV-series data → expressive conversatjon

○ Video graphic: Facial features ○ Audio : Acoustjc features ○ Subtjtle : Lexical features

English
Utuerance-level annotatjon

○ Speaker : 57 names ○ Emotjon - valence: 3 classes (negatjve - neutral – positjve) ○ Emotjon - arousal : 3 classes (negatjve - neutral - positjve)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 5

III. Model Architectures
Multjlayer perceptron models (5 layers)
Multjmodal classifjcatjon
Multjtask classifjcatjon

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 6

III. Model Architectures

Multimodal Classifjcation

2 evaluated approaches:

a. Features concatenatjon
b. Features hierarchical fusion

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 7

III. Model Architectures

Multitask Classifjcation

Perform classifjcatjon on several tasks at once.

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 8

IV. Features
1. Acoustjc (main)

○ INTERSPEECH 2010 feature conf. ○ openSMILE toolkit (Eyben et al., 2010)

2. Lexical

○ Word-vectors average ○ Pre-trained Google Word2Vec (Mikolov et al., 2013)

3. Facial

○ Facial contours and angles ○ openFace toolkit (Baltrusaitjs et al., 2016)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 9

V. Experiment

SLIDE 10

Train set: 2460 utuerances

○ Speaker : 57 speaker (imbalanced) ○ Valence : Negatjve 31%, Neutral 60% , Positjve 9% ○ Arousal : Negatjve 4%, Neutral 75% , Positjve 21%

Evaluated on 300 utuerances

○ Speaker : 10 speaker, 30 samples each ○ Valence : Negatjve 32%, Neutral 57% , Positjve 11% ○ Arousal : Negatjve 1%, Neutral 78% , Positjve 21%

Compared performance of unimodal, multjmodal, single-task, and multjtask

models

Evaluated based on F1-score(%) on evaluatjon set
V. Experiment

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 11

V. Experiment

Result

SLIDE 12

F1-scores (%) on evaluatjon set

V. Experiment

Result: Speaker

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

*Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

SLIDE 13

F1-score (%) on evaluatjon set

V. Experiment

Result: Emotion

*Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 14

V. Experiment

Result Summary

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

*Multjmodal approaches U - Unimodal C - Features concatenatjon H - Features hierarchical fusion Feature types A - Acoustjc F - Facial L - Lexical

SLIDE 15

VI. Conclusion
We constructed the multimodal and multitask speaker-emotion

recognition model by using deep learning and TV-series data

Multitask model able to outperform single-task model, especially

when recognizing emotion by using acoustic features only

Multimodal-multitask model did not result in a signifjcant

improvement (larger data might be needed)

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

SLIDE 16

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

Outline

linguistjc and paralinguistjc aspects

emotjon from emotjon-rich data

(Tang et al., 2016; Tian et al., 2016; Vallet et al., 2013)

○ Video graphic: Facial features ○ Audio : Acoustjc features ○ Subtjtle : Lexical features

○ Speaker : 57 names ○ Emotjon - valence: 3 classes (negatjve - neutral – positjve) ○ Emotjon - arousal : 3 classes (negatjve - neutral - positjve)

Multimodal Classifjcation

2 evaluated approaches:

Multitask Classifjcation

Perform classifjcatjon on several tasks at once.

○ INTERSPEECH 2010 feature conf. ○ openSMILE toolkit (Eyben et al., 2010)

○ Word-vectors average ○ Pre-trained Google Word2Vec (Mikolov et al., 2013)

○ Facial contours and angles ○ openFace toolkit (Baltrusaitjs et al., 2016)

○ Speaker : 57 speaker (imbalanced) ○ Valence : Negatjve 31%, Neutral 60% , Positjve 9% ○ Arousal : Negatjve 4%, Neutral 75% , Positjve 21%

○ Speaker : 10 speaker, 30 samples each ○ Valence : Negatjve 32%, Neutral 57% , Positjve 11% ○ Arousal : Negatjve 1%, Neutral 78% , Positjve 21%

models

Result

Result: Speaker

Result: Emotion

Result Summary

recognition model by using deep learning and TV-series data

when recognizing emotion by using acoustic features only

improvement (larger data might be needed)

Thank You