Part 2: Audio-Visual HRI: Methodology and Applications in Assistive - - PowerPoint PPT Presentation

part 2 audio visual hri methodology and applications in
SMART_READER_LITE
LIVE PREVIEW

Part 2: Audio-Visual HRI: Methodology and Applications in Assistive - - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation


slide-1
SLIDE 1

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory

National Technical University of Athens, Greece (NTUA)

Robot Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Petros Maragos and Athanasia Zlatintsi

1

Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018

slides: http://cvsp.cs.ntua.gr/interspeech2018

Part 2: Audio-Visual HRI: Methodology and Applications in Assistive Robotics

slide-2
SLIDE 2

2

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

2A. Audio-Visual HRI: General Methodology

slide-3
SLIDE 3

3

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multimodal HRI: Applications and Challenges

education, entertainment assistive robotics  Challenges

 Speech: distance from microphones, noisy acoustic scenes, variabilities  Visual recognition: noisy backgrounds, motion, variabilities  Multimodal fusion: incorporation of multiple sensors, integration issues  Elderly users, Children

slide-4
SLIDE 4

4

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Database of Multim odal Gesture Challenge (in conjunction with ACM ICMI 2013)

  • 20 cultural/ anthropological signs of Italian language
  • ‘vattene’ (get out)
  • ‘vieni qui’ (come here)
  • ‘perfetto’ (perfect)
  • ‘furbo’ (clever)
  • ‘che due palle’ (what a nuisance!)
  • ‘che vuoi’ (what do you want?)
  • ‘d’accordo’ (together)
  • ‘sei pazzo’ (you are crazy)
  • ‘combinato’ (combined)
  • ‘freganiente’ (damn)
  • ‘ok’ (ok)
  • ‘cosa ti farei’ (what would I make to you!)
  • ‘basta’ (that’s enough)
  • ‘prendere’ (to take)
  • ‘non ce ne piu’ (there is none more)
  • ‘fame’ (hunger)
  • ‘tanto tempo’ (a long time ago)
  • ‘buonissimo’ (very good)
  • ‘messi d’accordo’ (agreed)
  • ‘sono stufo’ (I am sick)
  • 22 different users
  • 20 repeats per user approximately

(~ 1 minute for each gesture video)

Database of Multimodal Gesture Challenge

slide-5
SLIDE 5

5

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Depth

(vieniqui ‐ come here)

User Mask

(vieniqui - come here)

Skeleton

(vieniqui - come here)

RGB Video & Audio

Multimodal Gesture Signals from Kinect‐0 Sensor

[S. Escalera, J. Gonzalez, X. Baro, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, “Multi-modal gesture recognition challenge 2013: Dataset and results”, Proc. 15th ACM Int’l Conf. Multimodal Interaction, 2013.]

ChaLearn corpus

slide-6
SLIDE 6

6

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multimodal Hypothesis Rescoring + Segmental Parallel Fusion

N‐best list generation audio skeleton N‐best list generation handshape N‐best list generation multiple hypotheses list rescoring & resorting best single‐stream hypotheses best multistream hypothesis parallel segmental fusion single‐stream models recognized gesture sequence

[V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, “Multimodal Gesture Recognition via Multiple Hypotheses Rescoring”, JMLR 2015]

slide-7
SLIDE 7

7

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Visual Fusion & Recognition

 Audio and visual modalities for an A-V word sequence.  Ground truth transcriptions (“REF”) and decoding results for audio and 3 different fusion schemes.

 Audio and visual modalities for A-V gesture word sequence.  Ground truth transcriptions (“REF”) and decoding results for audio and 3 different A-V fusion schemes.  Results in top rank of ChaLearn (ACM 2013 Gesture Challenge – 50 teams - 22 users x 20 gesture phrases x 20 repeats).

[ V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, JMLR 2015 ]

slide-8
SLIDE 8

8

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Activity Recognition

Action: sit to stand

  • Gestures: come here, come near

Sign: (GSL) Europe

slide-9
SLIDE 9

9

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual action recognition pipeline

Temporal Sliding Window Visual Feature Extraction

Sliding Window

Visual Feature Extraction Classifier

Post- processing

Video Classifier

Recognized Sequence

Sit to Stand Walk

slide-10
SLIDE 10

10

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Example video

Visual Front-End

Optical Flow Example video Video Dense Trajectories Feature Descriptors Example video

slide-11
SLIDE 11

11

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Features: Dense Trajectories

  • 1. Feature points are sampled on a

regular grid in multiple scales

  • 2. Feature points are tracked through

consecutive video frames

  • 3. Descriptors are computed in space-

time volumes along trajectories

[ Wang et al. IJCV 2013 ]

slide-12
SLIDE 12

12

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

K-means Clustering and Dictionary

Feature Samples K-means Dictionary

slide-13
SLIDE 13

VLAD BOF ‐ Size: K VLAD ‐ Size: K*D

Feature Encoding

slide-14
SLIDE 14

14

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Action Classification

Classifier Labeled Videos

Train Test

Unlabeled Videos Labels

slide-15
SLIDE 15

15

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Temporal Segmentation Results

15 Ground Truth SVM Ground Truth SVM+Filter+ HMM_Viterbi Sit Walk Stand

  • B.M.
slide-16
SLIDE 16

16

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Action Recognition Results (4a, 6p): Descriptors + Post-processing Smoothing

 Dense Trajectories + BOF Encoding

Results improve by adding Depth and/or advanced Encoding

slide-17
SLIDE 17

17

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Recognition

slide-18
SLIDE 18

18

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Recognition Challenges

Challenging task of recognizing human gestural movements:

  • Large variability in gesture performance.
  • Some gestures can be performed with left or right hand.

I want to Perform a Task I want to Sit Down Park Come Closer

slide-19
SLIDE 19

19

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Gesture Classification Pipeline

Class Probabilities (SVM scores)

slide-20
SLIDE 20

20

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Applying Dense Trajectories

  • n Gesture Data
slide-21
SLIDE 21

21

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Extended Results on Gesture Recognition

10 20 30 40 50 60 70 80 traject. HOG HOF MBH combined accuracy (%)

Comparisons: Multiple descriptors, Multiple encodings; Mean over patients

BoVW VLAD Fisher

MOBOT‐I, Task 6a (8g, 8p)

slide-22
SLIDE 22

22

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Synergy: Semantic Segmentation + Gesture Recognition

foreground/ background+ gesture recognition

  • A. Guler, N. Kardaris, S. Chandra, V. Pitsikalis, C. Werner, K. Hauer, C. Tzafestas, P. Maragos and I. Kokkinos, “Human Joint Angle

Estimation and Gesture Recognition for Assistive Robotic Vision” ECCV Workshop on Assistive Computer Vision and Robotics, 2016.

Median Relative Improve: 9 %

slide-23
SLIDE 23

23

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Spoken Command Recognition

slide-24
SLIDE 24

24

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Distant Speech Recognition inVoice-enabled Interfaces

Noise Other Speech Distant Microphones Reverberation

https://dirha.fbk.eu/

slide-25
SLIDE 25

25

Smart Home Voice Interface

Sweet home listen! Turn on the lights in the living room!

 Main technologies:

 Voice Activity Detection  Acoustic Event Detection  Speaker Localization  Speech Enhancement  Keyword Spotting  Far-field command recognition

slide-26
SLIDE 26

26

DIRHA demo (“spitaki mou”)

  • I. Rodomagoulakis, A. Katsamanis, G. Potamianos, P. Giannoulis, A. Tsiami, P. Maragos, “Room-

localized spoken command recognition in multi-room, multi-microphone environments”, Computer Speech & Language, 2017.

  • A. Tsiami, I. Rodomagoulakis, P. Giannoulis, A. Katsamanis, G. Potamianos and P. Maragos,

“ATHENA: A Greek Multi-Sensory Database for Home Automation Control”, INTERSPEECH 2014.

https://www.youtube.com/watch?v=zf5wSKv9wKs

slide-27
SLIDE 27

27

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Spoken-Command Recognition Module for HRI

VAD KWS ASR Acoustic Front-end

HTK tools Python interface

“Wo bin ich” “Hilfe” “Gehe rechts” ….

“robot” garbage MEMS

DS beamforming

ROBOT command

 integrated in ROS, always-listening mode, real time performance

slide-28
SLIDE 28

28

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Online Spoken Command Recognition

 Greek, German, Italian, English

1.5 – 3m

command generic speech sil sil … Segmentation

ch-1 ch-𝑁

Targeted Acoustic Scenes

Pentagon ceiling array (Shure) MEMS mic array Kinect mic array

slide-29
SLIDE 29

29

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Visual Fusion for Multimodal Gesture Recognition

slide-30
SLIDE 30

30

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multimodal Fusion: Complementarity of Visual and Audio Modalities

Similar audio, distinguishable gesture Distinguishable audio, similar gesture

slide-31
SLIDE 31

31

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Visual Fusion: Hypotheses Rescoring

speech & gesture recognition

spoken commands hypotheses

hypothesis normalized score A1 Help 0.2 A2 Stop 0.19 A3 park 0.12 … A19 go straight 0.01

visual gesture hypotheses

Hypothesis normalized score V1 Stop 0.5 V2 go away 0.15 V3 help 0.12 … V19 go straight 0.01

N-best

MAX𝑥 𝑡𝑑𝑝𝑠𝑓 𝐵 𝑥 𝑡𝑑𝑝𝑠𝑓 𝑊

, 𝑥 𝑡𝑑𝑝𝑠𝑓 𝐵 𝑥 𝑡𝑑𝑝𝑠𝑓 𝑊

𝑥, 𝑥 : modality weights

hypothesis combined score F1 Stop 0.205 F2 help 0.196

fusion hypotheses

slide-32
SLIDE 32

32

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Offline Multimodal Command Classification

 Leave-one-out experiments (Mobot-I.6a data: 8p,8g)  Unimodal: audio (A) and visual (V)  Multimodal (AV): N-best list rescoring

96,87 78,26 79,16 90,32 80 81,81 79,16 87,5

84

87,5 100 79,19 96,77 86,66 90 95,83 84,37

90

20 40 60 80 100 p1 p4 p7 p8 p9 p11 p12 p13 avg A V AV

Multimodal confusability graph

slide-33
SLIDE 33

33

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

HRI Online Multimodal System Architecture

 ROS based integration

 Spoken command recognition node  Activity detection node  Gesture classifier node  Multimodal fusion node

 Communication using ROS messages

slide-34
SLIDE 34

34

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Gestural Command Recognition

Online processing system – Open Source Software http://robotics.ntua.gr/projects/building-multimodal-interfaces

34 Gesture Classification Frontend+Activity Detector Background Models Gesture Models

recognized visual gesture + confidence

Keyword Spotting Background Models Speech Models Front‐end Audio Recogniti

  • n

recognized audio command + confidence

post‐ process fusion

final recognized result

  • N. Kardaris, I. Rodomagoulakis, V. Pitsikalis, A. Arvanitakis and P. Maragos, A platform for building new human‐computer

interface systems that support online automatic recognition of audio‐gestural commands, Proc. ACM Multimedia 2016.

slide-35
SLIDE 35

35

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

2B. Audio-Visual HRI: Applications in Assistive Robotics

slide-36
SLIDE 36

36

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

EU Project MOBOT: Motivation

Experiments conducted at Bethanien Geriatric Center Heidelberg

Mobility & Cognitive impairments, prevalent in elderly population, limiting factors for Activities of Daily Living (ADLs) Intelligent assistive devices (robotic Rollator) aiming to provide context- aware and user-adaptive mobility (walking) assistance

MOBOT rollator

slide-37
SLIDE 37

37

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multi-Sensor Data for HRI

Kinect1 RGB Data Kinect Depth Data Kinect1 RGB Kinect1 Depth MEMS Audio Data Go Pro RGB Data HD1 Camera Data HD2 Camera Data

slide-38
SLIDE 38

38

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Action Sample Data and Challenges

 Visual noise by intruders  Multiple subjects in the scene, even in same depth level  Frequent and extreme occlusions, missing body parts (e.g. face)  Significant variation in subjects pose, actions, visibility, background

Stand-to-Sit – P1 Stand-to-Sit – P3 Stand-to-Sit – P4

slide-39
SLIDE 39

39

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Gestural Command Recognition: Overview of our Multimodal Interface

Spoken Command Recognition Visual Action- gesture Recognition Multimodal Late Fusion MEMS linear array Kinect RGB-D camera

MOBOT robotic platform

N-best hypotheses & scores

Best AV Hypothesis

[ I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi,

  • A. Katsamanis, A. Tsiami and P. Maragos, ICASSP 2016 ]
slide-40
SLIDE 40

40

Kalamata – Diaplasis (30 patients)

Clinical Studies (ΜΟΒΟΤ)

Heidelberg – Bethanien (19 patients)

Speech, Gestures, Combination: 3 repetitions of 5 commands

slide-41
SLIDE 41

41

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Validation experiments (Bethanien, Heidelberg):

Audio-Gestural recognition in action (1/2)

41

slide-42
SLIDE 42

42

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

EU Project I-SUPPORT: Overview (Gesture & Spoken Command Recognition)

ch-1 … ch-M

dense trajectories of visual motion

slide-43
SLIDE 43

43

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Gestural Recognition: Validation Experiments (FSL, Rome)

slide-44
SLIDE 44

44

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

FSL, Rome Bethanien, Heidelberg

Validation Setup

slide-45
SLIDE 45

45

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 45

Data collection

Different viewpoints Poor gesture performance Random movements KIT ICCS - NTUA Pre-Validation FSL - Bethanien

Challenges

Gesture Recognition

slide-46
SLIDE 46

46

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Recognition – Depth Modality

 Experiments with Depth and Log-Depth streams  Extraction of Dense Trajectories performs better on the Log- Depth stream

RGB stream Dense Trajectories Log-Depth stream

slide-47
SLIDE 47

47

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Offline Classification – Results

 ICCS Dataset (24u, 28g)

 Two different setups  Two different streams  Different encoding methods  Different features

 KIT Dataset (8u, 8/10g)

 Two different setups  Average gesture recognition accuracy:  Legs (8 gestures): 83%  Back (10 gestures): 75%

 FSL Pre-Validation Dataset (5u, 10g)

 Train/fine-tuning the models for audio-visual gesture recognition  Average gesture recognition accuracy for the 5 gestures used in validation:  Legs: 85% , Back: 75%

slide-48
SLIDE 48

48

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

  • ROS (Robot Operating System) based integration
  • Multimodal “late” fusion (Validation @ Bethanien, Heidelberg)

(ROS message to FSM)

Multimodal Fusion and On-line Integration

slide-49
SLIDE 49

49

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Validation results

Command Recognition Rate (CRR)

(= accuracy only on well performed commands)

Round 2

(“back” position) Gesture-only scenario Audio-Gestural Scenario Without training 59.6% 86.2% With training 68.7% 79.1%

Bethanien, Heidelberg FSL, Rome Round 2

(no training, audio‐gestural scenario, “legs” position) 83.5%

Round 1

(no training, audio‐gestural scenario) Back 73.8% (A)* Legs 84.7%

Round 1

(no training, audio‐gestural scenario) Back 87.2% Legs 79.5%

slide-50
SLIDE 50

50

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

I-SUPPORT system video

slide-51
SLIDE 51

51

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Part 2: Conclusions

 Synopsis:

 Multimodal Action Recognition and Human-Robot Interaction  Visual Action Recognition  Gesture Recognition  Spoken Command Recognition  Online Multimodal System and Applications in Assistive Robotics

 Ongoing work:

 Fuse Human Localization & Pose with Activity Recognition  Activities: Actions – Gestures – SpokenCommands - Gait  Applications in Perception and Robotics

Tutorial slides: http://cvsp.cs.ntua.gr/interspeech2018 For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr

slide-52
SLIDE 52

52

APPENDIX

slide-53
SLIDE 53

53

More Action Recognition results: +Gabor3D+Depth

DT

MOBOT-I.3b (6p, 4a)

slide-54
SLIDE 54

MOBOT Year-2 Review, Bristol, 26 March 2015

Action Recognition results: Com parison of HOG ( RGB) vs HOD ( Depth)

54

Encoding: BOF‐L1

slide-55
SLIDE 55

55

VLAD vs BOF comparison

MOBOT-I Scenario 3b (3 actions + BM, 6 patients).

slide-56
SLIDE 56

56

Overview: Visual Gesture Recognition

Handshape Hand Position & Movement

Spatiotemporal Features +Training Classification Recognition Statistical Modeling

RGB +D

Pose Annotation Anno- tations Pose Estimation

slide-57
SLIDE 57

57

GoPro Camera Data

“Help” “I want to stand up”

slide-58
SLIDE 58

58

Gesture Classification results on GoPro data

Mobot-I 6a (8p, 8g)

slide-59
SLIDE 59

59

Multimodal Gesture classification on Mobot 6.a dataset

 Task 6a

 User is sitting & gesturing  13 patients

 GoPro videos & MEMS audio

 aligned with annotations

 Noisy audio-visual scenes  Different speech & gesture

pronunciations

IS: Instructor speaking PS: Patient speaking BN: Background Noise