[PPT] - A System for Speech and 3D Facial Image Acquisition, Modeling and PowerPoint Presentation

SLIDE 1

Elmar Nöth, Tobias Bocklet, Arnd Gebhard

A System for Speech and 3D Facial Image Acquisition, Modeling and Analysis

Wednesday, 30 May 2012

SLIDE 2

Outline

Motivation: Long-term goal of the project
Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Speech technology
Facial analysis technology
Results
Summary

SLIDE 3

Necessity of Evaluation

Diagnosis
How intelligible is the patient?

(holistic impression)

How strongly does the patient nasalize?

(distinct aspect)

Therapy control
Has the situation of the patient improved during therapy?
Comparison of therapy methods
Which therapy method leads to the best results for a group of

patients?

Screening
Is the quality of a child’s speech according to its age?
Computer-assisted therapy
Did the patient perform the exercise correctly?

Motivation

SLIDE 4

Necessity of Evaluation

Diagnosis
How intelligible is the patient?

(holistic impression)

How strongly does the patient nasalize?

(distinct aspect)

Therapy control
Has the situation of the patient improved during therapy?
Comparison of therapy methods
Which therapy method leads to the best results for a group of

patients?

Screening
Is the quality of a child’s speech according to its age?
Computer-assisted therapy
Did the patient perform the exercise correctly?

Motivation

SLIDE 5

Long-term Goal of the Project

Provide a telemedical rehabilitation unit for clinical/home use
Support speech analysis and analysis of facial gestures and …

(gait, cognitive abilities  open, flexible platform)

Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Instruct the patient what to do
Evaluate the exercises
Compare with previous sessions
Summarize exercises for therapist

Motivation

SLIDE 6

Outline

Long-term goal of the project
Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Speech technology
Facial analysis technology
Results
Summary

SLIDE 7

Parkinson’s Disease

Degenerative disorder of the central nervous system
Death of dopamine-containing cells in the substantia nigra
Cause of cell-death is unknown
Second most common neurodegenerative disorder

(after Alzheimer's disease)

Prevalence ≈ 0.3% (whole population)
More common in the elderly:

1% of > 60 years, 4% of > 80 years

Incidence of PD ≈ 8 - 18 per 100,000 people
Onset in most cases > 50 years, mean onset ≈ 60 years

Patient Groups

SLIDE 8

Speech-related Symptoms of PD

Hypophonia (soft speech)
Monotonic speech: Speech quality tends to be soft, hoarse, and

monotonous

Festinating speech: excessively rapid, soft, poorly-intelligible

speech

Drooling: most likely caused by a weak, infrequent swallow
Dysphagia (impaired ability to swallow)
Dysarthria

Patient Groups

SLIDE 9

Dysarthria

A speech disorder affecting the coordination of muscles in the

vocal tract, face, larynx, and respiratory system (dysarthrophonia)

Mostly results from a neurological injury,

such as a stroke or other kind of brain injury

Patient Groups

SLIDE 10

Dysarthria

A speech disorder affecting the coordination of muscles in the

vocal tract, face, larynx, and respiratory system (dysarthrophonia)

Mostly results from a neurological injury,

such as a stroke or other kind of brain injury

Patient Groups

SLIDE 11

Outline

Long-term goal of the project
Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Speech technology
Facial analysis technology
Results
Summary

SLIDE 12

Speech Technology

Automatic speech processing methods
Word and phoneme recognition
Acoustic speaker modeling
Prosodic analysis
Model of excitation signal
Evaluation measures

SLIDE 13

features

classification words with highest probability

word chain

Word and Phoneme Recognition

Off-the-shelf technology
Semi-continuous HMMs
Easier to adapt with small amounts of data
Comparable results with continuous models
11 Mel cepstrum coefficients + energy + 1. derivative

Speech Technology

SLIDE 14

Acoustic Speaker Modeling

Idea:
Acoustic space of speakers can be modeled
Space represents the multidimensional characteristics of voice of a

speaker

Degree of pathology varies in acoustic space
Find characteristics of degree of speech disorder
Approach:
Acoustics modeled by Gaussian Mixture Models (GMMs)
Train Universal Background Model (UBM) with normal speakers
Train GMM of path. speakers and transform into vector
Perform a classification/regression (depends on the task)

Speech Technology

SLIDE 15

Acoustic Speaker Modeling

Gaussian density of UBM feature dimension 1 feature dimension 2 features of healthy speakers Gaussian density of speaker model features of a path. speaker

Variations of speakers with different degrees of pathology
Can be modeled by adaptation from UBM to GMM

Speech Technology

SLIDE 16

Gaussian densities (i = 1,.., N) of speaker model defined by mean values(mi) und covariance matrices (Ki)

feature dimension 1 feature dimension 2

Concatenation

f elements of densities

m1 m2 m3 m4 m5 m6 K1 K2 K3 K4 K5 K6

ms = Ks =

m1 K1 m2 K2 m3 K3 m4 K4 m5 K5 m6 K6

Acoustic Speaker Modeling

Speech Technology

SLIDE 17

Acoustic Speaker Modeling

speakers with pathology type 2

points correspond to supervectors (SVs)

speakers with pathology type 1

Discriminate between different types of pathology
Create SVs of speakers
Train some classifier on labeled SVs
Create SV of test speaker
Classify SV of test speaker

Speech Technology

SLIDE 18

Acoustic Speaker Modeling

supervector space

Train a regression (linear/SVR) Create SV for a test speaker Estimate degree of pathology

degree of pathology

Estimate degree of pathology

Speech Technology

SLIDE 19

Prosodic Analysis

Prosody: rhythm, intonation, stress, and related attributes
Computation of prosodic features on word level, across several words or

across syllable nuclei or across voiced segments

Computation across several words requires ASR
Computation across syllable nuclei requires syllable detection
Local features:
Pauses before/after segments, signal energy,

segment duration, and F0

Calculation of mean, max., min., and std. dev.
Global features: jitter, shimmer, voiced/unvoiced characteristics

 ≈100-200 features per test utterance

Speech Technology

SLIDE 20

Two-Mass Model of the Vocal Folds

Speech Technology

SLIDE 21

Two-Mass Model of the Vocal Folds

Speech Technology

SLIDE 22

Evaluation

Word accuracy (WA) and word correctness (WC)
Calculated features
Features of acoustic speaker models
Features of prosodic analysis
Features of 2-mass model
Correlation (Pearson & Spearman) based on calculated

features or WA, WC with human listener

Classification based on calculated features
Interpretation of relevant features after feature selection

Speech Technology

SLIDE 23

Outline

Long-term goal of the project
Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Speech technology
Facial analysis technology
Results
Summary

SLIDE 24

PD: Increasing inability to express emotions with facial gestures

(important for communication)

Dysarthric speech often accompanied by other physical impairments
Facial paresis
Motor handicaps
 Analysis of facial gestures
Reduced mobility requires therapist to come to patient
High costs
Waste of therapist’s time

 Telemedical therapy

Analysis of Facial Gestures

Facial Analysis Technology

SLIDE 25

Anger vs. Joy

Facial Analysis Technology

SLIDE 26

Showing Emotions

Reduced Ability to Vary Facial Expressions with PD

SLIDE 27

Unstressed look Lip pursing Closing of eyes Showing the teeth

Ability to Analyze Sequence of Movements

Dynamic Facial Expressions for Facial Paresis

SLIDE 28

Grading of Facial Paresis

Facial Analysis Technology

Different Grading Systems are used
Most prominent: Grading System by House&Brackmann

[J. House and D. Brackmann: Facial nerve grading system in Otolaryngolocical Head and Neck Surgery, 1985]

6 Grades:
House I

→ healthy person

House VI

→ completely paralyzed half of the patient's face

Grading is performed on (subjective) observations by expert
Problem: Objective tracking of cure processes
Solution: Automatic System for diagnosis support

SLIDE 29

3D Camera: Principle

Dynamic Analysis of Facial Gestures

SLIDE 30

Time-of-Flight (ToF) 3D Camera

Up to 50 Hz
More than 25k 3D points

(176*144 pixels)

Eye-safe infrared light /

no exposure

Precision for facial images
40 cm: +/-

1mm

80 cm: +/-

5mm

120 cm: +/- 15mm

Dynamic Analysis of Facial Gestures

SLIDE 31

Dynamic Analysis of Facial Gestures

Principles of Kinect

SLIDE 32

Dynamic Analysis of Facial Gestures

Principles of Kinect

SLIDE 33

Control image for the patient

Prototype

Illumination Stereo microphones TOF camera Webcam

Dynamic Analysis of Facial Gestures

SLIDE 34

Telemedical System

Framework

SLIDE 35

Measuring the Precision

SLIDE 36

Measuring the Precision

SLIDE 37

Dynamic Parameters: Interface for Therapist

SLIDE 38

Dynamic Parameters: Interface for Therapist

SLIDE 39

Dynamic Parameters: Interface for Therapist

SLIDE 40

Dynamic Parameters: Interface for Therapist

SLIDE 41

Dynamic Parameters: Interface for Therapist

SLIDE 42

Outline

Long-term goal of the project
Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Speech technology
Facial analysis technology
Results
Summary

SLIDE 43

Hoehn and Yahr Scale

Stage 0:

No signs of disease.

Stage 1:

Unilateral symptoms only

Stage 1.5:

Unilateral and axial involvement

Stage 2:

Bilateral symptoms; no impairment of balance

Stage 2.5:

Mild bilateral disease with recovery on pull test

Stage 3:

Balance impairment; mild to moderate disease physically independent

Stage 4:

Severe disability, still able to walk or stand unassisted

Stage 5:

Needing a wheelchair or bedridden unless assisted

Parkinson’s Disease

SLIDE 44

Czech Speech Data

Speech Analysis for Parkinson’s Disease

SLIDE 45

Classification PD vs. Control Group

Speech Analysis for Parkinson’s Disease

46 Czech speakers
23 with PD (Hoehn & Yahr 1-2)
23 as control group
Age matched

SLIDE 46

ROC-Curve for T4 and Prosody

Speech Analysis for Parkinson’s Disease

Screening of 100 000 people >= 60  ≈ 1000 people with PD 1% more people found developing PD ≈ 10 people 1% more FA of people w/o PD ≈ 1000 people  Need for cheap robust screening (e.g. automatic telephone system) + more detailed screening + detailed exam %PD % FA

SLIDE 47

Intelligibility

“The North Wind and the Sun”:
107 words (71 disjoint)
Contains all German phonemes
Commonly used by speech therapists and

phoneticians

28 patients with dysarthria
recorded during post-stroke rehabilitation
39 to 76 years old
Speech Technology: word recognition, prosody

Speech Analysis for Dysarthric Speech

SLIDE 48

1 Rater vs. Avg. of Other 3 Raters

Speech Analysis for Dysarthric Speech

SLIDE 49

Human raters vs. Word Recognition

Correlation r = −0. 84 Speech Analysis for Dysarthric Speech

SLIDE 50

Automatic Graduation of Paresis (2001)

Speech Analysis for Dysarthric Speech

4 grades of facial paresis were defined:

G1: healthy person (corresp. to House I)
G2: weak paresis (corresp. to House II + III)
G3: strong paresis (corresp. to House IV + V)
G4: paralysis (corresp. to House VI)

Result:

SLIDE 51

New Study of Patients with Paresis (2011)

Speech Analysis for Dysarthric Speech

In 2011 a new study of patients with facial paresis was started
Cooperation with HNO-Clinic Erlangen
Goal: > 10 patients of every House class to be acquired and

analyzed

Current state: 15 patients of different classes acquired

SLIDE 52

Gait Analysis – Sensor Platform

SLIDE 53

Acceleration sensors
Gyroscopes
Wireless transmission of the sensor data

Gait Analysis – Sensor Platform

SLIDE 54

Gait Analysis: 10 m Walking (4 Repetitions)

SLIDE 55

Summary

Provide a telemedical rehabilitation unit for clinical/home use
Patient groups:
Parkinson’s disease (PD) patients
Stroke patients and patients with facial paresis
Speech technology
Facial analysis technology (still images and sequences)
Results
Up to 91% classification PD vs. control group & 0.97 AUC
Correlation of -0.84 between automatic classification and

human raters for dysarthric speech using WC & prosody

SLIDE 56