[PPT] - Clinical Depression Classification Would like to detect clinical PowerPoint Presentation

SLIDE 1

09/18/2006 - 10/06/2006

Evaluation of Objective

Features for Classification of Clinical Depression in Speech by Genetic Programming

Juan Torres1, Ashraf Saad2, Elliot Moore1

2Computer Science Department

School of Computing Armstrong Atlantic State University Savannah, GA 31419, USA ashraf@cs.armstrong.edu

1School of Electrical and Computer

Engineering Georgia Institute of Technology Savannah, GA 31407, USA juan.torres@gatech.edu, emoore@gtsav.gatech.edu

SLIDE 2

09/18/2006 - 10/06/2006

Clinical Depression Classification

Would like to detect clinical depression by

analyzing a patient’s speech.

Binary decision classification problem Large number of features in dataset. Feature

Selection is necessary for:

Designing a robust classifier Identifying of small set of useful features, which may

in turn provide physiological insight

SLIDE 3

09/18/2006 - 10/06/2006

Speech Database

15 patients (6 male, 9 female) 18 control subjects (9 male, 9 female) Corpus: 65 sentence short story Observation Groupings:

G1: 13 obs/speaker (5 sentences each) G2: 5 obs/speaker (13 sentences each)

SLIDE 4

09/18/2006 - 10/06/2006

Speech Features

Prosodics Vocal Tract Resonant Frequencies

(Formants)

Glottal Waveform Teager FM

SLIDE 5

09/18/2006 - 10/06/2006

Speech Features (cont.)

Raw features extracted

frame by frame (25- 30ms), and grouped into 10 categories:

EDS = STD(DFS(Ev)) EMS = MED(DFS(Ev))

Teager FM (TFM) Glottal Timing (GLT) Formant Bandwidths (FBW) Speaking Rate (SPR) Formant Locations (FMT) Energy Deviation Statistics (EDS) Glottal Spectrum (GLS) Energy Median Statistics (EMS) Glottal Ratios (GLR) Pitch (PCH)

SLIDE 6

09/18/2006 - 10/06/2006

Statistics

Sentence-level statistics

were computed for each raw feature → Direct Feature Statistics

Same set of statistics

used on DFS’s over each entire observation → Observation Level Statistics

75th percentile – 25th percentile Interquartile Range (IQR) log10(MAX) – log10(MIN) Dynamic Range (DRNG) MAX – MIN Range (RNG) 95th Percentile Maximum (MAX) 5th Percentile Minimum (MIN) Sqrt(1/(N-1) * Sum{(xi- Mean(x))2}) Standard Deviation (STD) 50th percentile Median (MED) 1/N * Sum{xi} Average (AVG) Equation Statistic

SLIDE 7

09/18/2006 - 10/06/2006

Final Feature Sets

Result: 2000+ distinct

features (OFS)

Statistical significance

tests (ANOVA) used to initially prune feature set.

Final size: 298 – 1246

features → large FS problem.

857 90 FG2 1246 234 FG1 298 75 MG2 724 195 MG1 Features Observations Experiment

SLIDE 8

09/18/2006 - 10/06/2006

Feature Selection

Goal: Select (small) group of features that

maximizes classifier performance

Approaches

Filter: optimize computationally inexpensive

fitness function

Wrapper: fitness function = classification

performance

SLIDE 9

09/18/2006 - 10/06/2006

Genetic Programming for Classification and FS (GPFS)

Estimate optimal feature set and classifier

simultaneously → “online approach”. (Muni, Pal, Das 2006)

Advantages:

Evolutionary search: explores (potentially) large

portion of feature space

Resulting classifier consists of a simple algebraic

expression (easy to read and interpret)

Stochastic: multiple runs should yield different

solutions. Frequency of selection can be regarded as

approximate fitness measure, given large number of runs.

SLIDE 10

09/18/2006 - 10/06/2006

Genetic Programming

Classifier consists of expression trees Binary decision → single tree

Class assigned by algebraic sign of evaluation (T>0 → c1, T<0 → c2)

Internal nodes:

{ +, -, x, / (protected)}

External nodes:

{features, rnd_dbl(0-10)}

SLIDE 11

09/18/2006 - 10/06/2006

Genetic Programming (Cont.)

Large population of classifier trees is evolved

ver several generations.

Population Initialization

Random Trees (height 2-6), ramped half and half method

Fitness Function = Classification Performance Evolutionary Operators

Reproduction (fitness-proportional selection) Mutation (random selection) Crossover (tournament selection)

SLIDE 12

09/18/2006 - 10/06/2006

Evolutionary Rules for Simultaneous Feature Selection

Initial tree generation

Probability of selecting a feature set decreases

linearly with feature set size.

Fitness

Biased toward trees that use few features.

Crossover

Homogeneous: only between parents with same

feature set

Heterogeneous: biased toward selecting parents with

similar feature sets

SLIDE 13

09/18/2006 - 10/06/2006

Dynamic Parameters

Fitness bias toward smaller subsets decreases

with generations.

Probability of heterogeneous crossover

decreases with generations

Motivation: Explore feature space during first few

generations, then gradually concentrate on improving classification performance with current feature sets.

SLIDE 14

09/18/2006 - 10/06/2006

GP Parameters

3000 for G1 / 2000 for G2 Population size 12 Maximum height of a tree 350 Maximum allowed nodes of a tree 2-6 Initial height of trees 30 for G1 / 20 for G2 Number of generations 10 Tournament size 0.7 / 0.3

Prob. of selecting int./ext. node during mutation

0.8 / 0.2

Prob. of selecting int./ext. node during crossover

0.15 Mutation probability 0.05 Reproduction probability 0.80 Crossover probability Value Parameter

SLIDE 15

09/18/2006 - 10/06/2006

GP Results

Classification Performance, Averaged over 10 runs

f Leave-one-out Cross-validation

16.0 14.2 16.1 15.3 18.5 Feature Set Size 75.0 81.8 84.4 69.1 64.8 Specificity 80.9 82.7 85.4 74.7 80.9 Sensitivity 77.4 82.2 84.9 71.3 71.2 Classification Accuracy G2 G1 G2 G1 Mean Female Male Experiment

SLIDE 16

09/18/2006 - 10/06/2006

Feature Selection Histograms

SLIDE 17

09/18/2006 - 10/06/2006

“Best” Features -- Males

GLT: Max((CP)MIN) PCH: Med(A1) EDS: Avg(MED) GLT: IQR((CP)IQR) GLR: Min((rOPO)IQR) EDS: Avg(AVG) GLR: Med((rCPOP)MIN) GLT: Std((CP)MIN) GLR: Max((rCPOP)MIN) GLS: Avg((gSt1000)MAX) GLT: Max((CP)MIN) GLT: DRng((CP)IQR) GLS: Med((gSt1000)MAX) GLT: Std((OP)IQR) GLR: Rng((rCPO)IQR) GLS: Avg((gSt1000)MAX) EDS: Avg(AVG) EDS: Avg(MED) EDS: Med(MED) GLT: Med((CP)MIN) Male - G2 Male - G1

SLIDE 18

09/18/2006 - 10/06/2006

“Best” Features -- Females

EMS: IQR(AVG_1) EMS: Med(STD_1) PCH: IQR(IQR) EMS: Med(STD) EMS: Max(MR) TFM: Avg(MAX(IQR)) EMS: Med(MR) FBW: Med((bwF3)IQR) EMS: Med(MAX) EMS: Med(RNG) EMS: Med(MR) EMS: Med(STD_1) EMS: Max(MR) EMS: Med(RNG) EMS: Max(STD_1) EMS: Med(AVG) EMS: Max(MAX) EMS: Avg(STD_1) EMS: Avg(MED) EMS: Avg(AVG) Female - G2 Female - G1

SLIDE 19

09/18/2006 - 10/06/2006

GP Results (Cont.)

GP results were not as good as hoped for.

However, the fact that certain features were selected in the final solutions more frequently than others can be regarded as a measure of their usefulness.

To test this hypothesis, we train Bayesian

classifiers using the 16 features most frequently- selected by GP.

SLIDE 20

09/18/2006 - 10/06/2006

Naive Bayesian Classification

Assign Class Cj with highest probability given

bservation (features).

Can be estimated using Bayes’ rule: Under naive assumption, class-conditional

distributions can be expressed as:

) ( ) ( ) | ( ) | ( X p C P C X p X C p

j j j

=

∏

=

i j i j

C x p C X p ) | ( ) | (

SLIDE 21

09/18/2006 - 10/06/2006

PDF estimation methods

Uniform Bins
A histogram with N uniformly spaced intervals (bins) is computed for each feature and each

class using the training data.

The optimum value for N was found by exhaustive search.
Optimal Threshold
Similar to uniform bins with N=2, but the cutoff threshold between the two bins is chosen

separately for each feature.

(Naive) Gaussian Assumption
Model the PDF of each feature and each class as a 1-D Gaussian density function whose

mean and variance are taken as the sample mean and (unbiased) variance of the training data.

Gaussian Mixtures
Each likelihood function p(X | Cj) is modeled as a weighted sum of multivariate Gaussian

densities.

The expectation-maximization (EM) algorithm is used to estimate means, covariance

matrices, and weights. We use diagonal covariance matrices and limit the number of mixtures to 3 for the G1 experiments and 2 for the G2 experiments in order to reduce the number of parameters to be estimated.

Multivariate Gaussian
Each (class-conditional) likelihood function is modeled as a single multivariate Gaussian

PDF with a full covariance matrix. Like the GMM, this method does not follow the naïve assumption.

SLIDE 22

09/18/2006 - 10/06/2006

Results

97.8 86.7 92.2 MVG 91.1 80.0 86.7 MVG 91.1 83.3 88.0 GMM 91.1 90.0 90.7 GMM 86.7 95.6 91.1 Gaussian 86.7 93.3 89.3 Gaussian 97.8 75.6 86.7 Opt Thresh 88.9 50.0 73.3 Opt Thresh 93.3 93.3 93.3 Unif Bin (N = 5) Female G2 88.9 93.3 90.7 Unif Bin (N = 2) Male G2 87.2 83.8 85.5 MVG 84.6 83.3 84.1 MVG 87.7 88.0 87.6 GMM 89.7 87.2 88.7 GMM 82.9 91.5 87.2 Gaussian 86.3 88.5 87.2 Gaussian 91.5 65.8 78.6 Opt Thresh 82.9 82.1 82.6 Opt Thresh 90.6 85.5 88.0 Unif Bin (N = 9) Female G1 88.9 83.3 86.7 Unif Bin (N = 8) Male G1 Spec Sen Acc Method Exp Spec Sen Acc Method Exp

Average Improvement: 18.5% (Males), 7.1% (Females)

SLIDE 23

09/18/2006 - 10/06/2006

Conclusion

GPFS was successful in finding small set of

highly discriminating features.

Need to measure not just FS frequency for

single features, but for groups of features.

GPFS may be performing FS too quickly. It may

be beneficial to encourage more exploration of the feature space.

SLIDE 24

09/18/2006 - 10/06/2006

References

1.

E. Moore, M. Clements, J. Peifer, and L. Weisser, Comparing objective feature statistics of

speech for classifying clinical depression. In Proceedings of the 26th Annual Conf. on Eng. in Medicine and Biology, pages 17-20, San Francisco, CA, 2004.

2.

E. Moore, M. Clements, J. Peifer, and L. Weisser, Analysis of prosodic variation in speech for

clinical depression. In Proceedings of the 25th Annual Conf. on Eng. in Medicine and Biology, pages 2849-2852, Cancun, Mexico, 2003.

3.

M. Dash, H. Liu, Feature selection for classification. Intelligent Data Analysis, 1(3):131-156, 1997.

4.

D. Muni, N. Pal, and J. Das, A novel approach to design classifiers using genetic programming.

IEEE Transactions on Evolutionary Computation, 8(2):183- 196, 2003

5.

D. Muni, N. Pal, and J. Das, Genetic programming for simultaneous feature selection and

classifier design. IEEE Transactions on Systems, Man and Cybernetics, Part B, 36(1):106- 117, 2006.

6.

D. Zongker, and W. Punch, Lilgp 1.01 User’s Manual. Genetic Algorithms and Research

Application Group, Michigan State University, East Lansing, MI, 1998. http://garage.cse.msu.edu/software/lil-gp/index.html.

7.

T.F. Quatieri, Discrete-Time Speech Signal Processing, Prentice Hall, Upper Saddle River, NJ, 2001.

SLIDE 25

09/18/2006 - 10/06/2006

References (cont.)

8.

G. Zhou, J. Hansen, J. Kaiser, Nonlinear feature based classification of speech under stress.

IEEE Transactions on Speech and Audio Processing, 9(3):201-216, 2001.

9.

J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural

Selection. MIT Press, Cambridge, MA, 1992.

10.

R. Duda, P. Hart, and D. Stork, Pattern Classification. Wiley, New York, 2001.

11.

C. Elkan, Naive Bayesian learning. Adapted from Technical Report No. CS97-557, Dept. of

Computer Science and Engineering, University of California, San Diego, CA, 1997.

12.

Y. Yang and G. Webb, On why discretization works for naïve-Bayes classifiers. In Proceedings of

the 16th Australian Joint Conference on Artificial Intelligence (AI), pages 440-452, Perth, Australia, 2003.

13.

M. Wiggins, A. Saad, B. Litt, and G. Vachtsevanos, Genetic Algorithm-Evolved Bayesian Network

Classifier for Medical Applications. In Proceedings of the Tenth World Soft Computing Conference , 2005.

14.

S. Theodoridis, and K. Koutroumbas, Pattern Recognition. Academic Press, San Diego, CA,

1999.

15.

H.B. Amor and A. Rettinger, Intelligent exploration for genetic algorithms: using self-organizing maps in evolutionary computation. In Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO), pages 1531-1538, Washinton D.C, 2005.