Data Mining in Bioinformatics Day 3: Feature Selection Karsten - - PowerPoint PPT Presentation

▶

Feb 12, 2024 111 likes •359 views

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is feature

SLIDE 1

.: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 3: Feature Selection

Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

SLIDE 2

What is feature selection?

.: Data Mining in Bioinformatics, Page 2

Abundance of features Usually, our output variable Y does not depend on all of

ur input features X

Why is this? X usually includes all features that could determine Y according to our prior knowledge, but we do not know for sure. In fact, we perform supervised learning to determine this dependence between input variables and output vari- ables (Supervised) feature selection’ means ’selecting the rel- evant subset of features’ for a particular learning task

SLIDE 3

Why feature selection?

.: Data Mining in Bioinformatics, Page 3

Reasons for feature selection to detect causal features to remove noisy features to reduce the set of features that has to be observed cost, speed, data understanding Two modes of feature selection Filter approaches: select interesting features a priori, based on a quality function (information criterion) Wrapper approaches: select special features that are interesting for one particular classifier

SLIDE 4

Optimisation problem

.: Data Mining in Bioinformatics, Page 4

Combinatorial problem Given a set of features D, and a quality function q, we try to find the subset S of D of cardinality n′ that maximises q argmaxS⊂D∧|S|=n′ q(S)

(1)

Exponential runtime effort The computational effort for enumerating all possibilities is exponential in n′, and hence intractable for large D and n′ In practice, we have to find a workaround!

SLIDE 5

Greedy selection

.: Data Mining in Bioinformatics, Page 5

Take the currently best one Greedy selection is an alternative to exhaustive enumer- ation Idea is to iteratively add the currently most informative feature to the se- lected set or remove the currently most uninformative feature from the solution set These two variants of greedy feature selection are re- ferred to as: forward feature selection backward elimination

SLIDE 6

Greedy selection

.: Data Mining in Bioinformatics, Page 6

Forward Feature Selection

1: S† ← ∅ 2: repeat 3:

j ← arg maxj q(S† ∪ {j})

S† ← S† ∪ j

S ← S \ j

6: until S = ∅

SLIDE 7

Greedy selection

.: Data Mining in Bioinformatics, Page 7

Backward Elimination

1: S† ← ∅ 2: repeat 3:

j ← arg maxj q(S \ {j})

S ← S \ j

S† ← S† ∪ j

6: until S = ∅

Optimality of greedy selection Only optimal if q decomposes over the elements of S

q(S) =

X∈S

q(X) (2)

Near-optimal if q is submodular (more details later) Otherwise there is no guarantee for optimality

SLIDE 8

Correlation Coefficient

.: Data Mining in Bioinformatics, Page 8

Definition The correlation coefficient ρX,Y between two random variables X and Y with expected values µX and µY and standard deviations σX and σY is defined as:

ρX,Y = cov(X, Y ) σXσY (3) = E((X − µX)(Y − µY )) σXσY , (4)

where E is the expected value operator and cov means covariance.

SLIDE 9

Mutual Information

.: Data Mining in Bioinformatics, Page 9

Definition Given two random variables X and Y , we define the mutual information I as

I(X, Y ) =

y∈Y
x∈X

p(x, y) log

p(x, y)

p(x) p(y)

(5)

where

X is the input variable, Y is the output variable, p(x, y) is the (joint) probability of observing x and y, p(x) and p(y) are the marginal probabilities of observ-

ing x and y, respectively.

log is usually the logarithm with base 2.

SLIDE 10

HSIC

.: Data Mining in Bioinformatics, Page 10

Definition The Hilbert-Schmidt Independence Criterion (HSIC) measures the dependence of two random variables Given two random variables X and Y , an empirical es- timate of the HSIC can be computed as trace(KHLH)

(6)

where

K is a kernel on X L is a kernel on Y H is a centering matrix with H(i, j) = δ(i, j) − 1

HSIC(X, Y ) = 0 iff X and Y are independent

The larger HSIC(X, Y ), the larger the dependence be- tween X and Y

SLIDE 11

Submodular Functions I

.: Data Mining in Bioinformatics, Page 11

Definition A function q on a set D is said to be submodular if

q(S ∪ X) − q(S) ≥ q(T ∪ X) − q(T) (7)

where

X ∈ D S ⊂ D T ⊂ D S ⊆ T

This is referred to as the property of ‘diminishing re- turns’: If S is a subset of T, then S benefits more from adding X than T

SLIDE 12

Submodular Functions II

.: Data Mining in Bioinformatics, Page 12

Near-optimality (Nemhauser, Wolsey, and Fisher, 1978) If q is a submodular, nondecreasing set function and

q(∅) = 0, then the greedy algorithm is guaranteed to

find a set S such that

q(S) ≥ (1 − 1 e) max

|T|=|S| q(T)

(8)

This means that the solution of greedy selection reaches at least ∼ 63% of the quality of the optimal solution.

SLIDE 13

Submodular Functions III

.: Data Mining in Bioinformatics, Page 13

Example: Sensor Placement Imagine our features form a graph G = (D, E) Imagine the features are possible locations for a sensor. Each sensor may cover a node v and its neighbourhood N(v), that is q(S) = |N(v) ∪ v|. Now we want to pick locations in the graph such that our sensors cover as large an area of the graph as possible. q fulfills the following properties

q(∅) = 0 q is non-decreasing q is submodular

Hence greedy selection will lead to near-optimal sensor placement!

SLIDE 14

Wrapper methods

.: Data Mining in Bioinformatics, Page 14

Two flavours: embedded: The selection process is really integrated into the learning algorithm not-embedded (Wrapper): The learning algorithm is em- ployed as a quality measure Wrappers: Simple wrapper: do prediction using 1 feature only. Use classification accuracy as measure of quality Extend this to groups of features by heuristic search strategies (greedy, Monte-Carlo, etc.) Embedded: Typical example: Decision Trees!

ℓ0 norm SVM

SLIDE 15

ℓ0 norm SVM

.: Data Mining in Bioinformatics, Page 15

3 steps

1. Train a regular linear SVM (using ℓ1-norm or ℓ2-norm

regularization)

2. Re-scale the input variables by multiplying them by the

absolute values of the components of the weight vector

w obtained.

3. Iterate the first 2 steps until convergence.

SLIDE 16

Unsupervised feature selection

.: Data Mining in Bioinformatics, Page 16

Problem setting Even without a target variable y, we can select features that are informative according to some criterion Criteria (Guyon and Elisseeff, 2003) Saliency: a feature is salient if it has a high variance or range Entropy: a feature has high entropy if the distribution of examples is uniform Smoothness: a feature in a time series is smooth if on average its local curvature is moderate Density: a feature is in a high-density region if it is highly connected with many other variables Reliability: a feature is reliable if the measurement error bars are smaller than the variability

SLIDE 17

Feature Selection in Practice

.: Data Mining in Bioinformatics, Page 17

Catalog of 10 questions by Guyon and Elisseeff

1. Do you have domain knowledge?

If yes, construct a better set of ad-hoc features.

2. Are your features commensurate? If no, consider nor-

malizing them.

3. Do you suspect interdependence of features?

If yes, expand your feature set by constructing conjunctive fea- tures or products of features, as much as your computer resources allow you.

4. Do you need to prune the input variables (e.g. for cost,

speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features.

SLIDE 18

Feature Selection in Practice

.: Data Mining in Bioinformatics, Page 18

Catalog of 10 questions by Guyon and Elisseeff

5. Do you need to assess features individually (e.g. to un-

derstand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.

6. Do you need a predictor? If no, stop.
7. Do you suspect your data is “dirty” (has a few meaning-

less input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.

SLIDE 19

Feature Selection in Practice

.: Data Mining in Bioinformatics, Page 19

Catalog of 10 questions by Guyon and Elisseeff

8. Do you know what to try first? If no, use a linear pre-
dictor. Use a forward selection method with the “probe”

method as a stopping criterion or use the ℓ0-norm. em- bedded method. For comparison, following the ranking

f step 5, construct a sequence of predictors of same

nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.

9. Do you have new ideas, time, computational resources,

and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded meth-

ds. Use linear and non-linear predictors. Select the

best approach with model selection.

SLIDE 20

Feature Selection in Practice

.: Data Mining in Bioinformatics, Page 20

Catalog of 10 questions by Guyon and Elisseeff

10. Do you want a stable solution (to improve performance

and/or understanding)? If yes, subsample your data and redo your analysis for several “bootstraps”.

SLIDE 21

Number of features

.: Data Mining in Bioinformatics, Page 21

What if we don’t know a reasonable choice of n′? Use the probe method (Bi et al., 2003, Stoppiglia et al., 2003, Tusher et al., 2003): Insert ’fake’ features (= probes) into the set of features Fake features can be drawn randomly from a Gaus- sian distribution, or they can be created in a nonpara- metric manner by randomly shuffling existing features Stop feature selection when you select the first fake feature or when the proportion of fake features ex- ceeds a certain threshold HSIC-based stopping criterion. Stop feature selection when there is (no more) dependence between features

X and labels Y (Gretton et al., NIPS 2007)

SLIDE 22

Revealing examples

.: Data Mining in Bioinformatics, Page 22

Can presumably redundant variables help each other? Noise reduction and consequently better class separa- tion may be obtained by adding variables that are pre- sumably redundant. How does correlation impact variable redundancy? Perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them. Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity. Can a variable that is useless by itself be useful with others? Two variables that are useless by themselves can be useful together.

SLIDE 23

References and further reading

.: Data Mining in Bioinformatics, Page 23

References

[1] Isabelle Guyon and Andre Elisseeff. An Introduction to Variable and Feature Selection. In Journal of Machine Learning Research 3, pages 1157–1182, 2003.

SLIDE 24

The end

.: Data Mining in Bioinformatics, Page 24