[PPT] - Data Preprocessing Why Data Preprocessing? Chris Williams, School PowerPoint Presentation

SLIDE 1

Data Preprocessing

Chris Williams, School of Informatics University of Edinburgh Data preparation is a big issue for data mining. Cabena et al (1998) extimate that data preparation accounts for 60% of the effort in a data mining application.

Data cleaning
Data integration and transformation
Data reduction

Reading: Han and Kamber, chapter 3

Why Data Preprocessing?

Data in the real world is dirty. It is:

incomplete, e.g. lacking attribute values
noisy, e.g. containing errors or outliers
inconsistent, e.g. containing discrepancies in codes or names

GIGO: need quality data to get quality results

Major Tasks in Data Preprocessing

Data cleaning Data integration Data transformation Data reduction attributes attributes A1 A2 A3 ... A126 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48 T1 T2 T3 T4 ... T2000 transactions transactions A1 A3 ... T1 T4 ... T1456 A115

Data cleaning
Data integration
Data transformation
Data reduction

Figure from Han and Kamber

Data Cleaning Tasks

Handle missing values
Identify outliers, smooth out noisy data
Correct inconsistent data

SLIDE 2

Missing Data

What happens if input data is missing? Is it missing at random (MAR) or is there a systematic reason for its absence? Let xm denote those values missing, and xp those values that are present. If MAR, some “solutions” are – Model P(xm|xp) and average (correct, but hard) – Replace data with its mean value (?) – Look for similar (close) input patterns and use them to infer missing values (crude version of density model) – Reference: Statistical Analysis with Missing Data R. J. A. Little, D. B. Rubin, Wiley (1987)

Outliers detected by clustering, or combined computer and human inspection

Data Integration

Combines data from multiple sources into a coherent store

Entity identification problem: identify real-world entities from multiple data

sources, e.g. A.cust-id ≡ B.cust-num

Detecting and resolving data value conflicts: for the same real-world

entity, attribute values are different, e.g. measurement in different units

Data Transformation

Normalization, e.g. to zero mean, unit standard deviation

new data = old data − mean std deviation

r max-min normalization to [0, 1]

new data = old data − min max − min

Normalization useful for e.g. k nearest neighbours, or for neural networks
New features constructed, e.g. with PCA or with hand-crafted features

Data Reduction

Feature selection: Select a minimum set of features ˜

x from x so that: – P(class|˜ x) closely approximates P(class|x) – The classification accuracy does not significantly decrease

Data Compression (lossy)
PCA, Canonical variates
Sampling: choose a representative subset of the data

– Simple random sampling vs stratified sampling

Hierarchical reduction: e.g. country-county-town

SLIDE 3

Feature Selection

Usually as part of supervised learning

Stepwise strategies
(a) Forward selection: Start with no features. Add the one which is the best predictor.

Then add a second one to maximize performance using first feature and new one; and so on until a stopping criterion is satisfied

(b) Backwards elimination: Start with all features, delete the one which reduces

performance least, recursively until a stopping criterion is satisfied

Forward selection is unable to anticipate interactions
Backward selection can suffer from problems of overfitting
They are heuristics to avoid considering all subsets of size k of d features

Descriptive Modelling

Chris Williams, School of Informatics University of Edinburgh

Descriptive models are a summary of the data

Describing data by probability distributions

– Parametric models – Mixture Models – Non-parametric models – Graphical models

Clustering

– Partition-based Clustering Algorithms – Hierarchical Clustering – Probabilistic Clustering using Mixture Models Reading: HMS, chapter 9

Describing data by probability distributions

Parametric models, e.g. single multivariate Gaussian
Mixture models, e.g. mixture of Gaussians, mixture of Bernoullis
Non-parametric models, e.g. kernel density estimation

ˆ f(x) = 1 n

n

i=1

Kh(x − xi) Does not provide a good summary of the data, expensive to compute on large datasets

SLIDE 4

Probability Distributions: Graphical Models

Mixture of Independence Models

6 5 4 3 2 1

X X X X X X C

(also Naive Bayes model)

Fitting a given graphical model to data
Search over graphical structures

Clustering

Clustering is the partitioning of a data set into groups so that points in one group are similar to each other and are as different as possible from points in other groups

Partition-based Clustering Algorithms
Hierarchical Clustering
Probabilistic Clustering using Mixture Models

Examples

Split credit card owners into groups depending on what kinds of purchases they make
In biology, can be used to derive plant and animal taxonomies
Group documents on the web for information discovery

Defining a partition

Clustering algorithm with k groups
Mapping c from input example number to group to which it belongs
In Rd, assign to group j a cluster centre mj. Choose both c and the mj’s so as to

minimize

n

i=1

|xi − mc(i)|2

Given c, optimization of the mj’s is easy; mj is just the mean of the data vectors

assigned to class j

Optimiztion over c: cannot compute all possible groupings, use the k-means algorithm

to find a local optimum

k-means algorithm

initialize centres m1, . . . , mk while (not terminated) for i = 1, . . . , n calculate |xi − mj|2 for all centres assign datapoint i to the closest centre end for recompute each mj as the mean of the datapoints assigned to it end while

This is a batch algo-

rithm.

There is also an on-line

version, where the cen- tres are updated after each datapoint is seen

Also

k-medoids; find a representative object for each cluster centre

Choice of k?

SLIDE 5

Hierarchical clustering

for i = 1, . . . , n let Ci = {xi} while there is more than one cluster left do let Ci and Cj be the clusters minimizing the distance D(Ci, Cj) between any two clusters Ci = Ci ∪ Cj remove cluster Cj end

Results can be displayed as a dendrogram
This is agglomerative clustering; divisive techniques are also possible

15 20 25 30 35 40 45 45 50 55 60 65 70 75 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

_____________|------> p08 | |______|------> p04 | |------> p09 |--------------------------| _______|----> p02 | | |--------| |----> p12 | |-------------| |-------> p14 | |________|------> p10

|

|------> p15 | ________|-----> p03 | |--------------| |-----> p06 | | | ________|-----> |--------------------------| |--------| |-----> | |--------> p11 |______________|-------> p05 |_______|------> p13 |______|-----> p16 |-----> p17

Distance functions for hierarchical clustering

Single link (nearest neighbour)

Dsl(Ci, Cj) = min

x,y {d(x, y)|x ∈ Ci, y ∈ Cj}

The distance between the two closest points, one from each cluster. Can lead to “chaining”.

Complete link (furthest neighbour)

Dcl(Ci, Cj) = max

x,y {d(x, y)|x ∈ Ci, y ∈ Cj}

Centroid measure: distance between clusters is difference between centroids
Others possible

Probabilistic Clustering

Using finite mixture models, trained with EM
Can be extended to deal with outlier by using an extra, broad distribution to “mop up”
utliers
Can be used to cluster non-vectorial data, e.g. mixtures of Markov models for

sequences

Methods for comparing choice of k
Disadvantage: parametric assumption for each component
Disadvantage: complexity of EM relative to e.g. k-means

SLIDE 6

Graphical Models: Causality

J. Pearl, Causality, Cambridge UP (2000)
To really understand causal structure, we need to predict effect of

interventions

Semantics of do(X = 1) in a causal belief network, as opposed to

conditioning on X = 1

Example: smoking and lung cancer

Causal Bayesian Networks

A causal Bayesian network is a Bayesian network in which each arc is interpreted as a direct causal in- fluence between a parent node and a child node, relative to the other nodes in the network. (Gregory Cooper, 1999, section 4) Causation = behaviour under inter- ventions X X X X X Season Wet Slippery Rain Sprinkler

3 4 5 1 2

An Algebra of Doing

Available: algebra of seeing (observation)

e.g. what is the chance it rained if we see that the grass is wet? P(rain|wet) = P(wet|rain)P(rain)/P(wet)

Needed: algebra of doing

e.g. what is the chance it rained if we make the grass wet? P(rain|do(wet)) = P(rain)

Truncated factorization formula

P(x1, . . . , xn|ˆ x

′

i) = j=i P(xj|paj)

if xi = x

′

i

if xi = x

′

i

P(x1, . . . , xn|ˆ x

′

i) =      P(x1,...,xn) P(x′

i|pai)

if xi = x

′

i

if xi = x

′

i

SLIDE 7

compare with conditioning P(x1, . . . , xn|x

′

i) =      P(x1,...,xn) P(x′

i)

if xi = x

′

i

if xi = x

′

i

Intervention as surgery on graphs

X X X X X Season Wet Slippery Rain Sprinkler

3 4 5 1 2

= On

Controlling confounding bias

We wish to evaluate the effect of X on Y ; what other factors Z (known as covariates or confounders) do we need to adjust for? Simpson’s “paradox”: an event C increases the probability of E in a population p, but decreases the probability of E in every subpopulation. E.g. UC-Berkeley investigated for sex-bias (1975). Overall, higher rate of admission of males, but every for department there was a slight bias in favour

f admitting females.

[Explanation: females applied to more competitive departments where admission rate was low]

Another example: administering a drug gives rise to lower rates of

recovery than giving a placebo for both males and females, but overall it can appear better

What treatment would you give to a patient coming into your office?

Apparent answer is “if know that patient is male or female, don’t give drug, but if gender is unknown, do!”. This answer is ridiculous!

SLIDE 8

Correct answer to question will depend not only on observed

probabilities, but also on assumed causal model. Diagrams below can have the same P(C, E, F), but use of combined or gender-specific tables depends on diagram C C F F E E Recovery Recovery Gender Treatment Treatment Blood Pressure use gender-specific table use combined table