LECTURE 2: DATA (PRE-)PROCESSING
- Dr. Dhaval Patel
CSE, IIT-Roorkee
LECTURE 2: DATA (PRE-)PROCESSING Dr. Dhaval Patel CSE, IIT-Roorkee - - PowerPoint PPT Presentation
LECTURE 2: DATA (PRE-)PROCESSING Dr. Dhaval Patel CSE, IIT-Roorkee . In Previous Class, We discuss various type of Data with examples In this Class, We focus on Data pre-processing an important milestone of the Data
CSE, IIT-Roorkee
In Previous Class,
We discuss various type of Data with examples
In this Class,
We focus on Data pre-processing – “an important
milestone of the Data Mining Process”
Mining is not the only step in the analysis process Preprocessing: real data is noisy, incomplete and inconsistent.
Data cleaning is required to make sense of the data
Techniques: Sampling, Dimensionality Reduction, Feature
Selection.
Post-Processing: Make the data actionable and useful to the
user : Statistical analysis of importance & Visualization.
Data Preprocessing Data Mining Result Post-processing
Attribute Values Attribute Transformation
Normalization (Standardization) Aggregation Discretization
Sampling Dimensionality Reduction Feature subset selection Distance/Similarity Calculation Visualization
Data is described using attribute values
Attribute values are numbers or symbols assigned to
an attribute
Distinction between attributes and attribute values
Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
There are different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
Examples: calendar dates
Ratio
Examples: length, time, counts
Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values Interval new_value =a * old_value + b where a and b are constants Calendar dates can be converted – financial vs. Gregorian etc. Ratio new_value = a * old_value Length can be measured in meters or feet.
Discrete Attribute Has only a finite or countable infinite set of values Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables. Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a
finite number of digits.
Data has attribute values Then, How good our Data w.r.t. these attribute values?
Examples of data quality problems:
Noise and outliers Missing values Duplicate data
Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 10000K Yes 6 No NULL 60K No 7 Yes Divorced 220K NULL 8 No Single 85K Yes 9 No Married 90K No 9 No Single 90K No
10A mistake or a millionaire? Missing values Inconsistent duplicate entries
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on
a poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise Frequency Plot (FFT)
Outliers are data objects with characteristics that
are considerably different than most of the other data objects in the data set
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their
probabilities)
Data set may include data objects that are
duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous
sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
SFU, CMPT 741, Fall 2009, Martin Ester 16
Binning
sort data and partition into (equi-depth) bins smooth by bin means, bin median, bin boundaries, etc.
Regression
smooth by fitting a regression function
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values automatically and check by human
SFU, CMPT 741, Fall 2009, Martin Ester
Equal-width binning
Divides the range into N intervals of equal size Width of intervals: Simple Outliers may dominate result
Equal-depth binning
Divides the range into N intervals,
each containing approximately same number of records
Skewed data is also handled well
Example: customer ages
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Equi-width binning: number
0-22 22-31 44-48 32-38 38-44 48-55 55-62 62-80 Equi-width binning:
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into three (equi-depth) bins
* Smoothing by bin means
* Smoothing by bin boundaries
noisy
missing values by predicted values
model
attribute dependencies (maybe wrong!)
smoothing
for handling missing data
x y y = x + 1
X1 Y1 Y1’
Data has an attribute values Then, Can we compare these attribute values? For Example: Compare following two records (1) (5.9 ft, 50 Kg) (2) (4.6 ft, 55 Kg) Vs. (3) (5.9 ft, 50 Kg) (4) (5.6 ft, 56 Kg)
We need Data Transformation to makes different dimension(attribute) records comparable …
Normalization: scaled to fall within a small, specified range. min-max normalization z-score normalization normalization by decimal scaling Centralization: Based on fitting a distribution to the data Distance function between distributions
KL Distance Mean Centering
min-max normalization z-score normalization normalization by decimal scaling
min new min new max new min max min v v _ ) _ _ ( '
j
v v 10 '
Where j is the smallest integer such that Max(| |)<1
' v
given follow (1) (5.9 ft, 50 Kg) (2) (4.6 ft, 55 Kg) Vs. (3) (5.9 ft, 50 Kg) (4) (5.6 ft, 56 Kg)
Combining two or more attributes (or objects) into a
single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability
SFU, CMPT 741, Fall 2009, Martin Ester 27
Motivation for Discretization
Some data mining algorithms only accept categorical
attributes
May improve understandability of patterns
SFU, CMPT 741, Fall 2009, Martin Ester 28
Task
Reduce the number of values for a given continuous attribute
by partitioning the range of the attribute into intervals
Interval labels replace actual attribute values
Methods
Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width
The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same
number of samples
Good data scaling Managing categorical attributes can be tricky.
Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is
defined as:
Entropy measures the amount of randomness or surprise or
uncertainty.
Only takes into account non-zero probabilities
Given a set of samples S, if S is partitioned into two intervals
S1 and S2 using boundary T, the entropy after partitioning is
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained until
some stopping criterion is met, e.g.,
Experiments show that it may reduce data size and improve
classification accuracy
E S T S Ent S Ent
S S S S
( , ) | | | | ( ) | | | | ( ) 1 1 2 2
Ent S E T S ( ) ( , )
Data may be Big Then, Can we make is it Small by selecting some part of it?
Data Sampling can do this…
“Sampling is the main technique employed for data selection.”
Big Data Sampled Data
Statisticians sample because obtaining the entire set of
data of interest is too expensive or time consuming.
Example: What is the average height of a person in
Ioannina?
We cannot measure the height of everybody
Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time consuming.
Example: We have 1M documents. What fraction has at
least 100 words in common?
Computing number of common words for all pairs requires
10^12 comparisons
The key principle for effective sampling is the following:
Using a sample will work almost as well as using the entire
data sets, if the sample is representative
A sample is representative if it has approximately the same
property (of interest) as the original set of data
Otherwise we say that the sample introduces some bias What happens if we take a sample from the university
campus to compute the average height of a person at Ioannina?
Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the
sample.
In sampling with replacement, the same object can be picked up more than once
Stratified sampling Split the data into several partitions; then draw random samples from each
partition
Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the
sample.
In sampling with replacement, the same object can be picked up more than once.
This makes analytical computation of probabilities easier
E.g., we have 100 people, 51 are women P(W) = 0.51, 49 men
P(M) = 0.49. If I pick two persons what is the probability P(W,W) that both are women?
Sampling with replacement: P(W,W) = 0.512 Sampling without replacement: P(W,W) = 51/100 * 50/99
Stratified sampling Split the data into several groups; then draw random samples from each
group.
Ensures that both groups are represented.
Example 1. I want to understand the differences between legitimate and
fraudulent credit card transactions. 0.1% of transactions are fraudulent. What happens if I select 1000 transactions at random?
I get 1 fraudulent transaction (in expectation). Not enough to draw any conclusions. Solution:
sample 1000 legitimate and 1000 fraudulent transactions
Example 2. I want to answer the question: Do web pages that are linked
have on average more words in common than those that are not? I have 1M pages, and 1M links, what happens if I select 10K pairs of pages at random?
Most likely I will not get any links. Solution: sample 10K random pairs, and 10K links
Probability Reminder: If an event has probability p of happening and I do N trials, the expected number of times the event occurs is pN
8000 points 2000 Points 500 Points
What sample size is necessary to get at least one
You have N integers and you want to sample one integer
uniformly at random. How do you do that?
The integers are coming in a stream: you do not know the
size of the stream in advance, and there is not enough memory to store the stream in memory. You can only keep a constant amount of integers in memory
How do you sample? Hint: if the stream ends after reading n integers the last integer in
the stream should have probability 1/n to be selected.
Reservoir Sampling: Standard interview question for many companies
array R[k]; // result integer i, j; // fill the reservoir array for each i in 1 to k do R[i] := S[i] done; for each i in k+1 to length(S) do j := random(1, i);
if j <= k then R[j] := S[i] fi done
Do you know “Fisher-Yates shuffle”
S is an array with n number, a is also an array of size
n
a[0] ← S[0]
for i from 1 to n - 1 do r ← random (0 .. i) a[i] ← a[r] a[r] ← S[i]
Suppose we want to mine the comments/reviews of
people on Yelp and Foursquare.
Today there is an abundance of data online
Facebook, Twitter, Wikipedia, Web, etc…
We can extract interesting information from this data, but first we
need to collect it
Customized crawlers, use of public APIs Additional cleaning/processing to parse out the useful parts Respect of crawling etiquette
Data Preprocessing Data Mining Result Post-processing Data Collection
Collect all reviews for the top-10 most reviewed
restaurants in NY in Yelp
(thanks to Sahishnu)
Find few terms that best describe the restaurants. Algorithm?
I heard so many good things about this place so I was pretty juiced to try it. I'm from Cali and I heard Shake Shack is comparable to IN-N-OUT and I gotta say, Shake Shake wins hands down. Surprisingly, the line was short and we waited about 10
the view is breathtaking. Definitely one of my favorite places to eat in NYC.
I'm from California and I must say, Shake Shack is better than IN-N-OUT, all day, err'day.
Would I pay $15+ for a burger here? No. But for the price point they are asking for, this is a definite bang for your buck (though for some, the opportunity cost
the lunch swarm descended and I ordered a shake shack (the special burger with the patty + fried cheese & portabella topping) and a coffee milk shake. The beef patty was very juicy and snugly packed within a soft potato roll. On the downside, I could do without the fried portabella-thingy, as the crispy taste conflicted with the juicy, tender burger. How does shake shack compare with in-and-out or 5-guys? I say a very close tie, and I think it comes down to personal affliations. On the shake side, true to its name, the shake was well churned and very thick and
experience, or perhaps it was the food coma I was slowly falling into. Great place with food at a great price.
Do simple processing to “normalize” the data (remove punctuation,
make into lower case, clear white spaces, other?)
Break into words, keep the most popular words
the 27514 and 14508 i 13088 a 12152 to 10672
ramen 8518 was 8274 is 6835 it 6802 in 6402 for 6145 but 5254 that 4540 you 4366 with 4181 pork 4115 my 3841 this 3487 wait 3184 not 3016 we 2984 at 2980
the 16710 and 9139 a 8583 i 8415 to 7003 in 5363 it 4606
is 4340 burger 432 was 4070 for 3441 but 3284 shack 3278 shake 3172 that 3005 you 2985 my 2514 line 2389 this 2242 fries 2240
are 2142 with 2095 the 16010 and 9504 i 7966 to 6524 a 6370 it 5169
is 4519 sauce 4020 in 3951 this 3519 was 3453 for 3327 you 3220 that 2769 but 2590 food 2497
my 2311 cart 2236 chicken 2220 with 2195 rice 2049 so 1825 the 14241 and 8237 a 8182 i 7001 to 6727
you 4515 it 4308 is 4016 was 3791 pastrami 3748 in 3508 for 3424 sandwich 2928 that 2728 but 2715
this 2099 my 2064 with 2040 not 1655 your 1622 so 1610 have 1585
Do simple processing to “normalize” the data (remove punctuation,
make into lower case, clear white spaces, other?)
Break into words, keep the most popular words
the 27514 and 14508 i 13088 a 12152 to 10672
ramen 8518 was 8274 is 6835 it 6802 in 6402 for 6145 but 5254 that 4540 you 4366 with 4181 pork 4115 my 3841 this 3487 wait 3184 not 3016 we 2984 at 2980
the 16710 and 9139 a 8583 i 8415 to 7003 in 5363 it 4606
is 4340 burger 432 was 4070 for 3441 but 3284 shack 3278 shake 3172 that 3005 you 2985 my 2514 line 2389 this 2242 fries 2240
are 2142 with 2095 the 16010 and 9504 i 7966 to 6524 a 6370 it 5169
is 4519 sauce 4020 in 3951 this 3519 was 3453 for 3327 you 3220 that 2769 but 2590 food 2497
my 2311 cart 2236 chicken 2220 with 2195 rice 2049 so 1825 the 14241 and 8237 a 8182 i 7001 to 6727
you 4515 it 4308 is 4016 was 3791 pastrami 3748 in 3508 for 3424 sandwich 2928 that 2728 but 2715
this 2099 my 2064 with 2040 not 1655 your 1622 so 1610 have 1585
Most frequent words are stop words
Remove stop words
Stop-word lists can be found online.
a,about,above,after,again,against,all,am,an,and,any,are,aren't,as,at,be,be cause,been,before,being,below,between,both,but,by,can't,cannot,could,could n't,did,didn't,do,does,doesn't,doing,don't,down,during,each,few,for,from,f urther,had,hadn't,has,hasn't,have,haven't,having,he,he'd,he'll,he's,her,he re,here's,hers,herself,him,himself,his,how,how's,i,i'd,i'll,i'm,i've,if,in ,into,is,isn't,it,it's,its,itself,let's,me,more,most,mustn't,my,myself,no, nor,not,of,off,on,once,only,or,other,ought,our,ours,ourselves,out,over,own ,same,shan't,she,she'd,she'll,she's,should,shouldn't,so,some,such,than,tha t,that's,the,their,theirs,them,themselves,then,there,there's,these,they,th ey'd,they'll,they're,they've,this,those,through,to,too,under,until,up,very ,was,wasn't,we,we'd,we'll,we're,we've,were,weren't,what,what's,when,when's ,where,where's,which,while,who,who's,whom,why,why's,with,won't,would,would n't,you,you'd,you'll,you're,you've,your,yours,yourself,yourselves,
Remove stop words
Stop-word lists can be found online.
ramen 8572 pork 4152 wait 3195 good 2867 place 2361 noodles 2279 ippudo 2261 buns 2251 broth 2041 like 1902 just 1896 get 1641 time 1613
really 1437 go 1366 food 1296 bowl 1272 can 1256 great 1172 best 1167 burger 4340 shack 3291 shake 3221 line 2397 fries 2260 good 1920 burgers 1643 wait 1508 just 1412 cheese 1307 like 1204 food 1175 get 1162 place 1159
long 1013 go 995 time 951 park 887 can 860 best 849 sauce 4023 food 2507 cart 2239 chicken 2238 rice 2052 hot 1835 white 1782 line 1755 good 1629 lamb 1422 halal 1343 just 1338 get 1332
like 1096 place 1052 go 965 can 878 night 832 time 794 long 792 people 790 pastrami 3782 sandwich 2934 place 1480 good 1341 get 1251 katz's 1223 just 1214 like 1207 meat 1168
deli 984 best 965 go 961 ticket 955 food 896 sandwiches 813 can 812 beef 768
pickles 699 time 662
Remove stop words
Stop-word lists can be found online.
ramen 8572 pork 4152 wait 3195 good 2867 place 2361 noodles 2279 ippudo 2261 buns 2251 broth 2041 like 1902 just 1896 get 1641 time 1613
really 1437 go 1366 food 1296 bowl 1272 can 1256 great 1172 best 1167 burger 4340 shack 3291 shake 3221 line 2397 fries 2260 good 1920 burgers 1643 wait 1508 just 1412 cheese 1307 like 1204 food 1175 get 1162 place 1159
long 1013 go 995 time 951 park 887 can 860 best 849 sauce 4023 food 2507 cart 2239 chicken 2238 rice 2052 hot 1835 white 1782 line 1755 good 1629 lamb 1422 halal 1343 just 1338 get 1332
like 1096 place 1052 go 965 can 878 night 832 time 794 long 792 people 790 pastrami 3782 sandwich 2934 place 1480 good 1341 get 1251 katz's 1223 just 1214 like 1207 meat 1168
deli 984 best 965 go 961 ticket 955 food 896 sandwiches 813 can 812 beef 768
pickles 699 time 662
Commonly used words in reviews, not so interesting
Important words are the ones that are unique to the document
(differentiating) compared to the rest of the collection
All reviews use the word “like”. This is not interesting We want the words that characterize the specific restaurant
Document Frequency 𝐸𝐺(𝑥): fraction of documents that contain word 𝑥.
𝐸𝐺(𝑥) = 𝐸(𝑥) 𝐸
Inverse Document Frequency 𝐽𝐸𝐺(𝑥):
𝐽𝐸𝐺(𝑥) = log 1 𝐸𝐺(𝑥)
Maximum when unique to one document : 𝐽𝐸𝐺(𝑥) = log(𝐸) Minimum when the word is common to all documents: 𝐽𝐸𝐺(𝑥) = 0
𝐸(𝑥): num of docs that contain word 𝑥 𝐸: total number of documents
The words that are best for describing a document are the
the document.
TF(w,d): term frequency of word w in document d Number of times that the word appears in the document Natural measure of importance of the word for the document IDF(w): inverse document frequency Natural measure of the uniqueness of the word w TF-IDF(w,d) = TF(w,d) IDF(w)
Ordered by TF-IDF
ramen 3057.41761944282 7 akamaru 2353.24196503991 1 noodles 1579.68242449612 5 broth 1414.71339552285 5 miso 1252.60629058876 1 hirata 709.196208642166 1 hakata 591.76436889947 1 shiromaru 587.1591987134 1 noodle 581.844614740089 4 tonkotsu 529.594571388631 1 ippudo 504.527569521429 8 buns 502.296134008287 8 ippudo's 453.609263319827 1 modern 394.839162940177 7 egg 367.368005696771 5 shoyu 352.295519228089 1 chashu 347.690349042101 1 karaka 336.177423577131 1 kakuni 276.310211159286 1 ramens 262.494700601321 1 bun 236.512263803654 6 wasabi 232.366751234906 3 dama 221.048168927428 1 brulee 201.179739054263 2 fries 806.085373301536 7 custard 729.607519421517 3 shakes 628.473803858139 3 shroom 515.779060830666 1 burger 457.264637954966 9 crinkle 398.34722108797 1 burgers 366.624854809247 8 madison 350.939350307801 4 shackburger 292.428306810 1 'shroom 287.823136624256 1 portobello 239.8062489526 2 custards 211.837828555452 1 concrete 195.169925889195 4 bun 186.962178298353 6 milkshakes 174.9964670675 1 concretes 165.786126695571 1 portabello 163.4835416025 1 shack's 159.334353330976 2 patty 152.226035882265 6 ss 149.668031044613 1 patties 148.068287943937 2 cam 105.949606780682 3 milkshake 103.9720770839 5 lamps 99.011158998744 1 lamb 985.655290756243 5 halal 686.038812717726 6 53rd 375.685771863491 5 gyro 305.809092298788 3 pita 304.984759446376 5 cart 235.902194557873 9 platter 139.459903080044 7 chicken/lamb 135.8525204 1 carts 120.274374158359 8 hilton 84.2987473324223 4 lamb/chicken 82.8930633 1 yogurt 70.0078652365545 5 52nd 67.5963923222322 2 6th 60.7930175345658 9 4am 55.4517744447956 5 yellow 54.4470265206673 8 tzatziki 52.9594571388631 1 lettuce 51.3230168022683 8 sammy's 50.656872045869 1 sw 50.5668577816893 3 platters 49.9065970003161 5 falafel 49.4796995212044 4 sober 49.2211422635451 7 moma 48.1589121730374 3 pastrami 1931.94250908298 6 katz's 1120.62356508209 4 rye 1004.28925735888 2 corned 906.113544700399 2 pickles 640.487221580035 4 reuben 515.779060830666 1 matzo 430.583412389887 1 sally 428.110484707471 2 harry 226.323810772916 4 mustard 216.079238853014 6 cutter 209.535243462458 1 carnegie 198.655512713779 3 katz 194.387844446609 7 knish 184.206807439524 1 sandwiches 181.415707218 8 brisket 131.945865389878 4 fries 131.613054313392 7 salami 127.621117258549 3 knishes 124.339595021678 1 delicatessen 117.488967607 2 deli's 117.431839742696 1 carver 115.129254649702 1 brown's 109.441778045519 2 matzoh 108.22149937072 1
TF-IDF takes care of stop words as well We do not need to remove the stop words since
they will get IDF(w) = 0
When mining real data you often need to make some What data should we collect? How much? For how long? Should we throw out some data that does not seem to be useful? Too frequent data (stop words), too infrequent (errors?), erroneous
data, missing data, outliers
How should we weight the different pieces of data? Most decisions are application dependent. Some information
may be lost but we can usually live with it (most of the times)
Dealing with real data is hard…
AAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAA
An actual review
Each record has many attributes
useful, useless or correlated
Then, Can we select some small subset of attributes?
Dimensionality Reduction can do this….
Why?
When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
Curse of Dimensionality : Definitions of density and distance
between points, which is critical for clustering and outlier detection, become less meaningful
Objectives:
Avoid curse of dimensionality Reduce amount of time and memory required by data mining
algorithms
Observation: Certain Dimensions are correlated
Allow data to be more easily visualized May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis or Singular Value Decomposition (Mapping Data to New Space) : Wavelet Transform Others: supervised and non-linear techniques
Goal is to find a projection that captures the largest
amount of variation in data
Find the eigenvectors of the covariance matrix The eigenvectors define the new space
x2 x1 e
Eigen Vectors show the direction of axes of a fitted
ellipsoid
Eigen Values show the significance of the
corresponding axis
The larger the Eigen value, the more separation
between mapped data
For high dimensional data,
are significant
PCA (Principle Component Analysis) is defined as an
data to a new coordinate system such that the greatest variance comes to lie on the first coordinate, the second greatest variance on the second coordinate and so on.
64
Each Coordinate in Principle Component Analysis is
called Principle Component. Ci = bi1 (x1) + bi2 (x2) + … + bin(xn)
where, Ci is the ith principle component, bij is the regression coefficient for observed variable j for the principle component i and xi are the variables/dimensions.
65
Variance and Covariance Eigenvector and Eigenvalue Principle Component Analysis Application of PCA in Image Processing
66
The variance is a measure of how far a set of
numbers is spread out.
The equation of variance is
1 ) var(
1
n x x x x x
n i i i
67
Covariance is a measure of how much two random
variables change together.
The equation of variance is
1
n i i i
68
Covariance Matrix is a n*n matrix where each
element can be define as
A covariance matrix over 2 dimensional dataset is
69
The eigenvectors of a square matrix A are the non-
zero vectors x such that, after being multiplied by the matrix, remain parallel to the original vector.
70
For each Eigenvector, the corresponding Eigenvalue is the
factor by which the eigenvector is scaled when multiplied by the matrix.
71
The vector x is an eigenvector of the matrix A with
eigenvalue λ (lambda) if the following equation holds:
72
Calculating Eigenvalues Calculating Eigenvector
73
It turns out that the Eigenvectors of covariance
matrix of the data set are the principle components
Eigenvector with the highest eigenvalue is first
principle component and with the 2nd highest eigenvalue is the second principle component and so on.
74
1.
Adjust the dataset to zero mean dataset.
2.
Find the Covariance Matrix M
3.
Calculate the normalized Eigenvectors and Eigenvalues of M
4.
Sort the Eigenvectors according to Eigenvalues from highest to lowest
5.
Form the Feature vector F using the transpose of Eigenvectors.
6.
Multiply the transposed dataset with F
75
X Y 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2 1.6 1 1.1 1.5 1.6 1.1 0.9 X Y 0.69 0.49
0.39 0.99 0.09 0.29 1.29 1.09 0.49 0.79 0.19
Original Data Adjusted Dataset
76
AdjustedDataSet = OriginalDataSet - Mean
77
The eigenvalues of matrix M are Normalized Eigenvectors with corresponding
eigenvales are
28402771 . 1 0490833989 . s eigenvalue
735178656 . 677873399 . 677873399 . 735178656 . rs eigenvecto
78
Sorted eigenvector Feature vector
677873399 . 735178656 . 735178656 . 677873399 . rs eigenvecto
T
79
X Y
1.77758033 0.142857227
0.384374989
0.130417207
0.175282444
1.14457216 0.0464172582 0.438046137 0.0177646297 1.22382056
FinalData = F x AdjustedDataSetTransposed
80
FinalData = F x AdjustedDataSetTransposed
81
X
1.77758033
0.0991094375 1.14457216 0.438046137 1.22382056
FinalData = F x AdjustedDataSetTransposed AdjustedDataSetTransposed = F-1 x FinalData but, F-1 = FT So, AdjustedDataSetTransposed =FT x FinalData and, OriginalDataSet = AdjustedDataSet + Mean
82
83
84
85
http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.ht
ml
86
Using all PCs Using only 2 PCs
z11 z12 . . z1n z21 z22 . . z2n . . zn1 zn2 , , znn æ è ç ç ç ç ç ç ç ö ø ÷ ÷ ÷ ÷ ÷ ÷ ÷ × x1 x2 . . xn æ è ç ç ç ç ç ç ç ö ø ÷ ÷ ÷ ÷ ÷ ÷ ÷ = x'1 x'2 . . x'n æ è ç ç ç ç ç ç ç ö ø ÷ ÷ ÷ ÷ ÷ ÷ ÷
z11 z12 . . z1n z21 z22 . . z2n æ è ç ç ö ø ÷ ÷× x1 x2 . . xn æ è ç ç ç ç ç ç ç ö ø ÷ ÷ ÷ ÷ ÷ ÷ ÷ = x'1 x'2 æ è ç ç ö ø ÷ ÷
88
Decomposes a signal into
different frequency subbands
Applicable to n-dimensional
signals
Data are transformed to
preserve relative distance between objects at different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression
89
Discrete wavelet transform (DWT) for linear signal processing,
multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method: Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length
Haar2 Daubechie4
90
Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4,
1/2, 0, 0, -1, -1, 0]
Compression: many small detail coefficients can be replaced
by 0’s, and only the significant coefficients are retained
Another way to reduce dimensionality of data Redundant features
duplicate much or all of the information contained in one or
more other attributes
Example: purchase price of a product and the amount of
sales tax paid
Irrelevant features
contain no information that is useful for the data mining task
at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Abhinna Agarwal
M.Tech.(CSE) Guided by
1-2 Opening…. For … M.Tech. Dissertation in the Area of Feature Subset Selection
So far, our Trajectory on Data Preprocessing is as follow:
Data has many records Then, Can we find similar records?
Distance and Similarity are commonly used….
Shape Colour Size Pattern
Similarity
Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
Proximity refers to a similarity or dissimilarity
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Standardization is necessary, if scales differ.
n k k k
1 2
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
) ,..., , (
2 1 n
x x x
Euclidean distance: Point 1 is: Point 2 is: Euclidean distance is:
) ,..., , (
2 1 n
y y y
2 2 2 2 2 1 1
) ( ... ) ( ) (
n n
x y x y x y
1 2 3 1 2 3 4 5 6
p1 p2 p3 p4
point x y p1 2 p2 2 p3 3 1 p4 5 1
Distance Matrix
p1 p2 p3 p4 p1 2.828 3.162 5.099 p2 2.828 1.414 3.162 p3 3.162 1.414 2 p4 5.099 3.162 2
Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
r n k r k k
1 1
r = 1. City block (Manhattan, taxicab, L1 norm)
distance.
A common example of this is the Hamming distance, which is just the number
r = 2. Euclidean distance r . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component of the vectors Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
2 1 n
) ,..., , (
2 1 n
y y y
Manhattan distance (aka city-block distance) Point 1 is: Point 2 is: Manhattan distance is:
| | ... | | | |
2 2 1 1 n n
x y x y x y
(in case you don’t know: is the absolute value of x. )
| | x
) ,..., , (
2 1 n
y y y
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Chebychev distance
) ,..., , (
2 1 n
x x x
Point 1 is: Point 2 is: Chebychev distance is:
) ,..., , (
2 1 n
y y y |} | |,..., | |, max{|
2 2 1 1 n n
x y x y x y
Distance Matrix
point x y p1 2 p2 2 p3 3 1 p4 5 1 L1 p1 p2 p3 p4 p1 4 4 6 p2 4 2 4 p3 4 2 2 p4 6 4 2 L2 p1 p2 p3 p4 p1 2.828 3.162 5.099 p2 2.828 1.414 3.162 p3 3.162 1.414 2 p4 5.099 3.162 2 L p1 p2 p3 p4 p1 2 3 5 p2 2 1 3 p3 3 1 2 p4 5 3 2
Each variable contributes independently to the
measure of distance.
May not always be appropriate… e.g., think of
nearest neighbor classifier
height(i) height(j) diameter(i) diameter(j) height2(i) height100(i) … height2(j) height100(j) …
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Covariance and correlation measure linear dependence
(distance between variables, not objects)
Assume we have two variables or attributes X and Y and n
The sample covariance of X and Y is:
The covariance is a measure of how X and Y vary together. it will be large and positive if large values of X are associated
with large values of Y, and small X small Y
n i
y i y x i x n Y X Cov
1
) ) ( )( ) ( ( 1 ) , (
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Covariance depends on ranges of X and Y Standardize by dividing by standard deviation Linear correlation coefficient is defined as:
2 1 1 2 1 2 1
) ) ( ( ) ) ( ( ) ) ( )( ) ( ( ) , (
n i n i n i
y i y x i x y i y x i x Y X
business acreage nitrous oxide percentage of large residential lots
Data on characteristics
average # rooms Median house value
1 1
T MH
1. It automatically accounts for the scaling of the coordinate axes 2. It corrects for correlation between the different features Cost: 1. The covariance matrices can be hard to determine accurately 2. The memory and time requirements grow quadratically, O(p2), rather than linearly with the number of features.
Inverse covariance matrix Vector difference in p-dimensional space Evaluates to a scalar distance
Covariance matrix is diagonal and isotropic
equal variance
to Euclidean distance
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Covariance matrix is diagonal but non-isotropic
equal variance
to weighted Euclidean distance with weights = inverse variance
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Two outer blue points will have same MH distance to the center blue point
Y X (X,Y) = ? linear covariance, correlation Are X and Y dependent?
T
1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. is the covariance matrix of the input data X
n i k ik j ij k j
X X X X n
1 ,
) )( ( 1 1
P A B
Covariance Matrix:
3 . 2 . 2 . 3 .
B A C A: (0.5, 0.5) B: (0, 1) C: (1.5, 1.5) Mahal(A,B) = 5 Mahal(A,C) = 4
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Proportion different (red, male, big, hot) (green, male, small, hot)
) ,..., , (
2 1 n
x x x
Point 1 is: Point 2 is: Proportion different is:
) ,..., , (
2 1 n
y y y
n d d d x y d
f f
/ is different proportion 1 then ) ( if f field each for
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Jaccard coefficient (bread, cheese, milk, nappies) (batteries, cheese) Point 1 is a set: A Point 2 is a set: B Jaccard Coefficient is:
The number of things that appear in both (1 - cheese), divided by the total number of different things (5))
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data vectors are: (colour, manufacturer, top-speed) e.g.: (red, ford, 180) (yellow, toyota, 160) (silver, bugatti, 300) What distance measure will you use?
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data vectors are : (colour, manufacturer, top-speed) e.g.: (dark, ford, high) (medium, toyota, high) (light, bugatti, very-high) What distance measure will you use?
With different types of fields, e.g. p1 = (red, high, 0.5, UK, 12) p2 = (blue, high, 0.6, France, 15) You could simply define a distance measure for each field Individually, and add them up. Similarly, you could divide the vectors into ordinal and numeric parts: p1a = (red, high, UK) p1b = (0.5, 12) p2a = (blue, high, France) p2b = (0.6, 15) and say that dist(p1, p2) = dist(p1a,p2a)+d(p1b,p2b) using appropriate measures for the two kinds of vector.
David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Suppose one field varies hugely (standard deviation is 100), and one field varies a tiny amount (standard deviation 0.001) – why is Euclidean distance a bad idea? What can you do? What is the distance between these two? “Star Trek: Voyager” “Satr Trek: Voyagger” Normalising fields individually is often a good idea – when a numerical field is normalised, that means you scale it so that the mean is 0 and the standard deviation is 1. Edit distance is useful in many applications: see http://www.merriampark.com/ld.htm
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150, distance=1-cos(d1,d2)
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching m: # of matches, p: total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states
pm p j i d ) , (
An ordinal variable can be discrete or continuous order is important, e.g., rank Can be treated like interval-scaled replacing xif by their rank map the range of each variable onto [0, 1] by replacing i-th
compute the dissimilarity using methods for interval-scaled
variables
f if if
} ,..., 1 {
f if
M r
Distances, such as the Euclidean distance, have some well known properties.
1.
d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2.
d(p, q) = d(q, p) for all p and q. (Symmetry)
3.
d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is a metric
Similarities, also have some well known properties.
1.
s(p, q) = 1 (or maximum similarity) only if p = q.
2.
s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.