Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule
Stephen D. Bay1 and Mark Schwabacher2
1Institute for the Study of Learning and Expertise
sbay@apres.stanford.edu
2NASA Ames Research Center
Mining Distance-Based Outliers in Near Linear Time with - - PowerPoint PPT Presentation
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for the Study of Learning and Expertise sbay@apres.stanford.edu 2 NASA Ames Research Center
1Institute for the Study of Learning and Expertise
2NASA Ames Research Center
x
d
radius d
examples
in sphere Distance measure determines proximity and scaling.
– Knorr & Ng
– Ramaswamy, Rastogi, & Shim
– Angiulli & Pizzuti, Eskin et al.
These definitions all relate to
– For each example, find it’s nearest neighbors with a sequential scan
– For each example, find it’s nearest neighbors with an index tree
– For each example, find it’s nearest neighbors given that the examples are stored in bins (e.g., cells, clusters)
for more than 5 dimensions (Knorr & Ng)
– For each example, find it’s nearest neighbors with a sequential scan
– Randomize order of examples
– While performing the sequential scan,
the example cannot be a top outlier
x
d
39 State-gov 77516 Bachelors 13 50 Self-emp-not-inc 83311 Bachelors 13 38 Private 215646 HS-grad 9 53 Private 234721 11th 7 28 Private 338409 Bachelors 13 37 Private 284582 Masters 14 49 Private 160187 9th 5 52 Self-emp-not-inc 209642 HS-grad 9 31 Private 45781 Masters 14 42 Private 159449 Bachelors 13 37 Private 280464 Some-college 10 30 State-gov 141297 Bachelors 13 23 Private 122272 Bachelors 13 32 Private 205019 Assoc-acdm 12 40 Private 121772 Assoc-voc 11 34 Private 245487 7th-8th 4 25 Self-emp-not-inc 176756 HS-grad 9 32 Private 186824 HS-grad 9 38 Private 28887 11th 7 43 Self-emp-not-inc 292175 Masters 14 40 Private 193524 Doctorate 16 54 Private 302146 HS-grad 9 35 Federal-gov 76845 9th 5 43 Private 117037 11th 7 59 Private 109015 HS-grad 9 56 Local-gov 216851 Bachelors 13 19 Private 168294 HS-grad 9 54 ? 180211 Some-college 10 39 Private 367260 HS-grad 9
sequential scan d is distance to 3rd nearest neighbor for the weakest top
– Time does not include randomization
1 0
3
1 0
4
1 0
5
1 0
1 0 1 0
1
1 0
2
1 0
3
1 0
4
C o re l H is to g ra m C o re l H is to g ra m S iz e Total Time 1 0
3
1 0
4
1 0
5
1 0
6
1 0
7
1 0
1 0 1 0
2
1 0
4
1 0
6
1 0
8
K D D C u p 1 9 9 9 K D D C u p 1 9 9 9 S iz e Total Time 1 0
3
1 0
4
1 0
5
1 0
6
1 0
7
1 0 1 0
2
1 0
4
1 0
6
1 0
8
P e rs o n P e rs o n S iz e Total Time 1 0
3
1 0
4
1 0
5
1 0
6
1 0
1 0 1 0
2
1 0
4
1 0
6
N o rm a l 3 0 D N o rm a l 3 0 D S iz e Total Time
Slope of regression fit relating log time to log N
b
2 0 4 0 6 0 8 0 1 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 P e rs o n P e rs o n K Total Time 2 0 4 0 6 0 8 0 1 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 N o rm a l 3 0 D N o rm a l 3 0 D K Total Time
1 million records used for both Person and Normal 30D
– Outliers defined by distance to kth neighbor – Current cutoff distance is d – Randomization + sequential scan = I.I.D. sampling of pdf
x
d
Let p(x) = prob. randomly drawn example lies within distance d How many examples do we need to look at?
dV x pdf x p ) (
For non-outliers, number of samples follows a negative binomial distribution. Let P(Y=y) be probability of obtaining kth success on step y
k y k
−
k y
∞ =
Expectation of number of samples with infinite data is
20 40 60 80 100 1 2 3 4 5 6 7 Person Percent of Data Set Processed Cutoff 50K 100K 1M 5M
0.2 0.4 0.6 0.8 1 1.2 1 1.2 1.4 1.6 1.8
Uniform 3D Household Covertype Person Corel Histogram Normal 30D KDDCup Mixed 3D
Polynomial scaling b Relative change in cutoff (50K/5K) as N increases
1 0
3
1 0
4
1 0
5
1 0
6
1 0
1 0 1 0
2
1 0
4
1 0
6
Unifo rm 3 D Unifo rm 3 D S iz e Total Time
Examples drawn from a uniform distribution in 3 dimensions
i
b=1.76
1 0
3
1 0
4
1 0
5
1 0
6
1 0
1 0 1 0
2
1 0
4
1 0
6
M ixe d 3 D M ixe d 3 D S iz e Total Time
Examples drawn from 99% uniform, 1% Gaussian distribution
i
b=1.11
1 0
3
1 0
4
1 0
5
1 0
6
0.2 0.4 0.6 0.8 1 N o rm a l3 0 d - k1 0 0 a 5 0 0 N o rm a l3 0 d - k1 0 0 a 5 0 0 S iz e o f re fe re n c e s e t Correspondence 1 0
3
1 0
4
1 0
5
1 0
6
1 0
7
0.2 0.4 0.6 0.8 1 P e rs o n - k30 a 3 0 P e rs o n - k30 a 3 0 S iz e o f re fe re n c e s e t Correspondence
It depends…