Mining Distance-Based Outliers in Near Linear Time with - - PowerPoint PPT Presentation

mining distance based outliers in near linear time with
SMART_READER_LITE
LIVE PREVIEW

Mining Distance-Based Outliers in Near Linear Time with - - PowerPoint PPT Presentation

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for the Study of Learning and Expertise sbay@apres.stanford.edu 2 NASA Ames Research Center


slide-1
SLIDE 1

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Stephen D. Bay1 and Mark Schwabacher2

1Institute for the Study of Learning and Expertise

sbay@apres.stanford.edu

2NASA Ames Research Center

Mark.A.Schwabacher@nasa.gov

slide-2
SLIDE 2

Motivation

Detecting outliers or anomalies is an important KDD task with many practical applications and fast algorithms are needed for large databases. In this talk, I will

– Show that very simple modifications of a basic algorithm lead to extremely good performance – Explain why this approach works well – Discuss limitations of this approach

slide-3
SLIDE 3

Distance-Based Outliers

  • The main idea is to find points in low density

regions of the feature space

NV k x P ≅

) (

x

d

  • V is the total volume within

radius d

  • N is the total number of

examples

  • k is the number of examples

in sphere Distance measure determines proximity and scaling.

slide-4
SLIDE 4

Outlier Definitions

  • Outliers are the examples for which there are

fewer than p other examples within distance d

– Knorr & Ng

  • Outliers are the top n examples whose distance to

the kth nearest neighbor is greatest

– Ramaswamy, Rastogi, & Shim

  • Outliers are the top n examples whose average

distance to the k nearest neighbors is greatest

– Angiulli & Pizzuti, Eskin et al.

NV k x P

) (

These definitions all relate to

slide-5
SLIDE 5

Existing Methods

  • Nested Loops

– For each example, find it’s nearest neighbors with a sequential scan

  • O(N2)
  • Index Trees

– For each example, find it’s nearest neighbors with an index tree

  • Potentially N log N, in practice can be worse than NL
  • Partitioning Methods

– For each example, find it’s nearest neighbors given that the examples are stored in bins (e.g., cells, clusters)

  • Cell-based methods potentially N, in practice worse than NL

for more than 5 dimensions (Knorr & Ng)

  • Cluster based methods appear sub-quadratic
slide-6
SLIDE 6

Our Algorithm

  • Based on Nested loops

– For each example, find it’s nearest neighbors with a sequential scan

  • Two modifications

– Randomize order of examples

  • Can be done with a disk-based algorithm in linear time

– While performing the sequential scan,

  • Keep track of closest neighbors found so far
  • prune examples once the neighbors found so far indicate that

the example cannot be a top outlier

  • Process examples in blocks
  • Worst case O(N2) distance computations, O(N2/B)

disk accesses

slide-7
SLIDE 7

Pruning

  • Outliers based on distance to the 3rd nearest

neighbor (k=3)

x

d

39 State-gov 77516 Bachelors 13 50 Self-emp-not-inc 83311 Bachelors 13 38 Private 215646 HS-grad 9 53 Private 234721 11th 7 28 Private 338409 Bachelors 13 37 Private 284582 Masters 14 49 Private 160187 9th 5 52 Self-emp-not-inc 209642 HS-grad 9 31 Private 45781 Masters 14 42 Private 159449 Bachelors 13 37 Private 280464 Some-college 10 30 State-gov 141297 Bachelors 13 23 Private 122272 Bachelors 13 32 Private 205019 Assoc-acdm 12 40 Private 121772 Assoc-voc 11 34 Private 245487 7th-8th 4 25 Self-emp-not-inc 176756 HS-grad 9 32 Private 186824 HS-grad 9 38 Private 28887 11th 7 43 Self-emp-not-inc 292175 Masters 14 40 Private 193524 Doctorate 16 54 Private 302146 HS-grad 9 35 Federal-gov 76845 9th 5 43 Private 117037 11th 7 59 Private 109015 HS-grad 9 56 Local-gov 216851 Bachelors 13 19 Private 168294 HS-grad 9 54 ? 180211 Some-college 10 39 Private 367260 HS-grad 9

sequential scan d is distance to 3rd nearest neighbor for the weakest top

  • utlier
slide-8
SLIDE 8

Experimental Setup

  • 6 data sets varying from 68K to 5M examples
  • Mixture of discrete and continuous features (23-

55)

  • Wall time reported (CPU + IO)

– Time does not include randomization

  • No special caching of records
  • Pentium 4, 1.5 Ghz, 1GB Ram
  • Memory footprint ~3MB
  • Mined top 30 outliers, k=5, block size = 1000,

average distance

slide-9
SLIDE 9

Scaling with N

1 0

3

1 0

4

1 0

5

1 0

  • 1

1 0 1 0

1

1 0

2

1 0

3

1 0

4

C o re l H is to g ra m C o re l H is to g ra m S iz e Total Time 1 0

3

1 0

4

1 0

5

1 0

6

1 0

7

1 0

  • 2

1 0 1 0

2

1 0

4

1 0

6

1 0

8

K D D C u p 1 9 9 9 K D D C u p 1 9 9 9 S iz e Total Time 1 0

3

1 0

4

1 0

5

1 0

6

1 0

7

1 0 1 0

2

1 0

4

1 0

6

1 0

8

P e rs o n P e rs o n S iz e Total Time 1 0

3

1 0

4

1 0

5

1 0

6

1 0

  • 2

1 0 1 0

2

1 0

4

1 0

6

N o rm a l 3 0 D N o rm a l 3 0 D S iz e Total Time

slide-10
SLIDE 10

Scaling Summary

1.13 1.25 1.13 1.32 1.16 1.15 Corel Histogram Covertype KDDCup 1999 Household 1990 Person 1990 Normal 30D Slope Data Set

Slope of regression fit relating log time to log N

b

aN t N b a t

= + =

  • r

log log log

slide-11
SLIDE 11

Scaling with k

2 0 4 0 6 0 8 0 1 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 P e rs o n P e rs o n K Total Time 2 0 4 0 6 0 8 0 1 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 N o rm a l 3 0 D N o rm a l 3 0 D K Total Time

1 million records used for both Person and Normal 30D

slide-12
SLIDE 12

Average Case Analysis

Consider operation of the algorithm at moment in time

– Outliers defined by distance to kth neighbor – Current cutoff distance is d – Randomization + sequential scan = I.I.D. sampling of pdf

x

d

Let p(x) = prob. randomly drawn example lies within distance d How many examples do we need to look at?

( ) ∫

=

dV x pdf x p ) (

slide-13
SLIDE 13

For non-outliers, number of samples follows a negative binomial distribution. Let P(Y=y) be probability of obtaining kth success on step y

k y k

x p x p k y y Y P

−         − − = =

)) ( 1 ( ) ( 1 1 ) (

[ ] [ ]

) ( ) ( x p k Y E y y Y P Y E

k y

= ⋅ = =∑

∞ =

Expectation of number of samples with infinite data is

slide-14
SLIDE 14

How does the cutoff change during program execution?

20 40 60 80 100 1 2 3 4 5 6 7 Person Percent of Data Set Processed Cutoff 50K 100K 1M 5M

slide-15
SLIDE 15

0.2 0.4 0.6 0.8 1 1.2 1 1.2 1.4 1.6 1.8

Uniform 3D Household Covertype Person Corel Histogram Normal 30D KDDCup Mixed 3D

Scaling Rate b Versus Cutoff Ratio

Polynomial scaling b Relative change in cutoff (50K/5K) as N increases

slide-16
SLIDE 16

Limitations

  • Failure modes

– examples not in random order – examples not independent – no outliers in data

slide-17
SLIDE 17

Method fails when there are no

  • utliers

1 0

3

1 0

4

1 0

5

1 0

6

1 0

  • 2

1 0 1 0

2

1 0

4

1 0

6

Unifo rm 3 D Unifo rm 3 D S iz e Total Time

Examples drawn from a uniform distribution in 3 dimensions

) (

i

x P

b=1.76

slide-18
SLIDE 18

However, the method is efficient if there are at least a few outliers

1 0

3

1 0

4

1 0

5

1 0

6

1 0

  • 2

1 0 1 0

2

1 0

4

1 0

6

M ixe d 3 D M ixe d 3 D S iz e Total Time

Examples drawn from 99% uniform, 1% Gaussian distribution

) (

i

x P

b=1.11

slide-19
SLIDE 19

Future Work

  • Pruning eliminates examples when they cannot be

a top outlier. Can we prune examples when they are almost certain to be an outlier?

  • How many examples is enough? Do we need to do

the full N2 comparisons?

  • How do algorithm settings affect performance and

do they interact with data set characteristics?

  • How do we deal with dependent data points?
slide-20
SLIDE 20

Summary & Conclusions

  • Presented a nested loop approach to finding

distance-based outliers

  • Efficient and allows scaling to larger data

sets with millions of examples and many features

  • Easy to implement and should be the new

strawman for research in speeding up distance-based outliers

slide-21
SLIDE 21

Resources

  • Executables available from

http://www.isle.org/~sbay

  • Comparison with GritBot on Census data

http://www.isle.org/~sbay/papers/kdd03/

  • Datasets are public and are available by

request

slide-22
SLIDE 22
slide-23
SLIDE 23

Scaling Summary

2 3 4 5 6 7 NlogN 4.4 9.1 19.1 39.8 83.2 173.8 b=1.32 1.8 2.5 3.3 4.5 6.0 8.1 100 1000 10000 100000 1000000 10000000 b=1.13 N

slide-24
SLIDE 24

How big a sample do we need?

1 0

3

1 0

4

1 0

5

1 0

6

0.2 0.4 0.6 0.8 1 N o rm a l3 0 d - k1 0 0 a 5 0 0 N o rm a l3 0 d - k1 0 0 a 5 0 0 S iz e o f re fe re n c e s e t Correspondence 1 0

3

1 0

4

1 0

5

1 0

6

1 0

7

0.2 0.4 0.6 0.8 1 P e rs o n - k30 a 3 0 P e rs o n - k30 a 3 0 S iz e o f re fe re n c e s e t Correspondence

It depends…