[PPT] - Nearest Neighbor Learning (Instance Based Learning) l Classify based PowerPoint Presentation

SLIDE 1

CS 472 - Nearest Neighbor Learning 1

Nearest Neighbor Learning (Instance Based Learning)

l Classify based on local similarity l Ranges from simple nearest neighbor to case-based and

analogical reasoning

l Use local information near the current query instance to

decide the classification of that instance

l As such can represent quite complex decision surfaces in a

simple manner

– Local model vs a model such as an MLP which uses a global

decision surface

SLIDE 2

k-Nearest Neighbor Approach

l Simply store all (or some representative subset) of the

examples in the training set

l When desiring to generalize on a new instance,

measure the distance from the new instance to all the stored instances and the nearest ones vote to decide the class of the new instance

l No need to pre-process a specific hypothesis (Lazy vs

Eager learning)

– Fast learning – Can be slow during execution and require significant storage – Some models index the data or reduce the instances stored to

enhance efficiency

CS 472 - Nearest Neighbor Learning 2

SLIDE 3

3

k-Nearest Neighbor (cont)

l Naturally supports real valued attributes l Typically use Euclidean distance l Nominal/unknown attributes can just be a 1/0 distance (more on other

distance metrics later)

l The output class for the query instance is set to the most common class

f its k nearest neighbors (could output confidence/probability)

where d(x,y) = 1 if x = y, else 0

l k greater than 1 is more noise resistant, but a large k would lead to less

accuracy as less relevant neighbors have more influence (common values: k=3, k=5)

–

Usually choose k by Cross Validation (trying different values for a task)

dist(x,y) = (xi − yi)2

i=1 m

∑

f

^

(xq) = argmax

v∈V

δ(v, f (xi))

i=1 k

∑

CS 472 - Nearest Neighbor Learning

SLIDE 4

Decision Surface

l Linear decision boundary between 2 closest points of

different classes for 1-nn

CS 472 - Nearest Neighbor Learning 4

SLIDE 5

Decision Surface

l Combining all the appropriate intersections gives a

Voronoi diagram

CS 472 - Nearest Neighbor Learning 5

Euclidean distance – each point a unique class Same points - Manhattan distance

SLIDE 6

CS 472 - Nearest Neighbor Learning 6

k-Nearest Neighbor (cont)

l Usually do distance weighted voting where the strength of

a neighbor's influence is proportional to its distance

l Inverse of distance squared is a common weight l Gaussian is another common distance weight l In this case the k value is more robust, could let k be even

and/or be larger (even all points if desired), because the more distant points have negligible influence

f

^

(xq) = argmax

v∈V

wiδ(v, f (xi))

i=1 k

∑

wi = 1 dist(xq,xi)2

SLIDE 7

Challenge Question - k-Nearest Neighbor

l Assume the following data set l Assume a new point (2, 6)

– For nearest neighbor distance use Manhattan distance – What would the output be for 3-nn with no distance weighting?

What is the total vote?

– What would the output be for 3-nn with distance weighting? What is

the total vote?

A. A A
B. A B
C. B A
D. B B

E.

None of the above

CS 472 - Nearest Neighbor Learning 7

x y Label 1 5 A 8 B 9 9 B 10 10 A

SLIDE 8

Challenge Question - k-Nearest Neighbor

l Assume the following data set l Assume a new point (2, 6)

– For nearest neighbor distance use Manhattan distance – What would the output be for 3-nn with no distance weighting?

What is the total vote? – B wins with vote 2 out of 3

– What would the output be for 3-nn with distance weighting? What is

the total vote? A wins with vote .25 vs B vote of .0625+.01=.0725

CS 472 - Nearest Neighbor Learning 8

x y Label Distance Weighted Vote 1 5 A 1 + 1 = 2 1/22 = .25 8 B 2 + 2 = 4 1/42 = .0625 9 9 B 7 + 3 = 10 1/102 = .01 10 10 A 8 + 4 = 12 1/122 = .0069

SLIDE 9

CS 472 - Nearest Neighbor Learning 9

Regression with k-nn

l Can do regression by letting the output be the mean of the

k nearest neighbors

SLIDE 10

CS 472 - Nearest Neighbor Learning 10

Weighted Regression with k-nn

l Can do weighted regression by letting the output be the

weighted mean of the k nearest neighbors

l For distance weighted regression l Where f(x) is the output value for instance x f

^

(xq) = wi f (xi)

i=1 k

∑

wi

i=1 k

∑

wi = 1 dist(xq,xi)2

SLIDE 11

Regression Example

l What is the value of the new instance? l Assume dist(xq, n8) = 2, dist(xq, n5) = 3, dist(xq, n3) = 4 l f(xq) = (8/22 + 5/32 + 3/42)/(1/22 + 1/32 + 1/42) = 2.74/.42 = 6.5 l The denominator renormalizes the value

CS 472 - Nearest Neighbor Learning 11

8 5 3

f

^

(xq) = wi f (xi)

i=1 k

∑

wi

i=1 k

∑

wi = 1 dist(xq,xi)2

SLIDE 12

k-Nearest Neighbor Homework

l Assume the following training set l Assume a new point (.5, .2)

– For all below, use Manhattan distance, if required, and show work – What would the output class for 3-nn be with no distance weighting? – What would the output class for 3-nn be with squared inverse

distance weighting?

– What would the 3-nn regression value be for the point be if we used

the regression labels rather than the class labels and used squared inverse distance weighting?

CS 472 - Nearest Neighbor Learning 12

x y Class Label Regression Label .3 .8 A .6

.3

1.6 B

.3

.9 B .8 1 1 A 1.2

SLIDE 13

CS 472 - Nearest Neighbor Learning 13

Attribute Weighting

l One of the main weaknesses of nearest neighbor is irrelevant

features, since they can dominate the distance

– Example: assume 2 relevant and 10 irrelevant features

l Can create algorithms which weight the attributes (Note that

backprop and ID3 do higher order weighting of features)

l Could do attribute weighting - No longer lazy evaluation since

you need to come up with a portion of your hypothesis (attribute weights) before generalizing

l Still an open area of research

– Higher order weighting – 1st order helps, but not enough – Even if all features are relevant features, all distances become similar

as number of features increases, since not all features are relevant at the same time, and the currently irrelevant ones can dominate distance

– A problem with all pure distance based techniques, need higher-order

weighting to ignore currently irrelevant features

– What is the best method, etc.? – important research area – Dimensionality reduction can be useful (feature pre-processing, PCA,

NLDR, etc.)

SLIDE 14

CS 472 - Nearest Neighbor Learning 14

Reduction Techniques

l Create a subset or other representative set of prototype nodes

–

Faster execution, and could even improve accuracy if noisy instances removed l Approaches –

Leave-one-out reduction - Drop instance if it would still be classified correctly

–

Growth algorithm - Only add instance if it is not already classified correctly

both order dependent, similar results

–

More global optimizing approaches

–

Just keep central points – lower accuracy (mostly linear Voronoi decision surface), best space savings

–

Just keep border points, best accuracy (pre-process noisy instances – Drop5)

–

Drop 5 (Wilson & Martinez) maintains almost full accuracy with approximately 15% of the original instances

l Wilson, D. R. and Martinez, T. R., Reduction Techniques for Exemplar-Based

Learning Algorithms, Machine Learning Journal, vol. 38, no. 3, pp. 257-286, 2000.

SLIDE 15

CS 472 - Nearest Neighbor Learning 15

SLIDE 16

CS 472 - Nearest Neighbor Learning 16

Distance Metrics

l Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance

Functions, Journal of Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997.

l Normalization of features - critical l Don't know values in novel or data set instances

– Can do some type of imputation and then normal distance – Or have a distance (between 0-1) for don't know values

l Original main question: How best to handle nominal features

SLIDE 17

CS 472 - Nearest Neighbor Learning 17

SLIDE 18

Value Difference Metric

l Assume a 2 output class (A,B) example l Attribute 1 = Shape (Round, Square, Triangle, etc.) l 10 total round instances

– 6 class A and 4 class B

l 5 total square instances

– 3 class A and 2 class B

l Since both attribute values suggest the same probabilities

for the output class, the distance between Round and Square would be 0

– If triangle and round suggested very different outputs, triangle and

round would have a large distance

l Distance of two attribute values is a measure of how

similar they are in inferring the output class

CS 472 - Nearest Neighbor Learning 18

SLIDE 19

CS 472 - Nearest Neighbor Learning 19

SLIDE 20

CS 472 - Nearest Neighbor Learning 20

SLIDE 21

CS 472 - Nearest Neighbor Learning 21

SLIDE 22

CS 472 - Nearest Neighbor Learning 22

SLIDE 23

CS 472 - Nearest Neighbor Learning 23

SLIDE 24

CS 472 - Nearest Neighbor Learning 24

IVDM

l Distance Metrics make a difference l IVDM also helps deal with the many/irrelevant feature

problem of k-NN, because features only add significantly to the overall distance if that distance leads to different

utputs

l Two features which tend to lead to the same output

probabilities (exactly what irrelevant features should do) will have 0 or little distance, while their Euclidean distance could have been significantly larger

l Need to take it further to find distance approaches taking

into account higher order combinations between features in the distance metric

SLIDE 25

k-Nearest Neighbor Notes

l Note that full "Leave one out" CV is easy with k-nn l Very powerful yet simple scheme which does well on

many tasks

l Overfitting handled by just using larger k l Struggles with irrelevant inputs

– Needs better incorporation of feature weighting schemes

l Issues with distance with very high dimensionality tasks

– Too many features wash out effects of the specifically important

nes (akin to the irrelevant feature problem)

– May need distance metrics other than Euclidean distances

l Also can be helpful to reduce total # of instances

– Efficiency – Sometimes accuracy

CS 472 - Nearest Neighbor Learning 25

SLIDE 26

CS 472 - Nearest Neighbor Learning 26

K-nn Lab

l See Learning Suite l Regression part – Normalize output values?

– No need – Will change the MSE value

l C++ default is RMSE?

– Don't normalize outputs so that we can have consistent MSEs

SLIDE 27

Radial Basis Function Networks

27

f (x) = w0 + wiK(d(µi, x))

i=1 h

∑

Where f(x) is the output of each linear output node, K is RBF (kernel function) and d is distance

CS 472 - Nearest Neighbor Learning

SLIDE 28

CS 472 - Nearest Neighbor Learning 28

Radial Basis Function (RBF) Networks

l One linear output node per class – with weights and bias l Each hidden (prototype) node computes the distance from itself to the input

instance (Gaussian is common) – not like an MLP hidden node

l An arbitrary number of prototype nodes form a hidden layer in the Radial

Basis Function network – prototype nodes typically non-adaptive

l The prototype layer expands the input space into a new prototype space.

Translates the data set into a new set with more features

l Output layer weights are learned with a simple linear model (e.g. delta rule)

–

Not a preset label vote like in k-nearest neighbor l Thus, output nodes learn 1st order prototype weightings for each class

2 A 3 x y 1 B 4 1 x y 2 3 4

SLIDE 29

CS 472 - Nearest Neighbor Learning 29

Radial Basis Function Networks

l Neural Network variation of nearest neighbor algorithm l Output layer execution and weight learning

– Highest node/class net value wins (or can output confidences) – Each node collects weighted votes from prototypes – unlearned

weighting from distance (prototype activation – like k-nn), but unlike k-nn, vote value is learned

– Weight learning - Delta rule variations or direct matrix weight

calculation – linear or non-linear node activation function

l Could use an MLP at the top layer if desired

l Key Issue: How many prototype nodes should there be and

where should they be placed (means)

l Prototype node sphere of influence – Kernel basis function

(deviation) – like choosing k for k-NN

– Too small – less generalization, should have some overlap – Too large - saturation, lose local effects, longer training

SLIDE 30

CS 472 - Nearest Neighbor Learning 30

Node Placement

l Random Coverage - Prototypes potentially placed in areas

where instances don't occur, Curse of dimensionality

l One prototype node for each instance of the training set l Random subset of training set instances l Clustering - Unsupervised or supervised - k-means style

vs. constructive

l Genetic Algorithms l Node adjustment – Adaptive prototypes (Competitive

Learning style)

l Dynamic addition and deletion of prototype nodes

SLIDE 31

RBF Homework

l Assume you have an RBF with

– Two inputs – Three output classes A, B, and C (linear units) – Three prototype nodes at (0,0), (.5,1) and (1,.5) – The radial basis function of the prototype nodes is

l max(0, 1 – Manhattan distance between the node and the instance in

question) – Assume no bias and initial weights of .6 into output node A, -.4

into output node B, and 0 into output node C

– Assume top layer training is the delta rule with LR = .1

l Assume we input the single instance .6 .8

– Which class would be the winner? – What would the weights be updated to if it were a training instance

f .6 .8 with target class B? (thus B has target 1 and A has target 0)

CS 472 - Nearest Neighbor Learning 31

SLIDE 32

CS 472 - Nearest Neighbor Learning 32

RBF vs. BP

l Line vs. Sphere - mix-and-match approaches

– Multiple spheres still create Voronoi decision surfaces

l Potential Faster Training - nearest neighbor localization -

Yet more data and hidden nodes typically needed

l Local vs Distributed, less extrapolation (ala BP), have

reject capability (avoid false positives)

l RBF will have problems with irrelevant features just like

nearest neighbor (or any distance based approach which treats all inputs equally)

– Could be improved by adding learning into the prototype layer to

learn attribute weighting

SLIDE 33

Distributed vs Local

l MLP vs K-NN (RBF) – exponential vs linear representation

potential – but how useable is it? – overfit, exponential training data? Which is best is an open question.

l Below are decision surfaces for MLP with 3 hidden nodes, and

K-NN with 3 nodes

CS 472 - Nearest Neighbor Learning 33

Nearest Neighbor Learning (Instance Based Learning)

l Classify based on local similarity l Ranges from simple nearest neighbor to case-based and

analogical reasoning

l Use local information near the current query instance to

decide the classification of that instance

l As such can represent quite complex decision surfaces in a

simple manner

– Local model vs a model such as an MLP which uses a global

decision surface

k-Nearest Neighbor Approach

l Simply store all (or some representative subset) of the

examples in the training set

l When desiring to generalize on a new instance,

measure the distance from the new instance to all the stored instances and the nearest ones vote to decide the class of the new instance

l No need to pre-process a specific hypothesis (Lazy vs

Eager learning)

– Fast learning – Can be slow during execution and require significant storage – Some models index the data or reduce the instances stored to

enhance efficiency

k-Nearest Neighbor (cont)

distance metrics later)

where d(x,y) = 1 if x = y, else 0

accuracy as less relevant neighbors have more influence (common values: k=3, k=5)

dist(x,y) = (xi − yi)2

∑

f

(xq) = argmax

δ(v, f (xi))

∑

Decision Surface

l Linear decision boundary between 2 closest points of

different classes for 1-nn

Decision Surface

l Combining all the appropriate intersections gives a

Voronoi diagram

Euclidean distance – each point a unique class Same points - Manhattan distance

k-Nearest Neighbor (cont)

l Usually do distance weighted voting where the strength of

a neighbor's influence is proportional to its distance

l Inverse of distance squared is a common weight l Gaussian is another common distance weight l In this case the k value is more robust, could let k be even

and/or be larger (even all points if desired), because the more distant points have negligible influence

f

(xq) = argmax

wiδ(v, f (xi))

∑

wi = 1 dist(xq,xi)2

*Challenge Question* - k-Nearest Neighbor

l Assume the following data set l Assume a new point (2, 6)

– For nearest neighbor distance use Manhattan distance – What would the output be for 3-nn with no distance weighting?

What is the total vote?

– What would the output be for 3-nn with distance weighting? What is

the total vote?

E.

None of the above

*Challenge Question* - k-Nearest Neighbor

l Assume the following data set l Assume a new point (2, 6)

– For nearest neighbor distance use Manhattan distance – What would the output be for 3-nn with no distance weighting?

What is the total vote? – B wins with vote 2 out of 3

– What would the output be for 3-nn with distance weighting? What is

the total vote? A wins with vote .25 vs B vote of .0625+.01=.0725

Regression with k-nn

l Can do regression by letting the output be the mean of the

k nearest neighbors

Weighted Regression with k-nn

l Can do weighted regression by letting the output be the

weighted mean of the k nearest neighbors

l For distance weighted regression l Where f(x) is the output value for instance x f

(xq) = wi f (xi)

∑

wi

∑

wi = 1 dist(xq,xi)2

Regression Example

l What is the value of the new instance? l Assume dist(xq, n8) = 2, dist(xq, n5) = 3, dist(xq, n3) = 4 l f(xq) = (8/22 + 5/32 + 3/42)/(1/22 + 1/32 + 1/42) = 2.74/.42 = 6.5 l The denominator renormalizes the value

8 5 3

f

(xq) = wi f (xi)

∑

wi

∑

wi = 1 dist(xq,xi)2

Challenge Question - k-Nearest Neighbor

Challenge Question - k-Nearest Neighbor