CS 472 - Nearest Neighbor Learning 1
Nearest Neighbor Learning (Instance Based Learning) l Classify based - - PowerPoint PPT Presentation
Nearest Neighbor Learning (Instance Based Learning) l Classify based - - PowerPoint PPT Presentation
Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges from simple nearest neighbor to case-based and analogical reasoning l Use local information near the current query instance to decide the
k-Nearest Neighbor Approach
l Simply store all (or some representative subset) of the
examples in the training set
l When desiring to generalize on a new instance,
measure the distance from the new instance to all the stored instances and the nearest ones vote to decide the class of the new instance
l No need to pre-process a specific hypothesis (Lazy vs
Eager learning)
– Fast learning – Can be slow during execution and require significant storage – Some models index the data or reduce the instances stored to
enhance efficiency
CS 472 - Nearest Neighbor Learning 2
3
k-Nearest Neighbor (cont)
l Naturally supports real valued attributes l Typically use Euclidean distance l Nominal/unknown attributes can just be a 1/0 distance (more on other
distance metrics later)
l The output class for the query instance is set to the most common class
- f its k nearest neighbors (could output confidence/probability)
where d(x,y) = 1 if x = y, else 0
l k greater than 1 is more noise resistant, but a large k would lead to less
accuracy as less relevant neighbors have more influence (common values: k=3, k=5)
–
Usually choose k by Cross Validation (trying different values for a task)
dist(x,y) = (xi − yi)2
i=1 m
∑
f
^
(xq) = argmax
v∈V
δ(v, f (xi))
i=1 k
∑
CS 472 - Nearest Neighbor Learning
Decision Surface
l Linear decision boundary between 2 closest points of
different classes for 1-nn
CS 472 - Nearest Neighbor Learning 4
Decision Surface
l Combining all the appropriate intersections gives a
Voronoi diagram
CS 472 - Nearest Neighbor Learning 5
Euclidean distance – each point a unique class Same points - Manhattan distance
CS 472 - Nearest Neighbor Learning 6
k-Nearest Neighbor (cont)
l Usually do distance weighted voting where the strength of
a neighbor's influence is proportional to its distance
l Inverse of distance squared is a common weight l Gaussian is another common distance weight l In this case the k value is more robust, could let k be even
and/or be larger (even all points if desired), because the more distant points have negligible influence
f
^
(xq) = argmax
v∈V
wiδ(v, f (xi))
i=1 k
∑
wi = 1 dist(xq,xi)2
*Challenge Question* - k-Nearest Neighbor
l Assume the following data set l Assume a new point (2, 6)
– For nearest neighbor distance use Manhattan distance – What would the output be for 3-nn with no distance weighting?
What is the total vote?
– What would the output be for 3-nn with distance weighting? What is
the total vote?
- A. A A
- B. A B
- C. B A
- D. B B
E.
None of the above
CS 472 - Nearest Neighbor Learning 7
x y Label 1 5 A 8 B 9 9 B 10 10 A
*Challenge Question* - k-Nearest Neighbor
l Assume the following data set l Assume a new point (2, 6)
– For nearest neighbor distance use Manhattan distance – What would the output be for 3-nn with no distance weighting?
What is the total vote? – B wins with vote 2 out of 3
– What would the output be for 3-nn with distance weighting? What is
the total vote? A wins with vote .25 vs B vote of .0625+.01=.0725
CS 472 - Nearest Neighbor Learning 8
x y Label Distance Weighted Vote 1 5 A 1 + 1 = 2 1/22 = .25 8 B 2 + 2 = 4 1/42 = .0625 9 9 B 7 + 3 = 10 1/102 = .01 10 10 A 8 + 4 = 12 1/122 = .0069
CS 472 - Nearest Neighbor Learning 9
Regression with k-nn
l Can do regression by letting the output be the mean of the
k nearest neighbors
CS 472 - Nearest Neighbor Learning 10
Weighted Regression with k-nn
l Can do weighted regression by letting the output be the
weighted mean of the k nearest neighbors
l For distance weighted regression l Where f(x) is the output value for instance x f
^
(xq) = wi f (xi)
i=1 k
∑
wi
i=1 k
∑
wi = 1 dist(xq,xi)2
Regression Example
l What is the value of the new instance? l Assume dist(xq, n8) = 2, dist(xq, n5) = 3, dist(xq, n3) = 4 l f(xq) = (8/22 + 5/32 + 3/42)/(1/22 + 1/32 + 1/42) = 2.74/.42 = 6.5 l The denominator renormalizes the value
CS 472 - Nearest Neighbor Learning 11
8 5 3
f
^
(xq) = wi f (xi)
i=1 k
∑
wi
i=1 k
∑
wi = 1 dist(xq,xi)2
k-Nearest Neighbor Homework
l Assume the following training set l Assume a new point (.5, .2)
– For all below, use Manhattan distance, if required, and show work – What would the output class for 3-nn be with no distance weighting? – What would the output class for 3-nn be with squared inverse
distance weighting?
– What would the 3-nn regression value be for the point be if we used
the regression labels rather than the class labels and used squared inverse distance weighting?
CS 472 - Nearest Neighbor Learning 12
x y Class Label Regression Label .3 .8 A .6
- .3
1.6 B
- .3
.9 B .8 1 1 A 1.2
CS 472 - Nearest Neighbor Learning 13
Attribute Weighting
l One of the main weaknesses of nearest neighbor is irrelevant
features, since they can dominate the distance
– Example: assume 2 relevant and 10 irrelevant features
l Can create algorithms which weight the attributes (Note that
backprop and ID3 do higher order weighting of features)
l Could do attribute weighting - No longer lazy evaluation since
you need to come up with a portion of your hypothesis (attribute weights) before generalizing
l Still an open area of research
– Higher order weighting – 1st order helps, but not enough – Even if all features are relevant features, all distances become similar
as number of features increases, since not all features are relevant at the same time, and the currently irrelevant ones can dominate distance
– A problem with all pure distance based techniques, need higher-order
weighting to ignore currently irrelevant features
– What is the best method, etc.? – important research area – Dimensionality reduction can be useful (feature pre-processing, PCA,
NLDR, etc.)
CS 472 - Nearest Neighbor Learning 14
Reduction Techniques
l Create a subset or other representative set of prototype nodes
–
Faster execution, and could even improve accuracy if noisy instances removed l Approaches –
Leave-one-out reduction - Drop instance if it would still be classified correctly
–
Growth algorithm - Only add instance if it is not already classified correctly
- both order dependent, similar results
–
More global optimizing approaches
–
Just keep central points – lower accuracy (mostly linear Voronoi decision surface), best space savings
–
Just keep border points, best accuracy (pre-process noisy instances – Drop5)
–
Drop 5 (Wilson & Martinez) maintains almost full accuracy with approximately 15% of the original instances
l Wilson, D. R. and Martinez, T. R., Reduction Techniques for Exemplar-Based
Learning Algorithms, Machine Learning Journal, vol. 38, no. 3, pp. 257-286, 2000.
CS 472 - Nearest Neighbor Learning 15
CS 472 - Nearest Neighbor Learning 16
Distance Metrics
l Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance
Functions, Journal of Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997.
l Normalization of features - critical l Don't know values in novel or data set instances
– Can do some type of imputation and then normal distance – Or have a distance (between 0-1) for don't know values
l Original main question: How best to handle nominal features
CS 472 - Nearest Neighbor Learning 17
Value Difference Metric
l Assume a 2 output class (A,B) example l Attribute 1 = Shape (Round, Square, Triangle, etc.) l 10 total round instances
– 6 class A and 4 class B
l 5 total square instances
– 3 class A and 2 class B
l Since both attribute values suggest the same probabilities
for the output class, the distance between Round and Square would be 0
– If triangle and round suggested very different outputs, triangle and
round would have a large distance
l Distance of two attribute values is a measure of how
similar they are in inferring the output class
CS 472 - Nearest Neighbor Learning 18
CS 472 - Nearest Neighbor Learning 19
CS 472 - Nearest Neighbor Learning 20
CS 472 - Nearest Neighbor Learning 21
CS 472 - Nearest Neighbor Learning 22
CS 472 - Nearest Neighbor Learning 23
CS 472 - Nearest Neighbor Learning 24
IVDM
l Distance Metrics make a difference l IVDM also helps deal with the many/irrelevant feature
problem of k-NN, because features only add significantly to the overall distance if that distance leads to different
- utputs
l Two features which tend to lead to the same output
probabilities (exactly what irrelevant features should do) will have 0 or little distance, while their Euclidean distance could have been significantly larger
l Need to take it further to find distance approaches taking
into account higher order combinations between features in the distance metric
k-Nearest Neighbor Notes
l Note that full "Leave one out" CV is easy with k-nn l Very powerful yet simple scheme which does well on
many tasks
l Overfitting handled by just using larger k l Struggles with irrelevant inputs
– Needs better incorporation of feature weighting schemes
l Issues with distance with very high dimensionality tasks
– Too many features wash out effects of the specifically important
- nes (akin to the irrelevant feature problem)
– May need distance metrics other than Euclidean distances
l Also can be helpful to reduce total # of instances
– Efficiency – Sometimes accuracy
CS 472 - Nearest Neighbor Learning 25
CS 472 - Nearest Neighbor Learning 26
K-nn Lab
l See Learning Suite l Regression part – Normalize output values?
– No need – Will change the MSE value
l C++ default is RMSE?
– Don't normalize outputs so that we can have consistent MSEs
Radial Basis Function Networks
27
f (x) = w0 + wiK(d(µi, x))
i=1 h
∑
Where f(x) is the output of each linear output node, K is RBF (kernel function) and d is distance
CS 472 - Nearest Neighbor Learning
CS 472 - Nearest Neighbor Learning 28
Radial Basis Function (RBF) Networks
l One linear output node per class – with weights and bias l Each hidden (prototype) node computes the distance from itself to the input
instance (Gaussian is common) – not like an MLP hidden node
l An arbitrary number of prototype nodes form a hidden layer in the Radial
Basis Function network – prototype nodes typically non-adaptive
l The prototype layer expands the input space into a new prototype space.
Translates the data set into a new set with more features
l Output layer weights are learned with a simple linear model (e.g. delta rule)
–
Not a preset label vote like in k-nearest neighbor l Thus, output nodes learn 1st order prototype weightings for each class
2 A 3 x y 1 B 4 1 x y 2 3 4
CS 472 - Nearest Neighbor Learning 29
Radial Basis Function Networks
l Neural Network variation of nearest neighbor algorithm l Output layer execution and weight learning
– Highest node/class net value wins (or can output confidences) – Each node collects weighted votes from prototypes – unlearned
weighting from distance (prototype activation – like k-nn), but unlike k-nn, vote value is learned
– Weight learning - Delta rule variations or direct matrix weight
calculation – linear or non-linear node activation function
l Could use an MLP at the top layer if desired
l Key Issue: How many prototype nodes should there be and
where should they be placed (means)
l Prototype node sphere of influence – Kernel basis function
(deviation) – like choosing k for k-NN
– Too small – less generalization, should have some overlap – Too large - saturation, lose local effects, longer training
CS 472 - Nearest Neighbor Learning 30
Node Placement
l Random Coverage - Prototypes potentially placed in areas
where instances don't occur, Curse of dimensionality
l One prototype node for each instance of the training set l Random subset of training set instances l Clustering - Unsupervised or supervised - k-means style
- vs. constructive
l Genetic Algorithms l Node adjustment – Adaptive prototypes (Competitive
Learning style)
l Dynamic addition and deletion of prototype nodes
RBF Homework
l Assume you have an RBF with
– Two inputs – Three output classes A, B, and C (linear units) – Three prototype nodes at (0,0), (.5,1) and (1,.5) – The radial basis function of the prototype nodes is
l max(0, 1 – Manhattan distance between the node and the instance in
question) – Assume no bias and initial weights of .6 into output node A, -.4
into output node B, and 0 into output node C
– Assume top layer training is the delta rule with LR = .1
l Assume we input the single instance .6 .8
– Which class would be the winner? – What would the weights be updated to if it were a training instance
- f .6 .8 with target class B? (thus B has target 1 and A has target 0)
CS 472 - Nearest Neighbor Learning 31
CS 472 - Nearest Neighbor Learning 32
RBF vs. BP
l Line vs. Sphere - mix-and-match approaches
– Multiple spheres still create Voronoi decision surfaces
l Potential Faster Training - nearest neighbor localization -
Yet more data and hidden nodes typically needed
l Local vs Distributed, less extrapolation (ala BP), have
reject capability (avoid false positives)
l RBF will have problems with irrelevant features just like
nearest neighbor (or any distance based approach which treats all inputs equally)
– Could be improved by adding learning into the prototype layer to
learn attribute weighting
Distributed vs Local
l MLP vs K-NN (RBF) – exponential vs linear representation
potential – but how useable is it? – overfit, exponential training data? Which is best is an open question.
l Below are decision surfaces for MLP with 3 hidden nodes, and
K-NN with 3 nodes
CS 472 - Nearest Neighbor Learning 33