Instance-based Learning CE-717: Machine Learning Sharif University - - PowerPoint PPT Presentation

instance based learning
SMART_READER_LITE
LIVE PREVIEW

Instance-based Learning CE-717: Machine Learning Sharif University - - PowerPoint PPT Presentation

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density


slide-1
SLIDE 1

Instance-based Learning

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2018

slide-2
SLIDE 2

Outline

2

} Non-parametric approach

} Unsupervised: Non-parametric density estimation

} Parzen Windows } Kn-Nearest Neighbor Density Estimation

} Supervised: Instance-based learners

} Classification

¨ kNN classification ¨ Weighted (or kernel) kNN

} Regression

¨ kNN regression ¨ Locally linear weighted regression

slide-3
SLIDE 3

Introduction

3

} Estimation of arbitrary density functions

} Parametric density functions cannot usually fit the densities we

encounter in practical problems.

} e.g., parametric densities are unimodal.

} Non-parametric methods don't assume that the model (from) of

underlying densities is known in advance } Non-parametric

methods (for classification) can be categorized into

} Generative

} Estimate 𝑞(𝒚|𝒟&) from 𝒠& using non-parametric density estimation

} Discriminative

} Estimate 𝑞(𝒟&|𝒚) from 𝒠

slide-4
SLIDE 4

Parametric vs. nonparametric methods

4

} Parametric methods need to find parameters from data and

then use the inferred parameters to decide on new data points

} Learning: finding parameters from data

} Nonparametric methods

} Training examples are explicitly used

} Training phase is not required

} Both supervised and unsupervised learning methods can be

categorized into parametric and non-parametric methods.

slide-5
SLIDE 5

Histogram approximation idea

5

} Histogram approximation of an unknown pdf

} 𝑄(𝑐+) ≈ 𝑙.(𝑐+)/𝑜

𝑚 = 1, … , 𝑀

} 𝑙.(𝑐+): number of samples (among n ones) lied in the bin 𝑐+

} The corresponding estimated pdf:

} 𝑞

7 𝑦 =

9(:;) <

𝑦 − 𝑦̅:; ≤

< @

ℎ 𝑙. Mid-point of the bin 𝑐+

slide-6
SLIDE 6

Non-parametric density estimation

6

} Probability of falling in a region ℛ:

} 𝑄 = ∫

𝑞 𝒚D 𝑒𝒚′

(smoothed version of 𝑞 𝒚 )

} 𝒠 = 𝒚 &

&GH . : a set of samples drawn i.i.d. according to 𝑞 𝒚

} The probability that 𝑙 of the 𝑜 samples fall in ℛ:

} 𝑄I = 𝑜

𝑙 𝑄I 1 − 𝑄 .JI

} 𝐹 𝑙 = 𝑜𝑄

} This binomial distribution peaks sharply about the mean:

} 𝑙 ≈ 𝑜𝑄 ⇒ I

. as an estimate for 𝑄

} More accurate for larger 𝑜

slide-7
SLIDE 7

Non-parametric density estimation

7

} We can estimate smoothed 𝑞 𝒚 by estimating 𝑄: } Assumptions: 𝑞 𝒚 is continuous and the region ℛ enclosing 𝒚

is so small that 𝑞 is near constant in it:

𝑄 = N 𝑞 𝒚D 𝑒𝒚′

= 𝑞 𝒚 ×𝑊

𝑊 = 𝑊𝑝𝑚 ℛ 𝒚 ∈ ℛ ⇒ 𝑞 𝒚 = 𝑄 𝑊 ≈ 𝑙/𝑜 𝑊

} Let 𝑊 approach zero if we want to find 𝑞 𝒚

instead of the averaged version.

slide-8
SLIDE 8

Necessary conditions for converge

8

} 𝑞. 𝒚 is the estimate of 𝑞 𝒚 using 𝑜 samples:

} 𝑊

.: the volume of region around 𝒚

} 𝑙.: the number of samples falling in the region

𝑞. 𝒚 = 𝑙./𝑜 𝑊

.

} Necessary conditions for converge of 𝑞. 𝒚 to 𝑞(𝒚):

} lim

.→W 𝑊 . = 0

} lim

.→W 𝑙. = ∞

} lim

.→W 𝑙./𝑜 = 0

slide-9
SLIDE 9

Non-parametric density estimation: Main approaches

9

} Two approaches of satisfying conditions:

} k-nearest neighbor density estimator: fix k and determine the

value ofV from the data

} Volume grows until it contains k neighbors of 𝒚 } converges to the true probability density in the limit 𝑜 → ∞ when k

grows with n (e.g., 𝑙. = 𝑙H 𝑜

  • )

} Kernel density estimator (Parzen window): fix V and determine

K from the data

} Number of points falling inside the volume can vary from point to

point

} converges to the true probability density in the limit 𝑜 → ∞ when V

shrinks suitably with n (e.g., 𝑊

. = 𝑊 H/ 𝑜

  • )
slide-10
SLIDE 10

Parzen window

10

} Extension of histogram idea:

} Hyper-cubes with length of side ℎ (i.e., volume ℎ[) are located on the

samples

} Hypercube as a simple window function:

𝜒 𝒗 = ^1 ( 𝑣H ≤ 1 2 ∧ … ∧ 𝑣[ ≤ 1 2) 0 𝑝. 𝑥.

𝑞. 𝑦 = 1 𝑜 × 1 ℎ.

[ d 𝜒 𝒚 − 𝒚(&)

ℎ.

. &GH

} 𝑞. 𝒚 =

Ie .f

e

} 𝑙. = ∑

𝜒

𝒚J𝒚(h) <e . &GH } 𝑊 . = ℎ. [ −1/2 1/2

1

−1/2 1/2

1

number of samples in the hypercube around 𝒚

slide-11
SLIDE 11

Window function

11

} Necessary

conditions for window function to find legitimate density function:

} 𝜒(𝒚) ≥ 0 } ∫ 𝜒 𝒚 𝑒𝒚

  • = 1

} Windows are also called kernels or potential functions.

slide-12
SLIDE 12

Density estimation: non-parametric

12

𝑞̂. 𝑦 = 1 𝑜 d 𝑂(𝑦|𝑦(&), ℎ@)

. &GH

1 2𝜌

𝑓J nJn(h) o

@<o

𝜏 = ℎ 1 1.2 1.4 1.5 1.6 2 2.1 2.15 4 4.3 4.7 4.75 5

𝑞̂ 𝑦 = 1 𝑜 d 𝑂(𝑦|𝑦(&), 𝜏@)

. &GH

= 1 𝑜 d 1 2𝜌

  • 𝜏

𝑓J nJn(h) o

@qo . &GH

Choice of 𝜏 is crucial.

slide-13
SLIDE 13

Density estimation: non-parametric

13

𝜏 = 0.02 𝜏 = 0.1 𝜏 = 0.5 𝜏 = 1.5

slide-14
SLIDE 14

Window (or kernel) function: Width parameter

14

𝑞. 𝑦 = 1 𝑜 × 1 ℎ.

[ d 𝜒 𝒚 − 𝒚(&)

ℎ.

. &GH

} Choosing ℎ.:

} Too large: low resolution } Too small: much variability

} For unlimited 𝑜, by letting 𝑊

. slowly approach zero as 𝑜

increases 𝑞.(𝒚) converges to 𝑞(𝒚)

[Duda, Hurt, and Stork]

slide-15
SLIDE 15

Parzen Window: Example

15

𝜒 𝑣 = 𝑂 0,1 𝑞 𝑦 = 𝑂(0,1) ℎ. = ℎ/ 𝑜

  • [Duda, Hurt, and Stork]
slide-16
SLIDE 16

Width parameter

16

} For fixed 𝑜, a smaller ℎ results in higher variance while a larger

ℎ leads to higher bias.

} For a fixed ℎ, the variance decreases as the number of sample

points 𝑜 tends to infinity

} for a large enough number of samples, the smaller ℎ the better the

accuracy of the resulting estimate

} In practice, where only a finite number of samples is possible, a

compromise between ℎ and 𝑜 must be made.

} ℎ can be set using techniques like cross-validation where the density

estimation used for learning tasks such as classification

slide-17
SLIDE 17

Practical issues: Curse of dimensionality

17

} Large 𝑜

is necessary to find an acceptable density estimation in high dimensional feature spaces

} 𝑜 must grow exponentially with the dimensionality 𝑒.

} If 𝑜 equidistant points are required to densely fill a one-dim interval,

𝑜[points are needed to fill the corresponding 𝑒-dim hypercube.

¨ We need an exponentially large quantity of training data to ensure that the

cells are not empty

} Also complexity requirements

𝑒 = 1 𝑒 = 2 𝑒 = 3

slide-18
SLIDE 18

𝑙.-nearest neighbor estimation

18

} Cell volume is a function of the point location

} To estimate 𝑞(𝒚), let the cell around 𝒚 grow until it captures

𝑙. samples called 𝑙. nearest neighbors of 𝒚.

} Two possibilities can occur:

} high density near 𝒚 ⇒ cell will be small which provides a good

resolution

} low density near 𝒚 ⇒ cell will grow large and stop until higher

density regions are reached

slide-19
SLIDE 19

𝑙.-nearest neighbor estimation

19

𝑞. 𝒚 = 𝑙./𝑜 𝑊

.

⇒ 𝑊

. ≈

1 𝑞(𝒚) ×𝑙./𝑜

} A family of estimates by setting 𝑙. = 𝑙H 𝑜

  • and choosing

different values for 𝑙H:

𝑞. 𝒚 = 𝑙./𝑜 𝑊

.

⇒ 𝑊

. ≈ 1/𝑞(𝒚)

𝑜

  • 𝑊

. is a function of 𝒚

𝑙H = 1

slide-20
SLIDE 20

𝑙.-Nearest Neighbor Estimation: Example

20

} Discontinuities in the slopes

[Bishop]

slide-21
SLIDE 21

𝑙.-Nearest Neighbor Estimation: Example

21

} 𝑙. =

𝑜

  • 𝑞H 𝑦 =

1 2 𝑦 − 𝑦(H) [Duda, Hurt, and Stork]

slide-22
SLIDE 22

Non-parametric density estimation: Summary

22

} Generality of distributions

} With enough samples, convergence to an arbitrarily complicated target

density can be obtained.

} The number of required samples must be very large to assure

convergence

} grows exponentially with the dimensionality of the feature space

} These methods are very sensitive to the choice of window width or

number of nearest neighbors

} There may be severe requirements for computation time and

storage (needed to save all training samples).

} ‘training’ phase simply requires storage of the training set. } computational cost of evaluating 𝑞(𝒚) grows linearly with the size of

the data set.

slide-23
SLIDE 23

Nonparametric learners

23

} Memory-based or instance-based learners

} lazy learning: (almost) all the work is done at the test time.

} Generic description:

} Memorize training (𝒚(H), 𝑧(H)), . . . , (𝒚(.), 𝑧(.)). } Given test 𝒚 predict: 𝑧

7 = 𝑔(𝒚; 𝒚(H), 𝑧(H), . . . , 𝒚(.), 𝑧(.)).

} 𝑔 is typically expressed in terms of the similarity of the

test sample 𝒚 to the training samples 𝒚(H), . . . , 𝒚(.)

slide-24
SLIDE 24

Parzen window & generative classification

24

} If

w ew× w xy ∑

z 𝒚{𝒚(h)

x

  • 𝒚(h)∈𝒠w

w eo× w xy ∑

z 𝒚{𝒚(h)

x

  • 𝒚(h)∈𝒠o

×

9(𝒟w) 9(𝒟o) > 1 decide 𝒟H

} otherwise decide 𝒟@

} 𝑜} = 𝒠

} (𝑘 = 1,2): number of training samples in class 𝒟 }

} 𝒠}: set of training samples labels as 𝒟

}

} For large 𝑜 , it needs both high time and memory

requirements

slide-25
SLIDE 25

Parzen window & generative classification: Example

25

Smaller ℎ larger ℎ [Duda, Hurt, and Stork]

slide-26
SLIDE 26

Estimate the posterior

26

𝑞 𝑦 𝑧 = 𝑗 = 𝑙& 𝑜&𝑊 𝑞 𝑧 = 𝑗 = 𝑜& 𝑜 𝑞(𝑦) = 𝑙 𝑜𝑊 𝑞 𝑧 = 𝑗 𝑦 = 𝑞 𝑦 𝑧 = 𝑗 𝑞(𝑧 = 𝑗) 𝑞(𝑦) = 𝑙& 𝑙

slide-27
SLIDE 27

k-Nearest-Neighbor (kNN) rule

27

} k-NN classifier: 𝑙 > 1 nearest neighbors

} Label for 𝒚 predicted by majority voting among its k-NN.

} 𝑙 = 5 } What is the effect of 𝑙? 𝑦2

1 1 1

  • 1

𝒚 = [𝒚H, 𝒚@] ?

𝑦1

slide-28
SLIDE 28

kNN classifier

28

} Given

} Training data {(𝒚 H , 𝑧 H ), . . . , (𝒚 . , 𝑧 . )} are simply stored. } Test sample: 𝒚

} To classify 𝒚:

} Find 𝑙 nearest training samples to 𝒚 } Out of these 𝑙 samples, identify the number of samples 𝑙}

belonging to class 𝒟

} (𝑘 = 1, … , 𝐷).

} Assign 𝒚 to the class 𝒟}∗ where 𝑘∗ = argmax

}GH,…,Š

𝑙}

} It can be considered as a discriminative method.

slide-29
SLIDE 29

Probabilistic perspective of kNN

29

} kNN as a discriminative nonparametric classifier

} Non-parametric density estimation for 𝑄(𝒟

}|𝒚)

} 𝑄 𝒟

} 𝒚 ≈ I‹ I where 𝑙} shows the number of training samples among

𝑙 nearest neighbors of 𝒚 that are labeled 𝒟

}

}

Bayes decision rule for assigning labels

slide-30
SLIDE 30

Nearest-neighbor classifier: Example

30

} Voronoi tessellation:

} Each cell consists of all points closer to a given training point

than to any other training points

} All points in a cell are labeled by the category of the corresponding

training point.

[Duda, Hurt, and Strok’s Book]

slide-31
SLIDE 31

kNN classifier: Effect of k

31

[Bishop] Need to determine an appropriate value for k (e.g., by crossvalidation)

slide-32
SLIDE 32

Nearest neighbor classifier: error bound

32

} Nearest-Neighbor: kNN with 𝑙 = 1

} Decision rule: 𝑧

7 = 𝑧ŒŒ(𝒚) where 𝑂𝑂(𝒚) = argmin

&GH,…,Œ

𝒚 − 𝒚(&)

} Cover & Hart 67: asymptotic risk of NN classifier satisfies:

𝑆∗ ≤ 𝑆W

ŒŒ ≤ 2𝑆∗(1 − 𝑆∗) ≤ 2𝑆∗ 𝑆.: expected risk of NN classifier with 𝑜 training examples drawn from 𝑞(𝒚, 𝑧) 𝑆W

ŒŒ = lim .→W 𝑆. ŒŒ

𝑆∗: the optimal Bayes risk

slide-33
SLIDE 33

k-NN classifier: error bound

33

} Devr 96: the asymptotic risk of the kNN classifier 𝑆W =

lim

.→W 𝑆. satisfies

𝑆∗ ≤ 𝑆W

IŒŒ ≤ 𝑆∗ + @•‘

’’

I

  • } where 𝑆∗ is the optimal Bayes risk.
slide-34
SLIDE 34

Bound on the k-Nearest Neighbor Error Rate

34

} As 𝑙 increases, the upper bounds get closer to the lower

bound (the Bayes Error rate).

} When 𝑙 → ∞, the two bounds meet and k-NN rule becomes

  • ptimal.

bounds on k-NN error rate for different values of 𝑙 (infinite training data) [Duda, Hurt, and Strok]

slide-35
SLIDE 35

Instance-based learner

35

} Main things to construct an instance-based learner:

} A distance metric } Number of nearest neighbors of the test data that we look at } A weighting function (optional) } How to find the output based on neighbors?

slide-36
SLIDE 36

Distance measures

36

} Euclidean distance

𝑒 𝒚, 𝒚′ = 𝒚 − 𝒚D @

@

  • =

𝑦H − 𝑦H

D @ + ⋯ + 𝑦[ − 𝑦[ D @

  • } Distance learning methods for this purpose

} Weighted Euclidean distance

} 𝑒𝒙 𝒚, 𝒚′ =

𝑥H 𝑦H − 𝑦H

D @ + ⋯ + 𝑥[ 𝑦[ − 𝑦[ D @

  • } Mahalanobis distance

} 𝑒𝑩 𝒚, 𝒚′ =

𝒚 − 𝒚′ –𝑩 𝒚 − 𝒚′

  • } Other distances:

} Hamming, angle, … } 𝑀— 𝒚, 𝒚′ =

∑ 𝑦& − 𝑦&

D — [ &GH

˜

Sensitive to irrelevant features

slide-37
SLIDE 37

Distance measure: example

37

𝑒 𝒚, 𝒚′ = 𝑦H − 𝑦H

D @ + 𝑦@ − 𝑦@ D @

  • 𝑒 𝒚, 𝒚′ =

𝑦H − 𝑦H

D @ + 3 𝑦@ − 𝑦@ D @

slide-38
SLIDE 38

LMNN

38

𝑒 𝒚, 𝒚′ = 𝒚 − 𝒚′ –𝑵 𝒚 − 𝒚′ [wikipedia]

slide-39
SLIDE 39

Weighted kNN classification

39

} Weight nearer neighbors more heavily:

𝑧 7 = 𝑔 𝒚 = argmax

ŠGH,…,š

d 𝑥

}(𝒚)×𝐽(𝑑 = 𝑧 } ) }∈Œ•(𝒚)

𝑥

}(𝒚) =

1 𝒚 − 𝒚(}) @

} In the weighted kNN, we can use all training examples instead

  • f just 𝑙 (Stepard’s method):

𝑧 7 = 𝑔 𝒚 = argmax

ŠGH,…,š

d 𝑥

}(𝒚)×𝐽(𝑑 = 𝑧 } ) . }GH

} Weights can be found using a kernel function 𝑥

}(𝒚) = 𝐿(𝒚, 𝒚(})):

} e.g., 𝐿(𝒚, 𝒚(})) = 𝑓Jy(𝒚,𝒚(‹))

Ÿo

An example of weighting function

slide-40
SLIDE 40

Weighting functions

40

𝑒 = 𝑒(𝒚, 𝒚D)

[Fig. has been adopted from Andrew Moore’s tutorial on “Instance-based learning”]

slide-41
SLIDE 41

kNN regression

41

} Simplest k-NN regression:

} Let 𝒚′ H , … , 𝒚D I

be the 𝑙 nearest neighbors of 𝒚 and 𝑧′ H , … , 𝑧D I be their labels. 𝑧 7 = 1 𝑙 d 𝑧′ }

I }GH

} Problems of kNN regression for fitting functions:

} Problem 1: Discontinuities in the estimated function

} Solution:Weighted (or kernel) regression

} 1NN: noise-fitting problem } kNN (𝑙 > 1) smoothes away noise, but there are other

deficiencies.

} flats the ends

slide-42
SLIDE 42

kNN regression: examples

42

[Figs. have been adopted from Andrew Moore’s tutorial on “Instance-based learning”]

𝑙 = 1 𝑙 = 9 Dataset 1 Dataset 2 Dataset 3

slide-43
SLIDE 43

Weighted (or kernel) kNN regression

43

} Higher weights to nearer neighbors:

𝑧 7 = 𝑔 𝒚 =

∑ ¡‹(𝒚)×¢(‹)

‹∈’•(𝒚)

∑ ¡‹(𝒚)

‹∈’•(𝒚)

} In the weighted kNN regression, we can use all training

examples instead of just 𝑙 in the weighted form: 𝑧 7 = 𝑔 𝒚 = ∑ 𝑥

}(𝒚)×𝑧(}) . }GH

∑ 𝑥

}(𝒚) . }GH

slide-44
SLIDE 44

Kernel kNN regression

44

} Choosing a good parameter (kernel width) is important.

𝜏 =

H £@ of x-axis width

𝜏 =

H £@ of x-axis width

𝜏 =

H H¤ of x-axis width

[This slide has been adapted from Andrew Moore’s tutorial on “Instance-based learning”]

slide-45
SLIDE 45

Kernel kNN regression

45

} Disadvantages:

} not capturing the simple structure of the data } failure to extrapolate at edges

[Figs. have been adopted from Andrew Moore’s tutorial on “Instance-based learning”]

In these datasets, some regions are without samples Best kernel widths have been used

slide-46
SLIDE 46

Locally weighted linear regression

46

} For each test sample, it produces linear approximation to the

target function in a local region

} Instead of finding the output using weighted averaging (as in

the kernel regression), we fit a parametric function locally:

𝑧 7 = 𝑔 𝒚, 𝒚 H , 𝑧 H , . . . , 𝒚 . , 𝑧 . 𝑧 7 = 𝑔 𝒚; 𝒙 = 𝑥¥ + 𝑥H𝑦H + ⋯ + 𝑥[𝑦[ 𝐾 𝒙 = d 𝑧 & − 𝒙–𝒚 &

@ &∈Œ•(𝒚)

𝒙 is found for each test sample Unweighted linear

slide-47
SLIDE 47

Locally weighted linear regression

47

𝑧 7 = 𝑔 𝒚, 𝒚 H , 𝑧 H , . . . , 𝒚 . , 𝑧 . 𝐾 𝒙 𝒚 = d 𝑧 & − 𝒙–𝒚 &

@ &∈Œ•(𝒚)

𝐾 𝒙 𝒚 = d 𝐿 𝒚, 𝒚 & 𝑧 & − 𝒙–𝒚 &

@ &∈Œ•(𝒚)

𝐾 𝒙 𝒚 = d 𝐿 𝒚, 𝒚 & 𝑧 & − 𝒙–𝒚 &

@ . &GH

e.g, 𝐿 𝒚,𝒚 & = 𝑓J

𝒚{𝒚 h

  • Ÿo

weighted Weighted on all training examples unweighted

slide-48
SLIDE 48

Locally weighted linear regression: example

48

} More proper result than weighted kNN regression 𝜏 =

H H¤ of x-axis width

𝜏 =

H £@ of x-axis width

𝜏 =

H § of x-axis width

[Figs. have been adopted from Andrew Moore’s tutorial on “Instance-based learning”]

slide-49
SLIDE 49

Locally weighted regression: summary

49

} Idea 1: weighted kNN regression

} using the weighted average on the output of 𝒚’s neighbors (or on the

  • utputs of all training data):

𝑧 7 = ∑ 𝑧D & 𝐿(𝒚, 𝒚D(&))

I &GH

∑ 𝐿(𝒚, 𝒚D(}))

I }GH

} Idea 2: Locally weighted parametric regression

} Fit a parametric model (e.g. linear function) to the neighbors of 𝒚 (or on

all training data).

} Implicit assumption: the target function is reasonably smooth.

𝑧 𝑦

slide-50
SLIDE 50

Parametric vs. nonparametric methods

50

} Is SVM classifier parametric?

𝑧 7 = sign(𝑥¥ + d 𝛽&𝑧 & 𝐿(𝒚, 𝒚(&))

  • ªh«¥

)

} In general, we can not summarize it in a simple parametric

form.

} Need to keep around support vectors (possibly all of the training

data).

} However, 𝛽& are kind of parameters that are found in the

training phase

slide-51
SLIDE 51

Instance-based learning: summary

51

} Learning is just storing the training data

} prediction on a new data based on the training data themselves

} An instance-based learner does not rely on assumption

concerning the structure of the underlying density function.

} With large datasets, instance-based methods are slow for

prediction on the test data

} kd-tree, Locally Sensitive Hashing (LSH), and other kNN approximations

can help.

slide-52
SLIDE 52

Reference

52

} T. Mitchell,“Machine Learning”, 1998. [Chapter 8] } C.M.

Bishop, “Pattern Recognition and Machine Intelligence”, Section 2.5.