Statistical Geometry Processing Winter Semester 2011/2012 Machine - - PowerPoint PPT Presentation

statistical geometry processing
SMART_READER_LITE
LIVE PREVIEW

Statistical Geometry Processing Winter Semester 2011/2012 Machine - - PowerPoint PPT Presentation

Statistical Geometry Processing Winter Semester 2011/2012 Machine Learning Topics Topics Machine Learning Intro Learning is density estimation The curse of dimensionality Bayesian inference and estimation Bayes rule in action


slide-1
SLIDE 1

Machine Learning

Statistical Geometry Processing

Winter Semester 2011/2012

slide-2
SLIDE 2

2

Topics

Topics

  • Machine Learning Intro
  • Learning is density estimation
  • The curse of dimensionality
  • Bayesian inference and estimation
  • Bayes rule in action
  • Discriminative and generative learning
  • Markov random fields (MRFs) and graphical models
  • Learning Theory
  • Bias and Variance / No free lunch
  • Significance
slide-3
SLIDE 3

Machine Learning

& Bayesian Statistics

slide-4
SLIDE 4

4

Statistics

How does machine learning work?

  • Learning: learn a probability distribution
  • Classification: assign probabilities to data

We will look only at classification problems:

  • Distinguish two classes of objects
  • From ambiguous data
slide-5
SLIDE 5

5

Banana 1.25kg Total 13.15 €

Application

Application Scenario:

  • Automatic scales at supermarket
  • Detect type of fruit using a camera

camera

slide-6
SLIDE 6

6

Learning Probabilities

Toy Example:

  • We want to distinguish pictures
  • f oranges and bananas
  • We have 100 training pictures

for each fruit category

  • From this, we want to derive a

rule to distinguish the pictures automatically

slide-7
SLIDE 7

7

Learning Probabilities

Very simple algorithm:

  • Compute average color
  • Learn distribution

red green

slide-8
SLIDE 8

8

Learning Probabilities

red green

slide-9
SLIDE 9

9

Simple Learning

Simple Learning Algorithms:

  • Histograms
  • Fitting Gaussians
  • We will see more

red green dim() = 2..3

slide-10
SLIDE 10

10

Learning Probabilities

red green

slide-11
SLIDE 11

11

Learning Probabilities

red green banana-orange decision boundary

? ? ?

“banana” (p=51%) “banana” (p=90%) “orange” (p=95%)

slide-12
SLIDE 12

12

Machine Learning

Very simple idea:

  • Collect data
  • Estimate probability distribution
  • Use learned probabilities for classification (etc.)
  • We always decide for the most likely case

(largest probability)

Easy to see:

  • If the probability distributions are known exactly,

this decision is optimal (in expectation)

  • “Minimal Bayesian risk classifier”
slide-13
SLIDE 13

13

What is the problem?

Why is machine learning difficult?

  • We need to learn the probabilities
  • Typical problem: High dimensional input data
slide-14
SLIDE 14

14

High Dimensional Spaces

color: 3D (RGB) image: 100 x 100 pixel 30 000 dimensions

slide-15
SLIDE 15

15

High Dimensional Spaces

red green dim() = 2..3

30 000 dimensions

?

average color learning full image learning

slide-16
SLIDE 16

16

High Dimensional Spaces

High dimensional probability spaces:

  • Too much space to fill
  • We can never get a sufficient number of examples
  • Learning is almost impossible

What can we do?

  • We need additional assumptions
  • Simplify probability space
  • Model statistical dependencies

This makes machine learning a hard problem.

slide-17
SLIDE 17

17

Learn From High Dimensional Input

Learning Strategies:

  • Features to reduce the dimension
  • Average color
  • Boundary shape
  • Other heuristics

Usually chosen manually. (black magic?)

  • High-dimensional learning techniques
  • Neural networks (old school)
  • Support vector machines (current “standard” technique)
  • Ada-boost, decision trees, ... (many other techniques)
  • Usually used in combination
slide-18
SLIDE 18

18

Basic Idea: Neural Networks

Classic Solution: Neural Networks

  • Non-linear functions
  • Features as input
  • Combine basic functions

with weights

  • Optimize to yield
  • (1,0) on bananas
  • (0,1) on oranges
  • Fit non-linear decision

boundary to data

w1 w2 ...

Inputs Outputs

slide-19
SLIDE 19

19

Neural Networks

l1 l2 ... Inputs Outputs bottleneck 

slide-20
SLIDE 20

20

Support Vector Machines

best separating hyperplane training set

slide-21
SLIDE 21

21

Kernel Support Vector Machine

Example Mapping:

 

 

2 2

, , , y xy x y x 

 

   

  • riginal space

“feature space”

slide-22
SLIDE 22

22

Other Learning Algorithms

Popular Learning Algorithms

  • Fitting Gaussians
  • Linear discriminant functions
  • Ada-boost
  • Decision trees
  • ...
slide-23
SLIDE 23

More Complex Learning Tasks

slide-24
SLIDE 24

24

Learning Tasks

Examples of Machine Learning Problems

  • Pattern recognition
  • Single class (banana / non-banana)
  • Multi class (banana, orange, apple, pear)
  • Howto: Density estimation, highest density minimizes risk
  • Regression
  • Fit curve to sparse data
  • Howto: Curve with parameters, density estimation for

parameters

  • Latent variable regression
  • Regression between observables and hidden variables
  • Howto: Parametrize, density estimation
slide-25
SLIDE 25

25

Supervision

Supervised learning

  • Training set is labeled

Semi-supervised

  • Part of the training set is labeled

Unsupervised

  • No labels, find structure on your own (“Clustering”)

Reinforcement learning

  • Learn from experience (losses/gains; robotics)
slide-26
SLIDE 26

26

Principle

training set Model Parameters 𝑦1, 𝑦2, … , 𝑦𝑙 hypothesis

slide-27
SLIDE 27

27

Two Types of Learning

Estimation:

  • Output most likely parameters
  • Maximum density

– “Maximum likelihood” – “Maximum a posteriori”

  • Mean of the distribution

Inference:

  • Output probability density
  • Distribution for parameters
  • More information
  • Marginalize to reduce dimension

p(x) x maximum mean distribution p(x) x maximum mean distribution

slide-28
SLIDE 28

28

Bayesian Models

Scenario

  • Customer picks banana (X = 0) or orange (X = 1)
  • Object X creates image D

Modeling

  • Given image D (observed), what was X (latent)?

𝑄 𝑌 𝐸 = 𝑄 𝐸 𝑌 𝑄(𝑌) 𝑄 𝐸 𝑄 𝑌 𝐸 ~𝑄 𝐸 𝑌 𝑄(𝑌)

slide-29
SLIDE 29

29

Bayesian Models

Model for Estimating X 𝑄 𝑌 𝐸 ~ 𝑄 𝐸 𝑌 𝑄(𝑌)

posterior data term, likelihood prior

slide-30
SLIDE 30

30

Generative vs. Discriminative

Generative Model: Properties

  • Comprehensive model:

Full description of how data is created

  • Might be complex (how to create images of fruit?)

𝑄 𝑌 𝐸 ~ 𝑄 𝐸 𝑌 𝑄(𝑌)

fruit | img fruit  img freq.

  • f fruits

learn learn compute

slide-31
SLIDE 31

31

Generative vs. Discriminative

Discriminative Model: Properties

  • Easier:
  • Learn mapping from phenomenon to explanation
  • Not trying to explain / understand the whole phenomenon
  • Often easier, but less powerful

𝑄 𝑌 𝐸 ~ 𝑄 𝐸 𝑌 𝑄(𝑌)

ignore ignore learn directly fruit | img fruit  img freq.

  • f fruits
slide-32
SLIDE 32

Statistical Dependencies

Markov Random Fields and Graphical Models

slide-33
SLIDE 33

33

Problem

Estimation Problem:

  • X = 3D mesh (10K vertices)
  • D = noisy scan (or the like)
  • Assume P(D|X) is known
  • But: Model P(X) cannot be build
  • Not even enough training data
  • In this part of the universe :-)

𝑄 𝑌 𝐸 ~ 𝑄 𝐸 𝑌 𝑄(𝑌)

posterior data term, likelihood prior

30 000 dimensions

?

slide-34
SLIDE 34

34

Reducing dependencies

Problem:

  • 𝑞(𝑦1, 𝑦2, … , 𝑦10000) is to high-dimensional
  • k States, n variables: O(kn) density entries
  • General dependencies kill the model

Idea

  • Hand-craft decencies
  • We might know or guess what

actually depends on each other and what not

  • This is the art of machine learning
slide-35
SLIDE 35

35

Graphical Models

Factorize Models

  • Pairwise models:

𝑞 𝑦1, … , 𝑦𝑜 = 1 𝑎 𝑞𝑗

1 𝑦𝑗 𝑜 𝑗=1

𝑞𝑗,𝑘

2 𝑦𝑗, 𝑦𝑘 𝑗,𝑘∈𝐹

  • Model complexity:
  • O(nk2) parameters
  • Higher order models:
  • Triplets, quadruples as factors
  • Local neighborhoods

𝑦1 𝑦2 𝑦3 𝑦4 𝑦5 𝑦6 𝑦7 𝑦8 𝑦9 𝑦10 𝑦11 𝑦12 𝑓1,2 𝑓2,3

𝑞𝑗

1 𝑦𝑗

𝑞𝑗,𝑘

2 𝑦𝑗, 𝑦𝑘

slide-36
SLIDE 36

36

Graphical Models

Markov Random fields

  • Factorize density in local

“cliques”

Graphical model

  • Connect variables that are

directly dependent

  • Formal model:

Conditional independence

𝑦1 𝑦2 𝑦3 𝑦4 𝑦5 𝑦6 𝑦7 𝑦8 𝑦9 𝑦10 𝑦11 𝑦12 𝑓1,2 𝑓2,3

𝑞𝑗

1 𝑦𝑗

𝑞𝑗,𝑘

2 𝑦𝑗, 𝑦𝑘

slide-37
SLIDE 37

37

Graphical Models

Conditional Independence

  • A node is conditionally

independent of all others given the values of its direct neighbors

  • I.e. set these values to

constants, x7 is independent of all others

Theorem (Hammersley–Clifford):

  • Given conditional independence as graph, a (positive)

probability density factors over cliques in the graph

𝑦1 𝑦2 𝑦3 𝑦4 𝑦5 𝑦6 𝑦7 𝑦8 𝑦9 𝑦10 𝑦11 𝑦12 𝑓1,2 𝑓2,3

𝑞𝑗

1 𝑦𝑗

𝑞𝑗,𝑘

2 𝑦𝑗, 𝑦𝑘

slide-38
SLIDE 38

Example: Texture Synthesis

slide-39
SLIDE 39

region selected completion

slide-40
SLIDE 40

40

Texture Synthesis

Idea

  • One or more images as examples
  • Learn image statistics
  • Use knowledge:
  • Specify boundary conditions
  • Fill in texture

Example Data Boundary Conditions

slide-41
SLIDE 41

41

The Basic Idea

Markov Random Field Model

  • Image statistics
  • How pixels are colored depends
  • n local neighborhood only

(Markov Random Field)

  • Predict color from neighborhood

Pixel Neighborhood

slide-42
SLIDE 42

42

A Little Bit of Theory...

Image statistics:

  • An image of n × m pixels
  • Random variable: x = [x11,...,xnm] [0, 1, ..., 255]n×m
  • Probability distribution:

p(x) = p(x11, ..., xnm)

It is impossible to learn full images from examples!

256 choices 256 choices ... 256n × m probability values

slide-43
SLIDE 43

43

Simplification

Problem:

  • Statistical dependencies
  • Simple modell can express dependencies on all kinds of

combinations

Markov Random Field:

  • Each pixel is conditionally independent of the rest of the

image given a small neighborhood

  • In English: likelihood only depends on neighborhood, not

rest of the image

slide-44
SLIDE 44

44

Markov Random Field

Example:

  • Red pixel depends on

light red region

  • Not on black region
  • If region is known, probability

is fixed and independent

  • f the rest

However:

  • Regions overlap
  • Indirect global dependency

Pixel Neighborhood

slide-45
SLIDE 45

45

Texture Synthesis

Use for Texture Synthesis

𝑞𝑗,𝑘 = 𝑞𝑗,𝑘 𝑂𝑗,𝑘

=

= 𝑞𝑗,𝑘 𝑦𝑗−𝑙,𝑘−𝑙 … , 𝑦𝑗+𝑙,𝑘+𝑙 ~ exp −𝑒𝑗𝑡𝑢 𝑂𝑗,𝑘, 𝑒𝑏𝑢𝑏

2

2𝜏2 𝑞(𝐲) = 1 𝑎 𝑞𝑗,𝑘 𝑂𝑗,𝑘

𝑛 𝑘=1 𝑜 𝑗=1

i, j Ni, j

slide-46
SLIDE 46

46

Inference

Inference Problem

  • Computing p(x) is trivial for known x.
  • Finding the x that maximizes p(x) is very complicated.
  • In general: NP-hard
  • No efficient solution known (not even for the image case)

In practice

  • Different approximation strategies

("heuristics", strict approximation is also NP-hard)

slide-47
SLIDE 47

47

Simple Practical Algorithm

Here is the short story:

  • Unknown pixels:

consider known neighborhood

  • Match to all of the known data
  • Copy the pixel with the best

matching neighborhood

  • Region growing, outside in

Approximation only

  • Can run into bad local minima
slide-48
SLIDE 48

Learning Theory

There is no such thing as a free lunch...

slide-49
SLIDE 49

49

Overfitting

Problem: Overfitting

  • Two steps:
  • Learn model on training data
  • Use model on more data (“test data”)
  • Overfitting
  • High accuracy in training is no guarantee for later performance
slide-50
SLIDE 50

50

Learning Probabilities

red green possible banana-orange decision boundaries

slide-51
SLIDE 51

51

Learning Probabilities

red green possible banana-orange decision boundaries

slide-52
SLIDE 52

52

Learning Probabilities

red green possible banana-orange decision boundaries

slide-53
SLIDE 53

53

Regression Example

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-54
SLIDE 54

54

Regression Example

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-55
SLIDE 55

55

Regression Example

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-56
SLIDE 56

56

Regression Example

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-57
SLIDE 57

57

Regression Example

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010 Housing bubble great recession starts

  • il crisis

(recession) up again

disclaimer: numbers are made up this is not an investment advice

slide-58
SLIDE 58

58

Bias – Variance Tradeoff

There is a trade off: Bias:

  • Coarse prior assumptions to regularize model

Variance:

  • Bad generalization performance
slide-59
SLIDE 59

59

Model Selection

How to choose the right model? For example

  • Linear
  • Quadratic
  • Higher order

Standard heuristic: Cross validation

  • Partition data in two parts (halfs, leave-one-out,...)
  • Train on part 1, test on part 2
  • Choose according to performance on part 2
slide-60
SLIDE 60

60

Cross Validation

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-61
SLIDE 61

61

Cross Validation

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-62
SLIDE 62

62

Cross Validation

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

disclaimer: numbers are made up this is not an investment advice

slide-63
SLIDE 63

63

No Free Lunch Theorem

Given

  • Labeling problem (holds in general as well)
  • Data 𝐲𝑗 ∈ Ω (for example: images of fruit)
  • Labels 𝑚𝑗 ∈{1,...,k} (for example: fruit type)
  • Training data D = {(𝐲1≡ 𝑚1), … , (𝐲𝑜≡ 𝑚𝑜)}

Looking for

  • Hypothesis h that works everywhere on Ω
  • 1 MPixel photos: 256 1 000 000 data items
  • Cannot cover everything with examples
  • Off training error: Predictions on Ω\D
slide-64
SLIDE 64

64

No Free Lunch Theorem

Unknown:

  • True labeling function 𝑀: Ω → {1, … , 𝑙}

Assumption

  • No prior information
  • All true labeling functions are equally likely

Theorem (“no free lunch”)

  • Under these assumptions, all learning algorithms have the

same expected performance (i.e.: averaged over all potential true L)

slide-65
SLIDE 65

65

Consequences

Without prior knowledge:

  • The expected off-training error of the following

algorithms is the same

  • Fancy Multi-Class Support Vector machine
  • Output random numbers
  • Output always 0
  • Learning with cross validation

There is no “ultimate learning algorithm”

  • Learning from data needs further knowledge (structure

assumptions)

  • No truly “fully automatic” machine learning
slide-66
SLIDE 66

66

Example: Regression

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

slide-67
SLIDE 67

67

Example: Regression

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

slide-68
SLIDE 68

68

Example: Regression

Housing Prices in Springfield

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010 same likelihood for all in-between values same likelihood for all in-between values

slide-69
SLIDE 69

69

Example: Density Estimation

Relativity of Orange-Banana Spaces vs.

“smooth densities” In this case: Gaussians

slide-70
SLIDE 70

70

Significance and Capacity

Scenario

  • We have a two hypothesis h0, h1
  • One is correct

Solution

  • Choose the one with higher likelihood

Significance test

  • For example: Does new drug help?
  • h0: Just random outcome
  • Show that P(h0) is small
slide-71
SLIDE 71

71

Machine Learning: Capacity

We have:

  • Complex models

Example

  • Polynomial fitting
  • d continuous parameters 𝑏𝑗

𝑞 𝑦 = 𝑏𝑗𝑦𝑗

𝑒−1 𝑗=0

  • “Capacity” grows with d

100 K 200 K 300 K 400 K 500 K 600 K 1960 1970 1980 1990 2000 2010

slide-72
SLIDE 72

72

Significance?

Simple criterion

  • Model must be able to predict training data
  • Order d – 1 polynomial can always fit d points perfectly
  • Credit card numbers: 16 digits, 15-th order polynomial?
  • Need O(d) training points at least
  • Random sampling: Overhead
  • d bins need O(d log d) random draws
  • Rule of thumb “10 samples per parameter”
slide-73
SLIDE 73

73

Simple Model

Single Hypothesis

  • Hypothesis ℎ: ℝ𝑒 → 0,1 , maps features to decisions

Groud truth 𝑕: ℝ𝑒 → 0,1 , correct labeling

  • Stream of data, drawn i.i.d.

𝐲𝑗, 𝑧𝑗 ~𝒠 𝑕 𝐲𝑗 = 𝑧𝑗 drawn from fixed distribution 𝒠.

  • Expected error:

𝜗 ℎ = 𝑄𝒠 ℎ 𝐲 ≠ 𝑕 𝐲

slide-74
SLIDE 74

74

Simple Model

Empirical vs. True Error

  • Inifinte stream 𝐲𝑗, 𝑧𝑗 ~𝒠, drawn i.i.d.
  • Finite training set { 𝐲1, 𝑧1 , … , 𝐲𝑜, 𝑧𝑜 }~𝒠, drawn i.i.d.
  • Expected error:

𝜗 ℎ = 𝑄𝒠 ℎ 𝐲 ≠ 𝑕 𝐲

  • Empirical error (training error):

𝜗 ℎ = 1 𝑜 ℎ 𝐲𝑗 − 𝑕 𝐲𝑗

𝑜 𝑗

  • Bernoulli experiment: Chernoff bound

𝑄 𝜗 ℎ − 𝜗 ℎ > 𝛿 ≤ 2exp (−2𝛿2𝑜)

slide-75
SLIDE 75

75

Simple Model

Empirical vs. True Error

  • Finite training set { 𝐲1, 𝑧1 , … , 𝐲𝑜, 𝑧𝑜 }~𝒠, drawn i.i.d.
  • Training error bound:

𝑄 𝜗 ℎ − 𝜗 ℎ > 𝛿 ≤ 2exp (−2𝛿2𝑜)

Result

  • Reliability of assessment of hypothesis quality grows

quickly with increasing number of trials

  • We can bound generalization error
slide-76
SLIDE 76

76

Machine Learning

We have multiple hypothesis

  • Multiple hypothesis ℋ = {ℎ1, … , ℎ𝑙}
  • Need a bound on generalization error estimate for

all of them after n training examples 𝑄 ∃ℎ𝑗 ∈ ℋ: 𝜗 ℎ𝑗 − 𝜗 ℎ𝑗 > 𝛿 = 𝑄 ℎ1 breaks ∪ ⋯ ∪ ℎ𝑙 breaks ≤ 𝑙𝑄 ℎ𝑗 breaks = 2𝑙 exp(−2𝛿2𝑜)

slide-77
SLIDE 77

77

Machine Learning

Result

  • After n training examples, we now training error up to 𝛿

uniformly for k hypothesis with probability of at least 1 − 2𝑙 exp(−2𝛿2𝑜)

  • With probability of at least 1 − 𝜀 sufficient to use

𝑜 ≥

1 2𝛿2 log 2𝑙 𝜀 (log in k)

training examples.

  • With probability 1 − 𝜀, error bounded by

∀ℎ𝑗 ∈ ℋ: 𝜗 ℎ𝑗 − 𝜗 ℎ𝑗 ≤ 1 2𝑜 log 2𝑙 𝜀

slide-78
SLIDE 78

78

Empirical Risk Minimization

ERM Learning Algorithm

  • Evaluate all hypothesis ℋ = {ℎ1, … , ℎ𝑙} on training set
  • Choose ℎ

with lowest error, ℎ = arg min

𝑗=1..𝑙

𝜗 ℎ𝑗

  • “Empirical Risk Minimization”
slide-79
SLIDE 79

79

Empirical Risk Minimization

Guarantees

  • When using empirical risk minimization
  • With probability ≥ 1 − 𝜀, we get:
  • Not far from optimum:

𝜗 ℎ ≤ 𝜗 ℎ𝑐𝑓𝑡𝑢 + 2𝛿

  • Trade off:

𝜗 ℎ ≤ 𝜗 ℎ𝑐𝑓𝑡𝑢

Bias

+ 2 1 2𝑜 log 2𝑙 𝜀

Variance

slide-80
SLIDE 80

80

Generalization

Can be generalized

  • For multi-class learning, regression, etc.

Continuous set of hypothesis

  • Simple: k bits encode hypothesis
  • More sophisticated model:

Vapnik-Chervonenkis (VC) dimension

  • „Capacity“ of classifier
  • Max. number of points that can be labled differently by

hypothesis set

  • 𝒫 𝑊𝐷 ℋ

training examples needed

slide-81
SLIDE 81

81

Conclusion

Two theoretical insights

  • No free lunch:

Without additional information, no prediction possible about off-training examples

  • Significance: Yes, we can...
  • ...estimate expected generalization error with high probability
  • ...choose a good hypothesis from a set (with h.p. / error bounds)
slide-82
SLIDE 82

82

Conclusion

Two theoretical insights

  • There is no contradiction here
  • Still, some non-training points might be misclassified all the time
  • But they cannot show up frequently
  • Have to choose hypothesis set

– Infinite capacity leads to unbounded error – Thus: We do need prior knowledge

slide-83
SLIDE 83

83

Conclusions

Machine Learning

  • Is basically density estimation
  • Curse of dimensionality
  • High dimensionality makes things intractable
  • Model dependencies to fight the problem
slide-84
SLIDE 84

84

Conclusions

Machine Learning

  • No free lunch
  • You can only learn when you already know something
  • Math won’t tell you were knowledge initially came from
  • Significance
  • Beware of overfitting!
  • Need to adapt plasticity of model to available training data
slide-85
SLIDE 85

85

Recommended Further Readings

Short intro:

  • Aaron Hertzman: Siggraph 2004 Course

“Introduction to Bayesian Learning” http://www.dgp.toronto.edu/~hertzman/ibl2004/

Bayesian learning, no free lunch:

  • R. Duda, P. Hart, D. Stork: Pattern Classification, 2nd edition, Wiley.

Significance of multiple hypothesis:

  • Andrew Ng, Stanford University

“CS 229 – Machine Learning” Course notes (Lecture 4) http://cs229.stanford.edu/materials.html