CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and - - PDF document

cs 678 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and - - PDF document

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts example of predicting basketball


slide-1
SLIDE 1

CS 678 Machine Learning

Lecture Notes

1 Week 1 - chapter 1 and probability

1.1 General

syllabus what do students know (prog. lang., stats, math, calculus )

1.2 machine learning

1.2.1 general concepts

  • example of predicting basketball players (height and speed)
  • detecting patterns or regularities
  • application of ML to large databases is data mining
  • pattern recognition (face recognition, fingerprint, character, etc.)
  • combines math, statistics and computer science

1.2.2 examples of ML

  • learning associations
  • classification

– classes – discriminant, prediction – OCR, face recognition, medical diagnosis, speech recognition – knowledge extraction, compression, outlier detection

  • regression

1

slide-2
SLIDE 2

1.3 probability

  • events, probability and sample space
  • axioms

– 0 ≤ P(E) ≤ 1 – P(S) = 1 example: ∗ E1 = die = 1 ∗ S = E1 ∪ E2 ∪ E3 ∪ E4 ∪ E5 ∪ E6 ∗ p(E2) = 1

6

∗ p(S) = 1 – P(∪Ei) = P(Ei) – P(E ∪ Ec) = P(E) + P(Ec) = 1 – P(EUF) = P(E) + P(F) − P(E ∩ F))

  • conditional prob:

– P(E|F) = P(E ∩ F)/P(F) – P(F|E) = P(E|F)P(F)/P(E) bayes formula (show derivation) ∗ E = have lung cancer, F = smoke ∗ P(E) = people with lung cancer

all people

= .05 ∗ P(F) = people who smoke

all people

= .50 ∗ P(F|E) = people who smoke and have lung cancer

people who have lung cancer

= .50 ∗ P(E|F) = .80·.05

.5

= .08 – marginals ∗ P(X) =

i P(X|Yi)P(Yi)

∗ Ei=first die is i ∗ P(T = 7|Ei) = 1

6

∗ so P(T = 7) = P(T = 7|E1)P(E1) + P(T = 7|E2)P(E2) + ... = 1/36 = 1

6

∗ also do the same with P(E3) ∗ can also be done with continuous distributions... – P(E1|F) = P(F|E1)P(E1)

P(F)

=

P(F|E1)P(E1)

  • i P(F|Ei)P(Ei)

– P(E ∩ F) = P(E)P(F) if E and F are independent ∗ P(E|F) = P(E ∩ F)/P(F) ∗ P(E ∩ F) = P(E|F)P(F) ∗ so if E and F are independent, P(E|F) = P(E) ∗ for example, given the first die is 2, P(die2 = 3) = 1/6 ∗ independence is THE big assumption in machine learning: i.i.d.

  • random variables

2

slide-3
SLIDE 3

– probability distributions ∗ F(a) = P{X <= a} ∗ P{a < X <= b} = F(b) − F(a) ∗ F(a) = sum(x <= a)(P(x)) discrete ∗ F(a) = a

∞ p(x)dx

– joint distributions ∗ F(x, y) = P{X ≤ x, Y ≤ y} ∗ FX(x) = P{X ≤ x, Y ≤ ∞} marginal (show both the discrete and continuous) – conditional distributions: PX|Y (x|y) = P{X=x|Y =y}

P{Y =y}

– bayes rule: P(y|x) = P(x|y)PY (y)/PX(x) (posterior=likelihood*prior/evidence) – expectation (mean) E[X] =

i xiP(xi) or E[X] =

  • xp(x)dx

– variance: V ar(X) = E[(X − µ)2] = E[X2] − µ2 – distributions ∗ binomial ∗ multinomial ∗ uniform ∗ normal ∗ others (chi-sq, t, F, etc) 3

slide-4
SLIDE 4

2 Week 2 - chapter 2 supervised learning

2.1 learning from examples

  • positive, negative examples
  • x = x1...xd input representation (just the pertinant attributes)
  • X = {xt, rt}N

t=1

  • hypothesis h, hypothesis class, parameters. h(x) = 1 if h classifies x as positive
  • empirical error - classifier does not match those in X: E(h|X) = N

t=1 l(h(xt) =rt)

  • generalization - most specific (S) vs. most general (G)(false positives and negatives). Doubt
  • those in G - S are not certain so we do not make a decision

2.2 vapnik-chervonenkis dimension

maximum number of points that can be shattered by d dimensions. Draw example with 4 points and rectangles.

2.3 PAC learning

  • want the maximum error to be ǫ, for the 4 rectangles ǫ/4
  • prob of not an error = 4(1 − ǫ/4)N
  • given the inequality (1 − x) ≤ e−x, we want to choose N and δ so that 4e−ǫN/4 ≤ δ, which

leads to

  • N ≥ (4/ǫ)log(4/δ)
  • example: ǫ = .1 and δ = .05 we need to have at least 77 samples

2.4 noise

imprecision in recording, labeling mistakes, additional attributes. Question: do you think it is possible to predict with certainty something like ”will so-and-so like a particular movie” given all pertinent data? Complex models can be more accurate but simple models are easier to use, train, explain and may be more accurate (overfitting) - occam’s razor.

2.5 learning multiple classes

create rectangles for each class 4

slide-5
SLIDE 5

2.6 regression

  • X = xt, rtN

t=1 where rt ∈ ℜ

  • interpolation: rt = f(xt), regression: rt = f(xt) + ǫ
  • empirical error: E(g|X) = 1

N

N

t=1 [rt − g(xt)]2

  • if linear: g(x) = w1x1 + ... + wdxd + w0 = d

j=1 wjxj + w0

  • with one attribute: g(x) = w1x1 + w0
  • error function: E(w1, w0|X) = N

t=1 [rt − (w1xt + w0)]2

  • taking the partials, setting to zero and solving:

– w1 =

N

t=1 xtrt− ¯

xrN N

t=1 (xt)2−N ¯

x2

– w0 = ¯ r − w1¯ x

  • quadratic and higher-order polynomials

2.7 model selection and generalization

  • Go over example in table 2.1.
  • When the data does not identify a model with certainty, it is an ill-posed problem.
  • Inductive bias is the set of assumptions that are adopted.
  • model selection is choosing the right bias.
  • Underfitting is when the hypothesis is less complex than function
  • overfitting hypothesis is too complex

2.8 dimensions of supervised ML algorithm (recap)

  • model: g(x|θ)
  • loss function: E(θ|X) = N

t=1 L(rt, g(xt|θ))

  • optimization method: θ∗ = argminθE(θ|X)

5

slide-6
SLIDE 6

2.9 implementation

  • program to find most specific parameters
  • program to find most general parameters
  • program to learn for multiple classes
  • program to do regression (many packages)

6

slide-7
SLIDE 7

3 Week 3 - chapter 3 Baysian decision theory

  • observable (x) and unobservable (z) variables x = f(z)
  • choose the most probable event
  • estimate P(X) using samples, i.e. ˆ

p0 =

heads totaltosses

3.1 classification

  • use the observable variables to predict the class
  • choose C = 1 if P(C = 1|x1, x2) > .5
  • prob of error is 1 − max(P(C = 1|x1, x2), P(C = 1|x1, x2))
  • bayes rule: P(C|x) = p(x|C)P(C)

p(x)

  • prior is the probability of the class
  • class likelihood is the probability of the data given the class
  • evidence is the probability of the data, normalization constant
  • classifier: choose the class with the highest posterior prob: choose Ci if P(Ci) = maxkP(Ck|x)
  • example: want to predict success of college applicant given: gpa, sat score
  • example: predict a patient’s reaction (get better, no diff, get worse) given their blood pressure

and ethnic background

3.2 losses and risks

need to weight decisions as not all decision have the same consequences

  • let αi be the action of choosing Ci
  • and λik be the loss associated with taking action αi when the class is really Ci
  • then the risk of taking action αi is R(αi|x) = K

k=1 λikP(Ck|x)

  • zero-one loss is often assumed to simplify things. assigning risks can always be done as a

post processing step.

  • example: say P(C0|x) = .4 and P(C1|x) = .6 but that λ00 = 0, λ01 = 10, λ10 = 20 and

λ11 = 0. So – R(α0|x) = 0 · .4 + 10 · .6 – R(α1|x) = 20 · .4 + 0 · .6

  • reject option - create one more α and λ

7

slide-8
SLIDE 8

3.3 discriminant functions

  • gi(x) = −R(αi)
  • gi(x) = P(x|Ci)P(Ci) when zero-one loss function is considered
  • show briefly the quadratic discriminator

3.4 utility theory

  • utility function: UE(αi|x) =

k UikP(Sk|x)

  • choose αi if UE(αi|x) = maxjEU(αj|x)
  • typically defined in monetary terms

3.5 value of information

  • assessing the value of additional information (attributes)
  • expected utility of current best action: UE(x) = maxi
  • k UikP(Sk|x)
  • with new feature z, UE(x, z) = maxi
  • k UikP(Sk|x, z)
  • if EU(x, z) > EU(x), then z is useful but only if utility of the additional feature is more

than the cost of observation and processing

3.6 baysian nets

  • define probabilistic networks, graphical models and DAG
  • (slides) define causes and diagnostic arcs in network
  • explain P(R|W) = P(W|R)P(R)

P(W)

  • explain P(W|S) = P(W|R, S)P(R|S) + P(W| R, S)P( R|S)
  • P(W) = P(W|R, S)P(R, S)+P(W| R, S)P( R, S)+P(W|R, S)P(R, S)+P(W| R, S)P( R, S)
  • explain why P(S|R, W) is less than P(S|W)
  • local structure - results in storing fewer parameters and making computations easier
  • belief propagation and junction trees are methods of efficiently solving nets
  • classification

3.7 influence diagrams 3.8 association rules

8

slide-9
SLIDE 9

4 Week 4 - chapter 4 Parametric methods

use statistics to help estimate the likelihoods and priors

4.1 maximum likelihood

  • assume that any given sample x is drawn from a probability density p(x|θ)
  • to find the parameters θ we start by maximizing the likelihood of the sample p(X|θ) =

N

t=1 p(xt|θ)

  • difficult to work with derivative of products so we use the log likelihood
  • L(θ|X) ≡ log l(θ|X) = N

t=1 log p(xt|θ)

  • now we can plug in formula for specific distributions to calculate the statistics used to esti-

mate the paramters – bernoulli: p(x) = px(1 − p)1−x L(p|X) = log N

t=1 p(xt)(1 − p)1−xt

ˆ p =

  • t xt

N

– multinomial: ˆ pi =

  • t xt

i

N

– gaussian: p(x) =

1 √ 2πσe− (x−µ)2

2σ2

L(µ, σ|X) = −N

2 log(2π) − Nlogσ −

  • t(xt−µ)2

2σ2

m =

  • t xt

N

s2 =

  • t((xt−m)2

N

4.2 evaluating an estimator: bias and variance

  • bias is a measure of how much the estimator varies from θ
  • MSE: r(d, θ) = E[(d(X) − θ)2]
  • bias: bθ(d) = E[d(X)] − θ
  • for mean m: E[m] = E[
  • t xt

N ] = 1 N

  • t E[xt] = Nµ

N = µ

  • mean is also consistent, that is var(m) → 0 as N → ∞
  • sample variance is not unbias: E[s2] = ( N−1

N )σ2

  • MSE can be written: r(d, θ) = E [(d − E[d])2] + (E[d] − θ)2 (variance + bias squared)

9

slide-10
SLIDE 10

4.3 the bayes’ estimator

  • θ itself can be seen as a random variable.
  • with prior of θ we can calculate p(θ|X) =

p(X|θ)p(θ)

  • p(X|θ′)p(θ′)dθ′
  • then p(x|X) =
  • p(x, θ|X)dθ =
  • p(x|θ, X)p(θ|X)dθ =
  • p(x|θ)p(θ|X)dθ
  • often the integrals are difficult to evaluate so we narrow down p(θ|X) to a single point
  • using MAP: θMAP = argmaxθp(θ|X)
  • then we calculate p(x|X) = p(x|θMAP)
  • if we have no prior information, posterior has same form as likelihood p(X|θ) so we can use

ML

  • Bayes estimator: θBayes = E[θ|X] =
  • θp(θ|X)dθ
  • for both xt and θ normal, E[θ|X] =

N/σ2 N/σ2

0+1/σ2 1 m0 +

1/σ2

1

N/σ2

0+1/σ2 1 m1

4.4 parametric classification

  • using Bayes’ rule: p(Ci|x) = p(x|Ci)p(Ci)

p(x)

=

p(x|Ci)p(Ci) K

k=1 p(x|Ck)p(Ck)

  • discriminant: gi(x) = p(x|Ci)p(Ci) or gi(x) = logp(x|Ci) + logp(Ci)
  • Gaussian: gi(x) = −1

2log2π − logσi − (x−µi)2 2σ2

i

+ logp(Ci)

  • calculate parameters using statistics:

– mi =

N

k=1 xtrt i

N

k=1 rt i assumes rt

i = 1 if xt ∈ Ci

– s2

i = N

k=1 (xt−mi)2rt i

N

k=1 rt i

– ˆ p(Ci) =

N

k=1 rt i

N

  • example:

10