[PDF] - CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and PDF Document

SLIDE 1

CS 678 Machine Learning

Lecture Notes

1 Week 1 - chapter 1 and probability

1.1 General

syllabus what do students know (prog. lang., stats, math, calculus )

1.2 machine learning

1.2.1 general concepts

example of predicting basketball players (height and speed)
detecting patterns or regularities
application of ML to large databases is data mining
pattern recognition (face recognition, fingerprint, character, etc.)
combines math, statistics and computer science

1.2.2 examples of ML

learning associations
classification

– classes – discriminant, prediction – OCR, face recognition, medical diagnosis, speech recognition – knowledge extraction, compression, outlier detection

regression

1

SLIDE 2

1.3 probability

events, probability and sample space
axioms

– 0 ≤ P(E) ≤ 1 – P(S) = 1 example: ∗ E1 = die = 1 ∗ S = E1 ∪ E2 ∪ E3 ∪ E4 ∪ E5 ∪ E6 ∗ p(E2) = 1

6

∗ p(S) = 1 – P(∪Ei) = P(Ei) – P(E ∪ Ec) = P(E) + P(Ec) = 1 – P(EUF) = P(E) + P(F) − P(E ∩ F))

conditional prob:

– P(E|F) = P(E ∩ F)/P(F) – P(F|E) = P(E|F)P(F)/P(E) bayes formula (show derivation) ∗ E = have lung cancer, F = smoke ∗ P(E) = people with lung cancer

all people

= .05 ∗ P(F) = people who smoke

all people

= .50 ∗ P(F|E) = people who smoke and have lung cancer

people who have lung cancer

= .50 ∗ P(E|F) = .80·.05

.5

= .08 – marginals ∗ P(X) =

i P(X|Yi)P(Yi)

∗ Ei=first die is i ∗ P(T = 7|Ei) = 1

6

∗ so P(T = 7) = P(T = 7|E1)P(E1) + P(T = 7|E2)P(E2) + ... = 1/36 = 1

6

∗ also do the same with P(E3) ∗ can also be done with continuous distributions... – P(E1|F) = P(F|E1)P(E1)

P(F)

=

P(F|E1)P(E1)

i P(F|Ei)P(Ei)

– P(E ∩ F) = P(E)P(F) if E and F are independent ∗ P(E|F) = P(E ∩ F)/P(F) ∗ P(E ∩ F) = P(E|F)P(F) ∗ so if E and F are independent, P(E|F) = P(E) ∗ for example, given the first die is 2, P(die2 = 3) = 1/6 ∗ independence is THE big assumption in machine learning: i.i.d.

random variables

2

SLIDE 3

– probability distributions ∗ F(a) = P{X <= a} ∗ P{a < X <= b} = F(b) − F(a) ∗ F(a) = sum(x <= a)(P(x)) discrete ∗ F(a) = a

∞ p(x)dx

– joint distributions ∗ F(x, y) = P{X ≤ x, Y ≤ y} ∗ FX(x) = P{X ≤ x, Y ≤ ∞} marginal (show both the discrete and continuous) – conditional distributions: PX|Y (x|y) = P{X=x|Y =y}

P{Y =y}

– bayes rule: P(y|x) = P(x|y)PY (y)/PX(x) (posterior=likelihood*prior/evidence) – expectation (mean) E[X] =

i xiP(xi) or E[X] =

xp(x)dx

– variance: V ar(X) = E[(X − µ)2] = E[X2] − µ2 – distributions ∗ binomial ∗ multinomial ∗ uniform ∗ normal ∗ others (chi-sq, t, F, etc) 3

SLIDE 4

2 Week 2 - chapter 2 supervised learning

2.1 learning from examples

positive, negative examples
x = x1...xd input representation (just the pertinant attributes)
X = {xt, rt}N

t=1

hypothesis h, hypothesis class, parameters. h(x) = 1 if h classifies x as positive
empirical error - classifier does not match those in X: E(h|X) = N

t=1 l(h(xt) =rt)

generalization - most specific (S) vs. most general (G)(false positives and negatives). Doubt
those in G - S are not certain so we do not make a decision

2.2 vapnik-chervonenkis dimension

maximum number of points that can be shattered by d dimensions. Draw example with 4 points and rectangles.

2.3 PAC learning

want the maximum error to be ǫ, for the 4 rectangles ǫ/4
prob of not an error = 4(1 − ǫ/4)N
given the inequality (1 − x) ≤ e−x, we want to choose N and δ so that 4e−ǫN/4 ≤ δ, which

leads to

N ≥ (4/ǫ)log(4/δ)
example: ǫ = .1 and δ = .05 we need to have at least 77 samples

2.4 noise

imprecision in recording, labeling mistakes, additional attributes. Question: do you think it is possible to predict with certainty something like ”will so-and-so like a particular movie” given all pertinent data? Complex models can be more accurate but simple models are easier to use, train, explain and may be more accurate (overfitting) - occam’s razor.

2.5 learning multiple classes

create rectangles for each class 4

SLIDE 5

2.6 regression

X = xt, rtN

t=1 where rt ∈ ℜ

interpolation: rt = f(xt), regression: rt = f(xt) + ǫ
empirical error: E(g|X) = 1

N

t=1 [rt − g(xt)]2

if linear: g(x) = w1x1 + ... + wdxd + w0 = d

j=1 wjxj + w0

with one attribute: g(x) = w1x1 + w0
error function: E(w1, w0|X) = N

t=1 [rt − (w1xt + w0)]2

taking the partials, setting to zero and solving:

– w1 =

N

t=1 xtrt− ¯

xrN N

t=1 (xt)2−N ¯

x2

– w0 = ¯ r − w1¯ x

quadratic and higher-order polynomials

2.7 model selection and generalization

Go over example in table 2.1.
When the data does not identify a model with certainty, it is an ill-posed problem.
Inductive bias is the set of assumptions that are adopted.
model selection is choosing the right bias.
Underfitting is when the hypothesis is less complex than function
overfitting hypothesis is too complex

2.8 dimensions of supervised ML algorithm (recap)

model: g(x|θ)
loss function: E(θ|X) = N

t=1 L(rt, g(xt|θ))

optimization method: θ∗ = argminθE(θ|X)

5

SLIDE 6

2.9 implementation

program to find most specific parameters
program to find most general parameters
program to learn for multiple classes
program to do regression (many packages)

6

SLIDE 7

3 Week 3 - chapter 3 Baysian decision theory

observable (x) and unobservable (z) variables x = f(z)
choose the most probable event
estimate P(X) using samples, i.e. ˆ

p0 =

heads totaltosses

3.1 classification

use the observable variables to predict the class
choose C = 1 if P(C = 1|x1, x2) > .5
prob of error is 1 − max(P(C = 1|x1, x2), P(C = 1|x1, x2))
bayes rule: P(C|x) = p(x|C)P(C)

p(x)

prior is the probability of the class
class likelihood is the probability of the data given the class
evidence is the probability of the data, normalization constant
classifier: choose the class with the highest posterior prob: choose Ci if P(Ci) = maxkP(Ck|x)
example: want to predict success of college applicant given: gpa, sat score
example: predict a patient’s reaction (get better, no diff, get worse) given their blood pressure

and ethnic background

3.2 losses and risks

need to weight decisions as not all decision have the same consequences

let αi be the action of choosing Ci
and λik be the loss associated with taking action αi when the class is really Ci
then the risk of taking action αi is R(αi|x) = K

k=1 λikP(Ck|x)

zero-one loss is often assumed to simplify things. assigning risks can always be done as a

post processing step.

example: say P(C0|x) = .4 and P(C1|x) = .6 but that λ00 = 0, λ01 = 10, λ10 = 20 and

λ11 = 0. So – R(α0|x) = 0 · .4 + 10 · .6 – R(α1|x) = 20 · .4 + 0 · .6

reject option - create one more α and λ

7

SLIDE 8

3.3 discriminant functions

gi(x) = −R(αi)
gi(x) = P(x|Ci)P(Ci) when zero-one loss function is considered
show briefly the quadratic discriminator

3.4 utility theory

utility function: UE(αi|x) =

k UikP(Sk|x)

choose αi if UE(αi|x) = maxjEU(αj|x)
typically defined in monetary terms

3.5 value of information

assessing the value of additional information (attributes)
expected utility of current best action: UE(x) = maxi
k UikP(Sk|x)
with new feature z, UE(x, z) = maxi
k UikP(Sk|x, z)
if EU(x, z) > EU(x), then z is useful but only if utility of the additional feature is more

than the cost of observation and processing

3.6 baysian nets

define probabilistic networks, graphical models and DAG
(slides) define causes and diagnostic arcs in network
explain P(R|W) = P(W|R)P(R)

P(W)

explain P(W|S) = P(W|R, S)P(R|S) + P(W| R, S)P( R|S)
P(W) = P(W|R, S)P(R, S)+P(W| R, S)P( R, S)+P(W|R, S)P(R, S)+P(W| R, S)P( R, S)
explain why P(S|R, W) is less than P(S|W)
local structure - results in storing fewer parameters and making computations easier
belief propagation and junction trees are methods of efficiently solving nets
classification

3.7 influence diagrams 3.8 association rules

8

SLIDE 9

4 Week 4 - chapter 4 Parametric methods

use statistics to help estimate the likelihoods and priors

4.1 maximum likelihood

assume that any given sample x is drawn from a probability density p(x|θ)
to find the parameters θ we start by maximizing the likelihood of the sample p(X|θ) =

N

t=1 p(xt|θ)

difficult to work with derivative of products so we use the log likelihood
L(θ|X) ≡ log l(θ|X) = N

t=1 log p(xt|θ)

now we can plug in formula for specific distributions to calculate the statistics used to esti-

mate the paramters – bernoulli: p(x) = px(1 − p)1−x L(p|X) = log N

t=1 p(xt)(1 − p)1−xt

ˆ p =

t xt

N

– multinomial: ˆ pi =

t xt

i

N

– gaussian: p(x) =

1 √ 2πσe− (x−µ)2

2σ2

L(µ, σ|X) = −N

2 log(2π) − Nlogσ −

t(xt−µ)2

2σ2

m =

t xt

N

s2 =

t((xt−m)2

N

4.2 evaluating an estimator: bias and variance

bias is a measure of how much the estimator varies from θ
MSE: r(d, θ) = E[(d(X) − θ)2]
bias: bθ(d) = E[d(X)] − θ
for mean m: E[m] = E[
t xt

N ] = 1 N

t E[xt] = Nµ

N = µ

mean is also consistent, that is var(m) → 0 as N → ∞
sample variance is not unbias: E[s2] = ( N−1

N )σ2

MSE can be written: r(d, θ) = E [(d − E[d])2] + (E[d] − θ)2 (variance + bias squared)

9

SLIDE 10

4.3 the bayes’ estimator

θ itself can be seen as a random variable.
with prior of θ we can calculate p(θ|X) =

p(X|θ)p(θ)

p(X|θ′)p(θ′)dθ′
then p(x|X) =
p(x, θ|X)dθ =
p(x|θ, X)p(θ|X)dθ =
p(x|θ)p(θ|X)dθ
often the integrals are difficult to evaluate so we narrow down p(θ|X) to a single point
using MAP: θMAP = argmaxθp(θ|X)
then we calculate p(x|X) = p(x|θMAP)
if we have no prior information, posterior has same form as likelihood p(X|θ) so we can use

ML

Bayes estimator: θBayes = E[θ|X] =
θp(θ|X)dθ
for both xt and θ normal, E[θ|X] =

N/σ2 N/σ2

0+1/σ2 1 m0 +

1/σ2

1

N/σ2

0+1/σ2 1 m1

4.4 parametric classification

using Bayes’ rule: p(Ci|x) = p(x|Ci)p(Ci)

p(x)

=

p(x|Ci)p(Ci) K

k=1 p(x|Ck)p(Ck)

discriminant: gi(x) = p(x|Ci)p(Ci) or gi(x) = logp(x|Ci) + logp(Ci)
Gaussian: gi(x) = −1

2log2π − logσi − (x−µi)2 2σ2

i

+ logp(Ci)

calculate parameters using statistics:

– mi =

N

k=1 xtrt i

N

k=1 rt i assumes rt

i = 1 if xt ∈ Ci

– s2

i = N

k=1 (xt−mi)2rt i

N

k=1 rt i

– ˆ p(Ci) =

N

k=1 rt i

N

example: