Estimation: Sample Complexity and the Bias-Variance Tradeoff
CMPUT 296: Basics of Machine Learning
Textbook §3.4-3.5
Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - - PowerPoint PPT Presentation
Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook 3.4-3.5 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )
CMPUT 296: Basics of Machine Learning
Textbook §3.4-3.5
Reminders:
distance from the mean
the value of an unobserved quantity based on observed data
estimator being at least from the estimated quantity
quantity
Var[X] X ϵ
Popoviciu's inequality: If , then Hoeffding's inequality: Chebyshev's inequality:
✴ whenever ✴ E.g., if
, then whenever
a ≤ Xi ≤ b Var[Xi] ≤ 1 4(b − a)2 ϵ = (b − a) ln(2/δ) 2n ϵ = σ2 δn ln(2/δ) 2 < 1 2 δ Var[Xi] ≈ 1 4(b − a)2 δ < ∼ 0.232 σ2
= ln(2/δ) 2 (b − a) 1 n = 1 2 δ (b − a) 1 n ≤ (b − a)2 4δn
correct value Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and .
ϵ 1 − δ δ ϵ
The convergence rate indicates how quickly the error in an estimator decays as the number of samples grows. Example: Estimating mean of a distribution using
(why?)
¯ X = 1 n
n
∑
i=1
Xi Pr ¯ X − 𝔽[ ¯ X] ≤ σ2 δn ≥ 1 − δ O (1/ n)
Example: Now assume that we know , and we know but not . Find such that by finding such that (why?)
Xi
i.i.d.
∼ N(μ, σ2) σ2 μ ¯ X ∼ N(μ, σ2/n) ϵ Pr(| ¯ X − μ| < ϵ) = 0.95 ϵ ∫
ϵ −∞
p(x) dx = 0.025 ⟹ ϵ = 1.96 σ n
Questions: 1. What is the expected value of ? 2. What is the variance of ? 3. What is the distribution of ?
¯ X = 1 n
n
∑
i=1
Xi ¯ X ¯ X
inverse CDF
For , Chebyshev gives
δ = 0.05 ϵ = σ2 δn = 1 0.05 σ n ⟺ ϵ = 4.47 σ n ⟺ n = 4.47σ ϵ ⟺ n = 19.98 σ2 ϵ2
With Gaussian assumption and
δ = 0.05, ϵ = 1.96 σ n ⟺ n = 1.96 σ ϵ ⟺ n = 3.84σ2 ϵ2
Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error
, for given and .
ϵ 1 − δ δ ϵ
Definition: Mean squared error of an estimator of a quantity :
̂ X X MSE( ̂ X) = 𝔽 [( ̂ X − 𝔽[X])2]
different!
MSE( ̂ X) = 𝔽[( ̂ X − 𝔽[X])2] = 𝔽[( ̂ X − μ)2] = 𝔽[( ̂ X−𝔽[ ̂ X] + 𝔽[ ̂ X]−μ)2] = 𝔽[(( ̂ X − 𝔽[ ̂ X]) + b)2] = 𝔽[( ̂ X − 𝔽[ ̂ X])2 + 2b( ̂ X − 𝔽[ ̂ X]) + b2] = 𝔽[( ̂ X − 𝔽[ ̂ X])2] + 𝔽[2b( ̂ X − 𝔽[ ̂ X])] + 𝔽[b2] = 𝔽[( ̂ X − 𝔽[ ̂ X])2] + 2b𝔽[( ̂ X − 𝔽[ ̂ X])] + b2 = Var[ ̂ X] + 2b𝔽[( ̂ X − 𝔽[ ̂ X])] + b2 = Var[ ̂ X] + 2b(𝔽[ ̂ X] − 𝔽[ ̂ X]) + b2 = Var[ ̂ X] + b2 = Var[ ̂ X] + Bias( ̂ X)2 ∎
−𝔽[ ̂ X] + 𝔽[ ̂ X] = 0 b = Bias( ̂ X) = 𝔽[ ̂ X] − μ μ = 𝔽[X]
Sometimes a biased estimator can be closer to the estimated quantity than an unbiased one.
linearity of 𝔽 constants come out of 𝔽 linearity of 𝔽
more, then error goes down!
to be true (based on prior information)
Example: Let's estimate given i.i.d with using:
μ X1, …, Xn 𝔽[Xi] = μ
Y = 1 n+100
n
∑
i=1
Xi
This estimator is biased:
𝔽[Y] = 𝔽 [ 1 n + 100
n
∑
i=1
Xi] = 1 n + 100
n
∑
i=1
𝔽[Xi] = n n + 100 μ
Bias(Y) = n n + 100 μ − μ = −100 n + 100 μ
This estimator has low variance:
Var(Y) = Var [ 1 n + 100
n
∑
i=1
Xi]
= 1 (n + 100)2 Var [
n
∑
i=1
Xi]
= 1 (n + 100)2
n
∑
i=1
Var[Xi] = n (n + 100)2 σ2
MSE( ¯ X) = Var( ¯ X) + Bias( ¯ X)2 = Var( ¯ X) = 1 10
Example: Suppose that , , and
σ = 1 n = 10 μ = 0.1
Bias( ¯ X) = 0 Var( ¯ X) = σ2 n
MSE(Y) = Var(Y) + Bias(Y)2 = n (n + 100)2σ2 + ( 100 n + 100 μ)
2
= 10 1102 + ( 100 1100.1)
2
≈ 9 × 10−4
Example: Suppose that , , and
σ = 1 n = 10 μ = 5 MSE( ¯ X) = Var( ¯ X) + Bias( ¯ X)2 = Var( ¯ X) = 1 10 MSE(Y) = Var(Y) + Bias(Y)2 = n (n + 100)2σ2 + ( −100 n + 100 μ)
2
= 10 1102 + (− 100 1105)
2
≈ 20.66
Whoa! What went wrong?
error bound at a desired probability
and variance
ϵ 1 − δ