Tsybakov noise adap/ve margin-based ac/ve learning Aar$ - - PowerPoint PPT Presentation

▶

Feb 02, 2023 152 likes •392 views

Tsybakov noise adap/ve margin-based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data

SLIDE 1

Tsybakov ¡noise ¡adap/ve ¡ margin-‑based ¡ac/ve ¡learning ¡

Aar$ ¡Singh ¡

A. ¡Nico ¡Habermann ¡Associate ¡Professor ¡

¡ ¡

NIPS ¡workshop ¡on ¡Learning ¡Faster ¡from ¡Easy ¡Data ¡II ¡ Dec ¡11, ¡2015 ¡

SLIDE 2

Passive ¡Learning ¡

SLIDE 3

Ac/ve ¡Learning ¡

(Xj, ?) (Xj, Yj)

(Xi, Yi) (Xi, ?)

SLIDE 4

Streaming ¡se;ng ¡

SLIDE 5

Streaming ¡se;ng ¡

Algorithm ¡obtains ¡Xt ¡ sampled ¡iid ¡from ¡ marginal ¡distribu$on ¡PX ¡ Based ¡on ¡previous ¡ labeled ¡and ¡unlabeled ¡ data, ¡the ¡algorithm ¡ decides ¡whether ¡or ¡not ¡ to ¡accept ¡Xt ¡and ¡query ¡its ¡

label. ¡ ¡

¡ If ¡label ¡is ¡queried, ¡algorithm ¡receives ¡Yt ¡sampled ¡iid ¡from ¡ condi$onal ¡distribu$on ¡P(Y|X=Xt) ¡

SLIDE 6

Problem ¡setup ¡

X ¡is ¡d-‑dimensional, ¡PX ¡is ¡uniform ¡(or ¡log-‑concave ¡+ ¡isotropic) ¡
Binary ¡classifica$on: ¡Labels ¡Y ¡in ¡{+1, ¡-‑1} ¡
Homogeneous ¡linear ¡classifiers ¡sign(w. ¡X) ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡with ¡||w||2 ¡= ¡1 ¡

err(w) ¡= ¡P(sign(w.X) ¡≠ ¡Y) ¡ ¡

Bayes ¡op$mal ¡classifier ¡is ¡linear ¡w* ¡

arg ¡maxY ¡P(Y|X) ¡= ¡sign(w*. ¡X) ¡ ¡ w ¡

SLIDE 7

Tsybakov ¡Noise ¡Condi/on ¡

For ¡all ¡linear ¡classifiers ¡w ¡with ¡||w||2 ¡= ¡1 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡μ ¡θ(w,w)κ ¡ ¡≤ ¡err(w) ¡– ¡err(w) ¡

where ¡κ ¡in ¡[1,∞) ¡is ¡the ¡TNC ¡exponent ¡and ¡0 ¡< ¡μ ¡< ¡∞ ¡is ¡a ¡constant. ¡ w ¡ w* ¡ θ(w,w*) ¡ κ ¡characterizes ¡noise ¡in ¡label ¡distribu$on ¡ ¡ κ makes ¡problem ¡easy ¡or ¡hard ¡– ¡small ¡κ ¡implies ¡easier ¡problem ¡ ¡

0 ¡ 0.5 ¡ 1 ¡

X

SLIDE 8

Minimax ¡ac/ve ¡learning ¡rates ¡

If ¡Tsybakov ¡Noise ¡Condi$on ¡(TNC) ¡holds, ¡then ¡minimax ¡op$mal ¡ ac$ve ¡learning ¡rate ¡is ¡ ¡ ¡E[err(wT) ¡– ¡err(w*)] ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡

κ ¡= ¡∞ ¡passive ¡rate ¡1/√T ¡ ¡ κ = ¡1 ¡exponen$al ¡rate ¡e-‑T ¡ ¡ ¡ ¡

8 ¡

Lower ¡bound: ¡Castro-‑Nowak’06 ¡(d=1), ¡Hanneke-‑Yang’14 ¡(d, ¡PX), ¡ Singh-‑Wang’14 ¡(d, ¡lower-‑bounded/uniform ¡PX) ¡ ¡ Upper ¡bound ¡(Margin-‑based ¡ac$ve ¡learning): ¡Balcan-‑Broder-‑ Zhang’07 ¡(uniform ¡PX), ¡Balcan-‑Long’13 ¡(log-‑concave+isotropic ¡PX) ¡ Algorithms ¡need ¡to ¡know ¡κ!! ¡ ¡ Model ¡selec/on ¡for ¡ac/ve ¡learning ¡-‑ ¡Can ¡we ¡adapt ¡to ¡easy ¡ cases, ¡while ¡being ¡robust ¡to ¡worst-‑case? ¡

SLIDE 9

be-‑1 ¡

Margin-‑based ¡ac/ve ¡learning ¡

Input: ¡Desired ¡accuracy ¡ε, ¡Failure ¡probability ¡δ ¡
Ini$alize: ¡E; ¡For ¡e ¡= ¡1, ¡…, ¡E: ¡epoch ¡budgets ¡Te ¡, ¡search ¡radii ¡Re ¡, ¡

acceptance ¡regions ¡be ¡, ¡precision ¡values ¡εe; ¡random ¡classifier ¡w0 ¡ ¡

For ¡e ¡= ¡1, ¡…, ¡E ¡

¡Un$l ¡labeled ¡examples ¡< ¡Te ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Obtain ¡a ¡sample ¡Xt ¡from ¡PX ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡If ¡|we-‑1. ¡Xt| ¡≤ ¡be-‑1, ¡query ¡label ¡Yt ¡ ¡ ¡ ¡end ¡ ¡ ¡ ¡ ¡ Balcan-‑Broder-‑Zhang’07 ¡ we-‑1 ¡ X ¡ X ¡ ,Y ¡

SLIDE 10

Re-‑1 ¡

Input: ¡Desired ¡accuracy ¡ε, ¡Failure ¡probability ¡δ ¡
Ini$alize: ¡E; ¡For ¡e ¡= ¡1, ¡…, ¡E: ¡epoch ¡budgets ¡Te ¡, ¡search ¡radii ¡Re ¡, ¡

acceptance ¡regions ¡be ¡, ¡precision ¡values ¡εe; ¡random ¡classifier ¡w0 ¡ ¡

For ¡e ¡= ¡1, ¡…, ¡E ¡

¡Un$l ¡labeled ¡examples ¡< ¡Te ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Obtain ¡a ¡sample ¡Xt ¡from ¡PX ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡If ¡|we-‑1. ¡Xt| ¡≤ ¡be-‑1, ¡query ¡label ¡Yt ¡ ¡ ¡ ¡end ¡ ¡Find ¡we ¡that ¡(approximately) ¡minimizes ¡training ¡error ¡up ¡ ¡to ¡precision ¡εe ¡on ¡the ¡Te ¡labeled ¡examples ¡among ¡all ¡w ¡ ¡s.t. ¡θ(w,we-‑1) ¡≤ ¡Re-‑1 ¡ ¡

Output: ¡wT ¡= ¡wE ¡

Margin-‑based ¡ac/ve ¡learning ¡

Balcan-‑Broder-‑Zhang’07 ¡ we-‑1 ¡ we ¡

SLIDE 11

Margin-‑based ¡ac/ve ¡learning ¡

Input: ¡Desired ¡accuracy ¡ε, ¡Failure ¡probability ¡δ ¡
Ini$alize: ¡E; ¡For ¡e ¡= ¡1, ¡…, ¡E: ¡epoch ¡budgets ¡Te ¡, ¡search ¡radii ¡Re ¡, ¡

acceptance ¡regions ¡be ¡, ¡precision ¡values ¡εe; ¡random ¡classifier ¡w0 ¡ ¡

For ¡e ¡= ¡1, ¡…, ¡E ¡

¡Un$l ¡labeled ¡examples ¡< ¡Te ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Obtain ¡a ¡sample ¡Xt ¡from ¡PX ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡If ¡|we-‑1. ¡Xt| ¡≤ ¡be-‑1, ¡query ¡label ¡Yt ¡ ¡ ¡ ¡end ¡ ¡Find ¡we ¡that ¡(approximately) ¡minimizes ¡training ¡error ¡up ¡ ¡to ¡precision ¡εe ¡on ¡the ¡Te ¡labeled ¡examples ¡among ¡all ¡w ¡ ¡s.t. ¡θ(w,we-‑1) ¡≤ ¡Re-‑1 ¡ ¡

Output: ¡wT ¡= ¡wE ¡

Balcan-‑Broder-‑Zhang’07 ¡ All ¡depend ¡on ¡κ ¡

SLIDE 12

Adap/ve ¡margin-‑based ¡ac/ve ¡learning ¡

Input: ¡Query ¡budget ¡T, ¡Failure ¡probability ¡δ, ¡shrink ¡rate ¡r ¡
Ini$alize: ¡E ¡= ¡log ¡√T; ¡For ¡e ¡= ¡1, ¡…, ¡E: ¡epoch ¡budgets ¡Te ¡= ¡T/E, ¡search ¡

radius ¡R0 ¡= ¡π, ¡acceptance ¡region ¡b0 ¡= ¡∞; ¡random ¡classifier ¡w0 ¡ ¡

For ¡e ¡= ¡1, ¡…, ¡E ¡

¡Un$l ¡labeled ¡examples ¡< ¡Te ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Obtain ¡a ¡sample ¡Xt ¡from ¡PX ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡If ¡|we-‑1. ¡Xt| ¡≤ ¡be-‑1, ¡query ¡label ¡Yt ¡ ¡ ¡ ¡end ¡ ¡Find ¡we ¡that ¡(approximately) ¡minimizes ¡training ¡error ¡on ¡ ¡the ¡Te ¡labeled ¡examples ¡among ¡all ¡w ¡s.t. ¡θ(w,we-‑1) ¡≤ ¡Re-‑1 ¡ ¡ ¡Re ¡= ¡r ¡Re-‑1; ¡be ¡= ¡2Re√ ¡[E(1+log(1/r))/d] ¡

Output: ¡wT ¡= ¡wE ¡

No ¡knowledge ¡

f ¡κ ¡

SLIDE 13

Let ¡T ¡≥ ¡4, ¡d ¡≥ ¡4, ¡r ¡in ¡(0,1/2), ¡PX ¡is ¡uniform ¡on ¡d-‑dim ¡unit ¡ball ¡and ¡ PY|X ¡sa$sfies ¡TNC(μ, ¡κ). ¡Then ¡the ¡streaming ¡adap$ve ¡ac$ve ¡ learning ¡algorithm ¡achieves, ¡with ¡probability ¡≥ ¡1 ¡– ¡δ, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡err(wT) ¡– ¡err(w*) ¡= ¡Õ((d+log(1/δ)/T) κ/(2κ-‑2)) ¡

for ¡all ¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞. ¡ ¡ ¡ ¡

Adap/ve ¡margin-‑based ¡ac/ve ¡learning ¡

Minimax ¡op/mal ¡rate ¡without ¡knowing ¡µ, , κ ¡up ¡to ¡log ¡factors!! ¡ ¡ Adapt ¡to ¡easy ¡cases, ¡while ¡being ¡robust ¡to ¡worst-‑case! ¡

SLIDE 14

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Consider ¡shrink ¡rate ¡r ¡= ¡½. ¡We ¡will ¡argue ¡adap$vity ¡to ¡κ ¡in ¡[2,∞) ¡ ¡ ¡ Let ¡ ¡we* ¡denote ¡the ¡best ¡linear ¡classifier ¡among ¡all ¡w ¡s.t. ¡ ¡θ(w,we-‑1) ¡ ≤ ¡Re-‑1 ¡in ¡acceptance ¡region ¡be-‑1 ¡ For ¡all ¡e, ¡with ¡high ¡probability ¡ ¡err(we) ¡– ¡err(we) ¡= ¡Õ(Re-‑1 ¡(d/T)1/2) ¡ ¡passive ¡rate ¡ For ¡e ¡= ¡1,we ¡have ¡(d/T)1/2 ¡ For ¡e ¡= ¡E ¡we ¡have ¡d/T ¡since ¡RE ¡= ¡R0/2E ¡= ¡R0/√T. ¡(but ¡wE ¡≠ ¡w) ¡ ¡ Therefore, ¡there ¡exists ¡epoch ¡e’ ¡s.t. ¡with ¡high ¡probability ¡ ¡err(we’) ¡– ¡err(we’) ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡

SLIDE 15

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Consider ¡shrink ¡rate ¡r ¡= ¡½. ¡We ¡will ¡argue ¡adap$vity ¡to ¡κ ¡in ¡[2,∞) ¡ ¡ ¡ Let ¡ ¡we* ¡denote ¡the ¡best ¡linear ¡classifier ¡among ¡all ¡w ¡s.t. ¡ ¡θ(w,we-‑1) ¡ ≤ ¡Re-‑1 ¡in ¡acceptance ¡region ¡be-‑1 ¡ Therefore, ¡there ¡exists ¡epoch ¡e’ ¡s.t. ¡with ¡high ¡probability ¡ ¡err(we’) ¡– ¡err(we’) ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡ ¡ Also, ¡we’ ¡= ¡w* ¡(using ¡same ¡argument ¡as ¡Balcan-‑Broder-‑Zhang’07) ¡ Point ¡of ¡departure: ¡They ¡ensure ¡we ¡= ¡w ¡for ¡all ¡e ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡we ¡allow ¡we ¡≠ ¡w ¡for ¡all ¡e ¡≥ ¡e’ ¡

SLIDE 16

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Let ¡ ¡we* ¡denote ¡the ¡best ¡linear ¡classifier ¡among ¡all ¡w ¡s.t. ¡ ¡θ(w,we-‑1) ¡ ≤ ¡Re-‑1 ¡in ¡acceptance ¡region ¡be-‑1 ¡ ¡ w1 * ¡ = ¡ w * ¡ w0 ¡ w2 * ¡ = ¡ w * ¡ w1 ¡ we' * ¡ = ¡ w * ¡ we’ ¡ There ¡exists ¡epoch ¡e’ ¡s.t. ¡ ¡err(we’) ¡– ¡err(we’*) ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡

SLIDE 17

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Let ¡ ¡we* ¡denote ¡the ¡best ¡linear ¡classifier ¡among ¡all ¡w ¡s.t. ¡ ¡θ(w,we-‑1) ¡ ≤ ¡Re-‑1 ¡in ¡acceptance ¡region ¡be-‑1 ¡ ¡ we' * ¡ = ¡ w * ¡ we’ ¡ There ¡exists ¡epoch ¡e’ ¡s.t. ¡ ¡err(we’) ¡– ¡err(we’) ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡ we ¡ ≠ ¡ w * ¡ we’ ¡ we ¡ For ¡all ¡epochs ¡e ¡≥ ¡e’, ¡we ¡stays ¡close ¡to ¡we’ ¡

SLIDE 18

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Consider ¡shrink ¡rate ¡r ¡= ¡½. ¡We ¡will ¡argue ¡adap$vity ¡to ¡κ ¡in ¡[2,∞) ¡ ¡ ¡ Let ¡ ¡we* ¡denote ¡the ¡best ¡linear ¡classifier ¡among ¡all ¡w ¡s.t. ¡ ¡θ(w,we-‑1) ¡ ≤ ¡Re-‑1 ¡in ¡acceptance ¡region ¡be-‑1 ¡ Therefore, ¡there ¡exists ¡epoch ¡e’ ¡s.t. ¡with ¡high ¡probability ¡ ¡err(we’) ¡– ¡err(we’) ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡ ¡ Also, ¡we’ ¡= ¡w* ¡(using ¡same ¡argument ¡as ¡Balcan-‑Broder-‑Zhang’07) ¡ For ¡all ¡epochs ¡e ¡≥ ¡e’ ¡ ¡ ¡ ¡ ¡ ¡err(we) ¡– ¡err(we’) ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡ Therefore, ¡err(wE) ¡– ¡err(w*) ¡= ¡Õ((d/T) κ/(2κ-‑2)). ¡

SLIDE 19

Adap/ve ¡margin-‑based ¡ac/ve ¡learning ¡

Let ¡T ¡≥ ¡4, ¡d ¡≥ ¡4, ¡r ¡in ¡(0,1/2), ¡PX ¡is ¡log-‑concave ¡and ¡isotropic ¡on ¡d-‑ dim ¡unit ¡ball ¡and ¡PY|X ¡sa$sfies ¡TNC(μ, ¡κ). ¡Then ¡the ¡streaming ¡ adap$ve ¡ac$ve ¡learning ¡algorithm ¡with ¡achieves, ¡with ¡be ¡= ¡C ¡ Relog ¡T ¡probability ¡≥ ¡1 ¡– ¡δ, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡err(wT) ¡– ¡err(w*) ¡= ¡Õ((d+log(1/δ)/T) κ/(2κ-‑2)) ¡

for ¡all ¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞. ¡ ¡ ¡Minimax ¡op/mal ¡rate ¡without ¡knowing ¡µ, , κ ¡up ¡to ¡log ¡factors!! ¡ ¡ Adapt ¡to ¡easy ¡cases, ¡while ¡being ¡robust ¡to ¡worst-‑case! ¡

SLIDE 20

Limita/ons/Open ¡ques/ons ¡

Constants ¡become ¡large ¡as ¡κ ¡tends ¡to ¡1, ¡log ¡factors ¡

¡μ-‑1/(κ-1) ¡r-‑(κ-‑2)/(κ-‑1)

How ¡to ¡adapt ¡to ¡κ ¡= ¡1?

¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞ ¡

Adap$ve ¡ac$ve ¡learning ¡given ¡desired ¡accuracy ¡ε ¡(instead ¡of ¡

query ¡budget ¡T) ¡

Agnos$c ¡se•ng ¡(Bayes ¡op$mal ¡classifier ¡not ¡in ¡hypothesis ¡

space) ¡

SLIDE 21

Related ¡work ¡

Juditsky-‑Nesterov’14 ¡– ¡adap$ve ¡stochas$c ¡op$miza$on ¡of ¡

uniformly ¡convex ¡func$ons ¡(κ ≥ ¡2) ¡

Our ¡analysis ¡extends ¡to ¡achieve ¡adap$ve ¡op$miza$on ¡of ¡TNC ¡

func$ons ¡(κ ≥ ¡1) ¡

For ¡d ¡= ¡1 ¡

¡ ¡ ¡ ¡ ¡Rates ¡exactly ¡same ¡as ¡1-‑dim ¡ac$ve ¡learning! ¡ f(x) f(x∗) λkx x∗kκ

X

f(X)

x∗

kf(b x) f(x∗)k ⇣ T −

κ 2κ−2

SLIDE 22

Related ¡work ¡

Same ¡algorithm ¡also ¡studied ¡by ¡Awasthi ¡et ¡al’14 ¡for ¡a ¡

different ¡ques$on: ¡ ¡ ¡Maximum ¡amount ¡of ¡adversarial ¡noise ¡tolerated ¡by ¡ ¡algorithm ¡for ¡constant ¡excess ¡risk ¡and ¡polylog ¡sample ¡ ¡complexity ¡(exponen$al ¡rate ¡for ¡error) ¡ ¡ We ¡study ¡convergence ¡of ¡excess ¡risk ¡to ¡zero ¡with ¡increasing ¡ samples ¡not ¡restricted ¡to ¡be ¡polylog. ¡

SLIDE 23

References ¡+ ¡Acknowledgements ¡

Yining ¡Wang ¡ Aaditya ¡Ramdas ¡ Noise-‑adap$ve ¡Margin-‑based ¡Ac$ve ¡Learning ¡and ¡Lower ¡ Bounds ¡under ¡Tsybakov ¡Noise ¡Condi$on, ¡AAAI’16 ¡To ¡appear. ¡ Algorithmic ¡connec$ons ¡between ¡ac$ve ¡learning ¡and ¡ stochas$c ¡convex ¡op$miza$on, ¡ALT’13. ¡ Op$mal ¡rates ¡for ¡stochas$c ¡convex ¡op$miza$on ¡under ¡ Tsybakov ¡noise ¡condi$on, ¡ICML’13. ¡

Tsybakov ¡noise ¡adap/ve ¡ margin-­‑based ¡ac/ve ¡learning ¡

Aar$ ¡Singh ¡

¡ ¡

NIPS ¡workshop ¡on ¡Learning ¡Faster ¡from ¡Easy ¡Data ¡II ¡ Dec ¡11, ¡2015 ¡

Passive ¡Learning ¡

Ac/ve ¡Learning ¡

Streaming ¡se;ng ¡

Streaming ¡se;ng ¡

Algorithm ¡obtains ¡Xt ¡ sampled ¡iid ¡from ¡ marginal ¡distribu$on ¡PX ¡ Based ¡on ¡previous ¡ labeled ¡and ¡unlabeled ¡ data, ¡the ¡algorithm ¡ decides ¡whether ¡or ¡not ¡ to ¡accept ¡Xt ¡and ¡query ¡its ¡

¡ If ¡label ¡is ¡queried, ¡algorithm ¡receives ¡Yt ¡sampled ¡iid ¡from ¡ condi$onal ¡distribu$on ¡P(Y|X=Xt) ¡

Problem ¡setup ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡with ¡||w||2 ¡= ¡1 ¡

arg ¡maxY ¡P(Y|X) ¡= ¡sign(w*. ¡X) ¡ ¡ w ¡

Tsybakov ¡Noise ¡Condi/on ¡

For ¡all ¡linear ¡classifiers ¡w ¡with ¡||w||2 ¡= ¡1 ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡μ ¡θ(w,w*)κ ¡ ¡≤ ¡err(w) ¡– ¡err(w*) ¡

where ¡κ ¡in ¡[1,∞) ¡is ¡the ¡TNC ¡exponent ¡and ¡0 ¡< ¡μ ¡< ¡∞ ¡is ¡a ¡constant. ¡ w ¡ w* ¡ θ(w,w*) ¡ κ ¡characterizes ¡noise ¡in ¡label ¡distribu$on ¡ ¡ κ makes ¡problem ¡easy ¡or ¡hard ¡– ¡small ¡κ ¡implies ¡easier ¡problem ¡ ¡

X

Minimax ¡ac/ve ¡learning ¡rates ¡

If ¡Tsybakov ¡Noise ¡Condi$on ¡(TNC) ¡holds, ¡then ¡minimax ¡op$mal ¡ ac$ve ¡learning ¡rate ¡is ¡ ¡ ¡E[err(wT) ¡– ¡err(w*)] ¡= ¡Õ((d/T) κ/(2κ-­‑2)) ¡

κ ¡= ¡∞ ¡passive ¡rate ¡1/√T ¡ ¡ κ = ¡1 ¡exponen$al ¡rate ¡e-­‑T ¡ ¡ ¡ ¡

be-­‑1 ¡

Margin-­‑based ¡ac/ve ¡learning ¡

acceptance ¡regions ¡be ¡, ¡precision ¡values ¡εe; ¡random ¡classifier ¡w0 ¡ ¡

¡Un$l ¡labeled ¡examples ¡< ¡Te ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Obtain ¡a ¡sample ¡Xt ¡from ¡PX ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡If ¡|we-­‑1. ¡Xt| ¡≤ ¡be-­‑1, ¡query ¡label ¡Yt ¡ ¡ ¡ ¡end ¡ ¡ ¡ ¡ ¡ Balcan-­‑Broder-­‑Zhang’07 ¡ we-­‑1 ¡ X ¡ X ¡ ,Y ¡

Re-­‑1 ¡

acceptance ¡regions ¡be ¡, ¡precision ¡values ¡εe; ¡random ¡classifier ¡w0 ¡ ¡

Margin-­‑based ¡ac/ve ¡learning ¡

Balcan-­‑Broder-­‑Zhang’07 ¡ we-­‑1 ¡ we ¡

Margin-­‑based ¡ac/ve ¡learning ¡

acceptance ¡regions ¡be ¡, ¡precision ¡values ¡εe; ¡random ¡classifier ¡w0 ¡ ¡

Balcan-­‑Broder-­‑Zhang’07 ¡ All ¡depend ¡on ¡κ ¡

Adap/ve ¡margin-­‑based ¡ac/ve ¡learning ¡

radius ¡R0 ¡= ¡π, ¡acceptance ¡region ¡b0 ¡= ¡∞; ¡random ¡classifier ¡w0 ¡ ¡

No ¡knowledge ¡

for ¡all ¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞. ¡ ¡ ¡ ¡

Adap/ve ¡margin-­‑based ¡ac/ve ¡learning ¡

Minimax ¡op/mal ¡rate ¡without ¡knowing ¡µ, , κ ¡up ¡to ¡log ¡factors!! ¡ ¡ Adapt ¡to ¡easy ¡cases, ¡while ¡being ¡robust ¡to ¡worst-­‑case! ¡

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Why ¡does ¡it ¡work? ¡(proof ¡sketch) ¡

Adap/ve ¡margin-­‑based ¡ac/ve ¡learning ¡

for ¡all ¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞. ¡ ¡ ¡Minimax ¡op/mal ¡rate ¡without ¡knowing ¡µ, , κ ¡up ¡to ¡log ¡factors!! ¡ ¡ Adapt ¡to ¡easy ¡cases, ¡while ¡being ¡robust ¡to ¡worst-­‑case! ¡

Limita/ons/Open ¡ques/ons ¡

¡μ-­‑1/(κ-1) ¡r-­‑(κ-­‑2)/(κ-­‑1)

¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞ ¡

query ¡budget ¡T) ¡

space) ¡

Related ¡work ¡

uniformly ¡convex ¡func$ons ¡(κ ≥ ¡2) ¡

func$ons ¡(κ ≥ ¡1) ¡

¡ ¡ ¡ ¡ ¡Rates ¡exactly ¡same ¡as ¡1-­‑dim ¡ac$ve ¡learning! ¡ f(x) f(x∗) λkx x∗kκ

X

f(X)

x∗

kf(b x) f(x∗)k ⇣ T −

Related ¡work ¡

References ¡+ ¡Acknowledgements ¡

Tsybakov ¡noise ¡adap/ve ¡ margin-‑based ¡ac/ve ¡learning ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡μ ¡θ(w,w)κ ¡ ¡≤ ¡err(w) ¡– ¡err(w) ¡

If ¡Tsybakov ¡Noise ¡Condi$on ¡(TNC) ¡holds, ¡then ¡minimax ¡op$mal ¡ ac$ve ¡learning ¡rate ¡is ¡ ¡ ¡E[err(wT) ¡– ¡err(w*)] ¡= ¡Õ((d/T) κ/(2κ-‑2)) ¡

κ ¡= ¡∞ ¡passive ¡rate ¡1/√T ¡ ¡ κ = ¡1 ¡exponen$al ¡rate ¡e-‑T ¡ ¡ ¡ ¡

be-‑1 ¡

Margin-‑based ¡ac/ve ¡learning ¡

¡Un$l ¡labeled ¡examples ¡< ¡Te ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Obtain ¡a ¡sample ¡Xt ¡from ¡PX ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡If ¡|we-‑1. ¡Xt| ¡≤ ¡be-‑1, ¡query ¡label ¡Yt ¡ ¡ ¡ ¡end ¡ ¡ ¡ ¡ ¡ Balcan-‑Broder-‑Zhang’07 ¡ we-‑1 ¡ X ¡ X ¡ ,Y ¡

Re-‑1 ¡

Margin-‑based ¡ac/ve ¡learning ¡

Balcan-‑Broder-‑Zhang’07 ¡ we-‑1 ¡ we ¡

Margin-‑based ¡ac/ve ¡learning ¡

Balcan-‑Broder-‑Zhang’07 ¡ All ¡depend ¡on ¡κ ¡

Adap/ve ¡margin-‑based ¡ac/ve ¡learning ¡

Adap/ve ¡margin-‑based ¡ac/ve ¡learning ¡

Minimax ¡op/mal ¡rate ¡without ¡knowing ¡µ, , κ ¡up ¡to ¡log ¡factors!! ¡ ¡ Adapt ¡to ¡easy ¡cases, ¡while ¡being ¡robust ¡to ¡worst-‑case! ¡

Adap/ve ¡margin-‑based ¡ac/ve ¡learning ¡

for ¡all ¡1+ ¡1/(log(1/r)) ¡≤ ¡κ < ∞. ¡ ¡ ¡Minimax ¡op/mal ¡rate ¡without ¡knowing ¡µ, , κ ¡up ¡to ¡log ¡factors!! ¡ ¡ Adapt ¡to ¡easy ¡cases, ¡while ¡being ¡robust ¡to ¡worst-‑case! ¡

¡μ-‑1/(κ-1) ¡r-‑(κ-‑2)/(κ-‑1)

¡ ¡ ¡ ¡ ¡Rates ¡exactly ¡same ¡as ¡1-‑dim ¡ac$ve ¡learning! ¡ f(x) f(x∗) λkx x∗kκ