Weak Supervision,
noisy labels, and error propagation Marat Freytsis
hep-ai journal club — December 11, 2018
based on Yu et al. [arXiv:1402.5902], Cohen, MF, Osdiek [arXiv:1706.09451] + bits of others
Weak Supervision, noisy labels, and error propagation Marat - - PowerPoint PPT Presentation
Weak Supervision, noisy labels, and error propagation Marat Freytsis hep-ai journal club December 11, 2018 based on Yu et al. [arXiv:1402.5902], Cohen, MF, Osdiek [arXiv:1706.09451] + bits of others Why Weak supervision? Fully supervised
hep-ai journal club — December 11, 2018
based on Yu et al. [arXiv:1402.5902], Cohen, MF, Osdiek [arXiv:1706.09451] + bits of others
1/ 14
1/ 14
n=1 Iyn=i/r. From a set of m
For experimental measurements, fi(˜ y) can be naturally interpreted as, e.g., a rate/cross-section measurement/calculation even if individual events cannot be perfects separated by their features
2/ 14
fA,1−fB,1
fA,1−fB,1
3/ 14
h∈H ℓ(h(xi)batch, f (˜
h∈H ℓ(h(xi), f (˜
Metodiev et al. [arXiv:1708.02949]
4/ 14
Given mixed samples M1 and M2 defjned in terms of pure samples S and B with signal fractions f1 > f2, an optimal classifjer trained to distinguish M1 from M2 is also optimal for distinguishing S from B.
The optimal classifjer to distinguish examples drawn from pM1 and pM2 is the likelihood ratio LM1/M2(x) = pM1(x)/pM2(x). Similarly, the optimal classifjer to distinguish examples drawn from pS and pB is the likelihood ratio LS/B(x) = pS(x)/pB(x). Where pB has support, we can relate these two likelihood ratios algebraically: LM1/M2 = pM1 pM2 = f1pS + (1 − f1)pB f2pS + (1 − f2)pB = f1LS/B + (1 − f1) f2LS/B + (1 − f2) , which is a monotonically increasing rescaling of the likelihood LS/B as long as f1 > f2, since ∂LS/BLM1/M2 = (f1 − f2)/(f2LS/B − f2 + 1)2 > 0. If f1 < f2, then one obtains the reversed
5/ 14
5/ 14
n=1 h(xi)/r, the
h∈H
6/ 14
G(h) = E(˜ x,˜ y)ℓ(φr(h), f (˜
G(h) ≤ errℓ(h) + ǫ if the number of bags m is
for this proof and following, see arXiv:1402.5902
7/ 14
G(h) ≤ ǫ with probability 1 − δ, and each bag is at least
2
(1−δ−ρ)(1−2η−ǫ).
8/ 14
The general answer becomes quite involved in this case, and I won’t attempt to reproduce it.
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
r = 10 r = 15 r = 20 r = 25 r = 30 r = 35 r = 40 r = 45 r = 50 r = 100
9/ 14
9/ 14
fA,1−fB,1
fA,1−fB,1
10/ 14
← − more signal more background − →
good
bad
cut
all cuts ¯ z′ bad only
11/ 14
1−fA−δ 1−fB
1−2fA−2δ 1−2fB
1−2fB − ¯ zi 1−2fB + 2(¯
i − ¯
fA−fB 1−2fB + 2δ( 1−fB 1−2fB − ¯
12/ 14
10-5 10-4 10-3 10-2 10-1 100
False positve rate
0.4 0.5 0.6 0.7 0.8 0.9 1.0
True positive rate
randomly swap 15% of each class
10-5 10-4 10-3 10-2 10-1 100
False positve rate
0.4 0.5 0.6 0.7 0.8 0.9 1.0
True positive rate
Fully supervised (original) Weakly supervised (original) Fully supervised (mis-modeled) Weakly supervised (mis-modeled)
swap the 10% (15%) most signal-like (background-like)
13/ 14
14/ 14