[PPT] - Entropy property testing with finitely many errors Changlong Wu PowerPoint Presentation

SLIDE 1

Intro Results Related Work Conclusion

Entropy property testing with finitely many errors

Changlong Wu (Univ of Hawaii, Manoa) Empty line Empty line

Joint work with Narayana Santhanam (Univ of Hawaii, Manoa)

ISIT2020 Online Talk June, 2020

SLIDE 2

Intro Results Related Work Conclusion

Introduction

Meta-question: When will scientist find perfect theory eventually almost surely?

1 / 20

SLIDE 3

Intro Results Related Work Conclusion

Introduction

Meta-question: When will scientist find perfect theory eventually almost surely? Consider a scientist building a theory that describes a nature phenomenon by making observations.

1 / 20

SLIDE 4

Intro Results Related Work Conclusion

Introduction

Meta-question: When will scientist find perfect theory eventually almost surely? Consider a scientist building a theory that describes a nature phenomenon by making observations. The scientist may refine his theory every time new observations

arrive. (e.g. Newton → Einstein)

1 / 20

SLIDE 5

Intro Results Related Work Conclusion

Introduction

Meta-question: When will scientist find perfect theory eventually almost surely? Consider a scientist building a theory that describes a nature phenomenon by making observations. The scientist may refine his theory every time new observations

arrive. (e.g. Newton → Einstein)

Will the scientist perpetually refine his theory or settle a perfect theory after making finitely many observations?

1 / 20

SLIDE 6

Intro Results Related Work Conclusion

A toy example

Let p be a distribution over {1, 2, · · · , m}, and H(p) be the entropy of p.

2 / 20

SLIDE 7

Intro Results Related Work Conclusion

A toy example

Let p be a distribution over {1, 2, · · · , m}, and H(p) be the entropy of p. For some fixed h ∈ [0, log m], we would like to decide: Is H(p) = h? by observing i.i.d. samples X1, X2, · · · ∼ p.

2 / 20

SLIDE 8

Intro Results Related Work Conclusion

A toy example

Let p be a distribution over {1, 2, · · · , m}, and H(p) be the entropy of p. For some fixed h ∈ [0, log m], we would like to decide: Is H(p) = h? by observing i.i.d. samples X1, X2, · · · ∼ p. Seems to be an ill-posed problem...

2 / 20

SLIDE 9

Intro Results Related Work Conclusion

A toy example

Let p be a distribution over {1, 2, · · · , m}, and H(p) be the entropy of p. For some fixed h ∈ [0, log m], we would like to decide: Is H(p) = h? by observing i.i.d. samples X1, X2, · · · ∼ p. Seems to be an ill-posed problem... Since one can’t decide for distributions p with H(p) arbitrary close but not equals to h.

2 / 20

SLIDE 10

Intro Results Related Work Conclusion

A toy example

Let p be a distribution over {1, 2, · · · , m}, and H(p) be the entropy of p. For some fixed h ∈ [0, log m], we would like to decide: Is H(p) = h? by observing i.i.d. samples X1, X2, · · · ∼ p. Seems to be an ill-posed problem... Since one can’t decide for distributions p with H(p) arbitrary close but not equals to h.

2 / 20

SLIDE 11

Intro Results Related Work Conclusion

A toy example

We are allowed to sample as long as we want, but after some point we must make the right decision.

3 / 20

SLIDE 12

Intro Results Related Work Conclusion

A toy example

We are allowed to sample as long as we want, but after some point we must make the right decision. We show that for any h ∈ [0, log m], there exist a universal decision rule Φ, such that for any distribution p over [m], we have Φ(X n

1 ) → 1{H(p) = h}, almost surely as n → ∞

where X1, X2, · · · ∼ p independently.

3 / 20

SLIDE 13

Intro Results Related Work Conclusion

A toy example

We are allowed to sample as long as we want, but after some point we must make the right decision. We show that for any h ∈ [0, log m], there exist a universal decision rule Φ, such that for any distribution p over [m], we have Φ(X n

1 ) → 1{H(p) = h}, almost surely as n → ∞

where X1, X2, · · · ∼ p independently. In other words, Φ makes the right decision eventually almost surely.

3 / 20

SLIDE 14

Intro Results Related Work Conclusion

Proof?

Let ˆ pn be the empirical distribution of p with n samples

4 / 20

SLIDE 15

Intro Results Related Work Conclusion

Proof?

Let ˆ pn be the empirical distribution of p with n samples Standard concentration inequality yields that there exist an number N such that for any n ≥ N, we have p(||ˆ pn − p||TV ≥ log2 n

√n ) ≤ 1 n2 .

4 / 20

SLIDE 16

Intro Results Related Work Conclusion

Proof?

Let ˆ pn be the empirical distribution of p with n samples Standard concentration inequality yields that there exist an number N such that for any n ≥ N, we have p(||ˆ pn − p||TV ≥ log2 n

√n ) ≤ 1 n2 .

Entropy function is uniform continuous over bounded support, we have function t(n) → 0, such that for n ≥ N p(|H(ˆ pn) − H(p)| ≥ t(n)) ≤ 1 n2 .

4 / 20

SLIDE 17

Intro Results Related Work Conclusion

Proof?

The decision rule is as follows: if |H(ˆ pn) − h| ≤ t(n) we decide ”yes”, otherwise decide ”no”.

5 / 20

SLIDE 18

Intro Results Related Work Conclusion

Proof?

The decision rule is as follows: if |H(ˆ pn) − h| ≤ t(n) we decide ”yes”, otherwise decide ”no”. Now, if indeed we have H(p) = h, we know by Borel-Cantelli lemma that the rule will be correct for all but finite n ≥ N w.p. 1.

5 / 20

SLIDE 19

Intro Results Related Work Conclusion

Proof?

The decision rule is as follows: if |H(ˆ pn) − h| ≤ t(n) we decide ”yes”, otherwise decide ”no”. Now, if indeed we have H(p) = h, we know by Borel-Cantelli lemma that the rule will be correct for all but finite n ≥ N w.p. 1. If H(p) = h, we know that there exist some number Np such that for all but finite n ≥ Np we have |H(ˆ pn) − h| > |H(p) − h| − t(n) > t(n) w.p. 1, since t(n) → 0.

5 / 20

SLIDE 20

Intro Results Related Work Conclusion

Testing general entropy property

Let P be a class of distributions over N, and A ⊂ R+. For what combination of P and A, we can find a decision rule Φ such that Φ(X n

1 ) → 1{H(p) ∈ A}, almost surely as n → ∞

for all p ∈ P and X1, X2, · · · i.i.d. ∼ p?

6 / 20

SLIDE 21

Intro Results Related Work Conclusion

Fσ-separable

Sets A ⊂ R+ and Ac = R+\A are said to be Fσ-separable, if there exist collection of sets {Bn}n∈N and {Cn}n∈N such that

1. A =

n∈N Bn and Ac = n∈N Cn;

2. For all n ∈ N, Bn ⊂ Bn+1 and Cn ⊂ Cn+1;
3. For all n ∈ N, inf{|x − y| : x ∈ Bn, y ∈ Cn} > 0.

7 / 20

SLIDE 22

Intro Results Related Work Conclusion

Bounded support case

Theorem For any A ⊂ [0, log m], we can decide Is H(p) ∈ A? eventually almost surely for all distributions p over [m] iff A and Ac are Fσ-separable.

8 / 20

SLIDE 23

Intro Results Related Work Conclusion

Infinite Alphabets

Does the result extend to distributions on naturals with arbitrary support?

9 / 20

SLIDE 24

Intro Results Related Work Conclusion

Infinite Alphabets

Does the result extend to distributions on naturals with arbitrary support? The answer is no, we prove the following the theorem: Theorem For any k ≥ 1, there is no decision rule that decides

1. Is H(p) ≥ k?
2. Is H(p) finite?

eventual almost surely for all distributions over N.

9 / 20

SLIDE 25

Intro Results Related Work Conclusion

Infinite Alphabets

Does the result extend to distributions on naturals with arbitrary support? The answer is no, we prove the following the theorem: Theorem For any k ≥ 1, there is no decision rule that decides

1. Is H(p) ≥ k?
2. Is H(p) finite?

eventual almost surely for all distributions over N. Proof uses a diagonization argument...

9 / 20

SLIDE 26

Intro Results Related Work Conclusion

Infinite Alphabets

We note the following somewhat surprising theorem: Theorem For any k ≥ 1, there exists a decision rule that decides Is H(p)>k? eventual almost surely for all distributions over N.

10 / 20

SLIDE 27

Intro Results Related Work Conclusion

Infinite Alphabets

We note the following somewhat surprising theorem: Theorem For any k ≥ 1, there exists a decision rule that decides Is H(p)>k? eventual almost surely for all distributions over N. The difference from the H(p)≥k case is that, one can construct an estimator ˆ H such that ˆ H(X n

1 ) ≤ H(p) and ˆ

H(X n

1 ) → H(p) almost

surely. Decide ”yes” if ˆ H(X n

1 ) ≤ k and ”no” otherwise.

10 / 20

SLIDE 28

Intro Results Related Work Conclusion

Preparing for the main result: Tail entropy

For any function ρ : N → R+ and class P of distributions over N. We say the tail entropy of P is eventually dominated by ρ if for all p ∈ P there exist an number Np such that for all n ≥ Np we have Hn(p) =

i≥n

−pi log pi ≤ ρ(n).

11 / 20

SLIDE 29

Intro Results Related Work Conclusion

Main Result

Theorem Let ρ : N → R+ be an arbitrary function such that ρ(n) → 0 as n → ∞, P is eventually dominated by ρ, and A ⊂ R+. Then, there exist decision rule that decides Is H(p) ∈ A? eventually almost surely for all p ∈ P, iff A and Ac are Fσ-separable.

12 / 20

SLIDE 30

Intro Results Related Work Conclusion

Sketch of Proof

1. Since A and Ac are Fσ-separable, there exist B1 ⊂ B2 ⊂ · · · A

and C1 ⊂ C2 ⊂ · · · Ac with A = Bn, Ac = Cn such that for all n inf{|x − y| : x ∈ Bn, y ∈ Cn} = ǫn > 0

13 / 20

SLIDE 31

Intro Results Related Work Conclusion

Sketch of Proof

1. Since A and Ac are Fσ-separable, there exist B1 ⊂ B2 ⊂ · · · A

and C1 ⊂ C2 ⊂ · · · Ac with A = Bn, Ac = Cn such that for all n inf{|x − y| : x ∈ Bn, y ∈ Cn} = ǫn > 0

13 / 20

SLIDE 32

Intro Results Related Work Conclusion

Sketch of Proof (Cont.)

2. Define

Pn = {p ∈ P : H(p) ∈ Bn∪Cn and ∀k > N(n), Hk(p) ≤ ρ(k)} where N(n) ր +∞ and is choosing so that ρ(N(n)) ≤ ǫn

8 .

14 / 20

SLIDE 33

Intro Results Related Work Conclusion

Sketch of Proof (Cont.)

2. Define

Pn = {p ∈ P : H(p) ∈ Bn∪Cn and ∀k > N(n), Hk(p) ≤ ρ(k)} where N(n) ր +∞ and is choosing so that ρ(N(n)) ≤ ǫn

8 .

3. We have P = Pn and Pn ⊂ Pn+1, by eventual dominance
f ρ and properties of {Bn} and {Cn}.

14 / 20

SLIDE 34

Intro Results Related Work Conclusion

Sketch of Proof (Cont.)

4. By construction, the problem restricted on Pn can be decided

with arbitrary confidence and bounded number of samples. Denote bn be the sample complexity that achieves 1 − 2−n confidence.

5. The decision rule for P is as follows: if the sample size equals

bn we use the decision rule for Pn to make the decision, and retain the same decision until sample size reaches bn+1. Repeat the process for n + 1.

15 / 20

SLIDE 35

Intro Results Related Work Conclusion

When do we have eventual dominance?

Clearly, the class of distributions with finite support is eventually dominated.

16 / 20

SLIDE 36

Intro Results Related Work Conclusion

When do we have eventual dominance?

Clearly, the class of distributions with finite support is eventually dominated. The following lemma show that finite first moment is also sufficient: Lemma Let P be the class of all distributions over N with finite first moment, then P is eventually dominate by ρ(n) = log2 n

n

.

16 / 20

SLIDE 37

Intro Results Related Work Conclusion

Relation with regularization

A model class P together with a binary property f : P → {0, 1} is said to be regularizable, if P can be decomposed into P =

n∈N

Pn such each Pn is uniformly testable for property f .

17 / 20

SLIDE 38

Intro Results Related Work Conclusion

Relation with regularization

A model class P together with a binary property f : P → {0, 1} is said to be regularizable, if P can be decomposed into P =

n∈N

Pn such each Pn is uniformly testable for property f . Our result shows that a model class is regularizable for some property iff the class is finitely decidable by testing the same property.

17 / 20

SLIDE 39

Intro Results Related Work Conclusion

In conclusion

1. Under mild conditions, we completely characterized the

decidability of entropy properties of distributions over N.

2. Our approaches also yield elementary proof of the results in

(Cover, 1973), (Dembo-Peres, 1994) and (Koplowitz et al., 1995).

3. A full version of this work with more problem setups available

from: https://arxiv.org/abs/2001.03710

18 / 20

SLIDE 40

Intro Results Related Work Conclusion

Related Work

Problems with similar flavor were initiated by (Cover, 1973) A substantial extension of Cover’s work appears in (Dembo-Peres, 1994) A line of research that follows such work: (Kulkarni-Tse, 1994), (Koplowitz et al., 1995), (Newman, 2016), (Newman, 2019). A prediction analogy appears in (Santhanam-Anantharam, 2016) (Wu-Santhanam, 2019). A deterministic computational analogy was extensively studied in the TCS community, see (Zeugmann-Zilles, 2006) for a survey.

19 / 20

SLIDE 41

Intro Results Related Work Conclusion

Thank you!

20 / 20