[PPT] - Calibration of Convex Surrogate Losses via Property Elicitation PowerPoint Presentation

SLIDE 1

Calibration of Convex Surrogate Losses via Property Elicitation

Jessie Finocchiaro October 10, 2019

SLIDE 2

SLIDE 3

Introduction

Predictions about future events Some property of data Machine Learning

SLIDE 4

Outline

Background
Surrogates
Calibration
Properties
Necessary and Sufficient Conditions
Case Study: Abstain
Dimensionality
Conclusion

SLIDE 5

Empirical Risk Minimization (ERM)

Assumption: data comes i.i.d. from P over 𝒴, 𝒵 . Goal: minimize risk. 𝑆𝑗𝑡𝑙 ℎ = ∫ 𝑀 ℎ 𝑦 , 𝑧 𝑒𝑄(𝑦, 𝑧)

Problem: don’t know P.

Settle for minimizing empirical risk. Take 𝑛 inputs 𝑦𝑗 ∈ 𝒴, 𝑧𝑗 ∈ 𝒵. 𝐹𝑆𝑗𝑡𝑙 ℎ = 1 𝑛 ෍

𝑗=1 𝑛

𝑀(ℎ 𝑦𝑗 , 𝑧𝑗)

𝑌 𝑍 𝑄(𝑌, 𝑍) Average Loss Against

utcome 𝑧𝑗

Prediction given feature 𝑦𝑗

SLIDE 6

Conditional Probability

Form hypothesis ℎ: 𝒴 → ℛ. Want to learn ℎ(𝑌) minimizing 𝔽 𝑌,𝑍 ~𝑄𝑀(ℎ 𝑌 , 𝑍). If ℎ 𝑦 minimizes 𝑀 𝑠, 𝑍 over 𝑠 for all 𝑦 ∈ 𝒴, obviously true. Abstract ℎ 𝑦 to 𝑠, look at distribution on simplex, Δ𝒵.

𝑍 Pr[𝑍] Pr[𝑍|𝑌 = 3]

SLIDE 7

Setting

Classification-like problems

Multiclass classification (with reject option) Ranking Top-k classification

Notation

Finite outcomes 𝒵 ≔ 𝑜 Report set ℛ ≔ 𝑙 Probability Distribution over 𝒵: p ∈ Δ𝒵 𝑞𝑧 = Pr[𝑍 = 𝑧]

SLIDE 8

Outline

Background
Surrogates
Calibration
Properties
Necessary and Sufficient Conditions
Case Study: Abstain
Dimensionality
Conclusion

SLIDE 9

Surrogates

Loss functions measure error. Created with a task in mind.

Often discrete.

Discrete losses hard to optimize. Surrogates should approximate the original loss well.

L(r,1) Report r Surrogates for 0-1 loss, 𝒵 = {−1,1}

SLIDE 10

Calibration

A calibrated loss 𝑀 is “good approximation” of discrete loss ℓ(𝑠, 𝑧). Let ℓ be a discrete loss. A surrogate loss function 𝑀: ℝ𝑒 × 𝒵 → ℝ+ is said to be ℓ-calibrated if there exists a link function 𝜔: ℝ𝑒 → ℛ such that

.

Inf over reports not linked to the argmin of the discrete loss.

SLIDE 11

Consistency

Let 𝑔𝑛: 𝒴 → ℝ𝑒 be the hypothesis learned from training on 𝑛 samples drawn i.i.d. over 𝑄. 𝑓𝑠

𝑄 𝑀(𝑔) is the expected loss L by predicting 𝑔(𝑌) when 𝑌, 𝑍 ~𝑄.

𝑔𝑛 is said to be L-consistent if 𝑓𝑠

𝑄 𝑀 𝑔 𝑛 →

inf

𝑔:𝒴→ℛ 𝑓𝑠 𝑄 𝑀 𝑔 ≔ 𝑓𝑠 𝑄 𝑀,∗

.

Losses are calibrated. Hypotheses are consistent.

SLIDE 12

Relating calibration and consistency

Let ℓ: ℛ × 𝒵 → ℝ+ be a discrete loss. A surrogate loss function 𝑀: ℝ𝑒 × 𝒵 → ℝ+ is ℓ-calibrated if and only if there exists a link function 𝜔: ℝ𝑒 → ℛ such that for all distributions 𝑄

n 𝒴 × 𝒵 and all sequences of surrogate hypotheses 𝑔𝑛: 𝒴 → ℝ𝑒,

we have 𝑓𝑠

𝑄 𝑀 𝑔𝑛 → 𝑓𝑠 𝑄 𝑀,∗ ⇒ 𝑓𝑠 𝑄 ℓ 𝜔 ∘ 𝑔𝑛 → 𝑓𝑠 𝑄 ℓ,∗

Ramaswamy et al. (2015) Theorem 3, originally Tewari and Bartlett (2007) Theorem 2. Converging to optimal hypothesis for surrogate Linked hypothesis converges to optimal loss for discrete

SLIDE 13

Pause

Questions so far? Formalize properties.

We use these to study calibration.

SLIDE 14

Properties

A property Γ: Δ𝒵 → ℛ is a function mapping probability distributions to reports. If it’s easier, think of 𝑞 ∈ Δ𝒵 as conditional probability. A property Γ: Δ𝒵 → ℛ is elicitable if there is a loss function 𝑀: ℛ × 𝒵 → ℝ+such that, for all 𝑞 ∈ Δ𝒵, Γ 𝑞 = argmin𝑠𝔽𝑍~𝑞𝑀(𝑠, 𝑍). Here, we say the loss 𝑀 elicits Γ. Elicitable properties have convex level sets. (Lambert and Shoham 2009) Γ

𝑠 = {𝑞 ∈ Δ𝒵: 𝑠 ∈ Γ 𝑞 }

SLIDE 15

“Drawing” a property

𝑜-simplex in (𝑜 − 1)-dimensional space. Example: n = 3

SLIDE 16

Calibration… in terms of properties

A property Γ: Δ𝒵 → ℝ𝑒 and link 𝜔: ℝ𝑒 → ℛ are ℓ-calibrated if 𝑣𝑛 → Γ 𝑞 ⇒ 𝔽𝑞ℓ 𝜔 𝑣𝑛 , 𝑍 → min𝑠𝔽𝑞ℓ 𝑠, 𝑍 . i.e. The property value can always be linked to the argmin of loss. Tool to study geometric properties of losses eliciting Γ.

Definition courtesy of Agarwal and Agarwal (2015)

SLIDE 17

Property Papers

Lambert, Shoham (2009): Eliciting Truthful Answers to Multiple- choice questions.

Finite properties are elicitable iff their level sets form a power diagram.

Agarwal, Agarwal (2015): On Consistent Surrogate Risk Minimization and Property Elicitation.

There’s a connection between properties and surrogate losses.

Frongillo, Kash (2015): On Elicitation Complexity.

Every property is elicitable, but the question is how elicitable.

SLIDE 18

Calibrated surrogates

Positive Normal Sets. Necessary Conditions. Sufficient Conditions. Relationship between positive normal sets and level sets of property.

SLIDE 19

Positive Normal Sets

Finite outcome setting: rewrite 𝑀: ℛ → ℝ𝑜 be the vector of loss values should each outcome occur.

Linearity of expectation: rewrite 𝔽𝑞𝑀 𝑠, 𝑍 = < 𝑞, 𝑀 𝑠 >.

Positive normal set Definition for sequences: sequence converges to inf.

Inf expected loss possible Distributions where Expected loss

n outcome

vector Outcome vector

SLIDE 20

Necessary Condition

Let ℓ be a discrete loss and let 𝑀 be ℓ-calibrated. Let 𝛿 be the property elicited by ℓ. For all u ∈ 𝒯𝑀 = 𝑑𝑝𝑜𝑤(𝑗𝑛 𝑀 ) , there exists an 𝑠 ∈ ℛ such that 𝒪𝑀 𝑣 ⊆ 𝛿𝑠

Ramaswamy et al. (2015) Theorem 6

SLIDE 21

Sufficient Condition

Suppose there exists some finite set of 𝑣𝑗 ∈ 𝒯𝑀 such that ∪𝑗 𝒪𝑀(𝑣𝑗) = Δ𝒵, and for each i there exists an r𝑘 ∈ ℛ such that 𝒪𝑀 𝑣𝑗 ⊆ 𝛿𝑠𝑘. Then 𝑀 is ℓ -calibrated.

Ramaswamy et al. (2015) Theorem 8 Example: 0-1 loss and hinge 𝒪ℎ𝑗𝑜𝑕𝑓 {2,0} = 𝑞 ∈ Δ2: 𝑞1 ≥ 1 2 ⊆ 𝛿1 𝒪ℎ𝑗𝑜𝑕𝑓 0,2 = 𝑞 ∈ Δ2: 𝑞1 ≤ 1 2 ⊆ 𝛿−1 𝑞1 = 1 2

SLIDE 22

Outline

Background
Surrogates
Calibration
Properties
Necessary and Sufficient Conditions
Case Study: Abstain
Dimensionality
Conclusion

SLIDE 23

Case Study: Abstain

Situations where the cost of misclassification is high.

College admissions. Medical diagnoses.

Discrete loss for this problem:

ℓ

1 2 𝑠, 𝑧 =

0, 𝑠 = 𝑧 1 2 , 𝑠 = ⊥ 1, 𝑠 ≠ 𝑧 𝑝𝑠 ⊥

SLIDE 24

Historical calibrated surrogates

Crammer Singer (2001)

𝑀𝐷𝑇 𝑣, 𝑧 = 1 + max

𝑘≠𝑧 𝑣𝑘 − 𝑣𝑧

+

𝜔𝐷𝑇 𝑣 = ቊ𝑏𝑠𝑕𝑛𝑏𝑦1≤𝑗≤𝑜𝑣𝑗, 𝑣 1 − 𝑣 2 > 𝜐 ⊥, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

One vs All (Rafkin, Klateau 2004)

𝑀𝑃𝑤𝐵 𝑣, 𝑧 = ∑𝕁{𝑧 = 𝑗} 1 − 𝑣𝑗 + + 𝕁 𝑧 ≠ 𝑗 1 + 𝑣𝑗 + 𝜔𝑃𝑤𝐵 𝑣 = ൝ 𝑏𝑠𝑕𝑛𝑏𝑦1≤𝑗≤𝑜𝑣𝑗, max

𝑘

𝑣𝑘 > 𝜐 ⊥, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

SLIDE 25

BEP Surrogate and Link

Ramaswamy et al. (2018) Section 4 𝑞 = (1 4 , 1 4 , 1 4 , 1 4) 𝑞 = ( 7 10 , 1 10 , 1 10 , 1 10)

SLIDE 26

Why BEP?

BEP, CS, and OvA are all ℓ

1 2-calibrated for abstain loss.

Convex surrogates takes report 𝑣 ∈ ℝ𝑒.

BEP: 𝑒 = log2(𝑜) . CS and OvA: 𝑒 = 𝑜.

Why does dimension 𝑒 matter?

SLIDE 27

Outline

Background
Surrogates
Calibration
Properties
Necessary and Sufficient Conditions
Case Study: Abstain
Dimensionality
Conclusion

SLIDE 28

Dimensionality

Want algorithms that are efficient and accurate.

Reducing dimension makes optimization problem more efficient. Calibration guarantees accuracy.

SLIDE 29

Elicitation Complexity

Elic(Γ) = Minimum dimension 𝑒 such that Γ is elicitable by a 𝑒- dimensional loss. Maybe Γ isn’t itself 1-elicitable, but can be calculated by g ∘ ෠ Γ , where ෠ Γ is 1-elicitable.

Then Elic(Γ) = 1. Called indirect elicitation.

Example: Γ 𝑞 = 𝔽𝑞 𝑍 2.

Elicit 𝔽𝑞[𝑍] and take 𝑕: 𝑦 ↦ 𝑦2.

SLIDE 30

Convex Calibration Dimension

Special case of elicitation complexity. ccdim(ℓ) = minimum dimension 𝑒 such that there is a convex ℓ- calibrated surrogate 𝑀: ℝ𝑒 × 𝒵 → ℝ. Ex: ccdim(ℓ

1 2) ≤ log2(𝑜) because of BEP surrogate.

From Ramaswamy et al. (2015) Definition 10.

SLIDE 31

Bounds on CC dimension

Understood through Feasible Subspace Dimension. Tight bound for properties where all the level sets intersect on interior of simplex. ccdim(ℓ) = n-1

Does not apply to abstain.

These results from Ramaswamy et al. (2015).

SLIDE 32

Feasible Subspace Dimension

The feasible subspace dimension 𝜈𝒟 𝑞 of a convex set 𝒟 at the point 𝑞 ∈ 𝒟 is the dimension of ℱ𝒟 𝑞 ∩ −ℱ𝒟(𝑞).

ℱ𝒟 𝑞 is the cone of feasible directions of 𝒟 at 𝑞.

Essentially: the dimension of the smallest face of 𝒟 containing 𝑞.

SLIDE 33

Feasible Subspace Dimension

SLIDE 34

Feasible Subspace Dimension

SLIDE 35

Lower bound on CC Dimension

Let ℓ be the discrete loss eliciting the finite property 𝛿: Δ𝒵 → ℛ. For 𝑞 ∈ Δ𝒵 and 𝑠 such that 𝑞 ∈ 𝛿𝑠, ccdim(ℓ) ≥ ||𝑞||0 − 𝜈𝛿𝑠(𝑞) − 1.

Ramaswamy et al. (2016) Theorem 16 Skip proof sketch

SLIDE 36

Proof Sketch

Consider 𝑞 ∈ 𝑠𝑓𝑚𝑗𝑜𝑢(Δ𝒵). Other case is a reduction to this. Suppose surrogate 𝑀: ℝ𝑒 → ℝ+

𝑜 is ℓ-calibrated.

Want to show 𝑒 ≥ ||𝑞||0 − 𝜈𝛿𝑠 𝑞 − 1.

Want to show there is ℋ ⊆ Δ𝒵 and 𝑠′ ∈ ℛ so that three things are true:

1. 𝑞 ∈ ℋ
2. 𝜈ℋ 𝑞 = 𝑜 − 𝑒 − 1
3. ℋ ⊆ 𝛿𝑠′

𝑞 ∈ 𝛿𝑠′ 𝜈𝛿𝑠 𝑞 = 𝜈𝛿𝑠′ 𝑞 ≥ 𝜈ℋ 𝑞 = 𝑜 − 𝑒 − 1

SLIDE 37

Proof Sketch: Conditions 1 and 2

Construct ℋ and 𝑠′. Consider 𝑣 ∈ 𝒟 so that inf

z∈𝒯𝑀 < 𝑞, 𝑨 > = inf 𝑣∈𝒟 < 𝑞, 𝑀 𝑣 > .

Claim: ∃𝑥𝑧 ∈ 𝜖𝑀𝑧(𝑣) for all 𝑧 ∈ 𝒵 so that ∑𝑧 𝑞𝑧 𝑥𝑧 = 0. Set 𝐵 = 𝑥1, 𝑥2, … , 𝑥𝑜 ∈ ℝ𝑒 𝑦 𝑜. ℋ = {𝑟 ∈ Δ𝒵 ∶ 𝐵𝑟 = 0} 𝜈ℋ 𝑞 = 𝑜𝑣𝑚𝑚𝑗𝑢𝑧 𝐵 𝟚 ≥ 𝑜 − 𝑒 − 1 by Ramaswamy Lemma 15.

1. and 2. satisfied.

SLIDE 38

Condition 3 Intuition

We construct ℋ using sequences that converge to 𝑣. 𝜗-subdifferential used to limit to the infimum. They show 𝑟 ∈ ℋ ⇒ 𝑟 ∈ 𝒪𝑀 𝑣 with a bunch of arithmetic.

Result follows since 𝒪𝑀 𝑣 ⊆ 𝛿𝑠′ for some 𝑠′ ∈ ℛ. “Let ℓ be the discrete loss eliciting the finite property 𝛿: Δ𝒵 → ℛ. For 𝑞 ∈ Δ𝒵 and 𝑠 such that 𝑞 ∈ 𝛿𝑠, ccdim(ℓ) ≥ ||𝑞||0 − 𝜈𝛿𝑠(𝑞) − 1.”

SLIDE 39

Mode: CC dimension tight bound

Trivial bound: 𝑑𝑑𝑒𝑗𝑛 0 − 1 𝑚𝑝𝑡𝑡 ≤ 𝑜 − 1. Previous result: For 𝑞 ∈ 𝛿𝑠, 𝑑𝑑𝑒𝑗𝑛 ℓ ≥ ||𝑞||0 − 𝜈𝛿𝑠(𝑞) − 1. Take 𝑞 =∩𝑠 𝛿𝑠. 𝑑𝑑𝑒𝑗𝑛 0 − 1 𝑚𝑝𝑡𝑡 ≥ 𝑜 − 0 − 1 Thus, bound is tight.

SLIDE 40

Abstain: Unknown bounds

𝐷𝑑𝑒𝑗𝑛 ≥ 3 − 1 − 1 = 1 𝐷𝑑𝑒𝑗𝑛 ≥ 3 − 2 − 1 = 0 𝐷𝑑𝑒𝑗𝑛 ≥ 2 − 0 − 1 = 1

With abstain, ∩𝑗 𝛿𝑗 = ∅. Thus, lower bound does not apply.

SLIDE 41

Calibrated surrogate papers

Ramaswamy, Agarwal (2015): Convex calibration dimension for multiclass loss matrices.

Reducing surrogate dimension is important. Here are some bounds on dimension of calibrated convex surrogate losses.

Ramaswamy et al. (2018): Consistent Algorithms for Multiclass classification with a reject option.

Abstain is cool, and here’s a log(n) dimension consistent convex surrogate.

SLIDE 42

Other discrete examples

Ramaswamy et al. (2015): Convex calibrated surrogates for hierarchical classification.

Here’s a calibrated surrogate for this problem using abstain surrogate.

Lapin et al. (2015): Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification.

Here are some surrogates for these problems. Not sure if they’re calibrated.

Yu, Blaschko (2015): The Lovász Hinge: A Novel Convex Surrogate for Submodular Losses.

Here’s how you go from submodular functions to convex surrogates.

SLIDE 43

Future Work

Embedding Dimension. Constructing Link Functions. Approximate Elicitation.

SLIDE 44

Summary

ℓ

1 2 𝑠, 𝑧 =

0, 𝑠 = 𝑧 1 2 , 𝑠 = ⊥ 1, 𝑠 ≠ 𝑧 𝑝𝑠 ⊥