Calibration of Convex Surrogate Losses via Property Elicitation - - PowerPoint PPT Presentation
Calibration of Convex Surrogate Losses via Property Elicitation - - PowerPoint PPT Presentation
Calibration of Convex Surrogate Losses via Property Elicitation Jessie Finocchiaro October 10, 2019 Introduction Machine Learning Predictions about Some property of data future events Outline Background Surrogates Calibration
Introduction
Predictions about future events Some property of data Machine Learning
Outline
- Background
- Surrogates
- Calibration
- Properties
- Necessary and Sufficient Conditions
- Case Study: Abstain
- Dimensionality
- Conclusion
Empirical Risk Minimization (ERM)
Assumption: data comes i.i.d. from P over π΄, π΅ . Goal: minimize risk. πππ‘π β = β« π β π¦ , π§ ππ(π¦, π§)
Problem: donβt know P.
Settle for minimizing empirical risk. Take π inputs π¦π β π΄, π§π β π΅. πΉπππ‘π β = 1 π ΰ·
π=1 π
π(β π¦π , π§π)
π π π(π, π) Average Loss Against
- utcome π§π
Prediction given feature π¦π
Conditional Probability
Form hypothesis β: π΄ β β. Want to learn β(π) minimizing π½ π,π ~ππ(β π , π). If β π¦ minimizes π π , π over π for all π¦ β π΄, obviously true. Abstract β π¦ to π , look at distribution on simplex, Ξπ΅.
π Pr[π] Pr[π|π = 3]
Setting
Classification-like problems
Multiclass classification (with reject option) Ranking Top-k classification
Notation
Finite outcomes π΅ β π Report set β β π Probability Distribution over π΅: p β Ξπ΅ ππ§ = Pr[π = π§]
Outline
- Background
- Surrogates
- Calibration
- Properties
- Necessary and Sufficient Conditions
- Case Study: Abstain
- Dimensionality
- Conclusion
Surrogates
Loss functions measure error. Created with a task in mind.
Often discrete.
Discrete losses hard to optimize. Surrogates should approximate the original loss well.
L(r,1) Report r Surrogates for 0-1 loss, π΅ = {β1,1}
Calibration
A calibrated loss π is βgood approximationβ of discrete loss β(π , π§). Let β be a discrete loss. A surrogate loss function π: βπ Γ π΅ β β+ is said to be β-calibrated if there exists a link function π: βπ β β such that
.
Inf over reports not linked to the argmin of the discrete loss.
Consistency
Let ππ: π΄ β βπ be the hypothesis learned from training on π samples drawn i.i.d. over π. ππ
π π(π) is the expected loss L by predicting π(π) when π, π ~π.
ππ is said to be L-consistent if ππ
π π π π β
inf
π:π΄ββ ππ π π π β ππ π π,β
.
Losses are calibrated. Hypotheses are consistent.
Relating calibration and consistency
Let β: β Γ π΅ β β+ be a discrete loss. A surrogate loss function π: βπ Γ π΅ β β+ is β-calibrated if and only if there exists a link function π: βπ β β such that for all distributions π
- n π΄ Γ π΅ and all sequences of surrogate hypotheses ππ: π΄ β βπ,
we have ππ
π π ππ β ππ π π,β β ππ π β π β ππ β ππ π β,β
Ramaswamy et al. (2015) Theorem 3, originally Tewari and Bartlett (2007) Theorem 2. Converging to optimal hypothesis for surrogate Linked hypothesis converges to optimal loss for discrete
Pause
Questions so far? Formalize properties.
We use these to study calibration.
Properties
A property Ξ: Ξπ΅ β β is a function mapping probability distributions to reports. If itβs easier, think of π β Ξπ΅ as conditional probability. A property Ξ: Ξπ΅ β β is elicitable if there is a loss function π: β Γ π΅ β β+such that, for all π β Ξπ΅, Ξ π = argminπ π½π~ππ(π , π). Here, we say the loss π elicits Ξ. Elicitable properties have convex level sets. (Lambert and Shoham 2009) Ξ
π = {π β Ξπ΅: π β Ξ π }
βDrawingβ a property
π-simplex in (π β 1)-dimensional space. Example: n = 3
Calibration⦠in terms of properties
A property Ξ: Ξπ΅ β βπ and link π: βπ β β are β-calibrated if π£π β Ξ π β π½πβ π π£π , π β minπ π½πβ π , π . i.e. The property value can always be linked to the argmin of loss. Tool to study geometric properties of losses eliciting Ξ.
Definition courtesy of Agarwal and Agarwal (2015)
Property Papers
Lambert, Shoham (2009): Eliciting Truthful Answers to Multiple- choice questions.
Finite properties are elicitable iff their level sets form a power diagram.
Agarwal, Agarwal (2015): On Consistent Surrogate Risk Minimization and Property Elicitation.
Thereβs a connection between properties and surrogate losses.
Frongillo, Kash (2015): On Elicitation Complexity.
Every property is elicitable, but the question is how elicitable.
Calibrated surrogates
Positive Normal Sets. Necessary Conditions. Sufficient Conditions. Relationship between positive normal sets and level sets of property.
Positive Normal Sets
Finite outcome setting: rewrite π: β β βπ be the vector of loss values should each outcome occur.
Linearity of expectation: rewrite π½ππ π , π = < π, π π >.
Positive normal set Definition for sequences: sequence converges to inf.
Inf expected loss possible Distributions where Expected loss
- n outcome
vector Outcome vector
Necessary Condition
Let β be a discrete loss and let π be β-calibrated. Let πΏ be the property elicited by β. For all u β π―π = ππππ€(ππ π ) , there exists an π β β such that πͺπ π£ β πΏπ
Ramaswamy et al. (2015) Theorem 6
Sufficient Condition
Suppose there exists some finite set of π£π β π―π such that βͺπ πͺπ(π£π) = Ξπ΅, and for each i there exists an rπ β β such that πͺπ π£π β πΏπ π. Then π is β -calibrated.
Ramaswamy et al. (2015) Theorem 8 Example: 0-1 loss and hinge πͺβππππ {2,0} = π β Ξ2: π1 β₯ 1 2 β πΏ1 πͺβππππ 0,2 = π β Ξ2: π1 β€ 1 2 β πΏβ1 π1 = 1 2
Outline
- Background
- Surrogates
- Calibration
- Properties
- Necessary and Sufficient Conditions
- Case Study: Abstain
- Dimensionality
- Conclusion
Case Study: Abstain
Situations where the cost of misclassification is high.
College admissions. Medical diagnoses.
Discrete loss for this problem:
β
1 2 π , π§ =
0, π = π§ 1 2 , π = β₯ 1, π β π§ ππ β₯
Historical calibrated surrogates
- Crammer Singer (2001)
ππ·π π£, π§ = 1 + max
πβ π§ π£π β π£π§
+
ππ·π π£ = αππ ππππ¦1β€πβ€ππ£π, π£ 1 β π£ 2 > π β₯, ππ’βππ π₯ππ‘π
- One vs All (Rafkin, Klateau 2004)
πππ€π΅ π£, π§ = βπ{π§ = π} 1 β π£π + + π π§ β π 1 + π£π + πππ€π΅ π£ = ΰ΅ ππ ππππ¦1β€πβ€ππ£π, max
π
π£π > π β₯, ππ’βππ π₯ππ‘π
BEP Surrogate and Link
Ramaswamy et al. (2018) Section 4 π = (1 4 , 1 4 , 1 4 , 1 4) π = ( 7 10 , 1 10 , 1 10 , 1 10)
Why BEP?
BEP, CS, and OvA are all β
1 2-calibrated for abstain loss.
Convex surrogates takes report π£ β βπ.
BEP: π = log2(π) . CS and OvA: π = π.
Why does dimension π matter?
Outline
- Background
- Surrogates
- Calibration
- Properties
- Necessary and Sufficient Conditions
- Case Study: Abstain
- Dimensionality
- Conclusion
Dimensionality
Want algorithms that are efficient and accurate.
Reducing dimension makes optimization problem more efficient. Calibration guarantees accuracy.
Elicitation Complexity
Elic(Ξ) = Minimum dimension π such that Ξ is elicitable by a π- dimensional loss. Maybe Ξ isnβt itself 1-elicitable, but can be calculated by g β ΰ· Ξ , where ΰ· Ξ is 1-elicitable.
Then Elic(Ξ) = 1. Called indirect elicitation.
Example: Ξ π = π½π π 2.
Elicit π½π[π] and take π: π¦ β¦ π¦2.
Convex Calibration Dimension
Special case of elicitation complexity. ccdim(β) = minimum dimension π such that there is a convex β- calibrated surrogate π: βπ Γ π΅ β β. Ex: ccdim(β
1 2) β€ log2(π) because of BEP surrogate.
From Ramaswamy et al. (2015) Definition 10.
Bounds on CC dimension
Understood through Feasible Subspace Dimension. Tight bound for properties where all the level sets intersect on interior of simplex. ccdim(β) = n-1
Does not apply to abstain.
These results from Ramaswamy et al. (2015).
Feasible Subspace Dimension
The feasible subspace dimension ππ π of a convex set π at the point π β π is the dimension of β±π π β© ββ±π(π).
β±π π is the cone of feasible directions of π at π.
Essentially: the dimension of the smallest face of π containing π.
Feasible Subspace Dimension
Feasible Subspace Dimension
Lower bound on CC Dimension
Let β be the discrete loss eliciting the finite property πΏ: Ξπ΅ β β. For π β Ξπ΅ and π such that π β πΏπ , ccdim(β) β₯ ||π||0 β ππΏπ (π) β 1.
Ramaswamy et al. (2016) Theorem 16 Skip proof sketch
Proof Sketch
Consider π β π πππππ’(Ξπ΅). Other case is a reduction to this. Suppose surrogate π: βπ β β+
π is β-calibrated.
Want to show π β₯ ||π||0 β ππΏπ π β 1.
Want to show there is β β Ξπ΅ and π β² β β so that three things are true:
- 1. π β β
- 2. πβ π = π β π β 1
- 3. β β πΏπ β²
π β πΏπ β² ππΏπ π = ππΏπ β² π β₯ πβ π = π β π β 1
Proof Sketch: Conditions 1 and 2
Construct β and π β². Consider π£ β π so that inf
zβπ―π < π, π¨ > = inf π£βπ < π, π π£ > .
Claim: βπ₯π§ β πππ§(π£) for all π§ β π΅ so that βπ§ ππ§ π₯π§ = 0. Set π΅ = π₯1, π₯2, β¦ , π₯π β βπ π¦ π. β = {π β Ξπ΅ βΆ π΅π = 0} πβ π = ππ£ππππ’π§ π΅ π β₯ π β π β 1 by Ramaswamy Lemma 15.
- 1. and 2. satisfied.
Condition 3 Intuition
We construct β using sequences that converge to π£. π-subdifferential used to limit to the infimum. They show π β β β π β πͺπ π£ with a bunch of arithmetic.
Result follows since πͺπ π£ β πΏπ β² for some π β² β β. βLet β be the discrete loss eliciting the finite property πΏ: Ξπ΅ β β. For π β Ξπ΅ and π such that π β πΏπ , ccdim(β) β₯ ||π||0 β ππΏπ (π) β 1.β
Mode: CC dimension tight bound
Trivial bound: πππππ 0 β 1 πππ‘π‘ β€ π β 1. Previous result: For π β πΏπ , πππππ β β₯ ||π||0 β ππΏπ (π) β 1. Take π =β©π πΏπ . πππππ 0 β 1 πππ‘π‘ β₯ π β 0 β 1 Thus, bound is tight.
Abstain: Unknown bounds
π·ππππ β₯ 3 β 1 β 1 = 1 π·ππππ β₯ 3 β 2 β 1 = 0 π·ππππ β₯ 2 β 0 β 1 = 1
With abstain, β©π πΏπ = β . Thus, lower bound does not apply.
Calibrated surrogate papers
Ramaswamy, Agarwal (2015): Convex calibration dimension for multiclass loss matrices.
Reducing surrogate dimension is important. Here are some bounds on dimension of calibrated convex surrogate losses.
Ramaswamy et al. (2018): Consistent Algorithms for Multiclass classification with a reject option.
Abstain is cool, and hereβs a log(n) dimension consistent convex surrogate.
Other discrete examples
Ramaswamy et al. (2015): Convex calibrated surrogates for hierarchical classification.
Hereβs a calibrated surrogate for this problem using abstain surrogate.
Lapin et al. (2015): Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification.
Here are some surrogates for these problems. Not sure if theyβre calibrated.
Yu, Blaschko (2015): The LovΓ‘sz Hinge: A Novel Convex Surrogate for Submodular Losses.
Hereβs how you go from submodular functions to convex surrogates.
Future Work
Embedding Dimension. Constructing Link Functions. Approximate Elicitation.
Summary
β
1 2 π , π§ =
0, π = π§ 1 2 , π = β₯ 1, π β π§ ππ β₯