Performance-Aligned Learning Algorithms
with Statistical Guarantees
1
Rizal Zaini Ahmad Fathony
Committee: Prof. Brian Ziebart (Chair)
- Prof. Bhaskar DasGupta
- Prof. Xinhua Zhang
- Prof. Lev Reyzin
- Prof. Simon Lacoste-Julien
Performance-Aligned Learning Algorithms with Statistical Guarantees - - PowerPoint PPT Presentation
Performance-Aligned Learning Algorithms with Statistical Guarantees Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Xinhua Zhang Prof. Lev Reyzin Prof. Simon Lacoste-Julien 1 Outline New
1
Rizal Zaini Ahmad Fathony
Committee: Prof. Brian Ziebart (Chair)
2
Introduction & Motivation
1
General Multiclass Classification
2
Graphical Models
3
Bipartite Matching in Graphs
4
Conclusion & Future Directions
5 βNew learning algorithms that align with performance/loss metrics and provide the statistical guarantees of Fisher consistencyβ
3
Data
Data Distribution π(π, π§)
β¦
Training
ππ+1 ΰ· π§π+1 Testing ππ+2
β¦
Loss/Performance Metrics: loss ΰ· π§, π§ / score(ΰ· π§, π§)
4
ΰ· π§π+2 Multiclass Classification
Multivariate Performance
Structured Prediction
5
Non-convex, non-continuous metrics β Intractable optimization
Convex surrogate loss need to be employed!
Under ideal condition: optimize surrogate β minimizes the loss metric
(given the true distribution and fully expressive model) A desirable property of convex surrogates:
6
Probabilistic Approach 1 Large-Margin Approach 2
Logistic Regression, Conditional Random Fields (CRF)
Support Vector Machine (SVM), Structured SVM
* Pictures are taken from MLPP book (Kevin Murphy)
7
Multiclass Logistic Regression 1 Multiclass SVM 2
Statistical guarantee of Fisher consistency (minimizes the zero-one loss metric in the limit) No dual parameter sparsity Computational efficiency
(via the kernel trick & dual parameter sparsity)
Current multiclass SVM formulations:
8
Conditional Random Fields (CRF) 1 Structured SVM 2
No easy mechanism to incorporate customized loss/performance metrics Flexibility to incorporate customized loss/performance metrics Computation of the normalization term may be intractable Relatively more efficient in computation Statistical guarantee of Fisher consistency No Fisher consistency guaranteees
9
Provide Fisher consistency guarantee Align better with the loss/performance metric
(by incorporating the metric into its learning objective)
Computationally efficient Perform well in practice
How?
βWhat predictor best maximizes the performance metric (or minimizes the loss metric) in the worst case given the statistical summaries of the empirical distributions?β
Robust adversarial learning approach
10
Based on: Fathony, R., Asif, K., Liu, A., Bashiri, M. A., Xing, W., Behpour, S., Zhang, X., and Ziebart, B. D.: Consistent robust adversarial prediction for general multiclass classification. arXiv preprint arXiv:1812.07526, 2018. (Submitted to JMLR).
Fathony, R., Liu, A., Asif, K., and Ziebart, B.: Adversarial multiclass classification: A risk minimization perspective. NIPS 2016. Fathony, R., Bashiri, M. A., and Ziebart, B.: Adversarial surrogate losses for ordinal regression. NIPS 2017.
Data
Data Distribution π(π, π§)
β¦
Training
ππ+1 ΰ· π§π+1 Testing ππ+2
β¦
Loss/Performance Metrics: loss ΰ· π§, π§ / score(ΰ· π§, π§)
11
ΰ· π§π+2
β¦ 1 2 3 π
Finite set of possible value of π§
12
Example: Digit Recognition
β¦ 1 2 3 β¦
Loss Metric: loss ΰ· π§, π§ = π½(ΰ· π§ β π§)
Loss Metric: Zero-One Loss
L =
13
Loss Metric: loss ΰ· π§, π§ = |ΰ· π§ β π§|
Loss Metric: Absolute Loss
L =
β¦ 1 2 5 β¦
Predicted vs Actual Label: Distance Loss
Example: Movie Rating Prediction
14
Predictor can say βabstainβ
1 2 3
Loss Metric: loss ΰ· π§, π§ = α π½ if abstain π½(ΰ· π§ β π§)
Loss Metric: Abstention Loss
L =
Prediction Abstain
15
Squared loss metric loss ΰ· π§, π§ = ΰ· π§ β π§ 2 Cost-sensitive loss metric Taxonomy-based loss metric loss ΰ· π§, π§ = β β π€ ΰ· π§, π§ + 1 loss ΰ· π§, π§ = π ΰ·
π§,π§
16
17
Empirical Risk Minimization
Approximate the loss
Original Loss Metric
Non-convex, non-continuous
with convex surrogates
Probabilistic prediction Evaluate against an adversary, instead of using empirical data Adversaryβs probabilistic prediction Constraint the statistics of the adversaryβs distribution to match the empirical statistics
Robust Adversarial Learning
18
Primal: Dual:
Lagrange multiplier, minimax duality
ERM with the adversarial surrogate loss (AL):
Simplified notation where:
Convex in π
19
Adversarial Surrogate Loss Convert to a Linear Program Convex Polytope formed by the constraints Example for a four class classification Extreme points of the (bounded) polytope
There is always an optimal solution that is an extreme point of the domain.
Computing AL = finding the best extreme point
LP Solver π(π3.5)
20
Extreme points of the polytope The Adversarial Surrogate Loss for Zero-One Loss Metrics (AL0-1) Computation of AL0-1
π in non-increasing order
until adding more potential decrease the loss value
O(π log π), where π is the number of classes
ππ is a vector with a single 1 at the i-th index, and 0 elsewhere.
Convex Polytope of the AL0-1
21
Binary Classification Three Class Classification
π β π π§
22
Extreme points of the polytope: ππ is a vector with a single 1 at the i-th index, and 0 elsewhere.
Ordinal Regression with Absolute Loss Metric
Adversarial Surrogate Loss ALord: Computation cost: O(π), where π is the number of classes
23
Extreme points of the polytope: ππ is a vector with a single 1 at the i-th index, and 0 elsewhere.
Classification with Abstention (0 β€ π½ β€ 0.5)
Adversarial Surrogate Loss ALabstain: Computation cost: O(π), where π is the number of classes
24
Fisher Consistency Requirement in Multiclass Classification
Consistency Fisher consistent
Bayes risk minimizer
Minimizer Property
Under π β:
25
Sub-gradient descent Incorporate Rich Feature Spaces via the Kernel Trick
input space ππ rich feature space π(ππ) Compute the dot products
πβ is the set that maximize AL0-1 Example: AL0-1
26
27
Multiclass Support Vector Machine (SVM)
with:
Fisher Consistent?
(Tewari and Bartlett, 2007) (Liu, 2007)
Perform well in low dimensional feature?
(Dogan et.al., 2016)
Relative Margin Model Relative Margin Model Absolute Margin Model
28
Dataset properties and AL0-1 constraints
12 datasets dual parameter sparsity
29
Results for Linear Kernel and Gaussian Kernel
The mean (standard deviation) of the accuracy. Bold numbers: best or not significantly worse than the best
Linear Kernel
AL01: slight benefit LLW: poor perf.
LLW: improved perf. AL01: maintain benefit
30
Fisher Consistent? Perform well in low dimensional feature? Relative Margin Model Relative Margin Model Absolute Margin Model
Relative Margin Model
General Multiclass Classification
31
General Multiclass Classification
Loss Metric
Loss Metric
Option
32
Based on: Rizal Fathony, Ashkan Rezaei, Mohammad Bashiri, Xinhua Zhang, Brian D. Ziebart. βDistributionally Robust Graphical Modelsβ. Advances in Neural Information Processing Systems 31 (NIPS), 2018
33
Some Popular Graphical Structure in Structured Prediction Chain Structure Tree Structure Lattice Structure
Activity Prediction, Sequence Tagging, NLP tasks: e.g. Named Entity Recognition Parse Tree-Based NLP tasks: Semantic Role Labeling and Sentiment Analysis Computer Vision Tasks: e.g. Image Segmentation
34
Conditional Random Fields (CRF) 1 Structured SVM (SSVM) 2
Fisher Consistent
Produce Bayes optimal prediction in ideal case.
No easy mechanism to incorporate customized loss/performance metrics
The algorithm optimized the conditional likelihood. Loss/performance metric-based prediction can be performed after learning process.
Align with the loss/performance metrics
The algorithm accept customized loss/performance metric in its optimization objective.
No Fisher consistency guarantee
Based on Multiclass SVM-CS. Not consistent for distribution with no majority label.
(Tsochantaridis et. al., 2005) (Lafferty et. al., 2001)
35
Primal:
π, ΰ· π = Οi=1
n
loss ΰ· yi, ΰ· yi
Dual:
ππ: Lagrange multipliers for constraints with edge features ππ€: Lagrange multipliers for constraints with node features
size: ππ Γ ππ
Intractable
for modestly-sized π
36
Dual | Marginal Formulation:
The objective depends on ΰ· π(ΰ· π³|π²) only through its node marginal probability ΰ· π(ΰ· π§π|π²) The objective depends on ΰ· π(ΰ· π³|π²) only through its node and edge marginal probability ΰ· π(ΰ· π§π|π²) and ΰ· π(ΰ· π§π, ΰ· π§π|π²)
General Graphical Models: Intractable Similar to CRF and SSVM: Focus: Graphs with low tree-width, e.g.: chain, tree, simple loops. Tractable optimization
Dual:
37
Matrix Notation (Tree Structure AGM):
(outer optimization for ππ and ππ€)
Optimization Techniques:
π(πππ log π + ππ2) π: # classes, π: # nodes, π: # iterations in dual decomposition
Runtime (for a single subgradient update):
CRF π(ππ2) SSVM π(ππ2)
General graphs low tree-width
π πππ₯π(π₯+1) log π + ππ2(π₯+1) π: # cliques, π₯: treewidth of the graph
38
when π is optimized over all measurable functions on the input space
AGM is consistent
when π is optimized over a restricted set of functions: all measurable function that are additive over the edge and node potentials.
AGM is also consistent If the loss function is additive
39
Facial Emotion Intensity Prediction (Chain Structure, Labels with Ordinal Category)
Results: The mean (standard deviation) of the average loss metrics.
Bold numbers: best or not significantly worse than the best
40
Semantic Role Labeling (Tree Structure)
Results:
41
Performance-Aligned?
Conditional Random Field (CRF) Structured SVM Adversarial Graphical Models
(Lafferty et. al., 2001) (Tsochantaridis et. al., 2005) (our approach)
Consistent?
42
Based on: Rizal Fathony*, Sima Behpour*, Xinhua Zhang, Brian D. Ziebart. βEfficient and Consistent Adversarial Bipartite Matchingβ. International Conference on Machine Learning (ICML), 2018.
43
Maximum weighted bipartite matching:
1 2 3 4 1 2 3 4 A B
π = [4, 3, 1, 2]
Machine learning task: Learn the appropriate weights ππ(β ) Objective: Minimize a loss metric, e.g., the Hamming loss
44
Word alignment
(Taskar et. al., 2005; Pado & Lapta, 2006; Mac-Cartney et. al., 2008)
1 natΓΌrlich ist das haus klein
Correspondence between images
(Belongie et. al., 2002; Dellaert et. al., 2003)
2
Learning to rank documents
(Dwork et. al., 2001; Le & Smola, 2007)
3 1 2 3 4
A non-bipartite matching task can be converted to a bipartite matching problem
45
CRF 1 Structured SVM 2
Fisher Consistent
Produce Bayes optimal prediction in ideal case
Computationally intractable
Normalization term requires matrix permanent computation (a #P-hard problem). Approximation is needed for modestly sized problems.
Computationally Efficient
Hungarian algorithm for computing the maximum violated constraints
No Fisher consistency guarantee
Based on Multiclass SVM-CS Not consistent for distribution with no majority label solved using constraint generation
(Tsochantaridis et. al., 2005) (Petterson et. al., 2009; Volkovs & Zemel, 2012)
46
Primal: Dual:
Hamming loss Lagrangian term π
Augmented Hamming loss matrix for π = 3 permutations
size: π! Γ π!
Intractable
for modestly-sized π
47
Marginal Distribution Matrices:
Predictor Adversary π = π =
ππ,π = ΰ· π(ΰ· ππ = π) ππ,π = ΰ· π ( ΰ· ππ = π)
Dual:
Birkhoff β Von Neumann theorem:
123 132 213 231 312 321
convex polytope whose points are doubly stochastic matrices reduce the space of optimization: from π(π!) to π(π2)
48
Dual: Marginal Formulation:
: projected Quasi-Newton (Schmidt, et.al., 2009)
: closed-form solution
: projection to doubly-stochastic matrix
Optimization Techniques Used:
Rearrange the optimization order and add regularization and smoothing penalties
49
Empirical Risk Perspective of Adversarial Bipartite Matching
when π is optimized over all measurable functions on the input space (π¦, π)
ALperm is consistent
π is optimized over a restricted set of functions: π π¦, π = Οπ ππ(π¦, ππ) when π is allowed to be optimized over all measurable functions on the individual input space (π¦, ππ)
ALperm is also consistent
50
1.0 1.3 1.5 2.5 2.8 1.0 1.2 1.4 4.2 5.0 relative: 12=1.0 relative: 1.96=1.0
Application: Video Tracking Empirical runtime (until convergence)
grows (roughly) quadratically in π
CRF: impractical even for π = 20
(Petterson et. al., 2009)
Public Benchmark Datasets
51
significantly
competitive with SSVM
52
Efficient? Perform well?
Conditional Random Field (CRF) Structured SVM Adversarial Bipartite Matching
(Petterson et. al., 2009; Volkovs & Zemel, 2012) (Tsochantaridis et. al., 2005) (our approach)
Consistent?
53
54
Provide Fisher consistency guarantee Align better with the loss/performance metric
(by incorporating the metric into its learning objective)
Computationally efficient Perform well in practice
55
56
Our formulation only enforces constraints
Add fairness constraints to the predictor? Important issues in automated decision using ML algorithms. Requires the algorithm to produce fair prediction.
Is there any stronger statistical guarantee that can separate the high-performing Fisher consistent algorithm from the low-performing ones? In multiclass classification problem, both AL0-1 and SVM-LLW are Fisher consistent. However, their performances are quite different.
57
Graphical Models
Can we develop learning algorithms for general graphical models? More complex graphical structures are popular in some applications, e.g. computer vision. Deep learning has been successfully applied to many prediction problems. How can the robust adversarial learning approach help designing deep learning architectures? What kind of approximation algorithms can be applicable? Exact learning algorithms for AGM in this case may be intractable. Most of deep learning architectures are not designed to optimize customized loss metrics.
58
MB
Mohammad Bashiri
AR
Ashkan Rezaei
KA
Kaiser Asif
AL
Anqi Liu
SB
Sima Behpour
XZ
BZ
WX
Wei Xing
59
Rizal Fathony, Kaiser Asif, Anqi Liu, Mohammad Bashiri, Wei Xing, Sima Behpour, Xinhua Zhang, Brian D. Ziebart. Submitted to JMLR.
Rizal Fathony, Ashkan Rezaei, Mohammad Bashiri, Xinhua Zhang, Brian D. Ziebart. Advances in Neural Information Processing Systems 31 (NeurIPS), 2018.
Rizal Fathony*, Sima Behpour*, Xinhua Zhang, Brian D. Ziebart. International Conference on Machine Learning (ICML), 2018.
Rizal Fathony, Mohammad Bashiri, Brian D. Ziebart. Advances in Neural Information Processing Systems 30 (NIPS), 2017.
Rizal Fathony, Anqi Liu, Kaiser Asif, Brian D. Ziebart. Advances in Neural Information Processing Systems 29 (NIPS), 2016.
Anqi Liu, Rizal Fathony, Brian D. Ziebart. ArXiv Preprints, 2016.
60