Knowledge-Uncertainty Axiomatized Framework with Support Vector - - PowerPoint PPT Presentation

▶

Sep 15, 2022 374 likes •929 views

Knowledge-Uncertainty Axiomatized Framework with Support Vector Machines for Hyperparameter Optimization Marcin Orchel AGH University of Science and Technology in Poland 1 / 54 1 Introduction 2 Problem Definition 3 Solution 4 SVM 5 Measures of

SLIDE 1

Knowledge-Uncertainty Axiomatized Framework with Support Vector Machines for Hyperparameter Optimization

Marcin Orchel

AGH University of Science and Technology in Poland

1 / 54

SLIDE 2

1 Introduction 2 Problem Definition 3 Solution 4 SVM 5 Measures of Knowledge and Uncertainty 6 Experiments 7 Summary

2 / 54

SLIDE 3

Introduction

Introduction 3 / 54

SLIDE 4

Introduction

There are multiple reformulations of support vector machines (SVM). Reformulations regard objective function, constraints, representation

f a solution (kernel function).

How to group reformulations into one framework? We introduce a framework of knowledge and uncertainty. Multi-objective optimization problem with two goals: maximizing knowledge and minimizing uncertainty. Generalizing regularization term in SVM to an uncertainty measure, hinge loss to a knowledge measure. How to improve generalization performance or simplicity of SVM?

Introduction 4 / 54

SLIDE 5

Introduction

Define the most efficient measures of knowledge and uncertainty. Define concepts of knowledge and uncertainty with a set of axioms. Requirement: use existing SVM optimization problem. Define knowledge-uncertainty framework for selecting optimal values

f hyperparameters.

Select optimal values of hyperparameters over a finite set of candidates generated by a double grid search method.

Introduction 5 / 54

SLIDE 6

Introduction

There are no practical methods for selecting value of a hyperparameter C. We use double grid search method or some global optimization methods like evolutionary computation for selecting values of C and σ. We use cross validation for comparing different set of values of hyperparameters. We minimize statistical generalization bounds. We could aggregate solutions for multiple values of hyperparameters.

Introduction 6 / 54

SLIDE 7

Introduction

The similar idea to uncertainty is risk in financial economics. Risk is related more to potential loss then uncertainty. The knowledge has been axiomatized in epistemic modal logic. The axioms are based on logic rather than mathematical spaces.

Introduction 7 / 54

SLIDE 8

Problem Definition

Problem Definition 8 / 54

SLIDE 9

Classification problem

Definition 1 (Classification problem with a training set)

For a universe X of objects, a set C of classes and a set of mappings MT : XT ⊂ X → C\{c0} called a training set T, a classification problem is to find classes for all elements of X.

Definition 2 (Classification problem with hypotheses)

For a classification problem with a training set, we define additionally a space of hypotheses H. Each hypothesis is a function h : X → C. Finding a class for all elements of X is replaced by finding a hypothesis.

Problem Definition 9 / 54

SLIDE 10

Knowledge set

Definition 3 (knowledge set)

A knowledge set K is a tuple K = (X, C, MT; U, c), shortly, without environment objects it is a pair K = (U, c), where c = c0. It is a set U ⊂ X of points with information that every u ∈ U maps to c ∈ C\{c0}. The c is called a class of a knowledge set. The difference between a hypothesis and a knowledge set, is that a knowledge set defines mappings only for some of objects, while a hypothesis is for all objects. We also define an unknowledge set as a pair U = (U, c0).

Definition 4 (knowledge setting)

A knowledge setting is a pair of knowledge sets, (K1, K2), shortly (K1,2), where K1 = (U1, c1), K2 = (U2, c2), c1, c2 ∈ C\ {c0}, c1 = c2.

Problem Definition 10 / 54

SLIDE 11

Knowledge set

We can generalize knowledge settings to multiple knowledge sets, including only one knowledge set. The special type of knowledge setting is a knowledge setting consistent with a training set (T1,2). We define operation of inclusion as inclusion of mappings. We say (K1,2) ⊆ (L1,2), if and only if M(K1,2) ⊆ M(L1,2). The difference between knowledge settings is the difference between corresponding sets of mappings.

Problem Definition 11 / 54

SLIDE 12

Knowledge and uncertainty measures

We define a space of knowledge measures KK for knowledge settings. Each knowledge measure is a function kK : (K1,2) → R. The goal is to find a knowledge setting with maximal knowledge measure. We define a space of uncertainty measures UK for knowledge settings. Each uncertainty measure is a function uK : (K1,2) → R. The goal is to find a knowledge setting with minimal uncertainty measure. The special type of uncertainty measure is a measure dependent on U1 and U2 but independent of mappings. More formally, it is a measure on an uncertain setting (U1,2), for example on (X, c0).

Problem Definition 12 / 54

SLIDE 13

Knowledge measure axioms

Axiom 1 (monotonicity of a knowledge measure)

When (K1,2) ⊆ (L1,2), then kK ((K1,2)) ≤ kK ((L1,2)) . (1)

Axiom 2 (strict monotonicity of a knowledge measure)

When (K1,2) ⊂ (L1,2) and (L1,2)\(K1,2) ∩ (T1,2) = ∅, then kK ((K1,2)) < kK ((L1,2)) . (2)

Axiom 3 (non-negativity)

kK ((K1,2)) ≥ 0 , (3)

Problem Definition 13 / 54

SLIDE 14

Knowledge measure axioms

Axiom 4 (null empty set)

kK((∅, ∅) = 0 , (4)

Axiom 5 (knowledge in a training set)

kK ((K1,2) \ (T1,2)) = 0 , (5)

Axiom 6 (knowledge in a training set 2)

When (K1,2) ⊂ (L1,2) and (L1,2) \(K1,2) ∩ (T1,2) = ∅, kK ((K1,2)) = kK ((L1,2)) . (6)

Problem Definition 14 / 54

SLIDE 15

Knowledge measure axioms

Axiom 7

The maximal value of kK exists, that is kK < ∞.

Axiom 8

When (K1,2) ∩ (T1,2) = ∅, kK ((K1,2)) > 0 . (7)

Axiom 9

kK ((T1,2)) = kK

U′

1, c1

+ kK U′

2, c2

+kK ((K1,2)) (9)

Problem Definition 15 / 54

SLIDE 16

Knowledge measure axioms

Axiom 10 (optional, additivity)

When (K1,2) ∩ (L1,2) = ∅, kK ((K1,2) ∪ (L1,2)) = kK ((K1,2)) + kK ((L1,2)) (10)

Problem Definition 16 / 54

SLIDE 17

Knowledge measure axioms

Example 1

The example of a knowledge measure is kK ((K1,2)) = |(T1,2) ∩ (K1,2)| . (11) The knowledge kK might be interpreted as an upper bound on the number of examples with correct classification included in a knowledge setting from a training set. In general, knowledge kK is a quality measure.

Problem Definition 17 / 54

SLIDE 18

Uncertainty measure axioms

Axiom 11 (monotonicity of an uncertainty measure)

When (K1,2) ⊆ (L1,2), then uK ((K1,2)) ≤ uK ((L1,2)) , (12)

Axiom 12 (non-negativity)

uK ((K1,2)) ≥ 0 . (13)

Axiom 13 (null empty set)

uK((∅, ∅) = 0 . (14)

Problem Definition 18 / 54

SLIDE 19

Uncertainty measure axioms

Axiom 14 (uncertainty outside a training set)

When (K1,2) ⊂ (T1,2), uK ((K1,2)) > 0 . (15)

Axiom 15 (uncertainty outside a training set 2)

When (K1,2) ⊂ (L1,2) and (L1,2) \(K1,2) ⊂ (T1,2), uK ((K1,2)) < uK ((L1,2)) . (16)

Axiom 16

The maximal value of uK exists, that is uK < ∞.

Problem Definition 19 / 54

SLIDE 20

Uncertainty measure axioms

Axiom 17

uK ((K1,2)) = uK((X, c0)) − uK((U1 ∪ U2)′ , c0) (17)

Axiom 18 (optional, additivity)

When (K1,2) ∩ (L1,2) = ∅, uK ((K1,2) ∪ (L1,2)) = uK ((K1,2)) + kK ((L1,2)) (18)

Problem Definition 20 / 54

SLIDE 21

Uncertainty measure axioms

Example 2

The example of an uncertainty measure is uK ((K1,2)) = |(K1,2)| . (19) The uncertainty uK might be related to uncertain knowledge from a knowledge setting, so the classification might be incorrect. It could be also interpreted as an upper bound on knowledge in a knowledge setting that might be incorrect. The uncertainty 0 means that we are sure about correct classification.

Problem Definition 21 / 54

SLIDE 22

Solution

Solution 22 / 54

SLIDE 23

Solution

We define a solution for the classification problem as a solution of a multi-objective optimization problem for knowledge settings

Optimization problem (OP) 1

(K1,2) ∈ Ks : max kK ((K1,2)) , min uK ((K1,2)) . (20) We maximize knowledge kK and minimize uK. We are interested in a Pareto optimal set. For a finite set Ks, the solution always exists. The objective function is lower bounded by (0, 0).

Solution 23 / 54

SLIDE 24

Solution

We are interested only in Pareto optimal solutions. It is a priori assumption and might be considered as an axiom.

Axiom 19

The best solutions for the OP 1 are Pareto optimal solutions. When we have multiple Pareto optimal solutions, they must be compared by using oracle for knowledge settings in order to obtain the best single solution. By selecting only Pareto optimal solutions, we limit the number of knowledge settings to validate in oracle.

Solution 24 / 54

SLIDE 25

Solution

Proposition 1

The set of Pareto optimal solutions includes only subsets of a training knowledge setting (T1,2), if they are included in Ks assuming axioms Ax. 6,

Ax. 15.

Not all subsets are Pareto optimal solutions in general. We could have two separated knowledge settings, when one would have more knowledge and less uncertainty.

Solution 25 / 54

SLIDE 26

Online setting

Proposition 2

After adding a new object to a training set, we only need to check old Pareto optimal solutions and all of them with added new object to solve OP 1 assuming Ax. 10 and Ax. 18.

Solution 26 / 54

SLIDE 27

Spaces

Due to Prop. 1, in order to be able to generate Pareto optimal solutions on examples from outside a training set, we need to provide additional structure S to the problem. Usually the structure for data is assumed a priori. Without any assumptions about a structure of the space, we would not be able to generalize in the sense of Pareto optimal solutions over a training set.

Solution 27 / 54

SLIDE 28

Spaces

Because we do not know the space S, we may introduce multiple spaces Si ∈ S. In order to solve OP 1, we need to be able to compare measures kK and uK on different spaces (X, Si). So we need an additional axiom about comparability for the measures on different spaces.

Axiom 20

A knowledge measure kK and an uncertainty measure uK on different spaces (X, Si) must be comparable. We need measures kK and uK which are invariant to spaces. When the measures are dependent only on some basic space So such as So ⊂ Si, then they are comparable. In extreme, when the measure only depends on X it is also comparable.

Solution 28 / 54

SLIDE 29

Spaces

In order to achieve comparability, one strategy might be to compute the maximal value of a measure in a given space. Then we could divide a measure by a maximal possible value of that measure. We need to be sure that the maximal value exists, for example for an uncertainty measure due to Ax. 16, and for a knowledge measure due to Ax. 7. Due to Ax. 11, we expect maximal value of uK for the biggest possible knowledge setting. We can define such for the whole X. Due to Ax. 7, we expect maximal value of kK for the training set.

Solution 29 / 54

SLIDE 30

SVM

SVM 30 / 54

SLIDE 31

SVM

We have a set of n training vectors xi for i ∈ {1, . . . , n}, where

xi = (x1

i , . . . , xm i ). The m is a dimension of the problem. The support

vector classification (SVC) soft margin case optimization problem with ·1 norm is

OP 2

min

wc,bc,

ξc

f

wc, bc,

ξc

2 wc2 + Cc

n

ξi

c

(21) subject to yi

ch (

xi) ≥ 1 − ξi

c ,

(22)

ξc ≥ 0

(23) for i ∈ {1, . . . , n}, where h ( xi) = wc · ϕ ( xi) + bc, Cc > 0 . (24)

SVM 31 / 54

SLIDE 32

SVM

The h∗ ( x) = w∗

c · ϕ (

x) + b∗

c = 0 is a decision curve of the

classification problem.

SVM 32 / 54

SLIDE 33

SVM

OP 3

(K1,2) ∈ Ks : max −

n

max

0, 1 − yi

ch (

xi)

(25) min − 2 wc2 , (26) where Ks is a space of margin knowledge settings K1 = ({ x : h ( x) > 1} , 1) , (27) K2 = ({ x : h ( x) < −1} , −1) . (28)

Theorem 1

The solution of OP 2 for any C is equivalent to a set of Pareto optimal solutions of OP 3.

SVM 33 / 54

SLIDE 34

SVM

We define the following knowledge measure kK ((K1,2)) = kK ((T1,2)) −

n

max

0, 1 − yi

ch (

xi)

(29)

Proposition 3

The knowledge measure (29) fulfills axioms for a knowledge measure.

SVM 34 / 54

SLIDE 35

SVM

We define the following uncertainty measure uK ((K1,2)) = uK ((X, c0)) − 2 wc2 (30)

Proposition 4

The uncertainty measure (30) fulfills axioms for an uncertainty measure.

SVM 35 / 54

SLIDE 36

Measures of Knowledge and Uncertainty

Measures of Knowledge and Uncertainty 36 / 54

SLIDE 37

Measures

The measures (29) and (30) are comparable for the same dot product

space. Consider different spaces S1 and S2, like for different values of

the σ parameter for the radial basis function (RBF) kernel. The uncertainty measure (30) is dependent on a space. In order to convert it to an independent measure according to the requirement Ax. 20, we define the following measure uK ((K1,2)) = 1 − 2 wc2 R2 , (31) where R is a radius of a minimal hypersphere containing all examples in a dot product space.

Measures of Knowledge and Uncertainty 37 / 54

SLIDE 38

Measures

The knowledge measure is also dependent on a space. So we redefine it as kK ((K1,2)) = kK ((T1,2)) −

n

sgn max

0, 1 − yi

ch (

xi)

(32) Because of the sign function, the Ax. 10 is also fulfilled. This measure directly minimizes the number of support vectors.

Measures of Knowledge and Uncertainty 38 / 54

SLIDE 39

Measures

We consider grouping of knowledge and uncertainty measures. Usually data are given by specifying n-tuples. So each object is described by a corresponding ordered set of real numbers. This is a reference system (coordinate system). We usually assume that a coordinate system is equipped with an Euclidean vector space that is a finite-dimensional real unitary vector space. For SVM, we use the transformation of each object to a feature space which is a dot product space. We characterize knowledge and uncertainty measures in terms of a minimal space S in which the measure might be defined. For a knowledge measure (29), we need to compute inner product, so we need a unitary vector space for defining it. For uncertainty measures (30) and (31), we need a normed vector space to compute it.

Measures of Knowledge and Uncertainty 39 / 54

SLIDE 40

Measures

A knowledge measure (32) might be formulated more general as kK ((K1,2)) = kK ((T1,2)) −

U′

1, c1

−
U′

2, c2

(33) This measure can be defined in a minimal space of X without any additional structure S. So they are comparable Ax. 20.

Measures of Knowledge and Uncertainty 40 / 54

SLIDE 41

Measures

When we have more specific spaces, we have more freedom to define knowledge sets and knowledge and uncertainty measures. But with more specific spaces, there is more risk, that the space is incorrect for a real world problem. So defining more specific spaces and measures means more accurate prediction of a space for some real world problems, but less accurate average prediction for all real world problems together. The uncertainty measure is related to generalization, so without a space S, it does not make sense to define uncertainty measure for generalization.

Measures of Knowledge and Uncertainty 41 / 54

SLIDE 42

Generally, we could also use a classification error as a knowledge measure for a knowledge set. For such measure, all knowledge sets with the same decision boundary would have the same value of

knowledge. We can generally use only knowledge measure by setting

artificially uncertainty to constant value. Because classification error is not able to distinguish cases with the same decision boundary, we might add another knowledge measure. Finally, we use the hybrid method: combination of uncertainty and knowledge measures. As uncertainty measures, we use (31) and a constant value. As knowledge measures, we use (32) and a classification error. For the latter, means that we regularize classification error (for nonconstant uncertainty measure). We compare each knowledge measure with each uncertainty measure. We count cases for, against and draw. Based on count cases, we select the best set of values of hyperparameters.

Measures of Knowledge and Uncertainty 42 / 54

SLIDE 43

Experiments

Experiments 43 / 54

SLIDE 44

Results I

Table 1: Performance of SVC, knowledge-uncertainty machines (KUM) for real world

data, part 1. The numbers in descriptions of the columns mean the methods: 1 - SVC, 2

KUM. Column descriptions: dn – the name of a data set, size – the number of all

examples, dim – the dimension of a problem, trse – the mean classification error; the best method is in bold, sv – the number of support vectors, the smallest number is in bold, pd – the Pareto draw ratio.

Experiments 44 / 54

SLIDE 45

Results II

dn size dim trse1 trse2 sv1 sv2 pd2 a1a 24947 123 0.468 0.474 31.0 24.0 0.16 australian 690 14 0.417 0.413 30.0 28.0 0.17 breast-cancer 675 10 0.203 0.2 27.0 16.0 0.25 cod-rna 100000 8 0.324 0.301 22.0 21.0 0.22 colon-cancer 62 2000 0.365 0.365 39.0 36.0 0.19 covtype 100000 54 0.63 0.634 39.0 33.0 0.09 diabetes 768 8 0.542 0.54 31.0 29.0 0.18 fourclass 862 2 0.246 0.357 29.0 24.0 0.15 german_numer 1000 24 0.556 0.557 39.0 29.0 0.25 heart 270 13 0.452 0.443 30.0 28.0 0.07 HIGGS 100000 28 0.685 0.685 44.0 39.0 0.1 ijcnn1 100000 22 0.329 0.323 14.0 12.0 0.22 ionosphere_scale 350 33 0.298 0.361 33.0 30.0 0.13 liver-disorders 341 5 0.614 0.62 40.0 38.0 0.14 madelon 2600 500 0.699 0.699 50.0 47.0 0.2 mushrooms 8124 111 0.232 0.234 40.0 26.0 0.15 phishing 5785 68 0.355 0.35 39.0 33.0 0.15 skin_nonskin 51432 3 0.239 0.269 33.0 16.0 0.2 splice 2990 60 0.518 0.533 44.0 40.0 0.09 sonar_scale 208 60 0.506 0.522 39.0 31.0 0.11 SUSY 100000 18 0.581 0.579 35.0 33.0 0.1 svmguide1 6910 4 0.271 0.275 15.0 12.0 0.16 svmguide3 1243 21 0.492 0.484 28.0 22.0 0.24 w1a 34703 300 0.166 0.166 3.0 2.0 0.48 websam_unigram 100000 134 0.427 0.426 32.0 28.0 0.13 Experiments 45 / 54

SLIDE 46

Results

Table 2: Performance of SVC, KUM for real world data, part 2. The numbers in

descriptions of the columns mean the methods: 1 - SVC, 2 - KUM. The test is for all tests from Table 1. Column descriptions: rs – the average rank for the mean classification error; the best method is in bold, teT – p value for the statistical test for comparing classification error; the significant value is in bold, sv – the average rank for the number of support vectors, svT – p value for the statistical test for comparing the number of support vectors, pd – the average Pareto draw ratio. rs1 rs2 tst sv1 sv2 svt pd 1.5 1.5 0.4 1.78 1.22 0.0 0.17

Experiments 46 / 54

SLIDE 47

Results

The result are that KUM has almost the same generalization performance as SVM (columns rs,tst in Table 1)). We achieved smaller number of support vectors (columns sv1, sv2, svt in Table 2) with statistical significance. The average ratio of comparisons for KUM with Pareto draw is 17%. Moreover, we observed not too much freedom in selecting value of uncertainty, for example removing the radius R causes deterioration of

results. We tested also a measure of uncertainty measuring

uncertainty in X based on distances between four hyperplanes with similar results. For knowledge, we tested other measures, which were little worse, but without statistical significance, for example least squares, sum of slack variables for a validation set.

Experiments 47 / 54

SLIDE 48

Applications

We reformulate Fisher’s classifier to OP 1 with kK =

U1 − wTE U2

2 ,

(34) uk = wT (Σ1 + Σ2) w . (35) It is not exactly the equivalent, because in the Fisher’s classifier we have a division of (34) by (35), so we do not have any hyperparameter. The knowledge measure maximizes the distance between two knowledge sets. It can be interpreted as knowledge about a knowledge setting with two knowledge sets. The expected value on a set is knowledge about a knowledge set.

Proposition 5

The knowledge measure (34) does not fulfill monotonicity axioms Ax. 1,

Ax. 2 and also Ax. 8, Ax. 10. The rest of knowledge axioms are fulfilled.

Experiments 48 / 54

SLIDE 49

Applications

The uncertainty measure is a sum of variances of random variables

Z|Y = 1 and

Z|Y = −1 for data limited by a knowledge setting.

Proposition 6

All axioms for an uncertainty measure are fulfilled. For a linear discriminant analysis (LDA), we get the same classifier as for a Fisher’s classifier assuming multivariate normal distribution and that a priori probabilities of classes are the same. For principal component analysis (PCA) for dimensionality reduction, we have only a measure (35) but with one covariance matrix. For PCA, the idea is to maximize this measure, and thus the measure is a knowledge measure. The goal of dimensionality reduction is to preserve variability.

Experiments 49 / 54

SLIDE 50

Applications

The modern portfolio theory (MPT), known as mean-variance analysis is used to maximize the expected return of a portfolio of assets given a level of risk, defined as variance. The trade-off between expected return and a risk is called risk-return spectrum (risk-return tradeoff, risk-reward). Portfolio return is the proportion-weighted combination of the assets’ return. Portfolio return variance is computed based on standard deviation of the periodic returns on an asset and on correlation coefficient between the returns on assets (covariance). We generalize the risk-return model to OP 1 with kK = r T w, uk = wTΣ w , (36) where r is a vector of expected returns. We add to OP 1 the additional constraint that the sum of wi is equal to 1. So we have knowledge and uncertainty measures with the additional constraint.

Experiments 50 / 54

SLIDE 51

Applications

We define knowledge sets on a finite set of assets. The problem is to find optimal ratio for each asset, but not to generalize to unknown

assets. Knowledge and uncertainty measures are defined for subsets

for all assets. The uncertainty is very similar to (35), so the Prop. 6 is also fulfilled.

Proposition 7

The knowledge measure (36) does not fulfill monotonicity axioms Ax. 1,

Ax. 2 and also Ax. 8, Ax. 10. The rest of knowledge axioms are fulfilled.

While the measure (36) does not fulfill all axioms for a knowledge measure, but incorporating additional knowledge measure which is for example the number of assets in portfolio would lead to more diversified portfolio.

Experiments 51 / 54

SLIDE 52

Summary

Summary 52 / 54

SLIDE 53

Summary

We axiomatized knowledge and uncertainty for solving a classification problem with SVM. We proposed a framework for designing machine learning methods based on the knowledge-uncertainty model. We showed that minimizing directly the number of support vectors together with regularization of a validation set in the knowledge-uncertainty hybrid model leads to more sparse solutions. The idea of minimizing the number of support vectors has been already mentioned in terms of statistical bounds [1]. However, we derived this fact from the axiomatic system, and we used it in

practice. The idea of regularization on a validation set in particular

regularization of classification error, or the number of support vectors is up to our knowledge novel. The knowledge-uncertainty framework allows for combining different types of uncertainty and knowledge measures. Future research might be related to deriving the best combination of measures.

Summary 53 / 54

SLIDE 54

References

References 54 / 54

SLIDE 55

[1] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines : and other kernel-based learning methods. Cambridge University Press, 1 edition, March 2000. ISBN 0521780195.

Summary 54 / 54