[PPT] - The Online Approach to Machine Learning Nicol` o Cesa-Bianchi PowerPoint Presentation

SLIDE 1

The Online Approach to Machine Learning

Nicol`

Cesa-Bianchi

Universit` a degli Studi di Milano

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 1 / 53

SLIDE 2

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 2 / 53

SLIDE 3

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 3 / 53

SLIDE 4

Machine learning

Classification/regression tasks Predictive models h mapping data instances X to labels Y (e.g., binary classifier) Training data ST =

(X1, Y1), . . . , (XT, YT)
(e.g., email messages with spam vs. nonspam annotations)

Learning algorithm A (e.g., Support Vector Machine) maps training data ST to model h = A(ST) Evaluate the risk of the trained model h with respect to a given loss function

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 4 / 53

SLIDE 5

Two notions of risk

View data as a statistical sample: statistical risk E

loss
A( ST)
trained

model

, (X, Y)

test

example

Training set ST =
(X1, Y1), . . . , (XT, YT)
and test example (X, Y) drawn

i.i.d. from the same unknown and fixed distribution View data as an arbitrary sequence: sequential risk

T

t=1

loss

A(St−1)
trained

model

, (Xt, Yt)

test

example

Sequence of models trained on growing prefixes

St =

(X1, Y1), . . . , (Xt, Yt)
f the data sequence
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 5 / 53

SLIDE 6

Regrets, I had a few

Learning algorithm A maps datasets to models in a given class H Variance error in statistical learning E

loss
A(ST), (X, Y)
− inf

h∈H E

loss
h, (X, Y)
compare to expected loss of best model in the class

Regret in online learning

T

t=1

loss

A(St−1), (Xt, Yt)
− inf

h∈H T

t=1

loss

h, (Xt, Yt)
compare to cumulative loss of best model in the class
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 6 / 53

SLIDE 7

Incremental model update

A natural blueprint for online learning algorithms For t = 1, 2, . . .

1

Apply current model ht−1 to next data element (Xt, Yt)

2

Update current model: ht−1 → ht ∈ H Goal: control regret

T

t=1

loss

ht−1, (Xt, Yt)
− inf

h∈H T

t=1

loss

h, (Xt, Yt)
View this as a repeated game between a player generating predictors

ht ∈ H and an opponent generating data (Xt, Yt)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 7 / 53

SLIDE 8

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 8 / 53

SLIDE 9

Theory of repeated games

James Hannan (1922–2010) David Blackwell (1919–2010) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 9 / 53

SLIDE 10

Zero-sum 2-person games played more than once

1 2 . . . M 1 ℓ(1, 1) ℓ(1, 2) . . . 2 ℓ(2, 1) ℓ(2, 2) . . . . . . . . . . . . ... N N × M known loss matrix Row player (player) has N actions Column player (opponent) has M actions For each game round t = 1, 2, . . . Player chooses action it and opponent chooses action yt The player suffers loss ℓ(it, yt) (= gain of opponent) Player can learn from opponent’s history of past choices y1, . . . , yt−1

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 10 / 53

SLIDE 11

Prediction with expert advice

Volodya Vovk Manfred Warmuth t = 1 t = 2 . . . 1 ℓ1(1) ℓ2(1) . . . 2 ℓ1(2) ℓ2(2) . . . . . . . . . . . . ... N ℓ1(N) ℓ2(N) Opponent’s moves y1, y2, . . . define a sequential prediction problem with a time-varying loss function ℓ(it, yt) = ℓt(it)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 11 / 53

SLIDE 12

Playing the experts game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 12 / 53

SLIDE 13

Playing the experts game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 12 / 53

SLIDE 14

Playing the experts game

N actions 7 3 6 7 2 1 4 9 4 For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: ℓt(1), . . . , ℓt(N)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 12 / 53

SLIDE 15

Oblivious opponents

Losses ℓt(1), . . . , ℓt(N) for all t = 1, 2, . . . are fixed beforehand, and unknown to the (randomized) player Oblivious regret minimization RT

def

= E T

t=1

ℓt(It)

−

min

i=1,...,N T

t=1

ℓt(i) want = o(T)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 13 / 53

SLIDE 16

Bounds on regret

[Experts’ paper, 1997]

Lower bound using random losses ℓt(i) → Lt(i) ∈ {0, 1} independent random coin flip For any player strategy E T

t=1

Lt(It)

= T

2 Then the expected regret is E

max

i=1,...,N T

t=1

1 2 − Lt(i)

=
1 − o(1)

T ln N 2

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 14 / 53

SLIDE 17

Exponentially weighted forecaster

At time t pick action It = i with probability proportional to exp

−η

t−1

s=1

ℓs(i)

the sum at the exponent is the total loss of action i up to now

Regret bound

[Experts’ paper, 1997]

If η =

(ln N)/(8T) then

RT

T ln N

2 Matching lower bound including constants Dynamic choice ηt =

(ln N)/(8t)
nly loses small constants
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 15 / 53

SLIDE 18

The bandit problem: playing an unknown game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

SLIDE 19

The bandit problem: playing an unknown game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

SLIDE 20

The bandit problem: playing an unknown game

N actions ? 3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: Only ℓt(It) is revealed

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

SLIDE 21

The bandit problem: playing an unknown game

N actions ? 3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: Only ℓt(It) is revealed Many applications Ad placement, dynamic content adaptation, routing, online auctions

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

SLIDE 22

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 17 / 53

SLIDE 23

Relationships between actions

[Mannor and Shamir, 2011]

Undirected Directed

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 18 / 53

SLIDE 24

A graph of relationships over actions

? ? ? ? ? ? ? ? ? ?

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 19 / 53

SLIDE 25

A graph of relationships over actions

? ? ? ? ? ? ? ? ? ?

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 19 / 53

SLIDE 26

A graph of relationships over actions

7 3 6 7 ? 2 ? ? ? ?

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 19 / 53

SLIDE 27

Recovering expert and bandit settings

Experts: clique 7 3 6 7 2 2 1 4 9 4 Bandits: empty graph ? 3 ? ? ? ? ? ? ? ?

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 20 / 53

SLIDE 28

Exponentially weighted forecaster — Reprise

Player’s strategy

[Alon, C-B, Gentile, Mannor, Mansour and Shamir, 2013]

Pt(It = i) ∝ exp

−η

t−1

s=1
ℓs(i)
i = 1, . . . , N
ℓt(i) =

   ℓt(i) Pt

ℓt(i) observed
if ℓt(i) is observed
therwise

Importance sampling estimator Et

ℓt(i)
= ℓt(i)

unbiased Et

ℓt(i)2
1

Pt

ℓt(i) observed
variance control
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 21 / 53

SLIDE 29

Independence number α(G)

The size of the largest independent set

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 22 / 53

SLIDE 30

Independence number α(G)

The size of the largest independent set

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 22 / 53

SLIDE 31

Regret bounds

Analysis (undirected graphs) RT ln N η + η 2

T

t=1

α(G)

N
i=1

P

It = i | ℓt(i) observed
=
α(G)T ln N

by tuning η If graph is directed, then bound worsens only by log factors Special cases Experts: α(G) = 1 RT √ T ln N Bandits: α(G) = N RT √ TN ln N

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 23 / 53

SLIDE 32

Reactive opponents

[Dekel, Koren and Peres, 2014]

The loss of action i at time t depends on the player’s past m actions ℓt(i) → Lt(It−m, . . . , It−1, i) Adaptive regret Rada

T

= E  

T

t=1

Lt(It−m, . . . , It−1, It) − min

i=1,...,K T

t=1

Lt(i, . . . , i

m times

, i)   Minimax adaptive regret (for any constant m > 1) Rada

T

= Θ

T 2/3
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 24 / 53

SLIDE 33

Partial monitoring: not observing any loss

Dynamic pricing: Perform as the best fixed price

1

Post a T-shirt price

2

Observe if next customer buys or not

3

Adjust price Feedback does not reveal the player’s loss 1 2 3 4 5 1 1 2 3 4 2 c 1 2 3 3 c c 1 2 4 c c c 1 5 c c c c Loss matrix 1 2 3 4 5 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 4 1 1 5 1 Feedback matrix

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 25 / 53

SLIDE 34

A characterization of minimax regret

Special case Multiarmed bandits: loss and feedback matrix are the same A general gap theorem

[Bartok, Foster, P´ al, Rakhlin and Szepesv´ ari, 2013]

A constructive characterization of the minimax regret for any pair

f loss/feedback matrix

Only three possible rates for nontrivial games:

1

Easy games (e.g., bandits): Θ √ T

2

Hard games (e.g., revealing action): Θ

T 2/3

3

Impossible games: Θ(T)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 26 / 53

SLIDE 35

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 27 / 53

SLIDE 36

A game equivalent to prediction with expert advice

Online linear optimization in the simplex

1

Play point pt from the N-dimensional simplex ∆N

2

Incur linear loss E

ℓt(It)
= p⊤

t

ℓt

3

Observe loss gradient ℓt Regret: compete against the best point in the simplex

T

t=1

p

⊤ t

ℓt − min

q∈∆N T

t=1

q⊤ ℓt

=

min

i=1,...,N

1 T

T

t=1
ℓt(i)
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 28 / 53

SLIDE 37

From game theory to machine learning

OPPONENT TRUE LABEL GUESSED LABEL UNLABELED SYSTEM CLASSIFICATION DATA

Opponent’s moves yt are viewed as values or labels assigned to

bservations xt ∈ Rd (e.g., categories of documents)

A repeated game between the player choosing an element wt of a linear space and the opponent choosing a label yt for xt Regret with respect to best element in the linear space

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 29 / 53

SLIDE 38

Online convex optimization

[Zinkevich, 2003]

1

Play point wt from a convex linear space S

2

Incur convex loss ℓt(wt)

3

Observe loss gradient ∇ℓt(wt)

4

Update point: wt → wt+1 ∈ S Example Regression with square loss: ℓt(w) =

w⊤xt − yt

2 yt ∈ R Classification with hinge loss: ℓt(w) =

1 − yt w⊤xt
+

yt ∈ {−1, +1} Regret

T

t=1

ℓt(wt) − inf

u∈S T

t=1

ℓt(u)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 30 / 53

SLIDE 39

Finding a good online algorithm

Follow the leader wt+1 = arginf

w∈S t

s=1

ℓs(w) Regret can be linear due to lack of stability Example S = [−1, +1] ℓ1(w) = 1 + w 2 ℓt(w) = −w if t is even +w if t is odd

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 31 / 53

SLIDE 40

Regularized online learning

Strong convexity Φ : S → R is β-strongly convex w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + β 2 u − v2 Example: Φ(v) = 1 2 v2 Follow the regularized leader

[Shalev-Shwartz, 2007; Abernethy, Hazan and Rakhlin, 2008]

wt+1 = argmin

w∈S

η

t

s=1

ℓs(w) + Φ(w)

Φ is a strongly convex regularizer defined on S
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 32 / 53

SLIDE 41

Deriving an incremental update

Linearization of convex losses ℓt(wt) − ℓt(u) ∇ℓt(wt)

ℓt

⊤wt − ∇ℓt(wt)

ℓt

⊤u

Follow the regularized leader with linearized losses wt+1 = argmin

w∈S

η

t

s=1
ℓs
θt

⊤w + Φ(w)

= argmax

w∈S

−ηθ⊤

t w − Φ(w)

= ∇Φ∗

−η θt

Φ∗ is the convex dual of Φ
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 33 / 53

SLIDE 42

The Mirror Descent algorithm

[Nemirovsky and Yudin, 1983]

Recall: wt+1 = ∇Φ∗ −η θt

= ∇Φ∗
−η

t

s=1

∇ℓs(ws)

Online Mirror Descent

Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: θ1 = 0 // primal parameter For t = 1, 2, . . .

1

Use wt = ∇Φ∗(θt) // dual parameter (via mirror step)

2

Suffer loss ℓt(wt)

3

Observe loss gradient ∇ℓt(wt)

4

Update θt+1 = θt − η∇ℓt(wt) // gradient step

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 34 / 53

SLIDE 43

Some examples

Exponentiated gradient: S = ∆N and Φ(w) =

d

i=1

wi ln wi

[Kivinen and Warmuth, 1997]

Online Gradient Descent: S = Rd and Φ(w) = 1 2 w2

[Zinkevich, 2003]

p-norm Gradient Descent: S = Rd and Φ(w) = 1 2(p − 1) w2

p

[Gentile, 2003]

Matrix gradient descent

[Cavallanti, C-B and Gentile, 2010; Kakade, Shalev-Shwartz and Tewari, 2012]

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 35 / 53

SLIDE 44

General regret bound

Analysis relies on smoothness of Φ∗ in order to bound increments Φ∗(θt+1) − Φ∗(θt) via ∇ℓt(wt)2

∗

Oracle bound

[Kakade, Shalev-Shwartz and Tewari, 2012]

T

t=1

ℓt(wt)

cumulative loss

inf

u∈S

T
t=1

ℓt(u)

model fit

+ Φ(u) η

model cost
+ η

2

T

t=1

∇ℓt(wt)2

∗

β ℓ1, ℓ2, . . . are arbitrary convex losses If gradients are bounded, then RT = O √ T

This is optimal for general convex losses ℓt

If all ℓt are strongly convex, then RT = O

ln T
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 36 / 53

SLIDE 45

Regularization via stochastic smoothing

Follow the perturbed leader

[Kalai and Vempala, 2005]

wt+1 = E

argmin

w∈S

ηθ⊤

t w + Z⊤w

The distribution of Z must be “stable” (small variance and small

average sensitivity) For some choices of Z, FPL becomes equivalent to OMD

[Abernethy, Lee, Sinha and Tewari, 2014]

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 37 / 53

SLIDE 46

Adaptive regularization

Online Ridge Regression

[Vovk, 2001; Azoury and Warmuth, 2001]

T

t=1
w⊤

t xt − yt

2 inf

u∈Rd

T

t=1
u⊤xt − yt

2 + u2

+ d ln
1 + T

d

Φt(w) = 1

2 w2

At

At = I +

t

s=1

xs x⊤

s

More examples Online Newton Step

[Hazan, Agarwal and Kale, 2007]

Logarithmic regret for exp-concave loss functions AdaGrad

[Duchi, Hazan and Singer, 2010]

Competitive with “optimal” fixed regularizer Scale-invariant algorithms

[Ross, Mineiro and Langford, 2013]

Regret invariant w.r.t. rescaling of single features

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 38 / 53

SLIDE 47

Shifting regret

[Herbster and Warmuth, 2001]

Nonstationarity If data source is not fitted well by any model in the class, then comparing to the best model u ∈ S is trivial Compare instead to the best sequence u1, u2, · · · ∈ S of models Shifting Regret for Online Mirror Descent

[Zinkevich, 2003]

T

t=1

ℓt(wt)

cumulative loss
inf

u1,...,uT ∈S T

t=1

ℓt(ut)

model fit

+

T

t=1

ut − ut−1

shifting model cost

+ diam(S) +

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 39 / 53

SLIDE 48

Online active learning

GUESSED LABEL (UPON REQUEST) TRUE LABEL CLASSIFIER HUMAN UNLABELED DATA EXPERT USER

Observing the data process is cheap Observing the label process is expensive → need to query the human expert Question How much better can we do by subsampling adaptively the label process?

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 40 / 53

SLIDE 49

A game with the opponent

[C-B, Gentile, Zaniboni, 2006]

document vectorized d e c i s i

n

s u r f a c e

Opponent avoids causing mistakes on documents far away from decision surface Probability of querying a document proportional to inverse distance to decision surface Binary classification performance guarantee remains identical (in expectation) to the full sampling case

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 41 / 53

SLIDE 50

Experiments on document categorization

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 42 / 53

SLIDE 51

Stochastic Online Mirror Descent

Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: θ1 = 0 // primal parameter For t = 1, 2, . . .

1

Use wt = ∇Φ∗(θt) // mirror step with projection on S

2

Suffer loss ℓt(wt)

3

Compute estimate gt of loss gradient ∇ℓt(wt)

4

Update θt+1 = θt − η gt // gradient step Typically, Φ(w) = 1

2 w2 (stochastic OGD)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 43 / 53

SLIDE 52

Attribute efficient learning

[C-B, Shamir, Shalev-Shwartz, 2011]

riginal

sampled Obtain a few attributes from each training example Use Stochastic OGD with square loss ℓt(w) = 1

2

w⊤xt − yt

2 ∇ℓt(w) =

w⊤xt − yt
xt

Unbiased estimate of square loss gradient using two attributes

1

Estimate of w⊤x: query xi according to p(i) = |wi| w1

2

Estimate of x: query xj uniformly at random

3

Gradient estimate: g =

w1 sgn(wi) xi − y
d xjej
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 44 / 53

SLIDE 53

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 45 / 53

SLIDE 54

Online convex optimization with bandit feedback

For T = 1, 2, . . .

1

Play point wt from a convex linear space S

2

Incur and observe convex loss ℓt(wt)

3

Update point: wt → wt+1 ∈ S Regret RT = E T

t=1

ℓt(wt)

− inf

u∈S T

t=1

ℓt(u)

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 46 / 53

SLIDE 55

Gradient descent without a gradient

[Flaxman, Kalai and McMahan, 2004]

Run stochastic OGD using a perturbed version of wt: wt + δ U (U is a random unit vector and δ > 0) Gradient estimate

gt = d

δ ℓt(wt + δ U) U Fact (Stokes’ Theorem): If ℓt were differentiable, then E

gt
= ∇E
ℓt(wt + δ B)
where B is a random vector in the unit ball
gt estimates the gradient of a

locally smoothed version of ℓt

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 47 / 53

SLIDE 56

Guarantees

If ℓt is Lipschitz, then the smoothed version is a good approximation of ℓt Radius δ of perturbation controls bias/variance trade-off Regret of stochastic OGD for convex and Lipschitz loss sequences RT = O

T 3/4
N. Cesa-Bianchi (UNIMI)

Online Approach to ML 48 / 53

SLIDE 57

Guarantees

If ℓt is Lipschitz, then the smoothed version is a good approximation of ℓt Radius δ of perturbation controls bias/variance trade-off Regret of stochastic OGD for convex and Lipschitz loss sequences RT = O

T 3/4

The linear case Assume losses are linear functions on S, ℓt(w) = ℓ⊤

t w

Can we achieve a better rate?

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 48 / 53

SLIDE 58

Self-concordant functions

[Abernethy, Hazan and Rakhlin, 2008]

Run stochastic OGD regularized with a self-concordant function for S Variance control through the associated Dikin ellipsoid Loss estimate ℓt obtained via perturbed point Wt ± ei √λi {ei, λi} is a randomly drawn eigenvector-eigenvalue pair of Dikin ellipsoid Regret for linear functions RT = O

d3/2 √

T ln T

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 49 / 53

SLIDE 59

Algorithm GeometricHedge

[Dani, Hayes and Kakade, 2008]

Build an ε-cover S0 ⊆ S of size ε−d Use experts algorithm (e.g., exponential weights) to draw actions Wt ∈ S0 and use unbiased linear estimator for the loss

ℓt = P−1

t Wt W⊤ t ℓt

where Pt = E

Wt W⊤

t

Mix exponential weights with exploration distribution µ over the

actions in the cover: pt(w) = (1 − γ) qt(w)

exp. distrib.

+ γ µ(w) (0 γ 1) µ controls the variance of the loss estimates by ensuring all directions are sampled often enough

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 50 / 53

SLIDE 60

Analysis

[C-B and Lugosi, 2010]

Regret bound RT = O

d
1

d λmin + 1

T ln T
λmin = smallest eigenvalue of Eµ
W W⊤

λ−1

min is proportional to the variance of loss estimates

When λmin ≈ 1 d we get the optimal bound Θ

d

√ T ln T

If µ is uniform over all actions, the above happens when action

space is approximately isotropic

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 51 / 53

SLIDE 61

Good exploration bases

[Bubeck, C-B and Kakade, 2012]

Choose a basis under which the action set looks isotropic There are at most O

d2

contact points between S0 and L¨

wner

ellipsoid (the min volume ellipsoid enclosing S0) Put exploration distribution µ on these contact points This ensures that Eµ

W W⊤

is isotropic: λmin = 1

d

Exploration on contact points of L¨

wner ellipsoid achieves
ptimal regret

RT = O

d

√ T ln T

However, this construction is not efficient in general

An efficient construction uses volumetric ellipsoids

[Hazan, Gerber and Meka, 2014]

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 52 / 53

SLIDE 62

Conclusions

More applications Portfolio management Matrix completion Competitive analysis of algorithms Recommendation systems Some open problems Exact rates for bandit convex optimization Trade-offs between regret bounds and running times Online tensor and spectral learning Problems with states

N. Cesa-Bianchi (UNIMI)

Online Approach to ML 53 / 53