The Online Approach to Machine Learning Nicol` o Cesa-Bianchi - - PowerPoint PPT Presentation

the online approach to machine learning
SMART_READER_LITE
LIVE PREVIEW

The Online Approach to Machine Learning Nicol` o Cesa-Bianchi - - PowerPoint PPT Presentation

The Online Approach to Machine Learning Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary My beautiful regret 1 A supposedly fun game Ill play again 2 A graphic


slide-1
SLIDE 1

The Online Approach to Machine Learning

Nicol`

  • Cesa-Bianchi

Universit` a degli Studi di Milano

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 1 / 53

slide-2
SLIDE 2

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 2 / 53

slide-3
SLIDE 3

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 3 / 53

slide-4
SLIDE 4

Machine learning

Classification/regression tasks Predictive models h mapping data instances X to labels Y (e.g., binary classifier) Training data ST =

  • (X1, Y1), . . . , (XT, YT)
  • (e.g., email messages with spam vs. nonspam annotations)

Learning algorithm A (e.g., Support Vector Machine) maps training data ST to model h = A(ST) Evaluate the risk of the trained model h with respect to a given loss function

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 4 / 53

slide-5
SLIDE 5

Two notions of risk

View data as a statistical sample: statistical risk E

  • loss
  • A( ST)
  • trained

model

, (X, Y)

  • test

example

  • Training set ST =
  • (X1, Y1), . . . , (XT, YT)
  • and test example (X, Y) drawn

i.i.d. from the same unknown and fixed distribution View data as an arbitrary sequence: sequential risk

T

  • t=1

loss

  • A(St−1)
  • trained

model

, (Xt, Yt)

  • test

example

  • Sequence of models trained on growing prefixes

St =

  • (X1, Y1), . . . , (Xt, Yt)
  • f the data sequence
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 5 / 53

slide-6
SLIDE 6

Regrets, I had a few

Learning algorithm A maps datasets to models in a given class H Variance error in statistical learning E

  • loss
  • A(ST), (X, Y)
  • − inf

h∈H E

  • loss
  • h, (X, Y)
  • compare to expected loss of best model in the class

Regret in online learning

T

  • t=1

loss

  • A(St−1), (Xt, Yt)
  • − inf

h∈H T

  • t=1

loss

  • h, (Xt, Yt)
  • compare to cumulative loss of best model in the class
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 6 / 53

slide-7
SLIDE 7

Incremental model update

A natural blueprint for online learning algorithms For t = 1, 2, . . .

1

Apply current model ht−1 to next data element (Xt, Yt)

2

Update current model: ht−1 → ht ∈ H Goal: control regret

T

  • t=1

loss

  • ht−1, (Xt, Yt)
  • − inf

h∈H T

  • t=1

loss

  • h, (Xt, Yt)
  • View this as a repeated game between a player generating predictors

ht ∈ H and an opponent generating data (Xt, Yt)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 7 / 53

slide-8
SLIDE 8

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 8 / 53

slide-9
SLIDE 9

Theory of repeated games

James Hannan (1922–2010) David Blackwell (1919–2010) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 9 / 53

slide-10
SLIDE 10

Zero-sum 2-person games played more than once

1 2 . . . M 1 ℓ(1, 1) ℓ(1, 2) . . . 2 ℓ(2, 1) ℓ(2, 2) . . . . . . . . . . . . ... N N × M known loss matrix Row player (player) has N actions Column player (opponent) has M actions For each game round t = 1, 2, . . . Player chooses action it and opponent chooses action yt The player suffers loss ℓ(it, yt) (= gain of opponent) Player can learn from opponent’s history of past choices y1, . . . , yt−1

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 10 / 53

slide-11
SLIDE 11

Prediction with expert advice

Volodya Vovk Manfred Warmuth t = 1 t = 2 . . . 1 ℓ1(1) ℓ2(1) . . . 2 ℓ1(2) ℓ2(2) . . . . . . . . . . . . ... N ℓ1(N) ℓ2(N) Opponent’s moves y1, y2, . . . define a sequential prediction problem with a time-varying loss function ℓ(it, yt) = ℓt(it)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 11 / 53

slide-12
SLIDE 12

Playing the experts game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 12 / 53

slide-13
SLIDE 13

Playing the experts game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 12 / 53

slide-14
SLIDE 14

Playing the experts game

N actions 7 3 6 7 2 1 4 9 4 For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: ℓt(1), . . . , ℓt(N)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 12 / 53

slide-15
SLIDE 15

Oblivious opponents

Losses ℓt(1), . . . , ℓt(N) for all t = 1, 2, . . . are fixed beforehand, and unknown to the (randomized) player Oblivious regret minimization RT

def

= E T

  • t=1

ℓt(It)

min

i=1,...,N T

  • t=1

ℓt(i) want = o(T)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 13 / 53

slide-16
SLIDE 16

Bounds on regret

[Experts’ paper, 1997]

Lower bound using random losses ℓt(i) → Lt(i) ∈ {0, 1} independent random coin flip For any player strategy E T

  • t=1

Lt(It)

  • = T

2 Then the expected regret is E

  • max

i=1,...,N T

  • t=1

1 2 − Lt(i)

  • =
  • 1 − o(1)

T ln N 2

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 14 / 53

slide-17
SLIDE 17

Exponentially weighted forecaster

At time t pick action It = i with probability proportional to exp

  • −η

t−1

  • s=1

ℓs(i)

  • the sum at the exponent is the total loss of action i up to now

Regret bound

[Experts’ paper, 1997]

If η =

  • (ln N)/(8T) then

RT

  • T ln N

2 Matching lower bound including constants Dynamic choice ηt =

  • (ln N)/(8t)
  • nly loses small constants
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 15 / 53

slide-18
SLIDE 18

The bandit problem: playing an unknown game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

slide-19
SLIDE 19

The bandit problem: playing an unknown game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

slide-20
SLIDE 20

The bandit problem: playing an unknown game

N actions ? 3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: Only ℓt(It) is revealed

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

slide-21
SLIDE 21

The bandit problem: playing an unknown game

N actions ? 3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: Only ℓt(It) is revealed Many applications Ad placement, dynamic content adaptation, routing, online auctions

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 16 / 53

slide-22
SLIDE 22

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 17 / 53

slide-23
SLIDE 23

Relationships between actions

[Mannor and Shamir, 2011]

Undirected Directed

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 18 / 53

slide-24
SLIDE 24

A graph of relationships over actions

? ? ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 19 / 53

slide-25
SLIDE 25

A graph of relationships over actions

? ? ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 19 / 53

slide-26
SLIDE 26

A graph of relationships over actions

7 3 6 7 ? 2 ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 19 / 53

slide-27
SLIDE 27

Recovering expert and bandit settings

Experts: clique 7 3 6 7 2 2 1 4 9 4 Bandits: empty graph ? 3 ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 20 / 53

slide-28
SLIDE 28

Exponentially weighted forecaster — Reprise

Player’s strategy

[Alon, C-B, Gentile, Mannor, Mansour and Shamir, 2013]

Pt(It = i) ∝ exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • i = 1, . . . , N
  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if ℓt(i) is observed
  • therwise

Importance sampling estimator Et

  • ℓt(i)
  • = ℓt(i)

unbiased Et

  • ℓt(i)2
  • 1

Pt

  • ℓt(i) observed
  • variance control
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 21 / 53

slide-29
SLIDE 29

Independence number α(G)

The size of the largest independent set

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 22 / 53

slide-30
SLIDE 30

Independence number α(G)

The size of the largest independent set

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 22 / 53

slide-31
SLIDE 31

Regret bounds

Analysis (undirected graphs) RT ln N η + η 2

T

  • t=1

α(G)

  • N
  • i=1

P

  • It = i | ℓt(i) observed
  • =
  • α(G)T ln N

by tuning η If graph is directed, then bound worsens only by log factors Special cases Experts: α(G) = 1 RT √ T ln N Bandits: α(G) = N RT √ TN ln N

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 23 / 53

slide-32
SLIDE 32

Reactive opponents

[Dekel, Koren and Peres, 2014]

The loss of action i at time t depends on the player’s past m actions ℓt(i) → Lt(It−m, . . . , It−1, i) Adaptive regret Rada

T

= E  

T

  • t=1

Lt(It−m, . . . , It−1, It) − min

i=1,...,K T

  • t=1

Lt(i, . . . , i

  • m times

, i)   Minimax adaptive regret (for any constant m > 1) Rada

T

= Θ

  • T 2/3
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 24 / 53

slide-33
SLIDE 33

Partial monitoring: not observing any loss

Dynamic pricing: Perform as the best fixed price

1

Post a T-shirt price

2

Observe if next customer buys or not

3

Adjust price Feedback does not reveal the player’s loss 1 2 3 4 5 1 1 2 3 4 2 c 1 2 3 3 c c 1 2 4 c c c 1 5 c c c c Loss matrix 1 2 3 4 5 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 4 1 1 5 1 Feedback matrix

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 25 / 53

slide-34
SLIDE 34

A characterization of minimax regret

Special case Multiarmed bandits: loss and feedback matrix are the same A general gap theorem

[Bartok, Foster, P´ al, Rakhlin and Szepesv´ ari, 2013]

A constructive characterization of the minimax regret for any pair

  • f loss/feedback matrix

Only three possible rates for nontrivial games:

1

Easy games (e.g., bandits): Θ √ T

  • 2

Hard games (e.g., revealing action): Θ

  • T 2/3

3

Impossible games: Θ(T)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 26 / 53

slide-35
SLIDE 35

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 27 / 53

slide-36
SLIDE 36

A game equivalent to prediction with expert advice

Online linear optimization in the simplex

1

Play point pt from the N-dimensional simplex ∆N

2

Incur linear loss E

  • ℓt(It)
  • = p⊤

t

ℓt

3

Observe loss gradient ℓt Regret: compete against the best point in the simplex

T

  • t=1

p

⊤ t

ℓt − min

q∈∆N T

  • t=1

q⊤ ℓt

  • =

min

i=1,...,N

1 T

T

  • t=1
  • ℓt(i)
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 28 / 53

slide-37
SLIDE 37

From game theory to machine learning

OPPONENT TRUE LABEL GUESSED LABEL UNLABELED SYSTEM CLASSIFICATION DATA

Opponent’s moves yt are viewed as values or labels assigned to

  • bservations xt ∈ Rd (e.g., categories of documents)

A repeated game between the player choosing an element wt of a linear space and the opponent choosing a label yt for xt Regret with respect to best element in the linear space

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 29 / 53

slide-38
SLIDE 38

Online convex optimization

[Zinkevich, 2003]

1

Play point wt from a convex linear space S

2

Incur convex loss ℓt(wt)

3

Observe loss gradient ∇ℓt(wt)

4

Update point: wt → wt+1 ∈ S Example Regression with square loss: ℓt(w) =

  • w⊤xt − yt

2 yt ∈ R Classification with hinge loss: ℓt(w) =

  • 1 − yt w⊤xt
  • +

yt ∈ {−1, +1} Regret

T

  • t=1

ℓt(wt) − inf

u∈S T

  • t=1

ℓt(u)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 30 / 53

slide-39
SLIDE 39

Finding a good online algorithm

Follow the leader wt+1 = arginf

w∈S t

  • s=1

ℓs(w) Regret can be linear due to lack of stability Example S = [−1, +1] ℓ1(w) = 1 + w 2 ℓt(w) = −w if t is even +w if t is odd

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 31 / 53

slide-40
SLIDE 40

Regularized online learning

Strong convexity Φ : S → R is β-strongly convex w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + β 2 u − v2 Example: Φ(v) = 1 2 v2 Follow the regularized leader

[Shalev-Shwartz, 2007; Abernethy, Hazan and Rakhlin, 2008]

wt+1 = argmin

w∈S

  • η

t

  • s=1

ℓs(w) + Φ(w)

  • Φ is a strongly convex regularizer defined on S
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 32 / 53

slide-41
SLIDE 41

Deriving an incremental update

Linearization of convex losses ℓt(wt) − ℓt(u) ∇ℓt(wt)

  • ℓt

⊤wt − ∇ℓt(wt)

  • ℓt

⊤u

Follow the regularized leader with linearized losses wt+1 = argmin

w∈S

  • η

t

  • s=1
  • ℓs
  • θt

⊤w + Φ(w)

  • = argmax

w∈S

  • −ηθ⊤

t w − Φ(w)

  • = ∇Φ∗

−η θt

  • Φ∗ is the convex dual of Φ
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 33 / 53

slide-42
SLIDE 42

The Mirror Descent algorithm

[Nemirovsky and Yudin, 1983]

Recall: wt+1 = ∇Φ∗ −η θt

  • = ∇Φ∗
  • −η

t

  • s=1

∇ℓs(ws)

  • Online Mirror Descent

Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: θ1 = 0 // primal parameter For t = 1, 2, . . .

1

Use wt = ∇Φ∗(θt) // dual parameter (via mirror step)

2

Suffer loss ℓt(wt)

3

Observe loss gradient ∇ℓt(wt)

4

Update θt+1 = θt − η∇ℓt(wt) // gradient step

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 34 / 53

slide-43
SLIDE 43

Some examples

Exponentiated gradient: S = ∆N and Φ(w) =

d

  • i=1

wi ln wi

[Kivinen and Warmuth, 1997]

Online Gradient Descent: S = Rd and Φ(w) = 1 2 w2

[Zinkevich, 2003]

p-norm Gradient Descent: S = Rd and Φ(w) = 1 2(p − 1) w2

p

[Gentile, 2003]

Matrix gradient descent

[Cavallanti, C-B and Gentile, 2010; Kakade, Shalev-Shwartz and Tewari, 2012]

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 35 / 53

slide-44
SLIDE 44

General regret bound

Analysis relies on smoothness of Φ∗ in order to bound increments Φ∗(θt+1) − Φ∗(θt) via ∇ℓt(wt)2

Oracle bound

[Kakade, Shalev-Shwartz and Tewari, 2012]

T

  • t=1

ℓt(wt)

  • cumulative loss

inf

u∈S

  • T
  • t=1

ℓt(u)

  • model fit

+ Φ(u) η

  • model cost
  • + η

2

T

  • t=1

∇ℓt(wt)2

β ℓ1, ℓ2, . . . are arbitrary convex losses If gradients are bounded, then RT = O √ T

  • This is optimal for general convex losses ℓt

If all ℓt are strongly convex, then RT = O

  • ln T
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 36 / 53

slide-45
SLIDE 45

Regularization via stochastic smoothing

Follow the perturbed leader

[Kalai and Vempala, 2005]

wt+1 = E

  • argmin

w∈S

  • ηθ⊤

t w + Z⊤w

  • The distribution of Z must be “stable” (small variance and small

average sensitivity) For some choices of Z, FPL becomes equivalent to OMD

[Abernethy, Lee, Sinha and Tewari, 2014]

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 37 / 53

slide-46
SLIDE 46

Adaptive regularization

Online Ridge Regression

[Vovk, 2001; Azoury and Warmuth, 2001]

T

  • t=1
  • w⊤

t xt − yt

2 inf

u∈Rd

T

  • t=1
  • u⊤xt − yt

2 + u2

  • + d ln
  • 1 + T

d

  • Φt(w) = 1

2 w2

At

At = I +

t

  • s=1

xs x⊤

s

More examples Online Newton Step

[Hazan, Agarwal and Kale, 2007]

Logarithmic regret for exp-concave loss functions AdaGrad

[Duchi, Hazan and Singer, 2010]

Competitive with “optimal” fixed regularizer Scale-invariant algorithms

[Ross, Mineiro and Langford, 2013]

Regret invariant w.r.t. rescaling of single features

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 38 / 53

slide-47
SLIDE 47

Shifting regret

[Herbster and Warmuth, 2001]

Nonstationarity If data source is not fitted well by any model in the class, then comparing to the best model u ∈ S is trivial Compare instead to the best sequence u1, u2, · · · ∈ S of models Shifting Regret for Online Mirror Descent

[Zinkevich, 2003]

T

  • t=1

ℓt(wt)

  • cumulative loss
  • inf

u1,...,uT ∈S T

  • t=1

ℓt(ut)

  • model fit

+

T

  • t=1

ut − ut−1

  • shifting model cost

+ diam(S) +

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 39 / 53

slide-48
SLIDE 48

Online active learning

GUESSED LABEL (UPON REQUEST) TRUE LABEL CLASSIFIER HUMAN UNLABELED DATA EXPERT USER

Observing the data process is cheap Observing the label process is expensive → need to query the human expert Question How much better can we do by subsampling adaptively the label process?

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 40 / 53

slide-49
SLIDE 49

A game with the opponent

[C-B, Gentile, Zaniboni, 2006]

document vectorized d e c i s i

  • n

s u r f a c e

Opponent avoids causing mistakes on documents far away from decision surface Probability of querying a document proportional to inverse distance to decision surface Binary classification performance guarantee remains identical (in expectation) to the full sampling case

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 41 / 53

slide-50
SLIDE 50

Experiments on document categorization

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 42 / 53

slide-51
SLIDE 51

Stochastic Online Mirror Descent

Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: θ1 = 0 // primal parameter For t = 1, 2, . . .

1

Use wt = ∇Φ∗(θt) // mirror step with projection on S

2

Suffer loss ℓt(wt)

3

Compute estimate gt of loss gradient ∇ℓt(wt)

4

Update θt+1 = θt − η gt // gradient step Typically, Φ(w) = 1

2 w2 (stochastic OGD)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 43 / 53

slide-52
SLIDE 52

Attribute efficient learning

[C-B, Shamir, Shalev-Shwartz, 2011]

  • riginal

sampled Obtain a few attributes from each training example Use Stochastic OGD with square loss ℓt(w) = 1

2

  • w⊤xt − yt

2 ∇ℓt(w) =

  • w⊤xt − yt
  • xt

Unbiased estimate of square loss gradient using two attributes

1

Estimate of w⊤x: query xi according to p(i) = |wi| w1

2

Estimate of x: query xj uniformly at random

3

Gradient estimate: g =

  • w1 sgn(wi) xi − y
  • d xjej
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 44 / 53

slide-53
SLIDE 53

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

A graphic novel

4

The joy of convex

5

The joy of convex (without the gradient)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 45 / 53

slide-54
SLIDE 54

Online convex optimization with bandit feedback

For T = 1, 2, . . .

1

Play point wt from a convex linear space S

2

Incur and observe convex loss ℓt(wt)

3

Update point: wt → wt+1 ∈ S Regret RT = E T

  • t=1

ℓt(wt)

  • − inf

u∈S T

  • t=1

ℓt(u)

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 46 / 53

slide-55
SLIDE 55

Gradient descent without a gradient

[Flaxman, Kalai and McMahan, 2004]

Run stochastic OGD using a perturbed version of wt: wt + δ U (U is a random unit vector and δ > 0) Gradient estimate

  • gt = d

δ ℓt(wt + δ U) U Fact (Stokes’ Theorem): If ℓt were differentiable, then E

  • gt
  • = ∇E
  • ℓt(wt + δ B)
  • where B is a random vector in the unit ball
  • gt estimates the gradient of a

locally smoothed version of ℓt

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 47 / 53

slide-56
SLIDE 56

Guarantees

If ℓt is Lipschitz, then the smoothed version is a good approximation of ℓt Radius δ of perturbation controls bias/variance trade-off Regret of stochastic OGD for convex and Lipschitz loss sequences RT = O

  • T 3/4
  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 48 / 53

slide-57
SLIDE 57

Guarantees

If ℓt is Lipschitz, then the smoothed version is a good approximation of ℓt Radius δ of perturbation controls bias/variance trade-off Regret of stochastic OGD for convex and Lipschitz loss sequences RT = O

  • T 3/4

The linear case Assume losses are linear functions on S, ℓt(w) = ℓ⊤

t w

Can we achieve a better rate?

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 48 / 53

slide-58
SLIDE 58

Self-concordant functions

[Abernethy, Hazan and Rakhlin, 2008]

Run stochastic OGD regularized with a self-concordant function for S Variance control through the associated Dikin ellipsoid Loss estimate ℓt obtained via perturbed point Wt ± ei √λi {ei, λi} is a randomly drawn eigenvector-eigenvalue pair of Dikin ellipsoid Regret for linear functions RT = O

  • d3/2 √

T ln T

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 49 / 53

slide-59
SLIDE 59

Algorithm GeometricHedge

[Dani, Hayes and Kakade, 2008]

Build an ε-cover S0 ⊆ S of size ε−d Use experts algorithm (e.g., exponential weights) to draw actions Wt ∈ S0 and use unbiased linear estimator for the loss

  • ℓt = P−1

t Wt W⊤ t ℓt

where Pt = E

  • Wt W⊤

t

  • Mix exponential weights with exploration distribution µ over the

actions in the cover: pt(w) = (1 − γ) qt(w)

  • exp. distrib.

+ γ µ(w) (0 γ 1) µ controls the variance of the loss estimates by ensuring all directions are sampled often enough

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 50 / 53

slide-60
SLIDE 60

Analysis

[C-B and Lugosi, 2010]

Regret bound RT = O

  • d
  • 1

d λmin + 1

  • T ln T
  • λmin = smallest eigenvalue of Eµ
  • W W⊤

λ−1

min is proportional to the variance of loss estimates

When λmin ≈ 1 d we get the optimal bound Θ

  • d

√ T ln T

  • If µ is uniform over all actions, the above happens when action

space is approximately isotropic

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 51 / 53

slide-61
SLIDE 61

Good exploration bases

[Bubeck, C-B and Kakade, 2012]

Choose a basis under which the action set looks isotropic There are at most O

  • d2

contact points between S0 and L¨

  • wner

ellipsoid (the min volume ellipsoid enclosing S0) Put exploration distribution µ on these contact points This ensures that Eµ

  • W W⊤

is isotropic: λmin = 1

d

Exploration on contact points of L¨

  • wner ellipsoid achieves
  • ptimal regret

RT = O

  • d

√ T ln T

  • However, this construction is not efficient in general

An efficient construction uses volumetric ellipsoids

[Hazan, Gerber and Meka, 2014]

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 52 / 53

slide-62
SLIDE 62

Conclusions

More applications Portfolio management Matrix completion Competitive analysis of algorithms Recommendation systems Some open problems Exact rates for bandit convex optimization Trade-offs between regret bounds and running times Online tensor and spectral learning Problems with states

  • N. Cesa-Bianchi (UNIMI)

Online Approach to ML 53 / 53