CS6501: T
- pics in Learning and Game Theory
(Fall 2019)
Intro to Online Learning
Instructor: Haifeng Xu
Intro to Online Learning Instructor: Haifeng Xu Outline Online - - PowerPoint PPT Presentation
CS6501: T opics in Learning and Game Theory (Fall 2019) Intro to Online Learning Instructor: Haifeng Xu Outline Online Learning/Optimization Measure Algorithm Performance via Regret Warm-up: A Simple Example 2 Overview of Machine
CS6501: T
(Fall 2019)
Instructor: Haifeng Xu
2
Ø Online Learning/Optimization Ø Measure Algorithm Performance via Regret Ø Warm-up: A Simple Example
3
ØSupervised learning
Labeled training data ML Algorithm Classifier/ Regression function Ø Unsupervised learning Unlabeled training data ML Algorithm Clusters/ Knowledge Ø Semi-supervised learning (a combination of the two)
What else are there?
4
ØSupervised learning ØUnsupervised learning ØSemi-supervised learning ØOnline learning ØReinforcement learning ØActive learning Ø. . .
5
Online Learning: When Data Come Online
The online learning pipeline
Make predictions/ decisions Receive loss/reward Initial ML algorithm Observed one more training instance
6
Online Learning: When Data Come Online
The online learning pipeline
Observed one more training instance Make predictions/ decisions Receive loss/reward Initial ML algorithm Update ML algorithm
7
ØStatistical feedback: instances drawn from a fixed distribution
gambling machine (a.k.a., bandits)
ØAdversarial feedback: instances are drawn adversarially
ØMarkovian feedback: instances drawn from a distribution which is
dynamically changing
8
ØLearn to commute to school
ØLearn to gamble or buy stocks
9
ØLearn to commute to school
ØLearn to gamble or buy stocks ØAdvertisers learn to bid for keywords
10
ØLearn to commute to school
ØLearn to gamble or buy stocks ØAdvertisers learn to bid for keywords ØRecommendation systems learn to make recommendations
11
ØLearn to commute to school
ØLearn to gamble or buy stocks ØAdvertisers learn to bid for keywords ØRecommendation systems learn to make recommendations ØClinical trials ØRobotics learn to react ØLearn to play games (video games and strategic games) ØEven how you learn to make decisions in your life Ø. . .
12
ØA learner acts in an uncertain world for 𝑈 time steps ØEach step 𝑢 = 1, ⋯ , 𝑈, learner takes action 𝑗( ∈ 𝑜 = {1, ⋯ , 𝑜} ØLearner observes cost vector 𝑑( where 𝑑( 𝑗 ∈ [0,1] is the cost of
action 𝑗 ∈ [𝑜]
ØAdversarial feedbacks: 𝑑( is chosen by an adversary
past costs, etc.) until 𝑢 − 1 and also the learner’s algorithm
ØLearner’s goal: minimize ∑(∈[5] 𝑑((𝑗()
13
At each time step 𝑢 = 1, ⋯ , 𝑈, the following occurs in order:
1.
Learner picks a distribution 𝑞( over actions [𝑜]
2.
Adversary picks cost vector 𝑑( ∈ 0,1 7 (he knows 𝑞()
3.
Action 𝑗( ∼ 𝑞( is chosen and learner incurs cost 𝑑((𝑗()
4.
Learner observes 𝑑( (for use in future time steps) Ø Learner tries to pick distribution sequence 𝑞9, ⋯ , 𝑞5 to minimize expected cost 𝔽 ∑(∈5 𝑑((𝑗()
Ø The adversary does not have to really exist – it is assumed mainly for the purpose of worst-case analysis
14
Ø Adversary can choose 𝑑( ≡ 1, ∀𝑢; learner suffers cost 𝑈 regardless
ØIf 𝑑( ≡ 1 ∀𝑢, if you look back at the end, you do not regret anything
– had you known such costs in hindsight, you cannot do better
So what is a good measure for the performance of an
15
Ø Online Learning/Optimization Ø Measure Algorithm Performance via Regret Ø Warm-up: A Simple Example
16
ØMeasures how much the learner regrets, had he known the cost
vector 𝑑9, ⋯ , 𝑑5 in hindsight
Ø Formally, ØBenchmark min
@∈[7] ∑( 𝑑((𝑗) is the learner utility had he known 𝑑9, ⋯ , 𝑑5
and is allowed to take the best single action across all rounds
𝑆5 = 𝔽@B∼CB ∑(∈[5] 𝑑( 𝑗( − min
@∈[7] ∑(∈[5] 𝑑((𝑗)
17
ØMeasures how much the learner regrets, had he known the cost
vector 𝑑9, ⋯ , 𝑑5 in hindsight
Ø Formally, ØBenchmark min
@∈[7] ∑( 𝑑((𝑗) is the learner utility had he known 𝑑9, ⋯ , 𝑑5
and is allowed to take the best single action across all rounds
@∈[7] ∑( 𝑑((𝑗) is mostly used
𝑆5 = 𝔽@B∼CB ∑(∈[5] 𝑑( 𝑗( − min
@∈[7] ∑(∈[5] 𝑑((𝑗)
Regret is an appropriate performance measure of online algorithms
18
ØWhen D
𝑆5 → 0 as 𝑈 → ∞, we say the algorithm has vanishing regret or no-regret; the algorithm is called a no-regret online learning algorithm
D 𝑆5 =
GH 5 = 𝔽@B∼CB 9 5 ∑(∈[5] 𝑑( 𝑗( − min @∈[7] 9 5 ∑(∈[5] 𝑑((𝑗)
Our goal: design no-regret algorithms by minimizing regret
19
ØThat is, pick the action with the smallest accumulated cost so far
What is the worst-case regret of FTL? Answer: worst (largest) regret 𝑈/2 Ø Consider following instance with 2 actions
𝑢 1 2 3 4 5 . . . 𝑈 𝑑((1) 1 1 1 . . . ∗ 𝑑((2) 1 1 . . . ∗
Ø FTL always pick the action with cost 1 à total cost 𝑈 Ø Best action in hindsight has cost at most 𝑈/2
20
ØRecall, adversary knows history and learner’s algorithm
ØBut if 𝑞( is deterministic, action 𝑗( can also be inferred ØAdversary simply sets 𝑑( 𝑗( = 1 and 𝑑( 𝑗 = 0 for all 𝑗 ≠ 𝑗( ØLearner suffers total cost 𝑈 ØBest action in hindsight has cost at most 𝑈/𝑜
In fact, any deterministic algorithm suffers (linear) regret (n − 1)𝑈/𝑜 Can randomized algorithm achieve sublinear regret?
21
Ø Online Learning/Optimization Ø Measure Algorithm Performance via Regret Ø Warm-up: A Simple Example
22
ØOnly two types of costs, 𝑑( 𝑗 ∈ {0,1} ØOne of the actions is perfect – it always has cost 0
Is it possible to achieve sublinear regret in this simpler setting?
23
Observations:
1.
If an action ever had non-zero costs, it is not perfect
2.
Actions with all zero costs so far, we do not really know how to distinguish them currently For 𝑢 = 1, ⋯ , 𝑈 Ø Identify the set of actions with zero total cost so far, and pick
Note: there is always at least one action to pick since the perfect action is always a candidate These motivate to the following natural algorithm
24
ØFix a round 𝑢, we examine the expected loss from this round ØLet 𝑇OPPQ = {actions with zero total cost before 𝑢} and 𝑙 = |𝑇OPPQ|
25
ØFix a round 𝑢, we examine the expected loss from this round ØLet 𝑇OPPQ = {actions with zero total cost before 𝑢} and 𝑙 = |𝑇|
ØFor any parameter 𝜗 ∈ [0,1], one of the following two happens
26
ØFix a round 𝑢, we examine the expected loss from this round ØLet 𝑇OPPQ = {actions with zero total cost before 𝑢} and 𝑙 = |𝑇|
ØFor any parameter 𝜗 ∈ [0,1], one of the following two happens
at most 𝜗𝑙 actions from 𝑇OPPQ have cost 1, in which case we suffer expected cost at most 𝜗
27
ØFix a round 𝑢, we examine the expected loss from this round ØLet 𝑇OPPQ = {actions with zero total cost before 𝑢} and 𝑙 = |𝑇|
ØFor any parameter 𝜗 ∈ [0,1], one of the following two happens
at most 𝜗𝑙 actions from 𝑇OPPQ have cost 1, in which case we suffer expected cost at most 𝜗 at least 𝜗𝑙 actions from 𝑇OPPQ have cost 1, in which case we suffer expected cost at most 1
28
ØFix a round 𝑢, we examine the expected loss from this round ØLet 𝑇OPPQ = {actions with zero total cost before 𝑢} and 𝑙 = |𝑇|
ØFor any parameter 𝜗 ∈ [0,1], one of the following two happens
ØHow many times can Case 2 happen?
ØThe total cost of the algorithm is at most T×𝜗 + log9XY 𝑜X9 ×1
at most 𝜗𝑙 actions from 𝑇OPPQ have cost 1, in which case we suffer expected cost at most 𝜗 at least 𝜗𝑙 actions from 𝑇OPPQ have cost 1, in which case we suffer expected cost at most 1
29
ØThe cost upper bound can be further bounded as follows
Total Cost ≤ 𝑈×𝜗 + log9XY 𝑜X9 ×1 = 𝑈𝜗 + ln 𝑜 − ln(1 − 𝜗)
Since logb 𝑐 = de f
de b
≤ 𝑈𝜗 + ln 𝑜 𝜗
Since −ln 1 − 𝜗 ≥ 𝜗, ∀𝜗 ∈ (0,1)
ØThe above upper bound holds for any 𝜗, so picking 𝜗 =
ln 𝑜 /𝑈 we have 𝑆5 = Total Cost ≤ 2 𝑈 ln 𝑜
Sublinear in T
30
Ø 𝑑( ∈ 0,1 7 ØNo perfect action ØPrevious algorithm can be re-written in a more “mathematically
beautiful” way, which turns out to generalize
For 𝑢 = 1, ⋯ , 𝑈 Ø Identify the set of actions with zero total cost so far, and pick
31
Ø 𝑑( ∈ 0,1 7 ØNo perfect action ØPrevious algorithm can be re-written in a more “mathematically
beautiful” way, which turns out to generalize
For 𝑢 = 1, ⋯ , 𝑈 Ø Identify the set of actions with zero total cost so far, and pick
Initialize weight 𝑥9(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
( = ∑@∈[7] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (
2. Observe cost vector 𝑑( ∈ 0,1 7 3. Update 𝑥(j9 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝑑((𝑗))
32
Ø 𝑑( ∈ 0,1 7 ØNo perfect action ØPrevious algorithm can be re-written in a more “mathematically
beautiful” way, which turns out to generalize
à the weight update process is still okay For 𝑢 = 1, ⋯ , 𝑈 Ø Identify the set of actions with zero total cost so far, and pick
Initialize weight 𝑥9(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
( = ∑@∈[7] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (
2. Observe cost vector 𝑑( ∈ 0,1 7 3. Update 𝑥(j9 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝑑((𝑗)) 0,1 7
33
Ø 𝑑( ∈ 0,1 7 ØNo perfect action ØPrevious algorithm can be re-written in a more “mathematically
beautiful” way, which turns out to generalize
à the weight update process is still okay à more conservative when eliminating actions For 𝑢 = 1, ⋯ , 𝑈 Ø Identify the set of actions with zero total cost so far, and pick
Initialize weight 𝑥9(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
( = ∑@∈[7] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (
2. Observe cost vector 𝑑( ∈ 0,1 7 3. Update 𝑥(j9 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝑑((𝑗)) (1 − 𝜗 ⋅ 𝑑((𝑗)) 0,1 7 Multiplicative Weight Update (MWU)
34
ØProof of the theorem is left to the next lecture ØNote: we really care about theoretical bound for online algorithms
way to experimentally evaluate the algorithm
most O( 𝑈 ln 𝑜 ) for the previously described general setting. Is O( 𝑈 ln 𝑜) is best possible regret?
Next, we show 𝑈 ln 𝑜 is tight
35
ØConsider any 𝑈 ≈ ln(𝑜 − 1) ØWill construct a series of random costs such that there is a perfect
action yet any algorithm will have expected cost 𝑈/2
actions have cost 0
them to have cost 1 and remaining actions have cost 0
ØSince 𝑈 < ln(𝑜), at least one action remains perfect at the end ØBut any algorithm suffers expected cost 1/2 at each round (why?);
The total cost will be 𝑈/2
ØCosts are stochastic, not adversarial? à Will be provably worse
when costs become adversarial
(ln 𝑜) term is necessary
36
ØConsider 2 actions only, still stochastic costs ØFor t = 1, ⋯ , 𝑈, cost vector 𝑑( = (0,1) or (1,0) uniformly at random
ØAny algorithm has 50% chance of getting cost 1 at each round,
and thus suffers total expected cost 𝑈/2
ØWhat about the best action in hindsight?
bit drawn independently and uniformly at random
w) and 𝑑 2 = 𝑈 − 𝑑[1]
5 w − Θ( 𝑈)
( 𝑈) term is necessary
Haifeng Xu
University of Virginia hx4ad@virginia.edu