Ensemble Methods + Recommender Systems
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 28
- Apr. 29, 2019
Machine Learning Department School of Computer Science Carnegie Mellon University
Ensemble Methods + Recommender Systems Matt Gormley Lecture 28 - - PowerPoint PPT Presentation
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Ensemble Methods + Recommender Systems Matt Gormley Lecture 28 Apr. 29, 2019 1 Reminders Homework 9: Learning
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 28
Machine Learning Department School of Computer Science Carnegie Mellon University
2
3
pick k?
function as a function of k and pick the value at the “elbo” of the curve.
performance?
times and pick the run that gives the lowest training objective function value. The objective function is nonconvex, so we’re just looking for the best local minimum.
J(c, z) k
5
Learning Paradigms: What data is available and when? What form of prediction?
Problem Formulation: What is the structure of our output prediction?
boolean Binary Classification categorical Multiclass Classification
Ordinal Classification real Regression
Ranking multiple discrete Structured Prediction multiple continuous (e.g. dynamical systems) both discrete & cont. (e.g. mixed graphical models)
Theoretical Foundations: What principles guide learning? q probabilistic q information theoretic q evolutionary search q ML as optimization
Facets of Building ML Systems: How to build systems that are robust, efficient, adaptive, effective? 1. Data prep 2. Model selection 3. Training (optimization / search) 4. Hyperparameter tuning on validation data 5. (Blind) Assessment on test data Big Ideas in ML: Which are the ideas driving development of the field?
Application Areas Key challenges? NLP, Speech, Computer Vision, Robotics, Medicine, Search
6
7
8
9
10
11
12
than Netflix’s existing system on 3 million held out ratings
13
14
Top performing systems were ensembles
you know nothing about)
learning setting)
the predictions of the pool to make new predictions
– Initially weight all classifiers equally – Receive a training example and predict the (weighted) majority vote of the classifiers in the pool – Down-weight classifiers that contribute to a mistake by a factor of β
15
(Littlestone & Warmuth, 1994)
17
(Littlestone & Warmuth, 1994)
18
(Littlestone & Warmuth, 1994)
19
Weighted Majority Algorithm
ensemble method
learned ahead of time
weight for each classifiers AdaBoost
method
– the classifiers themselves – (majority vote) weight for each classifiers
20
D1
weak classifiers = vertical or horizontal half-planes
23
Slide from Schapire NIPS Tutorial
h1 α ε1 1 =0.30 =0.42 2 D
24
Slide from Schapire NIPS Tutorial
α ε2 2 =0.21 =0.65 h2 3 D
25
Slide from Schapire NIPS Tutorial
h3 α ε3 3=0.92 =0.14
26
Slide from Schapire NIPS Tutorial
H final + 0.92 + 0.65 0.42 sign = =
27
Slide from Schapire NIPS Tutorial
28
Given: where , Initialize . For : Train weak learner using distribution . Get weak hypothesis with error Choose . Update: if if where is a normalization factor (chosen so that will be a distribution). Output the final hypothesis:
Algorithm from (Freund & Schapire, 1999)
30
Figure from (Freund & Schapire, 1999)
error
10 100 1000 5 10 15 20
cumulative distribution
0.5 1 0.5 1.0
# rounds margin Figure 2: Error curves and the margin distribution graph for boosting C4.5 on the letter dataset as reported by Schapire et al. [41]. Left: the training and test error curves (lower and upper curves, respectively) of the combined classifier as a function of the number of rounds of boosting. The horizontal lines indicate the test error rate of the base classifier as well as the test error of the final combined classifier. Right: The cumulative distribution of margins of the training examples after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves, respectively.
31
– Content Filtering – Collaborative Filtering (CF) – CF: Neighborhood Methods – CF: Latent Factor Methods
– Background: Low-rank Factorizations – Residual matrix – Unconstrained Matrix Factorization
– Singular Value Decomposition (SVD) – Non-negative Matrix Factorization
32
33
38
than Netflix’s existing system on 3 million held out ratings
39
– Items: movies, songs, products, etc. (often many thousands) – Users: watchers, listeners, purchasers, etc. (often many millions) – Feedback: 5-star ratings, not-clicking ‘next’, purchases, etc.
– Can represent ratings numerically as a user/item matrix – Users only rate a small number of items (the matrix is sparse)
40
Doctor Strange Star Trek: Beyond Zootopia Alice 1 5 Bob 3 4 Charlie 3 5 2
Content Filtering
music recommendations (Music Genome Project)
side information about items (e.g. properties of a song)
add? No problem, just be sure to include the side information Collaborative Filtering
recommendations
access to side information about items (e.g. does not need to know about movie genres)
new items that have no ratings
41
43
– Bestseller lists – Top 40 music lists – The “recent returns” shelf at the library – Unmarked but well-used paths thru the woods – The printer room at work – “Read any good books lately?” – …
– If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y – especially (perhaps) if Bob knows Alice
44
Slide from William Cohen
45
Figures from Koren et al. (2009)
46
In the figure, assume that a green line indicates the movie was watched Algorithm: 1. Find neighbors based
preferences
that those neighbors watched
Figures from Koren et al. (2009)
47
Figures from Koren et al. (2009)
movies and users live in some low- dimensional space describing their properties
movie based on its proximity to the user in the latent space
Matrix Factorization
48
Question: Applied to the Netflix Prize problem, which of the following methods always requires side information about the users and movies? Select all that apply
C. ensemble methods
E. neighborhood methods F. recommender systems
49
1. Unconstrained Matrix Factorization 2. Singular Value Decomposition 3. Non-negative Matrix Factorization
1. define a model 2. define an objective function 3.
50
52
53
Figures from Aggarwal (2016)
ATTLE N O US CAESAR OPA TRA PLESS IN SEA TTY WOMAN ABLANCA NERO JULIU CLEO SLEEP PRET CASA 1 BOTH HISTORY 1 2 3 4 ROMANCE 1 1 1 5 6 1
R
7
(b) Residual matrix 1 2 3 4 5 6 7 HISTORY ROMANCE
X
HISTORY ROMANCE ROMANCE BOTH HISTORY 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NERO JULIUS CAESAR CLEOPA TRA SLEEPLESS IN SEA TTLE PRETTY WOMAN CASABLANCA
R U VT
NERO JULIUS CAESAR CLEOPA TRA SLEEPLESS IN SEA TTLE PRETTY WOMAN CASABLANCA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 6 7 5 4 3 2 1 E
(a) Example of rank-2 matrix factorization
54 TRAINING ROWS TEST ROWS INDEPENDENT VARIABLES DEPENDENT VARIABLE NO DEMARCATION BETWEEN TRAINING AND TEST ROWS NO DEMARCATION BETWEEN DEPENDENT AND INDEPENDENT VARIABLES
Figures from Aggarwal (2016)
Regression Collaborative Filtering
55
56
57
u i
(vui − T
u i)2
58
Figures from Koren et al. (2009)
Figures from Gemulla et al. (2011)
59
Figures from Koren et al. (2009)
u i
60
Figures from Koren et al. (2009)
u i)2
61
Figures from Koren et al. (2009)
(vui − T
u i)2
+ λ(
||i||2 +
||u||2)
62
Figures from Koren et al. (2009)
u i
(vui − T
u i)2
+ λ(
||i||2 +
||u||2)
63
Figures from Koren et al. (2009)
Figures from Gemulla et al. (2011)
64
Figures from Koren et al. (2009) Figure from Gemulla et al. (2011)
Matrix$factorization$as$SGD$V$ V$why$does$ th this$wo $work? k?
step size
Figure from Gemulla et al. (2011)
65
–1.5 –1.0 –0.5 0.0 0.5 1.0 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5 Factor vector 1 Factor vector 2
Freddy Got Fingered Freddy vs. Jason Half Baked Road Trip The Sound of Music Sophie’s Choice Moonstruck Maid in Manhattan The Way We Were Runaway Bride Coyote Ugly The Royal Tenenbaums Punch-Drunk Love I Heart Huckabees Armageddon Citizen Kane The Waltons: Season 1 Stepmom Julien Donkey-Boy Sister Act The Fast and the Furious The Wizard of Oz Kill Bill: Vol. 1 Scarface Natural Born Killers Annie Hall Belle de Jour Lost in Translation The Longest Yard Being John Malkovich Catwoman
Figure 3. The fjrst two vectors from a matrix decomposition of the Netfmix Prize
vectors in two dimensions. The plot reveals distinct genres, including clusters of movies with strong female leads, fraternity humor, and quirky independent fjlms.
Figure from Koren et al. (2009)
Example Factors
66
ALS = alternating least squares
Comparison
Optimization Algorithms
Figure from Gemulla et al. (2011)
67
69
Theorem: If R fully
regularization, the
SVD equals the
Unconstrained MF
70
– In many settings, users don’t have a way of expressing dislike for an item (e.g. can’t provide negative ratings) – The only mechanism for feedback is to “like” something
– Facebook has a “Like” button, but no “Dislike” button – Google’s “+1” button – Pinterest pins – Purchasing an item on Amazon indicates a preference for it, but there are many reasons you might not purchase an item (besides dislike) – Search engines collect click data but don’t have a clear mechanism for observing dislike of a webpage
71
Examples from Aggarwal (2016)
72
1. define a model 2. define an objective function 3.
82