SLIDE 1 Output Kernel Learning Methods
Francesco Dinuzzo 1 Cheng Soon Ong 2 Kenji Fukumizu 3
1Max Planck Institute for Intelligent Systems, T¨
ubingen, Germany
2NICTA, Melbourne 3Institute of Statistical Mathematics, Japan
SLIDE 2
Part I Learning multiple tasks and their relationships
SLIDE 3 Multiple regression: population pharmacokinetics
5 10 15 20 25 20 40 60 80 100 120 Xenobiotic concentration in 27 human subjects after a bolus administration Time (hours) Concentration
The response curves have similar shapes. However, there is macroscopic inter-individual variability.
SLIDE 4 Multiple regression: population pharmacokinetics
5 10 15 20 25 20 40 60 80 100 120 Subject #1 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #2 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #3 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #4 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #5 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #6 Time (hours) Concentration
Few data points per subject with sparse sampling.
SLIDE 5 Multiple regression: population pharmacokinetics
5 10 15 20 25 20 40 60 80 100 120 Subject #1 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #2 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #3 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #4 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #5 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #6 Time (hours) Concentration
Can we combine the datasets to better estimate all the curves?
SLIDE 6
Collaborative filtering and recommender systems
Data: collections of ratings assigned by several users to a set of items. Problem: estimate the preferences of the every user for all the items. Preference profiles are different for every user. However, similar users have similar preferences.
SLIDE 7
Collaborative filtering and recommender systems
Additional information: Data about the items. Data about the users (e.g. gender, age, occupation). Data about the ratings themselves (e.g. timestamp, tags). Can we combine all these data to better estimate individual preferences?
SLIDE 8 Multi-task learning: dataset structure
x Task 10 20 10 20 30 40 50 60 70 80 90 100
Sampling can be very sparse.
SLIDE 9 Multi-task learning: dataset structure
x Task 10 20 10 20 30 40 50 60 70 80 90 100
Few samples per task...
SLIDE 10 Multi-task learning: dataset structure
x Task 10 20 10 20 30 40 50 60 70 80 90 100
... but each sample is shared by many tasks.
SLIDE 11 Object recognition with structure discovery
1 Build an object classifier with good generalization performance 2 Discover relationships between the different classes
SLIDE 12 Organizing the classes in a graph structure
baseball-bat chopsticks sword tweezer bear chimp gorilla porcupine boom-box vcr video-projector bowling-ball cd frisbee brain mars breadmaker photocopier toaster buddha eiffel-tower butterfly grasshopper hibiscus hummingbird iris cactus fern fireworks grapes calculator laptop steering-wheel centipede crab scorpion skunk comet galaxy lightning monitor microwave refrigerator toad minaret rainbow skyscraper tower-pisa windmill golden-gate light-house goose killer-whale swan helicopter speed-boat airplanes motorbikes mountain-bike spaghetti tennis-court teepee
How can we generate such a graph automatically?
SLIDE 13
Part II Output Kernel Learning
SLIDE 14
Kernel-based multi-task learning
Multi-task supervised learning
Synthesizing multiple functions fj : X → Y, j = 1, . . . , m from multiple datasets of input-output pairs (xij, yij).
SLIDE 15
Kernel-based multi-task learning
Multi-task supervised learning
Synthesizing multiple functions fj : X → Y, j = 1, . . . , m from multiple datasets of input-output pairs (xij, yij).
Multi-task kernels
For every pair of inputs (x1, x2) and every pair of task indices (i, j), specify a similarity value K((x1, i), (x2, j)) Equivalently, specify a matrix valued function H such that [H(x1, x2)]ij = K((x1, i), (x2, j))
SLIDE 16
Decomposable kernels
Decomposable kernels
K((x1, i), (x2, j)) = KX(x1, x2)KY (i, j),
Matrix-valued kernel
H(x1, x2) = KX(x1, x2) · L, Lij = KY (i, j) KX is the input kernel. KY is the output kernel (equivalently, L ∈ Sm
+ ).
SLIDE 17 Kernel-based regularization methods
Kernel-based regularization
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL
SLIDE 18 Kernel-based regularization methods
Kernel-based regularization
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL
Representer theorem
fj(x) =
m
Ljk ℓk
cijKX(xij, x)
SLIDE 19 Kernel-based regularization methods
Kernel-based regularization
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL
Representer theorem
fj(x) =
m
Ljk ℓk
cijKX(xij, x)
- How to choose the output kernel?
Independent single-task learning: L = I.
SLIDE 20 Kernel-based regularization methods
Kernel-based regularization
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL
Representer theorem
fj(x) =
m
Ljk ℓk
cijKX(xij, x)
- How to choose the output kernel?
Independent single-task learning: L = I. Pooled single-task learning: L = 1.
SLIDE 21 Kernel-based regularization methods
Kernel-based regularization
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL
Representer theorem
fj(x) =
m
Ljk ℓk
cijKX(xij, x)
- How to choose the output kernel?
Independent single-task learning: L = I. Pooled single-task learning: L = 1. Design it using prior knowledge.
SLIDE 22 Kernel-based regularization methods
Kernel-based regularization
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL
Representer theorem
fj(x) =
m
Ljk ℓk
cijKX(xij, x)
- How to choose the output kernel?
Independent single-task learning: L = I. Pooled single-task learning: L = 1. Design it using prior knowledge. Learn it from the data.
SLIDE 23 Multiple Kernel Learning
Multiple Kernel Learning (MKL)
K =
N
dkKk, dk ≥ 0.
SLIDE 24 Multiple Kernel Learning
Multiple Kernel Learning (MKL)
K =
N
dkKk, dk ≥ 0.
MKL with decomposable basis kernels
K((x1, i), (x2, j)) =
N
dkKk
X(x1, x2)Kk Y (i, j)
SLIDE 25 Multiple Kernel Learning
Multiple Kernel Learning (MKL)
K =
N
dkKk, dk ≥ 0.
MKL with decomposable basis kernels
K((x1, i), (x2, j)) =
N
dkKk
X(x1, x2)Kk Y (i, j)
MKL with decomposable basis kernels (common input kernel)
K((x1, i), (x2, j)) = KX(x1, x2) N
dkKk
Y (i, j)
SLIDE 26 Multiple Kernel Learning
Multiple Kernel Learning (MKL)
K =
N
dkKk, dk ≥ 0.
MKL with decomposable basis kernels
K((x1, i), (x2, j)) =
N
dkKk
X(x1, x2)Kk Y (i, j)
MKL with decomposable basis kernels (common input kernel)
K((x1, i), (x2, j)) = KX(x1, x2) N
dkKk
Y (i, j)
1 The maximum number of kernels is limited by memory constraints. 2 Specifying the dictionary of basis kernels requires domain knowledge.
SLIDE 27 Output Kernel Learning
Optimization problem
min
L∈S+
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL + Ω(L)
,
SLIDE 28 Output Kernel Learning
Optimization problem
min
L∈S+
min
f∈HL
m
ℓj
V (yij, fj(xij)) + f2
HL + Ω(L)
, Examples: Squared Frobenius norm Ω(L) = L2
F .
Sparsity-inducing regularizer Ω(L) = L1. Low-rank inducing regularizer Ω(L) = tr(L) + I(rank(L) ≤ p).
SLIDE 29 Low-Rank Output Kernel Learning
Low-Rank OKL
min
L∈Sm,p
+
C∈Rℓ×m
F
2λ + tr
+ tr(L) 2
SLIDE 30 Low-Rank Output Kernel Learning
Low-Rank OKL
min
L∈Sm,p
+
C∈Rℓ×m
F
2λ + tr
+ tr(L) 2
A non-linear generalization of reduced-rank regression.
SLIDE 31 Low-Rank Output Kernel Learning
Low-Rank OKL
min
L∈Sm,p
+
C∈Rℓ×m
F
2λ + tr
+ tr(L) 2
A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.
SLIDE 32 Low-Rank Output Kernel Learning
Low-Rank OKL
min
L∈Sm,p
+
C∈Rℓ×m
F
2λ + tr
+ tr(L) 2
A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.
SLIDE 33 Low-Rank Output Kernel Learning
Low-Rank OKL
min
L∈Sm,p
+
C∈Rℓ×m
F
2λ + tr
+ tr(L) 2
A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.
Unconstrained reformulation
min
A∈Rℓ×p B∈Rm×p
W ⊙ (Y − KABT )2
F
2λ + tr(AT KA) 2 + B2
F
2
Current optimization strategy: block-coordinate descent + approximate Preconditioned Conjugate Gradient (PCG)
SLIDE 34 Low-Rank Output Kernel Learning
Low-Rank OKL
min
L∈Sm,p
+
C∈Rℓ×m
F
2λ + tr
+ tr(L) 2
A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.
Unconstrained reformulation
min
A∈Rℓ×p B∈Rm×p
W ⊙ (Y − KABT )2
F
2λ + tr(AT KA) 2 + B2
F
2
Current optimization strategy: block-coordinate descent + approximate Preconditioned Conjugate Gradient (PCG)
- F. Dinuzzo, K. Fukumizu. Learning low-rank output kernels, ACML, 2011
- F. Dinuzzo, Learning output kernels for multi-task problems, Neurocomputing, 2013
SLIDE 35
Part III Experiments with OKL
SLIDE 36
MovieLens datasets
Table: MovieLens datasets: total number of users, movies, and ratings.
Dataset Users Movies Ratings MovieLens100K 943 1682 105 MovieLens1M 6040 3706 106 MovieLens10M 69878 10677 107
SLIDE 37
MovieLens datasets
Table: MovieLens datasets: total number of users, movies, and ratings.
Dataset Users Movies Ratings MovieLens100K 943 1682 105 MovieLens1M 6040 3706 106 MovieLens10M 69878 10677 107
Input Kernel (similarity between movies)
K(x1, x2) = δK(xid
1 , xid 2 ) + exp (−dH(xg 1, xg 2)) ,
SLIDE 38
MovieLens datasets
Table: MovieLens datasets: total number of users, movies, and ratings.
Dataset Users Movies Ratings MovieLens100K 943 1682 105 MovieLens1M 6040 3706 106 MovieLens10M 69878 10677 107
Input Kernel (similarity between movies)
K(x1, x2) = δK(xid
1 , xid 2 ) + exp (−dH(xg 1, xg 2)) ,
Methods: Independent single-task learning. Pooled single-task learning. Regularized matrix factorization. Low-rank output kernel learning.
SLIDE 39 MovieLens datasets: results
Setup: for each user, 50% of the ratings are used for training and the remaining
Table: MovieLens datasets: test RMSE
Dataset Pooled Independent RMF OKL MovieLens100K 1.0209 1.0445 1.0300 0.9557 MovieLens1M 0.9811 1.0297 0.9023 0.8945 MovieLens10M 0.9441 0.9721 0.8627 0.8501
SLIDE 40
Object recognition with structure discovery
257 classes, 30 training examples per class Input Kernel: designed as in (Gehler and Nowozin, ICCV, 2009) Output Kernel: learned using OKL
SLIDE 41 Structure discovery: how was this graph obtained?
baseball-bat chopsticks sword tweezer bear chimp gorilla porcupine boom-box vcr video-projector bowling-ball cd frisbee brain mars breadmaker photocopier toaster buddha eiffel-tower butterfly grasshopper hibiscus hummingbird iris cactus fern fireworks grapes calculator laptop steering-wheel centipede crab scorpion skunk comet galaxy lightning monitor microwave refrigerator toad minaret rainbow skyscraper tower-pisa windmill golden-gate light-house goose killer-whale swan helicopter speed-boat airplanes motorbikes mountain-bike spaghetti tennis-court teepee
Edges correspond to highest entries of the output kernel matrix L.
SLIDE 42
Conclusions
Many machine learning problems are structured and multi-task. Solving multiple problems simultaneously can improve performances. Output Kernel Learning methods can solve multi-task problems and automatically reveal inter-task relationships. Non-convex optimization problems, but with a special structure. Code available online.
SLIDE 43 References
Learning output kernels for multi-task problems. Neurocomputing, In Press, 2013.
- F. Dinuzzo and K. Fukumizu.
Learning low-rank output kernels. In Proceedings of the Asian Conference on Machine Learning, 2011.
- F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto.
Learning output kernels with block coordinate descent. In Proceedings of the International Conference on Machine Learning, 2011.