[PPT] - Output Kernel Learning Methods Francesco Dinuzzo 1 Cheng Soon Ong 2 PowerPoint Presentation

SLIDE 1

Output Kernel Learning Methods

Francesco Dinuzzo 1 Cheng Soon Ong 2 Kenji Fukumizu 3

1Max Planck Institute for Intelligent Systems, T¨

ubingen, Germany

2NICTA, Melbourne 3Institute of Statistical Mathematics, Japan

SLIDE 2

Part I Learning multiple tasks and their relationships

SLIDE 3

Multiple regression: population pharmacokinetics

5 10 15 20 25 20 40 60 80 100 120 Xenobiotic concentration in 27 human subjects after a bolus administration Time (hours) Concentration

The response curves have similar shapes. However, there is macroscopic inter-individual variability.

SLIDE 4

Multiple regression: population pharmacokinetics

5 10 15 20 25 20 40 60 80 100 120 Subject #1 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #2 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #3 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #4 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #5 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #6 Time (hours) Concentration

Few data points per subject with sparse sampling.

SLIDE 5

Multiple regression: population pharmacokinetics

5 10 15 20 25 20 40 60 80 100 120 Subject #1 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #2 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #3 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #4 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #5 Time (hours) Concentration 5 10 15 20 25 20 40 60 80 100 120 Subject #6 Time (hours) Concentration

Can we combine the datasets to better estimate all the curves?

SLIDE 6

Collaborative filtering and recommender systems

Data: collections of ratings assigned by several users to a set of items. Problem: estimate the preferences of the every user for all the items. Preference profiles are different for every user. However, similar users have similar preferences.

SLIDE 7

Collaborative filtering and recommender systems

Additional information: Data about the items. Data about the users (e.g. gender, age, occupation). Data about the ratings themselves (e.g. timestamp, tags). Can we combine all these data to better estimate individual preferences?

SLIDE 8

Multi-task learning: dataset structure

x Task 10 20 10 20 30 40 50 60 70 80 90 100

Sampling can be very sparse.

SLIDE 9

Multi-task learning: dataset structure

x Task 10 20 10 20 30 40 50 60 70 80 90 100

Few samples per task...

SLIDE 10

Multi-task learning: dataset structure

x Task 10 20 10 20 30 40 50 60 70 80 90 100

... but each sample is shared by many tasks.

SLIDE 11

Object recognition with structure discovery

1 Build an object classifier with good generalization performance 2 Discover relationships between the different classes

SLIDE 12

Organizing the classes in a graph structure

baseball-bat chopsticks sword tweezer bear chimp gorilla porcupine boom-box vcr video-projector bowling-ball cd frisbee brain mars breadmaker photocopier toaster buddha eiffel-tower butterfly grasshopper hibiscus hummingbird iris cactus fern fireworks grapes calculator laptop steering-wheel centipede crab scorpion skunk comet galaxy lightning monitor microwave refrigerator toad minaret rainbow skyscraper tower-pisa windmill golden-gate light-house goose killer-whale swan helicopter speed-boat airplanes motorbikes mountain-bike spaghetti tennis-court teepee

How can we generate such a graph automatically?

SLIDE 13

Part II Output Kernel Learning

SLIDE 14

Kernel-based multi-task learning

Multi-task supervised learning

Synthesizing multiple functions fj : X → Y, j = 1, . . . , m from multiple datasets of input-output pairs (xij, yij).

SLIDE 15

Kernel-based multi-task learning

Multi-task supervised learning

Synthesizing multiple functions fj : X → Y, j = 1, . . . , m from multiple datasets of input-output pairs (xij, yij).

Multi-task kernels

For every pair of inputs (x1, x2) and every pair of task indices (i, j), specify a similarity value K((x1, i), (x2, j)) Equivalently, specify a matrix valued function H such that [H(x1, x2)]ij = K((x1, i), (x2, j))

SLIDE 16

Decomposable kernels

K((x1, i), (x2, j)) = KX(x1, x2)KY (i, j),

Matrix-valued kernel

H(x1, x2) = KX(x1, x2) · L, Lij = KY (i, j) KX is the input kernel. KY is the output kernel (equivalently, L ∈ Sm

+ ).

SLIDE 17

Kernel-based regularization methods

Kernel-based regularization

min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL

 

SLIDE 18

Kernel-based regularization methods

Kernel-based regularization

min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL

 

Representer theorem

fj(x) =

m

k=1

Ljk ℓk

i=1

cijKX(xij, x)

SLIDE 19

Kernel-based regularization methods

Kernel-based regularization

min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL

 

Representer theorem

fj(x) =

m

k=1

Ljk ℓk

i=1

cijKX(xij, x)

How to choose the output kernel?

Independent single-task learning: L = I.

SLIDE 20

Kernel-based regularization methods

Kernel-based regularization

min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL

 

Representer theorem

fj(x) =

m

k=1

Ljk ℓk

i=1

cijKX(xij, x)

How to choose the output kernel?

Independent single-task learning: L = I. Pooled single-task learning: L = 1.

SLIDE 21

Kernel-based regularization methods

Kernel-based regularization

min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL

 

Representer theorem

fj(x) =

m

k=1

Ljk ℓk

i=1

cijKX(xij, x)

How to choose the output kernel?

Independent single-task learning: L = I. Pooled single-task learning: L = 1. Design it using prior knowledge.

SLIDE 22

Kernel-based regularization methods

Kernel-based regularization

min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL

 

Representer theorem

fj(x) =

m

k=1

Ljk ℓk

i=1

cijKX(xij, x)

How to choose the output kernel?

Independent single-task learning: L = I. Pooled single-task learning: L = 1. Design it using prior knowledge. Learn it from the data.

SLIDE 23

Multiple Kernel Learning

Multiple Kernel Learning (MKL)

K =

N

k=1

dkKk, dk ≥ 0.

SLIDE 24

Multiple Kernel Learning

Multiple Kernel Learning (MKL)

K =

N

k=1

dkKk, dk ≥ 0.

MKL with decomposable basis kernels

K((x1, i), (x2, j)) =

N

k=1

dkKk

X(x1, x2)Kk Y (i, j)

SLIDE 25

Multiple Kernel Learning

Multiple Kernel Learning (MKL)

K =

N

k=1

dkKk, dk ≥ 0.

MKL with decomposable basis kernels

K((x1, i), (x2, j)) =

N

k=1

dkKk

X(x1, x2)Kk Y (i, j)

MKL with decomposable basis kernels (common input kernel)

K((x1, i), (x2, j)) = KX(x1, x2) N

k=1

dkKk

Y (i, j)

SLIDE 26

Multiple Kernel Learning

Multiple Kernel Learning (MKL)

K =

N

k=1

dkKk, dk ≥ 0.

MKL with decomposable basis kernels

K((x1, i), (x2, j)) =

N

k=1

dkKk

X(x1, x2)Kk Y (i, j)

MKL with decomposable basis kernels (common input kernel)

K((x1, i), (x2, j)) = KX(x1, x2) N

k=1

dkKk

Y (i, j)

Two issues:

1 The maximum number of kernels is limited by memory constraints. 2 Specifying the dictionary of basis kernels requires domain knowledge.

SLIDE 27

Output Kernel Learning

Optimization problem

min

L∈S+

  min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL + Ω(L)

    ,

SLIDE 28

Output Kernel Learning

Optimization problem

min

L∈S+

  min

f∈HL

 

m

j=1

ℓj

i=1

V (yij, fj(xij)) + f2

HL + Ω(L)

    , Examples: Squared Frobenius norm Ω(L) = L2

F .

Sparsity-inducing regularizer Ω(L) = L1. Low-rank inducing regularizer Ω(L) = tr(L) + I(rank(L) ≤ p).

SLIDE 29

Low-Rank Output Kernel Learning

Low-Rank OKL

min

L∈Sm,p

+

min

C∈Rℓ×m

W ⊙ (Y − KCL)2

F

2λ + tr

CT KCL
2

+ tr(L) 2

.

SLIDE 30

Low-Rank Output Kernel Learning

Low-Rank OKL

min

L∈Sm,p

+

min

C∈Rℓ×m

W ⊙ (Y − KCL)2

F

2λ + tr

CT KCL
2

+ tr(L) 2

.

A non-linear generalization of reduced-rank regression.

SLIDE 31

Low-Rank Output Kernel Learning

Low-Rank OKL

min

L∈Sm,p

+

min

C∈Rℓ×m

W ⊙ (Y − KCL)2

F

2λ + tr

CT KCL
2

+ tr(L) 2

.

A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.

SLIDE 32

Low-Rank Output Kernel Learning

Low-Rank OKL

min

L∈Sm,p

+

min

C∈Rℓ×m

W ⊙ (Y − KCL)2

F

2λ + tr

CT KCL
2

+ tr(L) 2

.

A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.

SLIDE 33

Low-Rank Output Kernel Learning

Low-Rank OKL

min

L∈Sm,p

+

min

C∈Rℓ×m

W ⊙ (Y − KCL)2

F

2λ + tr

CT KCL
2

+ tr(L) 2

.

A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.

Unconstrained reformulation

min

A∈Rℓ×p B∈Rm×p

W ⊙ (Y − KABT )2

F

2λ + tr(AT KA) 2 + B2

F

2

,

Current optimization strategy: block-coordinate descent + approximate Preconditioned Conjugate Gradient (PCG)

SLIDE 34

Low-Rank Output Kernel Learning

Low-Rank OKL

min

L∈Sm,p

+

min

C∈Rℓ×m

W ⊙ (Y − KCL)2

F

2λ + tr

CT KCL
2

+ tr(L) 2

.

A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.

Unconstrained reformulation

min

A∈Rℓ×p B∈Rm×p

W ⊙ (Y − KABT )2

F

2λ + tr(AT KA) 2 + B2

F

2

,

Current optimization strategy: block-coordinate descent + approximate Preconditioned Conjugate Gradient (PCG)

F. Dinuzzo, K. Fukumizu. Learning low-rank output kernels, ACML, 2011
F. Dinuzzo, Learning output kernels for multi-task problems, Neurocomputing, 2013

SLIDE 35

Part III Experiments with OKL

SLIDE 36

MovieLens datasets

Table: MovieLens datasets: total number of users, movies, and ratings.

Dataset Users Movies Ratings MovieLens100K 943 1682 105 MovieLens1M 6040 3706 106 MovieLens10M 69878 10677 107

SLIDE 37

MovieLens datasets

Table: MovieLens datasets: total number of users, movies, and ratings.

Dataset Users Movies Ratings MovieLens100K 943 1682 105 MovieLens1M 6040 3706 106 MovieLens10M 69878 10677 107

Input Kernel (similarity between movies)

K(x1, x2) = δK(xid

1 , xid 2 ) + exp (−dH(xg 1, xg 2)) ,

SLIDE 38

MovieLens datasets

Table: MovieLens datasets: total number of users, movies, and ratings.

Dataset Users Movies Ratings MovieLens100K 943 1682 105 MovieLens1M 6040 3706 106 MovieLens10M 69878 10677 107

Input Kernel (similarity between movies)

K(x1, x2) = δK(xid

1 , xid 2 ) + exp (−dH(xg 1, xg 2)) ,

Methods: Independent single-task learning. Pooled single-task learning. Regularized matrix factorization. Low-rank output kernel learning.

SLIDE 39

MovieLens datasets: results

Setup: for each user, 50% of the ratings are used for training and the remaining

nes for test.

Table: MovieLens datasets: test RMSE

Dataset Pooled Independent RMF OKL MovieLens100K 1.0209 1.0445 1.0300 0.9557 MovieLens1M 0.9811 1.0297 0.9023 0.8945 MovieLens10M 0.9441 0.9721 0.8627 0.8501

SLIDE 40

Object recognition with structure discovery

257 classes, 30 training examples per class Input Kernel: designed as in (Gehler and Nowozin, ICCV, 2009) Output Kernel: learned using OKL

SLIDE 41

Structure discovery: how was this graph obtained?

baseball-bat chopsticks sword tweezer bear chimp gorilla porcupine boom-box vcr video-projector bowling-ball cd frisbee brain mars breadmaker photocopier toaster buddha eiffel-tower butterfly grasshopper hibiscus hummingbird iris cactus fern fireworks grapes calculator laptop steering-wheel centipede crab scorpion skunk comet galaxy lightning monitor microwave refrigerator toad minaret rainbow skyscraper tower-pisa windmill golden-gate light-house goose killer-whale swan helicopter speed-boat airplanes motorbikes mountain-bike spaghetti tennis-court teepee

Edges correspond to highest entries of the output kernel matrix L.

SLIDE 42

Conclusions

Many machine learning problems are structured and multi-task. Solving multiple problems simultaneously can improve performances. Output Kernel Learning methods can solve multi-task problems and automatically reveal inter-task relationships. Non-convex optimization problems, but with a special structure. Code available online.

SLIDE 43

References

F. Dinuzzo.

Learning output kernels for multi-task problems. Neurocomputing, In Press, 2013.

F. Dinuzzo and K. Fukumizu.

Learning low-rank output kernels. In Proceedings of the Asian Conference on Machine Learning, 2011.

F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto.