[PPT] - RegML 2016 Class 4 Regularization for multi-task learning Lorenzo PowerPoint Presentation

SLIDE 1

RegML 2016 Class 4 Regularization for multi-task learning

Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016

SLIDE 2

Supervised learning so far

◮ Regression f : X → Y ⊆ R ◮ Classification f : X → Y = {−1, 1}

What next?

◮ Vector-valued f : X → Y ⊆ RT ◮ Multiclass f : X → Y = {1, 2, . . . , T} ◮ ...

L.Rosasco, RegML 2016 2

SLIDE 3

Multitask learning

Given S1 = (x1

i , y1 i )n1 i=1, . . . , ST = (xT i , yT i )nT i=1

find f1 : X1 → Y1, . . . , fT : XT → YT

L.Rosasco, RegML 2016 3

SLIDE 4

Multitask learning

Given S1 = (x1

i , y1 i )n1 i=1, . . . , ST = (xT i , yT i )nT i=1

find f1 : X1 → Y1, . . . , fT : XT → YT

◮ vector valued regression,

Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ RT MTL with equal inputs! Output coordinates are “tasks”

◮ multiclass

Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ {1, . . . , T}

L.Rosasco, RegML 2016 4

SLIDE 5

Why MTL?

Task 1 Task 2 X X Y

L.Rosasco, RegML 2016 5

SLIDE 6

Why MTL?

5 10 15 20 25 20 40 60 5 10 15 20 25 20 40 60 5 10 15 20 25 20 40 60 5 10 15 20 25 20 40 60

Real data!

L.Rosasco, RegML 2016 6

SLIDE 7

Why MTL?

Examples of applications:

◮ geophysics ◮ music recommendation (Dinuzzo 08) ◮ pharmacological data (Pillonetto at el. 08) ◮ binding data (Jacob et al. 08) ◮ movies recommendation (Abernethy et al. 08) ◮ HIV Therapy Screening (Bickel et al. 08)

L.Rosasco, RegML 2016 7

SLIDE 8

Why MTL?

VVR, e.g. vector fields estimation

L.Rosasco, RegML 2016 8

SLIDE 9

Why MTL?

Component 1 Component 2 X X Y

L.Rosasco, RegML 2016 9

SLIDE 10

Penalized regularization for MTL

err(w1, . . . , wT ) + pen(w1, . . . , wT ) We start with linear models f1(x) = w⊤

1 x, . . . , fT (x) = w⊤ T x

L.Rosasco, RegML 2016 10

SLIDE 11

Empirical error

E(w1, . . . , wT ) =

T

i=1

1 ni

ni

j=1

(yi

j − w⊤ i xi j)2 ◮ could consider other losses ◮ could try to “couple” errors

L.Rosasco, RegML 2016 11

SLIDE 12

Least squares error

We focus on vector valued regression (VVR) Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ RT

L.Rosasco, RegML 2016 12

SLIDE 13

Least squares error

We focus on vector valued regression (VVR) Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ RT 1 n

T

t=1

n

i=1

(yt

i − w⊤ t xi)2 = 1

n ˆ X

n×d

W

d×T

− Y

n×T

2

F

W2

F = Tr(W ⊤W),

W = (w1, . . . , wT ),

Yit = ˆ

yt

i i = 1 . . . n t = 1 . . . T

L.Rosasco, RegML 2016 13

SLIDE 14

MTL by regularization

pen(w1 . . . wT )

◮ Coupling task solutions by regularization ◮ Borrowing strength ◮ Exploit structure

L.Rosasco, RegML 2016 14

SLIDE 15

Regularizations for MTL

pen(w1, . . . , wT ) =

T

t=1

wt2

L.Rosasco, RegML 2016 15

SLIDE 16

Regularizations for MTL

pen(w1, . . . , wT ) =

T

t=1

wt2 Single tasks regularization! min

w1,...,wT

1 n

T

t=1

n

i=1

(yt

i − w⊤ t xi)2 + λ T

t=1

wt2 =

T

t=1

(min

wt

1 n

n

i=1

(yt

i − w⊤ t xi)2 + λwt2)

L.Rosasco, RegML 2016 16

SLIDE 17

Regularizations for MTL

◮ Isotropic coupling

(1 − α)

T

j=1

wj2 + α

T

j=1
wj − 1

T

i=1

wi

2

L.Rosasco, RegML 2016 17

SLIDE 18

Regularizations for MTL

◮ Isotropic coupling

(1 − α)

T

j=1

wj2 + α

T

j=1
wj − 1

T

i=1

wi

2

◮ Graph coupling - Let M ∈ RT ×T an adjacency matrix, with Mts ≥ 0 T

t=1

T

s=1

Mtswt − ws2 + γ

T

t=1

wt2 special case: output divided in clusters

L.Rosasco, RegML 2016 18

SLIDE 19

A general form of regularization

All the regularizers so far are of the form

T

t=1

T

s=1

Atsw⊤

t ws

for a suitable positive definite matrix A

L.Rosasco, RegML 2016 19

SLIDE 20

MTL regularization revisited

◮ Single tasks T j=1 wj2

= ⇒ A = I

L.Rosasco, RegML 2016 20

SLIDE 21

MTL regularization revisited

◮ Single tasks T j=1 wj2

= ⇒ A = I

◮ Isotropic coupling

(1 − α)

T

j=1

wj2 + α

T

j=1
wj − 1

T

j=1

wj

2

= ⇒ A = I − α T 1

L.Rosasco, RegML 2016 21

SLIDE 22

MTL regularization revisited

◮ Single tasks T j=1 wj2

= ⇒ A = I

◮ Isotropic coupling

(1 − α)

T

j=1

wj2 + α

T

j=1
wj − 1

T

j=1

wj

2

= ⇒ A = I − α T 1

◮ Graph coupling T

t=1

T

s=1

Mtswt − ws2 + γ

T

t=1

wt2 = ⇒ A = L + γI, where L graph Laplacian of M L = D − M, D = diag(

j

M1,j, . . . ,

j

MT,j, )

L.Rosasco, RegML 2016 22

SLIDE 23

A general form of regularization

Let W = (w1, . . . , wT ), A ∈ RT ×T Note that

T

t=1

T

s=1

Atsw⊤

t ws = Tr(WAW ⊤)

L.Rosasco, RegML 2016 23

SLIDE 24

A general form of regularization

Let W = (w1, . . . , wT ), A ∈ RT ×T Note that

T

t=1

T

s=1

Atsw⊤

t ws = Tr(WAW ⊤)

Indeed Tr(WAW ⊤) =

d

i=1

Wi

⊤AWi = d

i=1

T

t,s=1

AtsWitWis =

T

t,s=1

Ats

d

i=1

WisWir =

T

t,s=1

Atsw⊤

t ws

L.Rosasco, RegML 2016 24

SLIDE 25

Computations

1 n XW − Y 2

F + λTr(WAW ⊤)

L.Rosasco, RegML 2016 25

SLIDE 26

Computations

1 n XW − Y 2

F + λTr(WAW ⊤)

Consider the SVD A = UΣU ⊤, Σ = diag(σ1, . . . , σT )

L.Rosasco, RegML 2016 26

SLIDE 27

Computations

1 n XW − Y 2

F + λTr(WAW ⊤)

Consider the SVD A = UΣU ⊤, Σ = diag(σ1, . . . , σT ) let ˜ W = WU, ˜ Y = Y U then we can rewrite the above problem as 1 n X ˜ W − ˜ Y 2

F + λTr( ˜

WΣ ˜ W ⊤)

L.Rosasco, RegML 2016 27

SLIDE 28

Computations (cont.)

Fially, rewrite 1 n X ˜ W − ˜ Y 2

F + λTr( ˜

WΣ ˜ W ⊤) as

T

t=1

( 1 n

n

i=1

(˜ yt

i − ˜

w⊤

t xi)2 + λσt ˜

wt2) Finally W = ˜ WU ⊤ Compare to single task regularization

L.Rosasco, RegML 2016 28

SLIDE 29

Computations (cont.)

Eλ(W) = 1 n XW − Y 2

F + λTr(WAW ⊤)

Alternatively ∇Eλ(W) = 2 n

X⊤(

XW − Y ) + 2λWA Wt+1 = Wt − γ∇Eλ(Wt) Trivially extends to other loss functions.

L.Rosasco, RegML 2016 29

SLIDE 30

Beyond Linearity

ft(x) = w⊤

t Φ(x),

Φ(x) = (φ1(x), . . . , φp(x)) Eλ(W) = 1 n ΦW − Y 2 + λTr(WAW ⊤), with Φ matrix with rows Φ(x1), . . . , Φ(xn)

L.Rosasco, RegML 2016 30

SLIDE 31

Nonparametrics and kernels

ft(x) =

n

i=1

K(x, xi)Cit with Cℓ+1 = Cℓ − γ 2 n

KCℓ −

Y + 2λCℓA

◮ Cℓ ∈ Rn×T

◮

K ∈ Rn×n, Kij = K(xi, xj)

◮

Y ∈ Rn×T , Yij = yj

i

L.Rosasco, RegML 2016 31

SLIDE 32

Spectral filtering for MTL

Beyond penalization min

W

1 n XW − Y 2 + λTr(WAW ⊤),

ther forms of regularizations can be considered

◮ projection ◮ early stopping

L.Rosasco, RegML 2016 32

SLIDE 33

Multiclass and MTL

Y = {1, . . . , T}

L.Rosasco, RegML 2016 33

SLIDE 34

From Multiclass to MTL

Encoding For j = 1, . . . , T j → ej canonical vector of RT the problem reduces to vector valued regression Decoding For f(x) ∈ RT f(x) → argmax

t=1,...t

e⊤

t f(x) = argmax t=1,...t

ft(x)

L.Rosasco, RegML 2016 34

SLIDE 35

Single MTL and OVA

Write min

W

1 n XW − Y 2 + λTr(WW ⊤), as

T

t=1

min

wt

1 n

nt

i=1

(w⊤

t xt i − yt i)2 + λwt2

This is known as one versus all (OVA)

L.Rosasco, RegML 2016 35

SLIDE 36

Beyond OVA

Consider min

W

1 n XW − Y 2 + λTr(WAW ⊤), that is

T

t=1

min

˜ wt T

t=1

( 1 n

n

i=1

(˜ yt

i − ˜

w⊤

t xi)2 + λσt ˜

wt2) Class relatedness encoded in A

L.Rosasco, RegML 2016 36

SLIDE 37

Back to MTL

T

t=1

1 nt

nt

j=1

(yt

j − w⊤ i xt j)2

⇓ ( ˆ X

n×d

W

d×T

− Y

n×T

) ⊙ M

n×T

2

F ,

n =

T

t=1

nt

◮ ⊙ Hadamard product ◮ M mask ◮ Y having one non-zero value for each row

L.Rosasco, RegML 2016 37

SLIDE 38

Computations

min

W ( ˆ

XW − Y ) ⊙ M2

F + λTr(WAW ⊤) ◮ can be rewritten using tensor calculus ◮ computation for vector valued regression easily extended ◮ sparsity of M can be exploited

L.Rosasco, RegML 2016 38

SLIDE 39

From MTL to matrix completion

Special case Take d = n and X = I ( ˆ XW − Y ) ◦ M2

F

⇓

T

t=1

n

i=1

(wij − ¯ yij)2Mij

L.Rosasco, RegML 2016 39

SLIDE 40

Summary so far

A regularization framework for

◮ VVR ◮ Multiclass ◮ MTL ◮ Matrix completion

if the structure of the “tasks” is known. What if it is not?

L.Rosasco, RegML 2016 40

SLIDE 41

The structure of MTL

Consider min

W

1 n XW − Y 2 + λTr(WAW ⊤), the matrix A encodes structure. Can we learn it?

L.Rosasco, RegML 2016 41

SLIDE 42

Learning structure of MTL

Consider min

W,A

1 n XW − Y 2 + λTr(WAW ⊤) + γpen(A) Estimate a positive definite matrix A using a regularizer pen(A)

L.Rosasco, RegML 2016 42

SLIDE 43

Regularizers for MTL

For example consider min

W,A

1 n XW − Y 2 + λTr(WAW ⊤) + γTr(A−2) using the same change of coordinates as before we have min

˜ w1,...,wT , σ1,...,σt T

t=1

T

t=1

1 n

n

i=1

(˜ yt

i − ˜

w⊤

t xi)2 + λσt ˜

wt2 + γ

T

t=1

1 σ2

t

we avoid each task having too little weight

L.Rosasco, RegML 2016 43

SLIDE 44

Alternating minimization

Solving min

W,A

1 n XW − Y 2 + λTr(WAW ⊤) + γpen(A)

L.Rosasco, RegML 2016 44

SLIDE 45

Alternating minimization

Solving min

W,A

1 n XW − Y 2 + λTr(WAW ⊤) + γpen(A)

◮ Fix A = A0 ◮ Compute W1 solving

min

W

1 n XW − Y 2 + λTr(WA0W ⊤)

◮ Compute A1 solving

min

A λTr(W1AW ⊤ 1 ) + γpen(A) ◮ Repeat. . .

L.Rosasco, RegML 2016 45

SLIDE 46

This class

◮ Why MTL? ◮ Regularization for MTL to exploit structure ◮ MTL and other problems ◮ Learning tasks AND their structure

L.Rosasco, RegML 2016 46

SLIDE 47

Next class

Sparsity!

L.Rosasco, RegML 2016 47

RegML 2016 Class 4 Regularization for multi-task learning

Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016

Supervised learning so far

◮ Regression f : X → Y ⊆ R ◮ Classification f : X → Y = {−1, 1}

What next?

◮ Vector-valued f : X → Y ⊆ RT ◮ Multiclass f : X → Y = {1, 2, . . . , T} ◮ ...

Multitask learning

Given S1 = (x1

i , y1 i )n1 i=1, . . . , ST = (xT i , yT i )nT i=1

find f1 : X1 → Y1, . . . , fT : XT → YT

Multitask learning

Given S1 = (x1

i , y1 i )n1 i=1, . . . , ST = (xT i , yT i )nT i=1

find f1 : X1 → Y1, . . . , fT : XT → YT

◮ vector valued regression,

Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ RT MTL with equal inputs! Output coordinates are “tasks”

◮ multiclass

Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ {1, . . . , T}

Why MTL?

Task 1 Task 2 X X Y

Why MTL?

Real data!

Why MTL?

Related problems:

◮ conjoint analysis ◮ transfer learning ◮ collaborative filtering ◮ co-kriging

Examples of applications:

◮ geophysics ◮ music recommendation (Dinuzzo 08) ◮ pharmacological data (Pillonetto at el. 08) ◮ binding data (Jacob et al. 08) ◮ movies recommendation (Abernethy et al. 08) ◮ HIV Therapy Screening (Bickel et al. 08)

Why MTL?

VVR, e.g. vector fields estimation

Why MTL?

Component 1 Component 2 X X Y

Penalized regularization for MTL

err(w1, . . . , wT ) + pen(w1, . . . , wT ) We start with linear models f1(x) = w⊤

1 x, . . . , fT (x) = w⊤ T x

Empirical error

T

1 ni

ni

(yi

j − w⊤ i xi j)2 ◮ could consider other losses ◮ could try to “couple” errors

Least squares error

We focus on vector valued regression (VVR) Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ RT

Least squares error

We focus on vector valued regression (VVR) Sn = (xi, yi)n

i=1,

xi ∈ X, yi ∈ RT 1 n

T

n

(yt

i − w⊤ t xi)2 = 1

n ˆ X

W

− Y

2

F

W2

F = Tr(W ⊤W),

W = (w1, . . . , wT ),

yt

i i = 1 . . . n t = 1 . . . T

MTL by regularization

pen(w1 . . . wT )

◮ Coupling task solutions by regularization ◮ Borrowing strength ◮ Exploit structure

Regularizations for MTL

pen(w1, . . . , wT ) =

T

wt2

Regularizations for MTL

pen(w1, . . . , wT ) =

T

wt2 Single tasks regularization! min

w1,...,wT

1 n

T