Tensor Canonical Correlation Analysis and It Its applications - - PowerPoint PPT Presentation

tensor canonical correlation analysis and it its
SMART_READER_LITE
LIVE PREVIEW

Tensor Canonical Correlation Analysis and It Its applications - - PowerPoint PPT Presentation

Tensor Canonical Correlation Analysis and It Its applications Presenter: Yong LUO The work is done when Yong LUO was a Research Fellow at Nanyang Technological University, Singapore Outline Y. Luo, D. C. Tao, R. Kotagiri, C. Xu, and Y. G.


slide-1
SLIDE 1

Tensor Canonical Correlation Analysis and It Its applications

Presenter: Yong LUO The work is done when Yong LUO was a Research Fellow at Nanyang Technological University, Singapore

slide-2
SLIDE 2

Outline

  • Y. Luo, D. C. Tao, R. Kotagiri, C. Xu, and Y. G. Wen,

“Tensor Canonical Correlation Analysis for Multi- view Dimension Reduction,” IEEE Transactions on Knowledge and Data Engineering (T-KDE), vol. 27,

  • no. 11, pp. 3111-3124, 2015.
  • Y. Luo, Y. G. Wen and D. C. Tao, “On Combining Side

Information and Unlabeled Data for Heterogeneous Multi-task Metric Learning,” International Joint Conference on Artificial Intelligence (IJCAI), pp. 1809-1815, 2016.

slide-3
SLIDE 3

Mult lti-view dimension reduction (MVDR)

  • Dimension reduction (DR)
  • Find a low-dimensional representation for high

dimensional data

  • Benefits: reduce the chance of over-fitting, reduce

computational cost, etc.

  • Approaches: feature selection (IG, MI, sparse learning,

etc.), feature transformation (PCA, LDA, LE, etc.)

slide-4
SLIDE 4

MVDR

  • Real world objects usually contain information from

multiple sources, and can be extracted different kinds of features.

  • Traditional DR methods cannot effectively handle

multiple types of features

Feature concatenation

slide-5
SLIDE 5

MVDR

  • Multi-view learning
  • Learn to fuse multiple distinct feature representations
  • Families: weighted view combination, multi-view

dimension reduction, view agreement exploration

  • Multi-view dimension reduction
  • Multi-view feature selection
  • Multi-view subspace learning: seek a low-dimensional

common subspace to compactly represent the heterogeneous data; One of the most representative model: CCA

slide-6
SLIDE 6

Canonical correla lation a analysis (CCA)

  • Objective of CCA
  • Correlation maximization on the common subspace

argmax

𝐴1,𝐴2

𝜍 = corr 𝐴1, 𝐴2 = 𝐢1

𝑈𝐷12𝐢2

𝐢1

𝑈𝐷11𝐢1𝐢2 𝑈𝐷22𝐢2

𝑦1 𝑦2 𝑨 𝑨1𝑜 = 𝐲1𝑜

𝑈 𝐢1

𝑨2𝑜 = 𝐲2𝑜

𝑈 𝐢2

  • H. Hotelling, “Relations between two sets of variants,” Biometrika, 1936.
  • D. P. Foster, et al., “Multi-view dimensionality reduction via canonical correlation analysis,” Tech. Rep., 2008.
slide-7
SLIDE 7

Generalizatio ions of CCA to several l vie iews

  • CCA-MAXVAR
  • Generalizes CCA to 𝑁 ≥ 2 views
  • 𝐴𝑛 = 𝑌𝑛

𝑈 𝐢𝑛 is the vector of canonical variables for the

𝑛’th view, and 𝐴 is a centroid representation

  • Solutions can be obtained using the SVD of 𝑌𝑛

argmin

𝐴,𝐛, 𝐢𝑞 𝑛=1

𝑁

1 𝑁 ෍

𝑛=1 𝑁

𝐴 − 𝛽𝑛𝐴𝑛

2 2 ,

  • s. t.

𝐴𝑛

2 = 1

  • J. R. Kettenring, “Canonical analysis of several sets of variables,” Biometrika, 1971.
slide-8
SLIDE 8

Generalizatio ions of CCA to several l vie iews

  • CCA-LS
  • Equivalent to CCA-MAXVAR, but can be solved efficiently

and adaptively based on LS regression argmin

𝐢𝑛 𝑛=1

𝑁

1 2𝑁 𝑁 − 1 ෍

𝑞,𝑟=1 𝑁

𝑌𝑞

𝑈𝐢𝑞 − 𝑌𝑟 𝑈𝐢𝑟 2 2

  • s. t. 1

𝑛 ෍

𝑛=1 𝑁

𝐢𝑛

𝑈 𝐷𝑛𝑛𝐢𝑛 = 1

  • J. Via et al., “A learning algorithm for adaptive canonical correlation analysis of several data sets,” Neural

Networks, 2007.

slide-9
SLIDE 9

The proposed TCCA framework

  • Main drawback of CCA-MAXVAR and CCA-LS
  • Only the statistics (correlation information) between

pairs of features is explored, while high-order statistics is ignored

  • Tensor CCA
  • Directly maximize the high-order correlation between all

views

High order tensor correlation

𝑦3 𝑦1 𝑦2 𝑒1 𝑒2 𝑒3

Pairwise correlation

𝑦2 𝑦1 𝑦3 𝑒1 𝑒2 𝑒3

slide-10
SLIDE 10

The proposed TCCA framework for MVDR

𝐯2

1

𝐯3

1

𝜇 1 𝐯1

1

𝐯2

𝑠

𝐯3

𝑠

𝜇 𝑠 𝐯1

𝑠

+ ⋯ ⋯ + ⋮ ⋮ 𝑌2 𝑌1 ⋮ 𝑌3 𝑂 𝑒1 𝑒2 𝑒3 ⋯ 𝑉1 ⋮ 𝑉2 ⋯ 𝑉3 Tensor CCA 𝑎1 𝑎2 𝑎3 𝑠 𝑠 𝑠 𝑎 3𝑠 Mapping LAB WT SIFT 𝑦3 𝑦1 𝑦2 𝑒1 𝑒2 𝑒3 𝒟123 Covariance tensor Sum of rank-1 approximation

slide-11
SLIDE 11

Tensor basic ics

  • Generalization of an n-dimensional array

Vector: order-1 tensor Matrix: order-2 tensor Scalar: order-0 tensor Order-3 tensor

slide-12
SLIDE 12

Tensor basic ics

  • Tensor-matrix multiplication
  • The 𝑛-mode product of an 𝐽1 × 𝐽2 × ⋯ × 𝐽𝑁 tensor 𝒝

and an 𝐾𝑛 × 𝐽𝑛 matrix 𝑉 is a tensor ℬ = 𝒝 ×𝑛 𝑉 of size 𝐽1 × ⋯ × 𝐽𝑛−1 × 𝐾𝑛 × 𝐽𝑛+1 ⋯ × 𝐽𝑁 with the element

  • The product of 𝒝 and a sequence of matrices ሼ𝑉𝑛 ∈

ℬ 𝑗1, ⋯ , 𝑗𝑛−1, 𝑘𝑛, 𝑗𝑛+1, ⋯ , 𝑗𝑁 = ෍

𝑗𝑛=1 𝐽𝑛

𝒝 𝑗1, 𝑗2, ⋯ , 𝑗𝑁 𝑉 𝑘𝑛, 𝑗𝑛 ℬ = 𝒝 ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁

slide-13
SLIDE 13

Tensor basic ics

  • Tensor-vector multiplication
  • The contracted 𝑛-mode product of 𝒝 and an 𝐽𝑛-vector

𝐯 is an 𝐽1 × ⋯ × 𝐽𝑛−1 × 𝐽𝑛+1 ⋯ × 𝐽𝑁 tensor ℬ = 𝒝 ഥ ×𝑛 𝐯 of order 𝑁 − 1 with the entry

  • Tensor-tensor multiplication
  • Outer product, contracted product, inner product
  • Frobenius norm of the tensor

ℬ 𝑗1, ⋯ , 𝑗𝑛−1, 𝑗𝑛+1, ⋯ , 𝑗𝑁 = ෍

𝑗𝑛=1 𝐽𝑛

𝒝 𝑗1, 𝑗2, ⋯ , 𝑗𝑁 𝐯 𝑗𝑛 𝒝 𝐺

2 = 𝒝, 𝒝 = ෍ 𝑗1=1 𝐽1

𝑗2=1 𝐽2

⋯ ෍

𝑗𝑁=1 𝐽𝑁

𝒝 𝑗1, 𝑗2, ⋯ , 𝑗𝑁 2

slide-14
SLIDE 14

Tensor basic ics

  • Matricization
  • The mode-𝑛 matricization of 𝒝 is denoted as an 𝐽𝑛 ×

𝐽1 ⋯ 𝐽𝑛−1𝐽𝑛+1 ⋯ 𝐽𝑁 matrix 𝐵 𝑛

mode-1 mode-2 frontal matricizing horizontal matricizing

𝒝 𝐵 1 𝐵 2 𝐵 3

row-wise vectorizing column-wise vectorizing

slide-15
SLIDE 15

Tensor basic ics

  • Matricization property
  • The 𝑛-mode multiplication ℬ = 𝒝 ×𝑛 𝑉 can be

manipulated as matrix multiplication by storing the tensors in metricized form, i.e., 𝐶 𝑛 = 𝑉𝐵 𝑛

  • The series of 𝑛-mode product can be expressed as a

series of Kronecker products 𝑑1, 𝑑2, ⋯ , 𝑑𝐿 = 𝑛 + 1, 𝑛 + 2, ⋯ , 𝑁, 1, 2, ⋯ , 𝑛 − 1 is a forward cyclic ordering for indices of the tensor dims ℬ = 𝒝 ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁 𝐶 𝑛 = 𝑉𝑛𝐵 𝑛 𝑉𝑑𝑛−1⨂𝑉𝑑𝑛−1⨂ ⋯ ⨂𝑉𝑑𝑛−1

𝑈

slide-16
SLIDE 16

TCCA formulation

  • Optimization problem
  • Maximize the correlation between the canonical variables

𝐴𝑛 = 𝑌𝑛

𝑈 𝐢𝑛, 𝑛 = 1, ⋯ , 𝑁:

  • Equivalent formulation
  • Covariance tensor: 𝒟12⋯𝑁 =

1 𝑂 σ𝑜=1 𝑂

𝐲1𝑜 ∘ 𝐲2𝑜 ∘ ⋯ ∘ 𝐲𝑁𝑜 argmax

𝐢𝑛

𝜍 = corr 𝐴1, 𝐴2, ⋯ , 𝐴𝑁 = 𝐴1⨀𝐴2⨀ ⋯ ⨀𝐴𝑁 𝑈𝐟 ,

  • s. t. 𝐴𝑛

𝑈 𝐴𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁

argmax

𝐢𝑛

𝜍 = 𝒟12⋯𝑛 ഥ ×1 𝐢1

𝑈 ഥ

×2 𝐢2

𝑈 ⋯ ഥ

×𝑁 𝐢𝑁

𝑈 ,

  • s. t. 𝐢𝑛

𝑈 𝐷𝑛𝑛 + 𝜗𝐽 𝐢𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁

slide-17
SLIDE 17

TCCA formulation

  • Reformulation
  • Let ℳ = 𝒟12⋯𝑁 ×1 ሚ

𝐷11

Τ 1 2 ×2 ሚ

𝐷22

Τ 1 2 ⋯ ×𝑁 ሚ

𝐷𝑁𝑁

Τ 1 2, and

𝐯𝑛 = ሚ 𝐷𝑛𝑛

Τ 1 2𝐢𝑛, where ሚ

𝐷𝑛𝑛 = 𝐷𝑛𝑛 + 𝜗𝐽

  • Main solution
  • If define ෡

ℳ = 𝜍𝐯1 ∘ 𝐯2 ∘ ⋯ ∘ 𝐯𝑁, problem becomes

  • Solved by alternating least square (ALS), high-order

power method (HOPM), etc. argmax

𝐯𝑛

𝜍 = ℳ ഥ ×1 𝐯1

𝑈 ഥ

×2 𝐯2

𝑈 ⋯ ഥ

×𝑁 𝐯𝑁

𝑈 ,

  • s. t. 𝐯𝑛

𝑈 𝐯𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁

argmin

𝐯𝑛

ℳ − ෡ ℳ 𝐺

2 , [Lathauwer et al., 2000a]

  • L. De Lathauwer et al., “On the best Rank-1 and rank-(r1, r2, ..., rn) approximation of higher-order tensors,”

SIAM J. Matrix Anal. Appl., 2000.

slide-18
SLIDE 18

TCCA solu lution

  • Solutions
  • Remaining solutions: recursively maximizing the same

correlation as presented in the main TCCA problem

  • All solutions: the best sum of rank-1 approximation, i.e.,

rank-𝑠 CP decomposition of ℳ

  • Projected data

෡ ℳ ≈ ෍

𝑙=1 𝑠

𝜍𝑙𝐯1

𝑙 ∘ 𝐯2 𝑙 ∘ ⋯ ∘ 𝐯𝑁 𝑙

𝑎𝑛 = 𝑌𝑛

𝑈 ሚ

𝐷𝑛𝑛

Τ −1 2𝑉𝑛

𝑉𝑛 = 𝐯𝑛

1 , ⋯ , 𝐯𝑛 𝑠

slide-19
SLIDE 19

KTCCA formulatio ion

  • Non-linear extension
  • Non-linear feature mapping ∅:
  • Canonical variables: 𝐴𝑛 = ∅𝑈 𝑌𝑛 𝐢𝑛
  • Representer theorem: 𝐢𝑛 = ∅ 𝑌𝑛 𝐛𝑛
  • Optimization problem

∅ 𝑌𝑛 = ∅ 𝐲𝑛1 , ∅ 𝐲𝑛2 , ⋯ , ∅ 𝐲𝑛𝑂 argmax

𝐛𝑛

𝜍 = 𝒧12⋯𝑁 ഥ ×1 𝐛1

𝑈 ഥ

×2 𝐛2

𝑈 ⋯ ഥ

×𝑁 𝐛𝑁

𝑈 ,

  • s. t. 𝐛𝑛

𝑈 𝐿𝑛𝑛 2

+ 𝜗𝐿𝑛𝑛 𝐛𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁 𝑀𝑛

𝑈 𝑀𝑛

slide-20
SLIDE 20

KTCCA solution

  • Reformulation
  • Let 𝒯 = 𝒧12⋯𝑛 ×1 ሚ

𝐷11

Τ 1 2 ×2 ሚ

𝐷22

Τ 1 2 ⋯ ×𝑁 ሚ

𝐷𝑁𝑁

Τ 1 2, and 𝐜𝑛 =

ሚ 𝐷𝑛𝑛

Τ 1 2𝐛𝑛:

  • Solved by ALS
  • Projected data:

argmax

𝐯𝑛

𝜍 = 𝒯 ഥ ×1 𝐜1

𝑈 ഥ

×2 𝐜2

𝑈 ⋯ ഥ

×𝑁 𝐜𝑁

𝑈 ,

  • s. t. 𝐜𝑛

𝑈 𝐜𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁

𝑎𝑛 = 𝐿𝑛𝑛𝑀𝑛

−1𝐶𝑛, 𝑛 = 1, ⋯ , 𝑁

slide-21
SLIDE 21

Experimental l setup

  • Datasets
  • SecStr: biometric structure prediction
  • 84K instances, 100 as labeled, additional 1200K unlabeled
  • 3 views: attributes based on left, middle, right context generated

from the sequence window of amino acid. Each view is 105-D

  • Advertisement classification
  • 3279 instances, 100 as labeled
  • 3 views: features based on the terms in the images (588-D), terms in

the current URL (495-D), and terms in the anchor URL (472-D)

  • Web image annotation
  • 11189 images, {4, 6, 8} labeled instances for each of 10 concepts
  • 3 views: 500-D SIFT visual words, 144-D color, 128-D wavelet
  • Classifier: RLS and KNN
  • Evaluation criterion: Prediction/classification/annotation

accuracy

slide-22
SLIDE 22

Experimental l setup

  • Compared methods
  • BSF: best single view feature
  • CAT: concatenation of the normalized features
  • FRAC: a recent multi-view feature selection algorithm
  • CCA:

Τ 𝑛 𝑛 − 1 2 subsets of two views

  • CCA (BST): the best subset
  • CCA (AVG): the average performance of all subsets
  • CCA-LS: traditional generalizations of CCA to several

views

  • DSE: a popular unsupervised multi-view DR method
  • SSMVD: a recent unsupervised multi-view DR method
  • TCCA: the proposed method
slide-23
SLIDE 23

Experimental l results and analy lysis

  • Biometric structure prediction
  • Learn common subspace > CAT > BSF
  • SSMVD, CCA-LS are comparable, so are DSE, CCA (BST)
  • TCCA is the best at most dims, and does not decrease

significantly when dim is high

Unlabeled = 84K Unlabeled = 1.3M

slide-24
SLIDE 24

Experimental l results and analy lysis

  • Web image annotation
  • DSE comparable to CCA (BST), CCA (AVG)
  • TCCA > SSMVD, and is better than the other CCA based

methods

  • Non-linear > linear

Linear Non-linear

slide-25
SLIDE 25

Conclusions and discussion

  • Conclusions
  • Finding a common subspace for all views using the CCA-

based strategy is often better than simply concatenating all the features, especially when the feature dimension is high

  • Examining more statistics, which may require more

unlabeled data to be utilized, often leads to better performance; By exploring the high-order statistics, the proposed TCCA outperforms the other methods

  • Discussion
  • Can the common subspace be used for knowledge

transfer between different views?

slide-26
SLIDE 26

Dis istance metric ic le learning (DML)

  • Goal: learn appropriate distance function over

input space to reflect relationships between data

  • Useful in many ML algorithms, e.g., clustering,

classification and information retrieval

  • Most common DML scheme: Mahalanobis metric

learning >>> learning linear transformation

  • Non-linear and local DML are able to capture

complex structure in the data

𝑒𝐵 𝐲𝑗, 𝐲𝑘 = 𝐲𝑗 − 𝐲𝑘

𝑈𝐵 𝐲𝑗 − 𝐲𝑘 ,

𝑒𝐵 𝐲𝑗, 𝐲𝑘 = 𝑉𝐲𝑗 − 𝑉𝐲𝑘

2 2

𝐵 = 𝑉𝑉𝑈

slide-27
SLIDE 27

Transfer DML (TDML)

  • Motivation
  • Needs large amount of side information to learn robust

distance metric

  • The training samples are insufficient in the task/domain
  • f interest (target task/domain)
  • We have abundant labeled data in certain related, but

different tasks/domains (source tasks/domains)

  • Goal
  • Utilize the metrics obtained from source tasks/domains

to help metric learning in the target tasks/domains

slide-28
SLIDE 28

Homogeneous TDML (HoTDML)

  • Data of source domain and target domain drawn

from different distributions (same feature space)

  • Examples [Pan and Yang, 2009]
  • Web document classification: university website -> new

website

  • Indoor WiFi localization: WiFi signal-strength values

change in different time periods or on different devices

  • Sentiment classification: distribution of reviews among

different types of products can be very different

  • Challenge
  • How to utilize the source information appropriately

given the different distributions or find a subspace in which the distribution difference is reduced

  • S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE TKDE, 2010.
slide-29
SLIDE 29

Heterogeneous TDML (HeTDML)

  • Data of source domain and target domain lie in different

feature spaces, and may have different semantics

  • Examples
  • Multi-lingual document classification
  • multi-view classification or retrieval, etc.
  • Challenge
  • How to find correspondences or common representations for

the different domains

Labeled reviews in English (abundant) Labeled reviews in Spanish (scarce) Classify reviews in Spanish

slide-30
SLIDE 30

HeTDML exis istin ing solu lutions

  • Heterogeneous transfer learning (HTL) approaches

usually transform heterogeneous features into a common subspace, and the transformation can be used to derive a metric

  • Groups
  • Heterogeneous domain adaptation (HDA)
  • Improve the performance in target domain
  • Most only handle two domains
  • Heterogeneous multi-task learning (HMTL)
  • Improve the performance of all domains simultaneously
slide-31
SLIDE 31

Heterogeneous multi-task metric ic le learning (HMTML)

  • Limitations of existing HMTL approaches
  • Do not optimize w.r.t. the metric
  • Mainly focus on utilizing the side information
  • Can only explore the pairwise relationships between

different domains, the high-order statistics that can only be obtained by simultaneously examining all domains is ignored

  • Our method
  • Handle arbitrary number of domains, and directly
  • ptimize w.r.t. the metrics
  • Make use of large amounts of unlabeled data to build

domain connections

  • Explore high-order statistics between all domains
slide-32
SLIDE 32

HMTML framework

… … … … …

𝑉1

𝑉𝑁 𝐵1

∗ = 𝑉1 ∗𝑉1 ∗𝑈

𝐵𝑁

∗ = 𝑉𝑁 ∗ 𝑉𝑁 ∗ 𝑈

… …

𝑦11

𝑉

𝑌1

𝑉

𝑦12

𝑉

𝑦1𝑂

𝑉

Tensor based correlation maximization

… …

𝑨11

𝑉

𝑨12

𝑉 𝑨1𝑂 𝑉

𝑎1

𝑉

𝑨𝑁1

𝑉

𝑨𝑁2

𝑉 𝑨𝑁𝑂 𝑉

𝑎𝑁

𝑉

𝑌𝑁

𝑉

𝑦𝑁2

𝑉

𝑦𝑁1

𝑉

𝑦𝑁𝑂

𝑉

Unlabeled data

𝒠𝑉

Germany

𝐵𝑁 = 𝑉𝑁𝑉𝑁

𝑈

𝒠𝑁

𝑀

English

𝐵1 = 𝑉1𝑉1

𝑈

𝒠1

𝑀

slide-33
SLIDE 33

HMTML formulation

  • Optimization problem
  • General formulation
  • Ψ 𝐵𝑛 =

2 𝑂𝑛 𝑂𝑛−1 σ𝑗<𝑘 𝑀 𝐵𝑛; 𝐲𝑛𝑗, 𝐲𝑛𝑘, 𝑧𝑛𝑗𝑘 is the

empirical loss w.r.t. 𝐵𝑛

  • 𝑆 𝐵1, 𝐵2, ⋯ , 𝐵𝑁 enforces information transfer across

different domains argmin

𝐵𝑛 𝑛=1

𝑁

𝐺 𝐵𝑛 = ෍

𝑛=1 𝑁

Ψ 𝐵𝑛 + 𝛿𝑆 𝐵1, 𝐵2, ⋯ , 𝐵𝑁 , s.t. 𝐵𝑛 ≽ 0, 𝑛 = 1, 2, ⋯ , 𝑁,

slide-34
SLIDE 34

Knowledge transfer by high-order correlation maxim imization

  • Main idea
  • Decompose 𝐵𝑛 as 𝐵𝑛 = 𝑉𝑛𝑉𝑛

𝑈

  • Use 𝑉𝑛 to project the unlabeled data points of different

domains into a common subspace, where the correlation of all domains are maximized

  • Formulation
  • corr 𝐴1𝑜

𝑉 , 𝐴2𝑜 𝑉 , ⋯ , 𝐴𝑁𝑜 𝑉

= 𝐴1𝑜

𝑉 ⨀𝐴2𝑜 𝑉 ⨀ ⋯ ⨀𝐴𝑁𝑜 𝑉 𝑈𝐟 is

the correlation of the projected representations 𝐴𝑛𝑜

𝑉

= 𝑉𝑛

𝑈 𝐲𝑛𝑜 𝑉 𝑛=1 𝑁

argmax

𝑉𝑛 𝑛=1

𝑁

1 𝑂𝑉 ෍

𝑜=1 𝑂𝑉

corr 𝐴1𝑜

𝑉 , 𝐴2𝑜 𝑉 , ⋯ , 𝐴𝑁𝑜 𝑉

,

slide-35
SLIDE 35

Knowledge transfer by high-order correlation maxim imization

  • Reformulation

argmax

𝑉𝑛 𝑛=1

𝑁

1 𝑂𝑉 ෍

𝑜=1 𝑂𝑉

corr 𝐴1𝑜

𝑉 , 𝐴2𝑜 𝑉 , ⋯ , 𝐴𝑁𝑜 𝑉

argmax

𝑉𝑛 𝑛=1

𝑁

1 𝑂𝑉 ෍

𝑜=1 𝑂𝑉

𝒣 ഥ ×1 𝐲1𝑜

𝑉 𝑈 ⋯ ഥ

×𝑁 𝐲𝑁𝑜

𝑉 𝑈

argmin

𝑉𝑛 𝑛=1

𝑁

1 𝑂𝑉 ෍

𝑜=1 𝑂𝑉

𝒟𝑜

𝑉 − 𝒣 𝐺 2

[Luo et al., 2015] 𝒣 = ℰ𝑠 ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁 is the covariance tensor of the mappings; 𝒟𝑜

𝑉 is the covariance tensor of

the representations for the 𝑜’th unlabeled sample. [Lathauwer et al., 2000b]

  • Y. Luo et al., “Tensor Canonical Correlation Analysis for Multi-view Dimension Reduction,” IEEE TKDE, 2015.
  • L. De Lathauwer et al., “A multilinear singular value decomposition,” JMAA, 2000.
slide-36
SLIDE 36

HMTML formulation

  • Specific optimization problem
  • Corresponds to find a subspace where the

representations of all domains are close to each other

  • Knowledge is transferred in this subspace, and so

different domains can help each other in learning the mapping 𝑉𝑛, or equivalently the metric 𝐵𝑛

argmin

𝑉𝑛 𝑛=1

𝑁

𝐺 𝑉𝑛 = ෍

𝑛=1 𝑁

1 𝑂𝑛

′ ෍ 𝑙=1 𝑂𝑛

𝑕 𝑧𝑛𝑙 1 − 𝛆𝑛𝑙

𝑈 𝑉𝑛𝑉𝑛 𝑈 𝛆𝑛𝑙

+ 𝛿 𝑂𝑉 ෍

𝑜=1 𝑂𝑉

𝒟𝑜

𝑉 − 𝒣 𝐺 2 + ෍ 𝑛=1 𝑁

𝛿𝑛 𝑉𝑛 1

slide-37
SLIDE 37

HMTML solution

  • Rewrite 𝒟𝑜

𝑉 − 𝒣 𝐺 2 to an expression w.r.t. 𝑉𝑛

  • Alternating for each 𝑉𝑛 and solve each subproblem

w.r.t. 𝑉𝑛 by projected gradient descent

𝒣 = ℰ𝑠 ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁 = ℬ ×𝑛 𝑉𝑛,

Metricizing property

𝐻 𝑛 = 𝑉𝑛𝐶 𝑛

𝒟𝑜

𝑉 − 𝒣 𝐺 2 =

𝐷𝑜 𝑛

𝑉

− 𝐻 𝑛

𝐺 2

𝒟𝑜

𝑉 − 𝒣 𝐺 2 =

𝐷𝑜 𝑛

𝑉

− 𝑉𝑛𝐶 𝑛

𝐺 2

[Lathauwer et al., 2000a]

ℬ = ℰ𝑠 ×1 𝑉1 ⋯ ×𝑛−1 𝑉𝑛−1 ×𝑛+1 𝑉𝑛+1 ⋯ ×𝑁 𝑉𝑁

  • L. De Lathauwer et al., “On the best Rank-1 and rank-(r1, r2, ..., rn) approximation of higher-order tensors,”

SIAM J. Matrix Anal. Appl., 2000.

slide-38
SLIDE 38

Experiments

  • Datasets and features
  • Reuters multilingual collection (RMLC)
  • 6 categories, 3 domains: English (EN), Italian (IT), Spanish (SP)
  • Number of Documents: EN=18758, IT=24039, SP=12342
  • TF-IDF features, PCA preprocess to find comparable and high-

level patterns for transfer

  • NUS WIDE
  • Subset of 12 animal concepts, 16519 images + tags
  • {SIFT, wavelet, tag} + PCA preprocess, each representation is a

domain

  • Evaluation criteria
  • Accuracy, MacroF1
slide-39
SLIDE 39

Experiments

  • Compared methods
  • EU: Euclidean distance between samples based on their
  • riginal feature representations
  • RDML: an efficient and competitive DML algorithm,

does not make use of any additional information from

  • ther domains
  • DAMA: constructing mappings 𝑉𝑛 to link multiple

heterogeneous domains using manifold alignment

  • MTDA: the multi-task extension of linear discriminative

analysis

  • HMTML: the proposed method
slide-40
SLIDE 40

Experiments

  • Average perf. of all domains w.r.t. common factors
  • Although the labeled samples in each domain is scarce,

learning the distance metric separately using RDML can still improve the performance significantly

slide-41
SLIDE 41

Experiments

  • Average perf. of all domains w.r.t. common factors
  • All the three heterogeneous transfer learning

approaches achieve much better performance than

  • RDML. This indicates that it is useful to leverage

information from other domains in DML

slide-42
SLIDE 42

Experiments

  • Average perf. of all domains w.r.t. common factors
  • HMTML outperforms both DAMA and MTDA at most

numbers (of common factors). This indicates that the learned factors by our method are more expressive than the other approaches

slide-43
SLIDE 43

Experiments

  • Performance for individual domains
  • RDML improves the performance in each domain, and

the improvements are similar for different domains, since there is no communication between them

slide-44
SLIDE 44

Experiments

  • Performance for individual domains
  • Transfer learning methods has much larger

improvements than RDML in the domains that the discriminative ability of the original representations is not very good. This demonstrates that the knowledge is successfully transferred between different domains

slide-45
SLIDE 45

Experiments

  • Performance for individual domains
  • The discriminative domain obtains little benefit from the
  • ther relatively non-discriminative domains in DAMA

and MTDA, while in the proposed HMTML, we still achieve significant improvements

slide-46
SLIDE 46

Conclusions

  • The labeled data deficiency problem can be

alleviated by learning metrics for multiple heterogeneous domains simultaneously

  • The shared knowledge of different domains

exploited by the transfer learning methods can benefit each domain if appropriate common factors are discovered, and the high-order statistics (correlation information) is critical in discovering such factors

slide-47
SLIDE 47

Thank You! Q & & A