Tensor Canonical Correlation Analysis and It Its applications
Presenter: Yong LUO The work is done when Yong LUO was a Research Fellow at Nanyang Technological University, Singapore
Tensor Canonical Correlation Analysis and It Its applications - - PowerPoint PPT Presentation
Tensor Canonical Correlation Analysis and It Its applications Presenter: Yong LUO The work is done when Yong LUO was a Research Fellow at Nanyang Technological University, Singapore Outline Y. Luo, D. C. Tao, R. Kotagiri, C. Xu, and Y. G.
Presenter: Yong LUO The work is done when Yong LUO was a Research Fellow at Nanyang Technological University, Singapore
“Tensor Canonical Correlation Analysis for Multi- view Dimension Reduction,” IEEE Transactions on Knowledge and Data Engineering (T-KDE), vol. 27,
Information and Unlabeled Data for Heterogeneous Multi-task Metric Learning,” International Joint Conference on Artificial Intelligence (IJCAI), pp. 1809-1815, 2016.
dimensional data
computational cost, etc.
etc.), feature transformation (PCA, LDA, LE, etc.)
multiple sources, and can be extracted different kinds of features.
multiple types of features
Feature concatenation
dimension reduction, view agreement exploration
common subspace to compactly represent the heterogeneous data; One of the most representative model: CCA
argmax
𝐴1,𝐴2
𝜍 = corr 𝐴1, 𝐴2 = 𝐢1
𝑈𝐷12𝐢2
𝐢1
𝑈𝐷11𝐢1𝐢2 𝑈𝐷22𝐢2
𝑦1 𝑦2 𝑨 𝑨1𝑜 = 𝐲1𝑜
𝑈 𝐢1
𝑨2𝑜 = 𝐲2𝑜
𝑈 𝐢2
𝑈 𝐢𝑛 is the vector of canonical variables for the
𝑛’th view, and 𝐴 is a centroid representation
argmin
𝐴,𝐛, 𝐢𝑞 𝑛=1
𝑁
1 𝑁
𝑛=1 𝑁
𝐴 − 𝛽𝑛𝐴𝑛
2 2 ,
𝐴𝑛
2 = 1
and adaptively based on LS regression argmin
𝐢𝑛 𝑛=1
𝑁
1 2𝑁 𝑁 − 1
𝑞,𝑟=1 𝑁
𝑌𝑞
𝑈𝐢𝑞 − 𝑌𝑟 𝑈𝐢𝑟 2 2
𝑛
𝑛=1 𝑁
𝐢𝑛
𝑈 𝐷𝑛𝑛𝐢𝑛 = 1
Networks, 2007.
pairs of features is explored, while high-order statistics is ignored
views
High order tensor correlation
𝑦3 𝑦1 𝑦2 𝑒1 𝑒2 𝑒3
Pairwise correlation
𝑦2 𝑦1 𝑦3 𝑒1 𝑒2 𝑒3
𝐯2
1
𝐯3
1
𝜇 1 𝐯1
1
𝐯2
𝑠
𝐯3
𝑠
𝜇 𝑠 𝐯1
𝑠
+ ⋯ ⋯ + ⋮ ⋮ 𝑌2 𝑌1 ⋮ 𝑌3 𝑂 𝑒1 𝑒2 𝑒3 ⋯ 𝑉1 ⋮ 𝑉2 ⋯ 𝑉3 Tensor CCA 𝑎1 𝑎2 𝑎3 𝑠 𝑠 𝑠 𝑎 3𝑠 Mapping LAB WT SIFT 𝑦3 𝑦1 𝑦2 𝑒1 𝑒2 𝑒3 𝒟123 Covariance tensor Sum of rank-1 approximation
Vector: order-1 tensor Matrix: order-2 tensor Scalar: order-0 tensor Order-3 tensor
and an 𝐾𝑛 × 𝐽𝑛 matrix 𝑉 is a tensor ℬ = ×𝑛 𝑉 of size 𝐽1 × ⋯ × 𝐽𝑛−1 × 𝐾𝑛 × 𝐽𝑛+1 ⋯ × 𝐽𝑁 with the element
ℬ 𝑗1, ⋯ , 𝑗𝑛−1, 𝑘𝑛, 𝑗𝑛+1, ⋯ , 𝑗𝑁 =
𝑗𝑛=1 𝐽𝑛
𝑗1, 𝑗2, ⋯ , 𝑗𝑁 𝑉 𝑘𝑛, 𝑗𝑛 ℬ = ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁
𝐯 is an 𝐽1 × ⋯ × 𝐽𝑛−1 × 𝐽𝑛+1 ⋯ × 𝐽𝑁 tensor ℬ = ഥ ×𝑛 𝐯 of order 𝑁 − 1 with the entry
ℬ 𝑗1, ⋯ , 𝑗𝑛−1, 𝑗𝑛+1, ⋯ , 𝑗𝑁 =
𝑗𝑛=1 𝐽𝑛
𝑗1, 𝑗2, ⋯ , 𝑗𝑁 𝐯 𝑗𝑛 𝐺
2 = , = 𝑗1=1 𝐽1
𝑗2=1 𝐽2
⋯
𝑗𝑁=1 𝐽𝑁
𝑗1, 𝑗2, ⋯ , 𝑗𝑁 2
𝐽1 ⋯ 𝐽𝑛−1𝐽𝑛+1 ⋯ 𝐽𝑁 matrix 𝐵 𝑛
mode-1 mode-2 frontal matricizing horizontal matricizing
𝐵 1 𝐵 2 𝐵 3
row-wise vectorizing column-wise vectorizing
manipulated as matrix multiplication by storing the tensors in metricized form, i.e., 𝐶 𝑛 = 𝑉𝐵 𝑛
series of Kronecker products 𝑑1, 𝑑2, ⋯ , 𝑑𝐿 = 𝑛 + 1, 𝑛 + 2, ⋯ , 𝑁, 1, 2, ⋯ , 𝑛 − 1 is a forward cyclic ordering for indices of the tensor dims ℬ = ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁 𝐶 𝑛 = 𝑉𝑛𝐵 𝑛 𝑉𝑑𝑛−1⨂𝑉𝑑𝑛−1⨂ ⋯ ⨂𝑉𝑑𝑛−1
𝑈
𝐴𝑛 = 𝑌𝑛
𝑈 𝐢𝑛, 𝑛 = 1, ⋯ , 𝑁:
1 𝑂 σ𝑜=1 𝑂
𝐲1𝑜 ∘ 𝐲2𝑜 ∘ ⋯ ∘ 𝐲𝑁𝑜 argmax
𝐢𝑛
𝜍 = corr 𝐴1, 𝐴2, ⋯ , 𝐴𝑁 = 𝐴1⨀𝐴2⨀ ⋯ ⨀𝐴𝑁 𝑈𝐟 ,
𝑈 𝐴𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁
argmax
𝐢𝑛
𝜍 = 𝒟12⋯𝑛 ഥ ×1 𝐢1
𝑈 ഥ
×2 𝐢2
𝑈 ⋯ ഥ
×𝑁 𝐢𝑁
𝑈 ,
𝑈 𝐷𝑛𝑛 + 𝜗𝐽 𝐢𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁
𝐷11
Τ 1 2 ×2 ሚ
𝐷22
Τ 1 2 ⋯ ×𝑁 ሚ
𝐷𝑁𝑁
Τ 1 2, and
𝐯𝑛 = ሚ 𝐷𝑛𝑛
Τ 1 2𝐢𝑛, where ሚ
𝐷𝑛𝑛 = 𝐷𝑛𝑛 + 𝜗𝐽
ℳ = 𝜍𝐯1 ∘ 𝐯2 ∘ ⋯ ∘ 𝐯𝑁, problem becomes
power method (HOPM), etc. argmax
𝐯𝑛
𝜍 = ℳ ഥ ×1 𝐯1
𝑈 ഥ
×2 𝐯2
𝑈 ⋯ ഥ
×𝑁 𝐯𝑁
𝑈 ,
𝑈 𝐯𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁
argmin
𝐯𝑛
ℳ − ℳ 𝐺
2 , [Lathauwer et al., 2000a]
SIAM J. Matrix Anal. Appl., 2000.
correlation as presented in the main TCCA problem
rank-𝑠 CP decomposition of ℳ
ℳ ≈
𝑙=1 𝑠
𝜍𝑙𝐯1
𝑙 ∘ 𝐯2 𝑙 ∘ ⋯ ∘ 𝐯𝑁 𝑙
𝑎𝑛 = 𝑌𝑛
𝑈 ሚ
𝐷𝑛𝑛
Τ −1 2𝑉𝑛
𝑉𝑛 = 𝐯𝑛
1 , ⋯ , 𝐯𝑛 𝑠
∅ 𝑌𝑛 = ∅ 𝐲𝑛1 , ∅ 𝐲𝑛2 , ⋯ , ∅ 𝐲𝑛𝑂 argmax
𝐛𝑛
𝜍 = 12⋯𝑁 ഥ ×1 𝐛1
𝑈 ഥ
×2 𝐛2
𝑈 ⋯ ഥ
×𝑁 𝐛𝑁
𝑈 ,
𝑈 𝐿𝑛𝑛 2
+ 𝜗𝐿𝑛𝑛 𝐛𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁 𝑀𝑛
𝑈 𝑀𝑛
𝐷11
Τ 1 2 ×2 ሚ
𝐷22
Τ 1 2 ⋯ ×𝑁 ሚ
𝐷𝑁𝑁
Τ 1 2, and 𝐜𝑛 =
ሚ 𝐷𝑛𝑛
Τ 1 2𝐛𝑛:
argmax
𝐯𝑛
𝜍 = 𝒯 ഥ ×1 𝐜1
𝑈 ഥ
×2 𝐜2
𝑈 ⋯ ഥ
×𝑁 𝐜𝑁
𝑈 ,
𝑈 𝐜𝑛 = 1, 𝑛 = 1, ⋯ , 𝑁
𝑎𝑛 = 𝐿𝑛𝑛𝑀𝑛
−1𝐶𝑛, 𝑛 = 1, ⋯ , 𝑁
from the sequence window of amino acid. Each view is 105-D
the current URL (495-D), and terms in the anchor URL (472-D)
accuracy
Τ 𝑛 𝑛 − 1 2 subsets of two views
views
significantly when dim is high
Unlabeled = 84K Unlabeled = 1.3M
methods
Linear Non-linear
based strategy is often better than simply concatenating all the features, especially when the feature dimension is high
unlabeled data to be utilized, often leads to better performance; By exploring the high-order statistics, the proposed TCCA outperforms the other methods
transfer between different views?
input space to reflect relationships between data
classification and information retrieval
learning >>> learning linear transformation
complex structure in the data
𝑒𝐵 𝐲𝑗, 𝐲𝑘 = 𝐲𝑗 − 𝐲𝑘
𝑈𝐵 𝐲𝑗 − 𝐲𝑘 ,
𝑒𝐵 𝐲𝑗, 𝐲𝑘 = 𝑉𝐲𝑗 − 𝑉𝐲𝑘
2 2
𝐵 = 𝑉𝑉𝑈
distance metric
different tasks/domains (source tasks/domains)
to help metric learning in the target tasks/domains
from different distributions (same feature space)
website
change in different time periods or on different devices
different types of products can be very different
given the different distributions or find a subspace in which the distribution difference is reduced
feature spaces, and may have different semantics
the different domains
Labeled reviews in English (abundant) Labeled reviews in Spanish (scarce) Classify reviews in Spanish
usually transform heterogeneous features into a common subspace, and the transformation can be used to derive a metric
different domains, the high-order statistics that can only be obtained by simultaneously examining all domains is ignored
domain connections
… … … … …
𝑉1
…
𝑉𝑁 𝐵1
∗ = 𝑉1 ∗𝑉1 ∗𝑈
𝐵𝑁
∗ = 𝑉𝑁 ∗ 𝑉𝑁 ∗ 𝑈
… …
𝑦11
𝑉
𝑌1
𝑉
𝑦12
𝑉
𝑦1𝑂
𝑉
Tensor based correlation maximization
… …
𝑨11
𝑉
𝑨12
𝑉 𝑨1𝑂 𝑉
𝑎1
𝑉
𝑨𝑁1
𝑉
𝑨𝑁2
𝑉 𝑨𝑁𝑂 𝑉
𝑎𝑁
𝑉
𝑌𝑁
𝑉
𝑦𝑁2
𝑉
𝑦𝑁1
𝑉
𝑦𝑁𝑂
𝑉
Unlabeled data
𝑉
Germany
𝐵𝑁 = 𝑉𝑁𝑉𝑁
𝑈
𝑁
𝑀
English
𝐵1 = 𝑉1𝑉1
𝑈
1
𝑀
2 𝑂𝑛 𝑂𝑛−1 σ𝑗<𝑘 𝑀 𝐵𝑛; 𝐲𝑛𝑗, 𝐲𝑛𝑘, 𝑧𝑛𝑗𝑘 is the
empirical loss w.r.t. 𝐵𝑛
different domains argmin
𝐵𝑛 𝑛=1
𝑁
𝐺 𝐵𝑛 =
𝑛=1 𝑁
Ψ 𝐵𝑛 + 𝛿𝑆 𝐵1, 𝐵2, ⋯ , 𝐵𝑁 , s.t. 𝐵𝑛 ≽ 0, 𝑛 = 1, 2, ⋯ , 𝑁,
𝑈
domains into a common subspace, where the correlation of all domains are maximized
𝑉 , 𝐴2𝑜 𝑉 , ⋯ , 𝐴𝑁𝑜 𝑉
= 𝐴1𝑜
𝑉 ⨀𝐴2𝑜 𝑉 ⨀ ⋯ ⨀𝐴𝑁𝑜 𝑉 𝑈𝐟 is
the correlation of the projected representations 𝐴𝑛𝑜
𝑉
= 𝑉𝑛
𝑈 𝐲𝑛𝑜 𝑉 𝑛=1 𝑁
argmax
𝑉𝑛 𝑛=1
𝑁
1 𝑂𝑉
𝑜=1 𝑂𝑉
corr 𝐴1𝑜
𝑉 , 𝐴2𝑜 𝑉 , ⋯ , 𝐴𝑁𝑜 𝑉
,
argmax
𝑉𝑛 𝑛=1
𝑁
1 𝑂𝑉
𝑜=1 𝑂𝑉
corr 𝐴1𝑜
𝑉 , 𝐴2𝑜 𝑉 , ⋯ , 𝐴𝑁𝑜 𝑉
argmax
𝑉𝑛 𝑛=1
𝑁
1 𝑂𝑉
𝑜=1 𝑂𝑉
ഥ ×1 𝐲1𝑜
𝑉 𝑈 ⋯ ഥ
×𝑁 𝐲𝑁𝑜
𝑉 𝑈
argmin
𝑉𝑛 𝑛=1
𝑁
1 𝑂𝑉
𝑜=1 𝑂𝑉
𝒟𝑜
𝑉 − 𝐺 2
[Luo et al., 2015] = ℰ𝑠 ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁 is the covariance tensor of the mappings; 𝒟𝑜
𝑉 is the covariance tensor of
the representations for the 𝑜’th unlabeled sample. [Lathauwer et al., 2000b]
representations of all domains are close to each other
different domains can help each other in learning the mapping 𝑉𝑛, or equivalently the metric 𝐵𝑛
argmin
𝑉𝑛 𝑛=1
𝑁
𝐺 𝑉𝑛 =
𝑛=1 𝑁
1 𝑂𝑛
′ 𝑙=1 𝑂𝑛
′
𝑧𝑛𝑙 1 − 𝛆𝑛𝑙
𝑈 𝑉𝑛𝑉𝑛 𝑈 𝛆𝑛𝑙
+ 𝛿 𝑂𝑉
𝑜=1 𝑂𝑉
𝒟𝑜
𝑉 − 𝐺 2 + 𝑛=1 𝑁
𝛿𝑛 𝑉𝑛 1
𝑉 − 𝐺 2 to an expression w.r.t. 𝑉𝑛
w.r.t. 𝑉𝑛 by projected gradient descent
= ℰ𝑠 ×1 𝑉1 ×2 𝑉2 ⋯ ×𝑁 𝑉𝑁 = ℬ ×𝑛 𝑉𝑛,
Metricizing property
𝐻 𝑛 = 𝑉𝑛𝐶 𝑛
𝒟𝑜
𝑉 − 𝐺 2 =
𝐷𝑜 𝑛
𝑉
− 𝐻 𝑛
𝐺 2
𝒟𝑜
𝑉 − 𝐺 2 =
𝐷𝑜 𝑛
𝑉
− 𝑉𝑛𝐶 𝑛
𝐺 2
[Lathauwer et al., 2000a]
ℬ = ℰ𝑠 ×1 𝑉1 ⋯ ×𝑛−1 𝑉𝑛−1 ×𝑛+1 𝑉𝑛+1 ⋯ ×𝑁 𝑉𝑁
SIAM J. Matrix Anal. Appl., 2000.
level patterns for transfer
domain
does not make use of any additional information from
heterogeneous domains using manifold alignment
analysis
learning the distance metric separately using RDML can still improve the performance significantly
approaches achieve much better performance than
information from other domains in DML
numbers (of common factors). This indicates that the learned factors by our method are more expressive than the other approaches
the improvements are similar for different domains, since there is no communication between them
improvements than RDML in the domains that the discriminative ability of the original representations is not very good. This demonstrates that the knowledge is successfully transferred between different domains
and MTDA, while in the proposed HMTML, we still achieve significant improvements
alleviated by learning metrics for multiple heterogeneous domains simultaneously
exploited by the transfer learning methods can benefit each domain if appropriate common factors are discovered, and the high-order statistics (correlation information) is critical in discovering such factors