From Past to Present: Personalized Attention Session- Aware RNN Recommender System
Presenter: Mei Wang Mentors: Weizhi Li, Yan Yan 08/22/2018
From Past to Present: Personalized Attention Session- Aware RNN - - PowerPoint PPT Presentation
From Past to Present: Personalized Attention Session- Aware RNN Recommender System Presenter: Mei Wang Mentors: Weizhi Li, Yan Yan 08/22/2018 .6 54 6365 2
Presenter: Mei Wang Mentors: Weizhi Li, Yan Yan 08/22/2018
33
.6
2
↑↓Expanding ←→Wide-Ranging
e-commerce music social media video app store
3
Personality Culture Fashion Style Aesthe3c Taste Age Educa3on Figure Marriage ...
Internal
Environments Weather Festivals Location Ads Friends Family Income ...
External
Periodic Purchases Makeups Car wiper blade Favorite brands Preferred Color Clothing styles Fast Moving Consumer Goods ...
Past
Current Needs Umbrella Birthday Travel suits Mood Promotions Hang around ...
Present
Influen.al Factors
4
Context-Aware
Time-Aware
information of user's needs
Sequence- Aware
interactions
alternatives
2, . -1 ,,88 22, 8, 02 5
i2
3 i3 3
i1
3
Session 3
i1
2 i2 2
Session 2
i2
1 i3 1
Session 1
i4
1
i1
1
. . . . . . . . . . . .
User 1
( )
IJ@CE G GLJML LLJ IJGECR CGOJ JFFGJ LF
CGLJLCG OCLCG CGAE CG G FIELE CJ EE L LJ EGALJF MJ CGLJLCG L J C@@JGL CG
The sequential interactions between users and items are crucial data sources for recommender systems.
6
313 2
NLD DNC NII FI M@O@@ F@NC CDC DFF L@MOFN D @RNL@G@ NLDD FN@S FL@ G@GILS IMN
P@LS IDMS MIG@ FDEM L@ G@DAOF MIG@ L@ FDE@ S MGFF DN@L@MN CDF@ MIG@ GS @P@ @ FDE@ S GDMNE@
M@MMDIM MCIOF JFS M DAA@L@N LIF@M M JL@M@N M@MMDI ON NC@L@ DM I MJ@DAD LOF@ AIL DN@LND NC@ M@MMDIM@ MCILN N@LG JLIADF@M M@MMDIL@ FI N@LG JLIADF@M
4-0 .0 8--0 44-8 0-8 1,4 7
DA GIMEOMEI I INK GI@AF @ALEC EL ELJEKA@ KAF @M ILAKOMEIL @ FLEL
R R
0M IGAL KIG IFEA EDE AIGGAKA OECMEI FIC @M DOEC KIN@ 4 EMAKMEIL 4 NLAKL @ 4 EMAGL KIG 2 MACIKEAL
R
A IGA NJ PEMD MDA IFFIPEC MDKAA ILAKOMEIL
8
11
The blue bar represents the mean percentage of user interactions hanging in the top 10 categories during the same session, and the
50% 20%
Overall both of them are subject to exponential decrease, which proves that user's short-term shopping goal plays a predominant role for the intra-session interaction choices.
9
22 2
The figure shows mean percentage value of a user clicking some repeated categories and items that he/she had clicked before in the previous 10 sessions. Inter-session informa?on provides 30 to 60 percent of informa?on for next-basket category predic?on and 5 to 20 percent of knowledge about repeated items.
65% 20%
10
3
The figure shows the normalized histogram of the click gap (view dwell <me) of user interac<ons, which follows gamma distribu<on with maximum around 10 seconds. Generally speaking, the longer time a user spends on the item, the more interest he has in it. This perfectly bridges the gap of discrete interaction sequence data with potential weights.
10s 300s
11
2332 3 : 1 2:
In this paper, we want to quantify, exploit and integrate the effectiveness
represent an important piece of context information GRU4REC Session-based RNN recommender system as the basis of our model design We choose to use an efficient embedding layer to automatically train and activate short and long term profiles from session representations. long-term profiles are important for recommender system, while current state-of-art session-based approaches fail to model them effectively. Finally, with the help of temporal dynamics scheme, we incorporate temporal context in the RNN and perform efficient combination for short-term session sequence information and long-term user and item profiles.
2, . -1 ,,88 22, 8, 02 12
: :
:A :
IGGJ FGMD IJGFDC FCGF JJCGFNI GEEFI JPJE EGD G JEDJJDP CFAI CFIJJJCGF F CFIJJJCGF IG@CDJ
G@@I F OFD FCGF JE G DMIA EGID PFECJ JE ODGCCFA EGI CFIJJJCGF CF@GIECGF JG J G FF JJJCGF J CF CE CEFJCGF
:
CFDL DGFAIE LJI IG@CDJ @GI JJJCGFJ G DIF IGJJ JJJCGF IF F LJI @MGIC MGDLCGF CF JEDJJ NP
:
GFL OFJCM OICEFJ GF @GLI ID JJ F EGFJI @@CMFJJ G@ @GI IJGFDC IGEEFCGF
13
General Multiplicative Evolution Time Sequence Methods \ Features User Taste Item Impression User-Item Interaction User Favorite Item Trend Temporal Drift Sequence Feature Notation !" #$ %"$ %" t #$(() t seq CF- based BPR-MF ✓ ✓ ✓ ✕ ✕ ✕ ✕ TimeSVD++ ✓ ✓ ✓ ✓ ✓ ✓ ✕ FPMC ✓ ✓ ✓ ✕ ✕ ✕ ✓ NN- based DNN ✓ ✓ ✓ ✕ ✕ ✕ ✕ GRU4REC ✕ ✓ ✓ ✕ ✕ ✕ ✓ PASAR ✓ ✓ ✓ ✓ ✓ ✓ ✓ Related works compared by different methodology categories exploiting various domain features.
14
Time-Aware Context-Aware Sequence-Aware CF-Based Methods
2008 Netflix Prize Ø Collaborative Filtering
(CF, Schafer et al, AdaptiveWeb’07 )
Ø Matrix Factorization
(MF, Koren et al, Computer’09)
Ø SVD++ model
(Koren et al, KDD’08)
Ø BPR-MF model
(Rendle et al, AUAI Press’09)
Ø kNN model
(Koren, TKDD’10)
Ø Behavior Factorization
(Zhao et al, WWW’17)
Ø TimeSVD++ model
(Koren et al, KDD’09)
Ø Factorizing Personalized Markov Chain
(FPMC, Rendle et al, WWW’10)
Although CF-based methods have been theoretically well developed and are less expensive for computational cost, their practicableness and scalability yield to NN-based approaches.
15
Time-Aware Context-Aware Sequence-Aware NN-Based Methods
2016 YouTube DNN
Ø YouTube DNN Model
(Covington et al, RecSys’16 )
Ø Neural Collaborative Filtering
(NCF, He et al, WWW’17)
Ø Session-based RNN
(GRU4REC, Hidasi , ICLR’16 )
Ø Time-LSTM model
(Zhu et al, IJCAI’17 )
Ø Recurrent Recommender Networks
(RRN, Wu et al, WSDM’17)
Ø Neural Survival Recommendation
(NSR, Jing et al, WSDM’17)
Ø Temporal Point Process
(TPS, Song et al, GaTech)
Ø Deep & Cross model
(DCN, Wang et al, ADKDD’17)
Ø Latent Cross model
(Beutel et al, WSDM’18)
While these approaches did not adapt to session-based scheme, which could play a predominant actor for recommendation as shown in EDA.
16
Improvements ++User Profile Session-Based Methods
2016
GRU4REC
RNN
Ø GRU4REC RNN Model
(Hidasi et al, ICLR’16 )
Ø Data augmentation (Improved RNN, Tan et al, DLRS’16) Ø Exploit dwell time (Dallmann et al, RecSys’17) Ø Add different types of interactions (Wu et al, CIKM’17) Ø Hierarchical RNN (Quadrana et al, RecSys’17)
These works made incremental improvements for GRU4REC, but they do not make significant modification and haven't consider long-term intra-session info and user action gap time feature, which can make great gain.
17
To sum up, RNNs show their privilege in short-term sequential pattern mining than other item-based or Markov Chain-based approaches. To facilitate RNN with long-term profiling, the goal of this paper is to make effective use of both long-term and short profiles and construct a better personalized session-aware RNN recommender system.
[CIKM’15] STAR LDA (Latent Dirichlet Allocation) MCMC (Monte Carlo Markov Chain) [RecSys’17] HRNN
GRUuser
2, . -1 ,,88 22, 8, 02 18
@CE JJJCFE J J F@ FECELFLJ EMCACFE CMCCJ NCFL CEIILGCFE CE DFA JLE E FLI JCEAJ N JGI JJJCFE O DJ FEFLI CECMCO
! = #
$ % = 1, … , )}
#
$ = {,$
., … , ,$ /0}
We formulate this as a top-K ranking problem: We define the activity session sequence ! as: Where #
$ represents the number %12 session:
Totally there are ) sessions. Each session are of length 3$. ̂ 56 = 7 ,/ | ,-,…,/9- where , ∈ ; and ; is the item set.
Data flow of user and item interactions over time.
. . . . . . . . . . . .
?
. . . . . . . . . . . .
i1
5
i2
5
Session 5
i2
4
i3
4
i4
4 i5 4
Session 4
i1
4
?
i2
3 i3 3
i1
3
Session 3
i1
2 i2 2
Session 2
i2
1 i3 1
Session 1
i4
1
i1
1
. . . . . . . . . . . .
( )
19
Mini-Batches . . . INPUT LAYER
i2
1 i3 1 i4 1
Output
i2
2 i1 4 i2 4 i3 4 i4 4 i5 4
i2
3 i3 3 i1 5 i2 5
i1
1 i2 1 i3 1 i4 1
Input
i1
2 i2 2 i1 4 i2 4 i3 4 i4 4 i5 4
i1
3 i2 3 i3 3 i1 5 i2 5
Session 1 i1
1 i2 1 i3 1 i4 1
Session 2 i1
2 i2 2
Session 3 i1
3 i2 3 i3 3
Session 4 i1
4 i2 4 i3 4 i4 4 i5 4
Session 5 i1
5 i2 5
. . . . . . . . . . . .
?
. . . . . . . . . . . .
i1
5
i2
5
Session 5
i2
4
i3
4
i4
4 i5 4
Session 4
i1
4
?
i2
3 i3 3
i1
3
Session 3
i1
2 i2 2
Session 2
i2
1 i3 1
Session 1
i4
1
i1
1
. . . . . . . . . . . .
( )
session-parallel mini-batch approach Different session length E = {!
"} = {! " %, ! " ', … , ! " )*}
One-hot mini-batch vector is fed into a GRU layer, and the hidden states are reset when switching sessions. ̂
ℎ121134) = GRU !
", ℎ121134)8%
The output of RNN can be treated as session-representations: The likelihood in predictor is:
Intra-session Independent ✓
20
Time
. . . . . . . . . . . .
?
. . . . . . . . . . . .
i1
5
i2
5
Session 5
i2
4
i3
4
i4
4 i5 4
Session 4
i1
4
?
i2
3 i3 3
i1
3
Session 3
i1
2 i2 2
Session 2
i2
1 i3 1
Session 1
i4
1
i1
1
. . . . . . . . . . . .
τ
Session Duration
δ(t)
Action Gap Time
δ(τ)
Session Gap Time Action Time
t
T
Action timestamp can be used for periodical purchasing feature training directly as contextual information. Session gap time is helpful for survival analysis to predict user return time.
t1
t2 t3
tn
Scott Binning and Scaled Embedding
We create a dwell time sequence with the same dimension of item sequence: !" = {!"
%, !" ', … , !" )*}
Since it follows gamma distribution, we can take Scott binning of time to reduce dimensionality and accelerate the training process:
Data flow of user and item interactions over time.
!,-) = .
/ 24 ∗
3 4 Next, we use an embedding method to represent dwell time importance within sessions: E = {56,"} = {56,"
% , 56," ' , … , 56," )7,*}
21
i1 i2
i3
in
. . .
hd
1
hd
2
hd
n
hd
3
Embedding Layer and Dropout Layer
. . .
t1
t2 t3
tn
hq
1
hq
2
hq
3
hq
m
. . .
Scott Binning and Scaled Embedding
LSTM layer Attention Scheme
samples, all subsequences need to be forward to the attention network Attention Problems: Intuitively, such attention vectors are perfectly used to modulate the outputs
tool to extract the importance of sequence vector. Attention Design: Input layer
22
Triangle parallel attention method:
i1 i2
i3
in t1
t2 t3
tn
. . .
hd
1
hd
2
hd
n
hd
3
hq
1
hq
2
hq
3
hq
m
. . .
Scott Binning and Scaled Embedding Embedding Layer and Dropout Layer
. . . . . .
!" = tanh()
*ℎ,"-. + 0*)
Trigonometric Transformation
x x x x x x … x x x x … … … …
, t)
. . . . . .
P
LSTM layer Attention Scheme Input layer 2 = [2"] = [!5
6 7 ℎ*, … , !" 6 7 ℎ*, 0, … , 0]
concat
q
. . .
a1 a2 a3
an
Softmax
;" = <=>
?7@ABAA>CD
∑" <=>
?7@ABAA>CD
F, = G
"
;" 7 ℎ*.**"HI
23
User-grouped session-parallel mini-batch approach: E(u) = {!",$} = {!",$
' , !",$ ( , … , !",$ *+,,}
Mini-Batches . . . INPUT LAYER
i2
1 i3 1 i4 1
Output
i2
2 i1 4 i2 4 i3 4 i4 4 i5 4
i2
3 i3 3 i1 5 i2 5
i1
1 i2 1 i3 1 i4 1
Input
i1
2 i2 2 i1 4 i2 4 i3 4 i4 4 i5 4
i1
3 i2 3 i3 3 i1 5 i2 5
Session 1 i1
1 i2 1 i3 1 i4 1
Session 2 i1
2 i2 2
Session 3 i1
3 i2 3 i3 3
Session 4 i1
4 i2 4 i3 4 i4 4 i5 4
Session 5 i1
5 i2 5
. . . . . . . . . . . . . . .
?
. . . . . . . . . . . .
i1
5
i2
5
Session 5
i2
4 i3 4
i4
4 i5 4
Session 4
i1
4
User 2
?
i2
3 i3 3
i1
3
Session 3
i1
2 i2 2
Session 2
i2
1 i3 1
Session 1
i4
1
i1
1
. . . . . . . . . . . .
User 1
( )
( )
T
User-based negative sampling:
We select negative samples in proportion to the item popularity within mini-batch sequences. Furthermore, for each user, we need to rule out the items appeared in his/her history. This way, the local negative sampling method not only improves performance but also reduces the computational time.
24
. . .
concat concat concat concat
. . .
hd
1
hd
2
hd
n
hd
3
p1 p2
tanh(Wshi + bs)
Softmax
αi
Trigonometric Transformation
x x x x x x … x x x x … … … … . . .
bu
Fully connected layer
f(y = in+1|i1,...,n, u, t)
pn
p3
hsessions
. . .
hsessions
q
P
concat
q
bu
Fully connected layer
f(in+1|i1,...,n, u, t)
Concatenation Design: Attention Design:
!" = $%&
'()*
∑" $%&
'()*
Self-attention mechanism:
̂
. + 50
The likelihood in predictor is:
25
Loss functions:
BPR loss:
L = − 1 %& '
()* +,
log(1( ̂ 3
( − 34))
TOP1 loss:
L = 1 %& '
()* +,
1 ̂ 3
( − 34 + 1( ̂
3
( 7)
Hinge loss:
L = max{ ̂ 3
( − 34 + 1, 0}
Data augmentation:
First, we train each sequence with all hidden
fully explores the subsequences information. Second, we leverage the dropout layer for the sequences such that it makes regularization as well as diversifies the input sequence data.
26
Datasets !"#$%&%'( )%*(+(15 .$/'*ℎ$ 12 Events 53,309 17,920,066 6,921,446 254,398 Users 237 / 12,332 3,035 Items 1,395 23,459 31,893 1,173 Sessions 3,609 4,247,567 93,287 45,878 Session Support 2 2 2 2 Item Support 10 20 10 20 User Support 10 / 10 20 Totally we use four datasets in our experiments. user time both
27
Models Description Baseline Models BPR-MF Matrix factorization techniques apply SVD factoring the user-item rating matrix YouTube DNN YouTube model includes two stages: candidate generation and ranking WaveNet CNN Inner multiplicative can be exploited by its stacked causal atrous convolutions GRU4REC RNN Basic GRU layers and TOP-1 loss and session-parallel mini-batching mechanism PASAR Variants PASAR_user_att Adding user profile embedding by self-attention network PASAR_user_cat Adding the user profile by concatenating hidden outputs and user embeddings PASAR_time_att Adding time profile embedding by global attention network PASAR_time_cat Adding time profile by concatenating time embeddings and item embeddings PASAR_time_user Integrating both time and user profiles as final PASAR model
28
Experimental Comparison Results -- shown are the MRR top 20 and Recall top 20 scores of four baseline models and five PASAR variants on four datasets. We highlight some focal improvements in bold and underline the best results.
29
The detailed MRR@10, MRR@20, MRR30, MRR40 and MRR@all results for each datasets
30
Train speed time (iterations/second) and Training memory cost (MiB).
We did experiments on NVIDIA Tesla P40 GPUs. (172.20.190.45) MF method is fastest and CNN method takes the most memory. Our model is half slower than baseline RNN model and takes similar memory cost.
31
::
." ." . . C: : : . : :. . : :. :
:" . ," : . .: . : : : : .
C: : .: :.: : . : . :. , :
:: . . .A: A:: . . .
..
32
,2C
4 2 5C4 4 4 4 244 1 :2: 43 24 4 44 4 : 1C4 4 5 4C424 4326 4 2 C4 6C43 C4 444 43C24 34: 3 224:44 4 6 24 4 2 4 35544 C 42 :4 4 3 344 3 3 4 4 34: 4 1C
144 C 414336 :D4 4 2:3 3 : 1:4 .. 34: D:C1:4 14 5C4 34D4:43 6: 246 42 4 2 C4 :4 6 36 5 C2 6 34:
2, . 1 ,,884 22, 8, 042 33
CC
Many thanks to my mentors Weizhi and Chris. Love JD. ♥