[PPT] - From Past to Present: Personalized Attention Session- Aware RNN PowerPoint Presentation

SLIDE 1

From Past to Present: Personalized Attention Session- Aware RNN Recommender System

Presenter: Mei Wang Mentors: Weizhi Li, Yan Yan 08/22/2018

SLIDE 2

54
6365 2
1 56

33

4
4
1

.6

SLIDE 3

1. 2 ,.20 2.2 811 .22 1

2

↑↓Expanding ←→Wide-Ranging

e-commerce music social media video app store

SLIDE 4

1. 2 ,.20 2.2 811 .22 1

3

Personality Culture Fashion Style Aesthe3c Taste Age Educa3on Figure Marriage ...

Internal

Environments Weather Festivals Location Ads Friends Family Income ...

External

Periodic Purchases Makeups Car wiper blade Favorite brands Preferred Color Clothing styles Fast Moving Consumer Goods ...

Past

Current Needs Umbrella Birthday Travel suits Mood Promotions Hang around ...

Present

Influen.al Factors

SLIDE 5

1. 2 ,.20 2.2 811 .22 1

4

Context-Aware

temporal information
spatial location
user profiles
cross-domain features

Time-Aware

easy-to-obtain
indicative

information of user's needs

Sequence- Aware

the ordering of

interactions

outperform popular

alternatives

NN- based CF- based

SLIDE 6

2, . -1 ,,88 22, 8, 02 5

?

i2

3 i3 3

i1

3

Session 3

i1

2 i2 2

Session 2

i2

1 i3 1

Session 1

i4

1

i1

1

. . . . . . . . . . . .

User 1

( )

8 AE @ LC OJD C L FD @@LCN M @ L CGLJCG G CGLJCG

IJ@CE G GLJML LLJ IJGECR CGOJ JFFGJ LF

2L PCLCGA CG FL EE @M G JLLJF MJ

CGLJLCG OCLCG CGAE CG G FIELE CJ EE L LJ EGALJF MJ CGLJLCG L J C@@JGL CG

The sequential interactions between users and items are crucial data sources for recommender systems.

SLIDE 7

0 .02 -8-0 8:- 0- 1,

6

313 2

LDNDIF IN

NLD DNC NII FI M@O@@ F@NC CDC DFF L@MOFN D @RNL@G@ NLDD FN@S FL@ G@GILS IMN

C@ DN@LNDI N DM

P@LS IDMS MIG@ FDEM L@ G@DAOF MIG@ L@ FDE@ S MGFF DN@L@MN CDF@ MIG@ GS @P@ @ FDE@ S GDMNE@

,N ALIG NC@ JMN

M@MMDIM MCIOF JFS M DAA@L@N LIF@M M JL@M@N M@MMDI ON NC@L@ DM I MJ@DAD LOF@ AIL DN@LND NC@ M@MMDIM@ MCILN N@LG JLIADF@M M@MMDIL@ FI N@LG JLIADF@M

SLIDE 8

4-0 .0 8--0 44-8 0-8 1,4 7

DA GIMEOMEI I INK GI@AF @ALEC EL ELJEKA@ KAF @M ILAKOMEIL @ FLEL

R R

0M IGAL KIG IFEA EDE AIGGAKA OECMEI FIC @M DOEC KIN@ 4 EMAKMEIL 4 NLAKL @ 4 EMAGL KIG 2 MACIKEAL

R

A IGA NJ PEMD MDA IFFIPEC MDKAA ILAKOMEIL

SLIDE 9

1. 2 ,.20 2.2 811 .22 1

8

11

Short-term profile predominates in the selection of the recommendations

The blue bar represents the mean percentage of user interactions hanging in the top 10 categories during the same session, and the

range one represents for items.

50% 20%

Overall both of them are subject to exponential decrease, which proves that user's short-term shopping goal plays a predominant role for the intra-session interaction choices.

SLIDE 10

1. 2 ,.20 2.2 811 .22 1

9

22 2

Longer-term behavioral patterns and user preferences can also be important

The figure shows mean percentage value of a user clicking some repeated categories and items that he/she had clicked before in the previous 10 sessions. Inter-session informa?on provides 30 to 60 percent of informa?on for next-basket category predic?on and 5 to 20 percent of knowledge about repeated items.

65% 20%

SLIDE 11

1. 2 ,.20 2.2 811 .22 1

10

3

The click gap time (item dwell time) helps in connecting short-term and long-term

The figure shows the normalized histogram of the click gap (view dwell <me) of user interac<ons, which follows gamma distribu<on with maximum around 10 seconds. Generally speaking, the longer time a user spends on the item, the more interest he has in it. This perfectly bridges the gap of discrete interaction sequence data with potential weights.

10s 300s

SLIDE 12

1. 2 ,.20 2.2 811 .22 1

11

2332 3 : 1 2:

Short-term profile predominates
Longer-term profile counts
The dwell time helps

In this paper, we want to quantify, exploit and integrate the effectiveness

f user's intra-session and inter-session profiles with temporal dynamics.
The very last actions in the present session should

represent an important piece of context information GRU4REC Session-based RNN recommender system as the basis of our model design We choose to use an efficient embedding layer to automatically train and activate short and long term profiles from session representations. long-term profiles are important for recommender system, while current state-of-art session-based approaches fail to model them effectively. Finally, with the help of temporal dynamics scheme, we incorporate temporal context in the RNN and perform efficient combination for short-term session sequence information and long-term user and item profiles.

SLIDE 13

2, . -1 ,,88 22, 8, 02 12

: :

:A :

IGGJ FGMD IJGFDC FCGF JJCGFNI GEEFI JPJE EGD G JEDJJDP CFAI CFIJJJCGF F CFIJJJCGF IG@CDJ

:

G@@I F OFD FCGF JE G DMIA EGID PFECJ JE ODGCCFA EGI CFIJJJCGF CF@GIECGF JG J G FF JJJCGF J CF CE CEFJCGF

:

CFDL DGFAIE LJI IG@CDJ @GI JJJCGFJ G DIF IGJJ JJJCGF IF F LJI @MGIC MGDLCGF CF JEDJJ NP

:

GFL OFJCM OICEFJ GF @GLI ID JJ F EGFJI @@CMFJJ G@ @GI IJGFDC IGEEFCGF

SLIDE 14

1. 2 ,.20 2.2 811 .22 1

13

General Multiplicative Evolution Time Sequence Methods \ Features User Taste Item Impression User-Item Interaction User Favorite Item Trend Temporal Drift Sequence Feature Notation !" #$ %"$ %" t #$(() t seq CF- based BPR-MF ✓ ✓ ✓ ✕ ✕ ✕ ✕ TimeSVD++ ✓ ✓ ✓ ✓ ✓ ✓ ✕ FPMC ✓ ✓ ✓ ✕ ✕ ✕ ✓ NN- based DNN ✓ ✓ ✓ ✕ ✕ ✕ ✕ GRU4REC ✕ ✓ ✓ ✕ ✕ ✕ ✓ PASAR ✓ ✓ ✓ ✓ ✓ ✓ ✓ Related works compared by different methodology categories exploiting various domain features.

SLIDE 15

1. 2 ,.20 2.2 811 .22 1

14

Time-Aware Context-Aware Sequence-Aware CF-Based Methods

2008 Netflix Prize Ø Collaborative Filtering

(CF, Schafer et al, AdaptiveWeb’07 )

Ø Matrix Factorization

(MF, Koren et al, Computer’09)

Ø SVD++ model

(Koren et al, KDD’08)

Ø BPR-MF model

(Rendle et al, AUAI Press’09)

Ø kNN model

(Koren, TKDD’10)

Ø Behavior Factorization

(Zhao et al, WWW’17)

Ø TimeSVD++ model

(Koren et al, KDD’09)

Ø Factorizing Personalized Markov Chain

(FPMC, Rendle et al, WWW’10)

Although CF-based methods have been theoretically well developed and are less expensive for computational cost, their practicableness and scalability yield to NN-based approaches.

SLIDE 16

1. 2 ,.20 2.2 811 .22 1

15

Time-Aware Context-Aware Sequence-Aware NN-Based Methods

2016 YouTube DNN

Ø YouTube DNN Model

(Covington et al, RecSys’16 )

Ø Neural Collaborative Filtering

(NCF, He et al, WWW’17)

Ø Session-based RNN

(GRU4REC, Hidasi , ICLR’16 )

Ø Time-LSTM model

(Zhu et al, IJCAI’17 )

Ø Recurrent Recommender Networks

(RRN, Wu et al, WSDM’17)

Ø Neural Survival Recommendation

(NSR, Jing et al, WSDM’17)

Ø Temporal Point Process

(TPS, Song et al, GaTech)

Ø Deep & Cross model

(DCN, Wang et al, ADKDD’17)

Ø Latent Cross model

(Beutel et al, WSDM’18)

While these approaches did not adapt to session-based scheme, which could play a predominant actor for recommendation as shown in EDA.

SLIDE 17

1. 2 ,.20 2.2 811 .22 1

16

Improvements ++User Profile Session-Based Methods

2016

GRU4REC

RNN

Ø GRU4REC RNN Model

(Hidasi et al, ICLR’16 )

Ø Data augmentation (Improved RNN, Tan et al, DLRS’16) Ø Exploit dwell time (Dallmann et al, RecSys’17) Ø Add different types of interactions (Wu et al, CIKM’17) Ø Hierarchical RNN (Quadrana et al, RecSys’17)

These works made incremental improvements for GRU4REC, but they do not make significant modification and haven't consider long-term intra-session info and user action gap time feature, which can make great gain.

SLIDE 18

1. 2 ,.20 2.2 811 .22 1

17

To sum up, RNNs show their privilege in short-term sequential pattern mining than other item-based or Markov Chain-based approaches. To facilitate RNN with long-term profiling, the goal of this paper is to make effective use of both long-term and short profiles and construct a better personalized session-aware RNN recommender system.

[CIKM’15] STAR LDA (Latent Dirichlet Allocation) MCMC (Monte Carlo Markov Chain) [RecSys’17] HRNN

GRUses

GRUuser

SLIDE 19

2, . -1 ,,88 22, 8, 02 18

@CE JJJCFE J J F@ FECELFLJ EMCACFE CMCCJ NCFL CEIILGCFE CE DFA JLE E FLI JCEAJ N JGI JJJCFE O DJ FEFLI CECMCO

! = #

$ % = 1, … , )}

#

$ = {,$

, ,$

., … , ,$ /0}

We formulate this as a top-K ranking problem: We define the activity session sequence ! as: Where #

$ represents the number %12 session:

Totally there are ) sessions. Each session are of length 3$. ̂ 56 = 7 ,/ | ,-,…,/9- where , ∈ ; and ; is the item set.

Data flow of user and item interactions over time.

. . . . . . . . . . . .

?

. . . . . . . . . . . .

i1

5

i2

5

Session 5

i2

4

i3

4

i4

4 i5 4

Session 4

i1

4

?

i2

3 i3 3

i1

3

Session 3

i1

2 i2 2

Session 2

i2

1 i3 1

Session 1

i4

1

i1

1

. . . . . . . . . . . .

( )

SLIDE 20

1. 2 ,.20 2.2 811 .22 1

19

-- -

Mini-Batches . . . INPUT LAYER

i2

1 i3 1 i4 1

Output

i2

2 i1 4 i2 4 i3 4 i4 4 i5 4

i2

3 i3 3 i1 5 i2 5

i1

1 i2 1 i3 1 i4 1

Input

i1

2 i2 2 i1 4 i2 4 i3 4 i4 4 i5 4

i1

3 i2 3 i3 3 i1 5 i2 5

Session 1 i1

1 i2 1 i3 1 i4 1

Session 2 i1

2 i2 2

Session 3 i1

3 i2 3 i3 3

Session 4 i1

4 i2 4 i3 4 i4 4 i5 4

Session 5 i1

5 i2 5

. . . . . . . . . . . .

?

. . . . . . . . . . . .

i1

5

i2

5

Session 5

i2

4

i3

4

i4

4 i5 4

Session 4

i1

4

?

i2

3 i3 3

i1

3

Session 3

i1

2 i2 2

Session 2

i2

1 i3 1

Session 1

i4

1

i1

1

. . . . . . . . . . . .

( )

session-parallel mini-batch approach Different session length E = {!

"} = {! " %, ! " ', … , ! " )*}

One-hot mini-batch vector is fed into a GRU layer, and the hidden states are reset when switching sessions. ̂

. = / !., ℎ.

ℎ121134) = GRU !

", ℎ121134)8%

The output of RNN can be treated as session-representations: The likelihood in predictor is:

Intra-session Independent ✓

SLIDE 21

1. 2 ,.20 2.2 811 .22 1

20

Time

. . . . . . . . . . . .

?

. . . . . . . . . . . .

i1

5

i2

5

Session 5

i2

4

i3

4

i4

4 i5 4

Session 4

i1

4

?

i2

3 i3 3

i1

3

Session 3

i1

2 i2 2

Session 2

i2

1 i3 1

Session 1

i4

1

i1

1

. . . . . . . . . . . .

τ

Session Duration

δ(t)

Action Gap Time

δ(τ)

Session Gap Time Action Time

t

T

Action timestamp can be used for periodical purchasing feature training directly as contextual information. Session gap time is helpful for survival analysis to predict user return time.

t1

t2 t3

tn

Scott Binning and Scaled Embedding

We create a dwell time sequence with the same dimension of item sequence: !" = {!"

%, !" ', … , !" )*}

Since it follows gamma distribution, we can take Scott binning of time to reduce dimensionality and accelerate the training process:

Data flow of user and item interactions over time.

!,-) = .

/ 24 ∗

3 4 Next, we use an embedding method to represent dwell time importance within sessions: E = {56,"} = {56,"

% , 56," ' , … , 56," )7,*}

SLIDE 22

1. 2 ,.20 2.2 811 .22 1

21

i1 i2

i3

in

. . .

hd

1

hd

2

hd

n

hd

3

Embedding Layer and Dropout Layer

. . .

t1

t2 t3

tn

hq

1

hq

2

hq

3

hq

m

. . .

Scott Binning and Scaled Embedding

LSTM layer Attention Scheme

1. Sequence-In-Sequence-Out RNNs in NLP tasks
2. Enable data augmentation to get more training

samples, all subsequences need to be forward to the attention network Attention Problems: Intuitively, such attention vectors are perfectly used to modulate the outputs

f hidden states representing session
rders, and it's reported as a very useful

tool to extract the importance of sequence vector. Attention Design: Input layer

SLIDE 23

1. 2 ,.20 2.2 811 .22 1

22

Triangle parallel attention method:

i1 i2

i3

in t1

t2 t3

tn

. . .

hd

1

hd

2

hd

n

hd

3

hq

1

hq

2

hq

3

hq

m

. . .

Scott Binning and Scaled Embedding Embedding Layer and Dropout Layer

. . . . . .

!" = tanh()

*ℎ,"-. + 0*)

Trigonometric Transformation

x x x x x x … x x x x … … … …

, t)

. . . . . .

P

LSTM layer Attention Scheme Input layer 2 = [2"] = [!5

6 7 ℎ*, … , !" 6 7 ℎ*, 0, … , 0]

concat

q

. . .

a1 a2 a3

an

Softmax

;" = <=>

?7@ABAA>CD

∑" <=>

?7@ABAA>CD

F, = G

"

;" 7 ℎ*.**"HI

SLIDE 24

1. 2 ,.20 2.2 811 .22 1

23

-

User-grouped session-parallel mini-batch approach: E(u) = {!",$} = {!",$

' , !",$ ( , … , !",$ *+,,}

Mini-Batches . . . INPUT LAYER

i2

1 i3 1 i4 1

Output

i2

2 i1 4 i2 4 i3 4 i4 4 i5 4

i2

3 i3 3 i1 5 i2 5

i1

1 i2 1 i3 1 i4 1

Input

i1

2 i2 2 i1 4 i2 4 i3 4 i4 4 i5 4

i1

3 i2 3 i3 3 i1 5 i2 5

Session 1 i1

1 i2 1 i3 1 i4 1

Session 2 i1

2 i2 2

Session 3 i1

3 i2 3 i3 3

Session 4 i1

4 i2 4 i3 4 i4 4 i5 4

Session 5 i1

5 i2 5

. . . . . . . . . . . . . . .

?

. . . . . . . . . . . .

i1

5

i2

5

Session 5

i2

4 i3 4

i4

4 i5 4

Session 4

i1

4

User 2

?

i2

3 i3 3

i1

3

Session 3

i1

2 i2 2

Session 2

i2

1 i3 1

Session 1

i4

1

i1

1

. . . . . . . . . . . .

User 1

( )

T

User-based negative sampling:

We select negative samples in proportion to the item popularity within mini-batch sequences. Furthermore, for each user, we need to rule out the items appeared in his/her history. This way, the local negative sampling method not only improves performance but also reduces the computational time.

SLIDE 25

1. 2 ,.20 2.2 811 .22 1

24

. . .

concat concat concat concat

. . .

hd

1

hd

2

hd

n

hd

3

p1 p2

tanh(Wshi + bs)

Softmax

αi

Trigonometric Transformation

x x x x x x … x x x x … … … … . . .

bu

Fully connected layer

f(y = in+1|i1,...,n, u, t)

pn

p3

hsessions

. . .

hsessions

q

P

concat

q

bu

Fully connected layer

f(in+1|i1,...,n, u, t)

-

Concatenation Design: Attention Design:

!" = $%&

'()*

∑" $%&

'()*

Self-attention mechanism:

̂

.,0 = 1 23 ( $0 + 5

. + 50

The likelihood in predictor is:

SLIDE 26

1. 2 ,.20 2.2 811 .22 1

25

Loss functions:

BPR loss:

L = − 1 %& '

()* +,

log(1( ̂ 3

( − 34))

TOP1 loss:

L = 1 %& '

()* +,

1 ̂ 3

( − 34 + 1( ̂

3

( 7)

Hinge loss:

L = max{ ̂ 3

( − 34 + 1, 0}

Data augmentation:

First, we train each sequence with all hidden

utputs and make the predictions, which

fully explores the subsequences information. Second, we leverage the dropout layer for the sequences such that it makes regularization as well as diversifies the input sequence data.

SLIDE 27

1. 2 ,.20 2.2 811 .22 1

26

Datasets !"#$%&%'( )%*(+(15 .$/'*ℎ$ 12 Events 53,309 17,920,066 6,921,446 254,398 Users 237 / 12,332 3,035 Items 1,395 23,459 31,893 1,173 Sessions 3,609 4,247,567 93,287 45,878 Session Support 2 2 2 2 Item Support 10 20 10 20 User Support 10 / 10 20 Totally we use four datasets in our experiments. user time both

SLIDE 28

1. 2 ,.20 2.2 811 .22 1

27

Models Description Baseline Models BPR-MF Matrix factorization techniques apply SVD factoring the user-item rating matrix YouTube DNN YouTube model includes two stages: candidate generation and ranking WaveNet CNN Inner multiplicative can be exploited by its stacked causal atrous convolutions GRU4REC RNN Basic GRU layers and TOP-1 loss and session-parallel mini-batching mechanism PASAR Variants PASAR_user_att Adding user profile embedding by self-attention network PASAR_user_cat Adding the user profile by concatenating hidden outputs and user embeddings PASAR_time_att Adding time profile embedding by global attention network PASAR_time_cat Adding time profile by concatenating time embeddings and item embeddings PASAR_time_user Integrating both time and user profiles as final PASAR model

SLIDE 29

1. 2 ,.20 2.2 811 .22 1

28

Experimental Comparison Results -- shown are the MRR top 20 and Recall top 20 scores of four baseline models and five PASAR variants on four datasets. We highlight some focal improvements in bold and underline the best results.

SLIDE 30

1. 2 ,.20 2.2 811 .22 1

29

The detailed MRR@10, MRR@20, MRR30, MRR40 and MRR@all results for each datasets

SLIDE 31

1. 2 ,.20 2.2 811 .22 1

30

Train speed time (iterations/second) and Training memory cost (MiB).

We did experiments on NVIDIA Tesla P40 GPUs. (172.20.190.45) MF method is fastest and CNN method takes the most memory. Our model is half slower than baseline RNN model and takes similar memory cost.

SLIDE 32

1. 2 ,.20 2.2 811 .22 1

31

::

." ." . . C: : : . : :. . : :. :

.." :: . : :.. :

:" . ," : . .: . : : : : .

: . C. .: : A. :. .

C: : .: :.: : . : . :. , :

.: : : : :. , : .

:: . . .A: A:: . . .

:. :A : : : : .:

..

SLIDE 33

1. 2 ,.20 2.2 811 .22 1

32

,2C

4 2 5C4 4 4 4 244 1 :2: 43 24 4 44 4 : 1C4 4 5 4C424 4326 4 2 C4 6C43 C4 444 43C24 34: 3 224:44 4 6 24 4 2 4 35544 C 42 :4 4 3 344 3 3 4 4 34: 4 1C

4 : 4 2 D4 4 46D4 :6 43 3 C4

144 C 414336 :D4 4 2:3 3 : 1:4 .. 34: D:C1:4 14 5C4 34D4:43 6: 246 42 4 2 C4 :4 6 36 5 C2 6 34:

SLIDE 34

2, . 1 ,,884 22, 8, 042 33

@A

CC

CC

Many thanks to my mentors Weizhi and Chris. Love JD. ♥

SLIDE 35